🛠️ Tool Stubbing for Reliable Evals

When running offline evals, your agent may call tools like “weather”, “search”, or “lookup_customer”. If those tools hit live systems, you'll get non-deterministic results — which can make evaluation noisy and inconsistent.

To solve this, Steel Thread supports stubbing tools with fixed, fake responses.

This makes your tests:

✅ Deterministic — the same input always produces the same output
✅ Isolated — no external API calls or flaky systems
✅ Repeatable — easy to track regressions across changes

🤔 When to Stub

Use tool stubs when:

You're writing offline evaluations
The tool response affects the plan or output
You want consistent scoring across iterations

⚙️ How Tool Stubbing Works

Steel Thread provides a ToolStubRegistry — a drop-in replacement for Portia’s default registry.

You can wrap your existing tools and selectively override individual tools by ID.

🔁 Example: Stubbing the Weather Tool

from portia import Config, Portia, ToolRunContext, DefaultToolRegistry
from steelthread.portia.tools import ToolStubRegistry

# Define stub behavior
def weather_stub(index, ctx: ToolRunContext, args, kwargs):
    city = kwargs.get("city", "").lower()
    return {
        "sydney": "33.28",
        "london": "2.00",
    }.get(city, f"Unknown city: {city}")

# Wrap your registry with a stub registry
portia = Portia(
    config,
    tools=ToolStubRegistry(
        DefaultToolRegistry(config),
        stubs={
            "weather_tool": weather_stub,
        },
    )
)

✅ The rest of your tools still work as normal — only the stubbed one is overridden.

🔬 Tool Stub Function Signature

Tool stubs are simple Python functions:

def my_stub(
    tool_call_index: int,
    ctx: ToolRunContext,
    args: tuple,
    kwargs: dict,
) -> Any:
    ...

You can use:

tool_call_index to vary behavior across iterations
ctx to access plan metadata, end user, etc.
args and kwargs for tool inputs

Return any value your real tool would return — a string, object, dict, etc.

🧪 Using Tool Stubs in Offline Eval Runs

from steelthread.steelthread import SteelThread, OfflineEvalConfig

SteelThread().run_offline(
    portia,
    OfflineEvalConfig(
        data_set_name="offline_eval_v1",
        config=config,
        iterations=3,
    )
)

With the stubbed tool in place, your offline evals will be clean, fast, and reproducible.

✅ Best Practices

✅ Stub only the tools that matter for evaluation
✅ Use consistent return types (e.g. same as real tool)
✅ Use tool_call_index if you want per-run variance
✅ Combine stubbing with assertions to detect misuse (e.g. tool called too many times)

🤔 When to Stub​

⚙️ How Tool Stubbing Works​

🔁 Example: Stubbing the Weather Tool​

🔬 Tool Stub Function Signature​

🧪 Using Tool Stubs in Offline Eval Runs​

✅ Best Practices​