π§΅ Introducing Steel Thread
Steel Thread is a lightweight, extensible framework for evaluating LLM agents β designed to help teams measure quality, catch regressions, and improve performance with minimal friction.
It supports offline and online evals, integrates deeply with Portia Cloud, and is built from the ground up for real-world agentic workflows.
π§ Why We Built Steel Threadβ
Evaluating agents isnβt hard because the models are bad β itβs hard because:
- The output space is non-deterministic
- The tool usage is complex and multi-step
- The definition of "correct" is subjective
- And most of all: curating test data is painful
We found that most eval frameworks fall down not on logic or metrics β but on data. They assume someone else is maintaining clean eval datasets.
Thatβs the bottleneck.
So we flipped the problem on its head.
Instead of asking teams to build new datasets from scratch, Steel Thread plugs directly into the data you already generate in Portia Cloud:
- Plans
- Plan Runs
- Tool Calls
- User IDs
- Metadata and outputs
Now, every agent execution can become an eval β either retrospectively or in real time.
βοΈ What Does It Do?β
Steel Thread helps you answer the question:
"Is my agent getting better or worse?"
It does this by providing:
β Offline Evalsβ
Run against curated static datasets. Useful for:
- Iterating on prompts
- Testing new toolchains
- Benchmarking models
- Catching regressions before deployment
π Online Evalsβ
Run against your live or recent production runs. Useful for:
- Monitoring quality in real usage
- Tracking performance across time or model changes
- Detecting silent failures
π― Custom Metricsβ
Use rules, thresholds, or even LLMs-as-judges to compute:
- Accuracy
- Completeness
- Clarity
- Efficiency
- Latency
- Tool usage
- ...or domain-specific checks
π Built for Portiaβ
Steel Thread is deeply integrated with the Portia agentic platform.
It works natively with:
- Plan and PlanRun IDs
- ToolCall metadata
- End user context
- Agent outputs (e.g. final outputs, intermediate values)
- APIs and UI features in Portia Cloud
This means you donβt need to create new test harnesses or annotate synthetic datasets β you can evaluate what's already happening.
Just point Steel Thread at your Portia instance, and start measuring.
π§© Flexible & Extensibleβ
Steel Thread is designed to be modular:
- β Drop in custom metrics
- π οΈ Stub or override tool behavior
- π Run in CI or ad hoc from the CLI
- π Mix and match online and offline strategies
- π Save metrics wherever you like β log, database, dashboard
It plays well with teams at any stage of maturity β whether youβre just getting started with agents or deploying them in production.
π Get Startedβ
Once installed, you can start running evals in just a few lines:
from steelthread.steelthread import SteelThread, OnlineEvalConfig
from portia import Config
config = Config.from_default()
SteelThread().run_online(
OnlineEvalConfig(data_set_name="prod-evals", config=config)
)
Or, define a custom offline dataset with your own metrics and stubs:
from steelthread.offline_evaluators.evaluator import OfflineEvaluator
from steelthread.metrics.metric import Metric
class MyEvaluator(OfflineEvaluator):
def eval_test_case(self, test_case, plan, plan_run, metadata):
return Metric(name="custom", score=1.0, description="Always passes!")
𧬠Why Itβs Differentβ
Steel Thread isnβt just another eval runner. Itβs an opinionated framework focused on:
- Using your real production data
- Supporting deep introspection into agent behavior
- Making evals easy to write and easy to trust
We believe the best way to scale intelligent agents is not just to deploy them β but to hold them accountable.
Steel Thread helps you do just that.