π Getting Started with Steel Thread
Steel Thread lets you evaluate your agents β both during development and in production β using real data, real metrics, and minimal boilerplate.
This page walks you through getting started in three steps:
1οΈβ£ Sign Up on Portia Cloudβ
To use Steel Thread, you need data from your agent runs. This means integrating them with Portia Cloud.
- Go to app.portialabs.ai
- Log in or create a new account
- Create a new API Key.
- Run your agent making sure to enable the Portia Cloud integration by setting an API key in the config.
- These runs β plans, tool calls, outputs β will form the basis of your evaluation datasets
π‘ Every run you create in Portia can be used as an evaluation input β no extra data labeling required.
2οΈβ£ Create a Dataset to Evaluateβ
There are two types of datasets that can be created via the UI.
π¦ Offline Datasetβ
Use this to create a static, repeatable set of evals.
- In the Portia UI, create a new Offline Eval Set with a distinct name e.g.
offline_evals_v1
. - Add new test cases to the newly created dataset using the
Add to Dataset
wizard.
π Online Datasetβ
Use this to continuously evaluate production runs.
- In the Portia UI, create a new Online Eval Set with a distinct name e.g.
online_evals_v1
. - Test cases will be automatically sampled based on the config you provide.
- [n.b.] Online Datasets only sample data after creation, so you will need to generate new data after creating the dataset.
3οΈβ£ Run Evals Locally Using the SDKβ
First, install the Steel Thread SDK:
# Using pip
pip install steel-thread
# Using poetry
poetry add steel-thread
# Using uv
uv pip install steel-thread
Then set the correct env vars. At a minimum you need PORTIA_API_KEY
set and one LLM provider key (i.e. OPENAI_API_KEY
).
export OPENAI_API_KEY=""
export PORTIA_API_KEY=""
export ANTHROPIC_API_KEY=""
export MISTRAL_API_KEY=""
export GOOGLE_API_KEY=""
export AZURE_OPENAI_API_KEY=""
export AZURE_OPENAI_ENDPOINT=""
Then, run your evals:
π§ͺ Offline Exampleβ
from portia import Config, Portia
from steelthread.steelthread import SteelThread, OfflineEvalConfig
config = Config.from_default()
SteelThread().run_offline(
Portia(config),
OfflineEvalConfig(
data_set_name="offline_evals_v1",
config=config,
iterations=3,
),
)
π Online Exampleβ
from portia import Config
from steelthread.steelthread import SteelThread, OnlineEvalConfig
config = Config.from_default()
SteelThread().run_online(
OnlineEvalConfig(
data_set_name="prod_online_evals",
config=config,
)
)
β Youβre Set!β
Once you're running evals, you can:
- View metrics in your terminal or save them to dashboards
- Catch regressions before they ship
- Monitor live agent quality in production
- Iterate faster with confidence