๐ Default Evaluators
Steel Thread provides two built-in evaluators to help you get started quickly:
- ๐ง
LLMJudgeOnlineEvaluator
for online evaluation using LLM-as-Judge - ๐งช
DefaultOfflineEvaluator
for assertion-based offline testing
These evaluators provide baseline capabilities for scoring the quality of your agentsโ plans and executions โ and serve as a foundation you can extend with custom evaluators.
๐ง Online: LLMJudgeOnlineEvaluatorโ
This evaluator uses a Large Language Model (LLM) as a judge to assess the quality of:
- a Plan (before execution)
- a PlanRun (after execution)
โ When to Use Itโ
Use this evaluator if:
- You want subjective scoring based on high-level properties like clarity or correctness.
- Youโre monitoring production behavior using Online Evals.
โ๏ธ Scored Metricsโ
For Plans:โ
Metric | Description |
---|---|
correctness | Are the steps logically valid? |
completeness | Are all necessary steps included? |
clearness | Are the steps clearly written and easy to follow? |
For PlanRuns:โ
Metric | Description |
---|---|
success | Did the run accomplish its intended goal? |
efficiency | Were the steps necessary and minimal? |
These metrics are scored by passing your plan/run JSON to an LLM and asking it to evaluate.
๐งช Exampleโ
from steelthread.online_evaluators.llm_as_judge import LLMJudgeOnlineEvaluator
evaluator = LLMJudgeOnlineEvaluator(config)
plan_metrics = evaluator.eval_plan(plan)
run_metrics = evaluator.eval_plan_run(plan, plan_run)
๐งช Offline: DefaultOfflineEvaluatorโ
Offline evaluation is assertion-based. You define what should happen in each test case, and the DefaultOfflineEvaluator
checks whether that actually occurred.
โ When to Use Itโ
Use this when:
- You want precise, rule-based tests (like latency thresholds or tool usage).
- Youโre running Offline Evals against fixed datasets.
๐ Supported Assertion Typesโ
Assertion Type | Description |
---|---|
outcome | Checks whether the final status matches an expected value (e.g. COMPLETE). |
final_output | Compares the final output to an expected string, either exactly or partially, or uses LLM. |
latency | Compares latency against a threshold using normalized scoring. |
tool_calls | Verifies which tools were or werenโt used during the run. |
custom | Allows additional user-defined metadata for evaluators to interpret. |
๐งฎ Scoring Logicโ
-
Outcome: 1.0 if status matches, else 0.0
-
Final Output:
exact_match
: strict equalitypartial_match
: expected string must be a substringllm_judge
: LLM rates similarity
-
Latency: Uses normalized difference between actual and expected latency
-
Tool Calls: Penalizes missing or unexpected tool invocations
๐งช Exampleโ
from steelthread.offline_evaluators.default_evaluator import DefaultOfflineEvaluator
evaluator = DefaultOfflineEvaluator(config)
metrics = evaluator.eval_test_case(test_case, plan, plan_run, metadata)
๐ง How It Fits in Eval Runsโ
These evaluators are the default components for Steel Threadโs OfflineEvalConfig
and OnlineEvalConfig
.
You can override them with your own evaluators โ or chain multiple evaluators together for deeper analysis.