Skip to main content

๐Ÿ“Š Default Evaluators

Steel Thread provides two built-in evaluators to help you get started quickly:

  • ๐Ÿง  LLMJudgeOnlineEvaluator for online evaluation using LLM-as-Judge
  • ๐Ÿงช DefaultOfflineEvaluator for assertion-based offline testing

These evaluators provide baseline capabilities for scoring the quality of your agentsโ€™ plans and executions โ€” and serve as a foundation you can extend with custom evaluators.


๐Ÿง  Online: LLMJudgeOnlineEvaluatorโ€‹

This evaluator uses a Large Language Model (LLM) as a judge to assess the quality of:

  • a Plan (before execution)
  • a PlanRun (after execution)

โœ… When to Use Itโ€‹

Use this evaluator if:

  • You want subjective scoring based on high-level properties like clarity or correctness.
  • Youโ€™re monitoring production behavior using Online Evals.

โœ๏ธ Scored Metricsโ€‹

For Plans:โ€‹

MetricDescription
correctnessAre the steps logically valid?
completenessAre all necessary steps included?
clearnessAre the steps clearly written and easy to follow?

For PlanRuns:โ€‹

MetricDescription
successDid the run accomplish its intended goal?
efficiencyWere the steps necessary and minimal?

These metrics are scored by passing your plan/run JSON to an LLM and asking it to evaluate.

๐Ÿงช Exampleโ€‹

from steelthread.online_evaluators.llm_as_judge import LLMJudgeOnlineEvaluator

evaluator = LLMJudgeOnlineEvaluator(config)
plan_metrics = evaluator.eval_plan(plan)
run_metrics = evaluator.eval_plan_run(plan, plan_run)

๐Ÿงช Offline: DefaultOfflineEvaluatorโ€‹

Offline evaluation is assertion-based. You define what should happen in each test case, and the DefaultOfflineEvaluator checks whether that actually occurred.

โœ… When to Use Itโ€‹

Use this when:

  • You want precise, rule-based tests (like latency thresholds or tool usage).
  • Youโ€™re running Offline Evals against fixed datasets.

๐Ÿ“ Supported Assertion Typesโ€‹

Assertion TypeDescription
outcomeChecks whether the final status matches an expected value (e.g. COMPLETE).
final_outputCompares the final output to an expected string, either exactly or partially, or uses LLM.
latencyCompares latency against a threshold using normalized scoring.
tool_callsVerifies which tools were or werenโ€™t used during the run.
customAllows additional user-defined metadata for evaluators to interpret.

๐Ÿงฎ Scoring Logicโ€‹

  • Outcome: 1.0 if status matches, else 0.0

  • Final Output:

    • exact_match: strict equality
    • partial_match: expected string must be a substring
    • llm_judge: LLM rates similarity
  • Latency: Uses normalized difference between actual and expected latency

  • Tool Calls: Penalizes missing or unexpected tool invocations

๐Ÿงช Exampleโ€‹

from steelthread.offline_evaluators.default_evaluator import DefaultOfflineEvaluator

evaluator = DefaultOfflineEvaluator(config)
metrics = evaluator.eval_test_case(test_case, plan, plan_run, metadata)

๐Ÿ”ง How It Fits in Eval Runsโ€‹

These evaluators are the default components for Steel Threadโ€™s OfflineEvalConfig and OnlineEvalConfig.

You can override them with your own evaluators โ€” or chain multiple evaluators together for deeper analysis.