๐๏ธ ๐งช What Are Evals?
As language models and agentic systems become more powerful, evaluating their behavior becomes increasingly critical.
๐๏ธ ๐ป Offline vs Online Evals
When working with LLM-based systems and agents, it's not enough to evaluate performance once โ you need ongoing, structured feedback. This is where offline and online evals come in.
๐๏ธ ๐งต Introducing Steel Thread
Steel Thread is a lightweight, extensible framework for evaluating LLM agents โ designed to help teams measure quality, catch regressions, and improve performance with minimal friction.
๐๏ธ ๐ Getting Started with Steel Thread
Steel Thread lets you evaluate your agents โ both during development and in production โ using real data, real metrics, and minimal boilerplate.
๐๏ธ ๐ Default Evaluators
Steel Thread provides two built-in evaluators to help you get started quickly:
๐๏ธ ๐๏ธ Writing Custom Evaluators
Steel Thread makes it easy to define your own logic for evaluating agent runs.
๐๏ธ ๐ ๏ธ Tool Stubbing for Reliable Evals
When running offline evals, your agent may call tools like โweatherโ, โsearchโ, or โlookup_customerโ. If those tools hit live systems, you'll get non-deterministic results โ which can make evaluation noisy and inconsistent.