Evals and SteelThread | Portia AI Docs

📄️ 🧪 What Are Evals?

As language models and agentic systems become more powerful, evaluating their behavior becomes increasingly critical.

📄️ 💻 Offline vs Online Evals

When working with LLM-based systems and agents, it's not enough to evaluate performance once — you need ongoing, structured feedback. This is where offline and online evals come in.

📄️ 🧵 Introducing Steel Thread

Steel Thread is a lightweight, extensible framework for evaluating LLM agents — designed to help teams measure quality, catch regressions, and improve performance with minimal friction.

📄️ 🚀 Getting Started with Steel Thread

Steel Thread lets you evaluate your agents — both during development and in production — using real data, real metrics, and minimal boilerplate.

📄️ 📊 Default Evaluators

Steel Thread provides two built-in evaluators to help you get started quickly:

📄️ 🏗️ Writing Custom Evaluators

Steel Thread makes it easy to define your own logic for evaluating agent runs.

📄️ 🛠️ Tool Stubbing for Reliable Evals

When running offline evals, your agent may call tools like “weather”, “search”, or “lookup_customer”. If those tools hit live systems, you'll get non-deterministic results — which can make evaluation noisy and inconsistent.