How we build AI agents that hold up in production
A demo agent and a production agent share almost no code in spirit. The demo answers a happy-path prompt in front of a friendly audience. The production agent runs unattended, against real data, with real consequences when it gets something wrong. Here is the architecture we use to get from one to the other.
Most agent prototypes work on the first try and then quietly fall apart the moment they meet reality: a tool returns an error the model has never seen, a user pastes something adversarial, a retry storm doubles your bill, and nobody can explain why the agent did what it did three steps ago. None of that is an "LLM problem." It is a systems problem, and it yields to the same engineering discipline we apply everywhere else. Below is how we structure agents we are willing to put our name on.
The control loop is the agent
An agent is not a prompt. It is a control loop wrapped around a model: perceive the current state, plan the next action, act by calling a tool, observe the result, and repeat until the goal is met or a limit is hit. Everything else — the prompt, the tools, the memory — is configuration of that loop. So we design the loop first and treat it as ordinary software with invariants we can reason about.
The single most important invariant is that the loop is bounded. An unbounded agent is a way to spend money and time you did not budget. We cap the number of iterations, the wall-clock duration, and the cumulative token spend per run, and we make exceeding any of them a defined, observable outcome rather than a hang.
def run_agent(goal, tools, max_steps=12, token_budget=120_000):
state = init_state(goal)
spent = 0
for step in range(max_steps):
plan = model.decide(state) # perceive + plan
spent += plan.tokens_used
if spent > token_budget:
return degrade(state, reason="budget")
if plan.is_final:
return finalize(plan)
result = call_tool(plan.tool, plan.args) # act
state = observe(state, result) # observe
return degrade(state, reason="max_steps") # bounded exit
Note that the loop has two exits that are not "success": budget and step-count. Both route to a deliberate degradation path, never to an exception that bubbles up as a 500. An agent that knows how to stop is worth more than an agent that knows one more trick.
Designing tools the model can actually use
The model is only as good as the actions you give it, and tool design is where most of the real engineering lives. Our rules of thumb:
- Narrow over general.
refund_order(order_id, amount)beats a do-everythingcall_api(method, path, body). Narrow tools constrain the model toward correct behavior and make permissioning tractable. - Typed inputs and outputs. Every tool has a strict schema. The model's arguments are validated before the tool runs; the tool's response is structured, not free prose.
- Idempotent where it can be. If a retry could fire the same tool twice, an idempotency key makes the second call a no-op. This is the difference between a flaky network and a double refund.
- Error messages written for the model. A tool error is not a stack trace — it is feedback the model will read and act on.
"order_id not found; expected format ORD-XXXXXX"lets the agent self-correct."KeyError"does not.
Treat the tool layer as your real API design problem. If a competent junior engineer would be confused by a tool's name, arguments, or error text, the model will be too — it just won't tell you. We review tool definitions with the same rigor as a public SDK.
Guardrails: assume everything is hostile
The security model for an agent is straightforward once you accept one premise: every byte that enters the model from outside the system prompt is untrusted. That includes the user's input, and — critically — the output of every tool. A web page the agent fetched, a row from a database, a support ticket: any of these can carry instructions aimed at hijacking the agent. Prompt injection is not an edge case; it is the default threat.
So we layer defenses:
- Input validation on the way in — length caps, encoding checks, and stripping of obvious control sequences.
- Output schema validation on the way out — the model's final answer must parse against a schema before anything downstream consumes it.
- Allowlists and permissioning — the agent can only call the tools its role grants, and sensitive tools (anything that moves money, deletes data, or sends external messages) sit behind explicit approval or a human in the loop.
- Tool output is data, never instructions. We fence retrieved content clearly and never let it silently expand the agent's permissions.
from pydantic import BaseModel, ValidationError
class Resolution(BaseModel):
action: Literal["refund", "replace", "escalate"]
order_id: str
reason: str
def validate_output(raw: str) -> Resolution:
try:
return Resolution.model_validate_json(raw)
except ValidationError as e:
# hand the error back to the model for one repair attempt,
# then fall back to human escalation if it still fails
raise OutputSchemaError(detail=e.errors())
Schema validation does double duty: it catches malformed model output, and it is your last line of defense against an injected instruction producing an action you never authorized. If the action isn't one of three allowed verbs, it doesn't happen.
Failure handling is most of the work
Tools fail. Models occasionally return garbage. Networks flake. A production agent treats failure as the normal case and has a planned response for each kind:
- Retries with backoff and jitter for transient errors — but only on idempotent operations, and with a hard cap so a failing dependency can't spiral into a retry storm.
- Fallbacks when a tool or model is unavailable: a cheaper model, a cached result, or a simpler deterministic path.
- Circuit breakers around flaky dependencies so the agent stops hammering a service that is already down and fails fast instead.
- Graceful degradation as the contract: when the agent cannot complete the goal, it returns a clear, partial, honest result and escalates — it never fabricates a confident answer to avoid admitting it is stuck.
The test we hold ourselves to: pull the plug on any single dependency and the agent should still behave defensibly. That mindset is the same one behind our Sweep iOS app, which ships with zero third-party dependencies, runs 100% on-device, and passes 23 of 23 tests — fewer moving parts, fewer ways to fail, every failure mode accounted for.
Cost and latency are design constraints, not afterthoughts
An agent that is correct but slow and expensive will not survive contact with a finance review. We treat cost and latency as first-class budgets, set at design time:
- Model tiering. Route easy steps to a small, fast model and reserve the frontier model for genuine reasoning. Most steps in a real agent are routing and formatting, not deep thought.
- Prompt caching. System prompts, tool definitions, and stable context get cached so you are not paying full freight to re-read the same tokens on every turn.
- Parallel tool calls. When the next actions are independent, fire them concurrently rather than serially. Latency is often dominated by round-trips, not compute.
- Token budgets per run, enforced in the loop as shown above, so a pathological case fails loudly and cheaply instead of silently and expensively.
The cheapest token is the one you never send. Before reaching for a bigger model, we ask whether the step needs the model at all — a lot of "agent" work is plumbing that a deterministic function does faster, cheaper, and more reliably.
Evals are the real spec
You cannot improve what you cannot measure, and "it looked good when I tried it" is not measurement. For us, the eval set is the specification. We build an offline suite of representative inputs with expected outcomes — happy paths, known-hard cases, and the adversarial inputs that bit us in the past — and we score every change against it before it ships.
Those evals become regression gates in CI. A prompt tweak that fixes one case and quietly breaks five others gets caught before it reaches a user, not after. This is the discipline that turns prompt-tuning from folklore into engineering: every claim about the agent getting "better" is backed by a number that moved on a fixed test set.
Observability: if you can't replay it, you can't trust it
When an agent does something surprising in production, "the model decided to" is not an acceptable answer. We instrument every run so that any single execution can be reconstructed end to end: the inputs, every model call and its tokens, every tool invocation with arguments and results, the decisions at each step, and the final output. Each step carries a trace ID; the whole run is replayable.
That replay capability is what makes agents debuggable instead of mystical. It lets us reproduce a bad run deterministically, turn it into a new eval case, fix it, and prove the fix holds. Without it, you are not operating an agent — you are hoping at scale.
Key takeaways
- The agent is the control loop, and the loop must be bounded — cap steps, time, and tokens, with deliberate exits for each.
- Tool design is API design: narrow, typed, idempotent where possible, with error messages written for the model to act on.
- Treat all input — including tool output — as untrusted; validate schemas and gate sensitive actions behind permissions.
- Plan for failure with retries, fallbacks, circuit breakers, and honest graceful degradation instead of confident fabrication.
- Make cost and latency design-time budgets via model tiering, caching, parallel calls, and per-run token limits.
- Evals are the spec and belong in CI; full tracing and replay are what make an agent trustworthy in production.
Building an agent you actually have to trust?
We design and ship production AI for companies — idea to production in weeks.
Book a 30-min call →