Insights  ›  AI Agents
AI Agents

How we build AI agents that hold up in production

A demo agent and a production agent share almost no code in spirit. The demo answers a happy-path prompt in front of a friendly audience. The production agent runs unattended, against real data, with real consequences when it gets something wrong. Here is the architecture we use to get from one to the other.

Most agent prototypes work on the first try and then quietly fall apart the moment they meet reality: a tool returns an error the model has never seen, a user pastes something adversarial, a retry storm doubles your bill, and nobody can explain why the agent did what it did three steps ago. None of that is an "LLM problem." It is a systems problem, and it yields to the same engineering discipline we apply everywhere else. Below is how we structure agents we are willing to put our name on.

The control loop is the agent

An agent is not a prompt. It is a control loop wrapped around a model: perceive the current state, plan the next action, act by calling a tool, observe the result, and repeat until the goal is met or a limit is hit. Everything else — the prompt, the tools, the memory — is configuration of that loop. So we design the loop first and treat it as ordinary software with invariants we can reason about.

The single most important invariant is that the loop is bounded. An unbounded agent is a way to spend money and time you did not budget. We cap the number of iterations, the wall-clock duration, and the cumulative token spend per run, and we make exceeding any of them a defined, observable outcome rather than a hang.

def run_agent(goal, tools, max_steps=12, token_budget=120_000):
    state = init_state(goal)
    spent = 0
    for step in range(max_steps):
        plan = model.decide(state)              # perceive + plan
        spent += plan.tokens_used
        if spent > token_budget:
            return degrade(state, reason="budget")
        if plan.is_final:
            return finalize(plan)
        result = call_tool(plan.tool, plan.args) # act
        state = observe(state, result)           # observe
    return degrade(state, reason="max_steps")    # bounded exit

Note that the loop has two exits that are not "success": budget and step-count. Both route to a deliberate degradation path, never to an exception that bubbles up as a 500. An agent that knows how to stop is worth more than an agent that knows one more trick.

Designing tools the model can actually use

The model is only as good as the actions you give it, and tool design is where most of the real engineering lives. Our rules of thumb:

Treat the tool layer as your real API design problem. If a competent junior engineer would be confused by a tool's name, arguments, or error text, the model will be too — it just won't tell you. We review tool definitions with the same rigor as a public SDK.

Guardrails: assume everything is hostile

The security model for an agent is straightforward once you accept one premise: every byte that enters the model from outside the system prompt is untrusted. That includes the user's input, and — critically — the output of every tool. A web page the agent fetched, a row from a database, a support ticket: any of these can carry instructions aimed at hijacking the agent. Prompt injection is not an edge case; it is the default threat.

So we layer defenses:

from pydantic import BaseModel, ValidationError

class Resolution(BaseModel):
    action: Literal["refund", "replace", "escalate"]
    order_id: str
    reason: str

def validate_output(raw: str) -> Resolution:
    try:
        return Resolution.model_validate_json(raw)
    except ValidationError as e:
        # hand the error back to the model for one repair attempt,
        # then fall back to human escalation if it still fails
        raise OutputSchemaError(detail=e.errors())

Schema validation does double duty: it catches malformed model output, and it is your last line of defense against an injected instruction producing an action you never authorized. If the action isn't one of three allowed verbs, it doesn't happen.

Failure handling is most of the work

Tools fail. Models occasionally return garbage. Networks flake. A production agent treats failure as the normal case and has a planned response for each kind:

The test we hold ourselves to: pull the plug on any single dependency and the agent should still behave defensibly. That mindset is the same one behind our Sweep iOS app, which ships with zero third-party dependencies, runs 100% on-device, and passes 23 of 23 tests — fewer moving parts, fewer ways to fail, every failure mode accounted for.

Cost and latency are design constraints, not afterthoughts

An agent that is correct but slow and expensive will not survive contact with a finance review. We treat cost and latency as first-class budgets, set at design time:

The cheapest token is the one you never send. Before reaching for a bigger model, we ask whether the step needs the model at all — a lot of "agent" work is plumbing that a deterministic function does faster, cheaper, and more reliably.

Evals are the real spec

You cannot improve what you cannot measure, and "it looked good when I tried it" is not measurement. For us, the eval set is the specification. We build an offline suite of representative inputs with expected outcomes — happy paths, known-hard cases, and the adversarial inputs that bit us in the past — and we score every change against it before it ships.

Those evals become regression gates in CI. A prompt tweak that fixes one case and quietly breaks five others gets caught before it reaches a user, not after. This is the discipline that turns prompt-tuning from folklore into engineering: every claim about the agent getting "better" is backed by a number that moved on a fixed test set.

Observability: if you can't replay it, you can't trust it

When an agent does something surprising in production, "the model decided to" is not an acceptable answer. We instrument every run so that any single execution can be reconstructed end to end: the inputs, every model call and its tokens, every tool invocation with arguments and results, the decisions at each step, and the final output. Each step carries a trace ID; the whole run is replayable.

That replay capability is what makes agents debuggable instead of mystical. It lets us reproduce a bad run deterministically, turn it into a new eval case, fix it, and prove the fix holds. Without it, you are not operating an agent — you are hoping at scale.

Key takeaways

  • The agent is the control loop, and the loop must be bounded — cap steps, time, and tokens, with deliberate exits for each.
  • Tool design is API design: narrow, typed, idempotent where possible, with error messages written for the model to act on.
  • Treat all input — including tool output — as untrusted; validate schemas and gate sensitive actions behind permissions.
  • Plan for failure with retries, fallbacks, circuit breakers, and honest graceful degradation instead of confident fabrication.
  • Make cost and latency design-time budgets via model tiering, caching, parallel calls, and per-run token limits.
  • Evals are the spec and belong in CI; full tracing and replay are what make an agent trustworthy in production.

Building an agent you actually have to trust?

We design and ship production AI for companies — idea to production in weeks.

Book a 30-min call →