Insights  ›  Delivery
Delivery

Idea to production in weeks: how a small senior team ships AI

Most AI projects don't fail because the model isn't good enough. They fail because the scope was a platform, the spec was a paragraph, and the demo never had to survive contact with a real user. Here is how we take one use case from idea to production in weeks — without the corners that come back to bite you in month three.

"Weeks not quarters" is a scope decision, not a heroics decision

When someone says they shipped AI in weeks, the instinct is to assume crunch: a few people grinding nights to brute-force a quarter of work into a sprint. That is the wrong model, and it produces software that nobody can maintain. Speed in AI delivery is almost entirely upstream of the engineering — it comes from what you decide not to build.

The slow projects we see are slow because they tried to build a capability before they had shipped a single feature. They scoped "an AI assistant for our operations team" and then spent a quarter on connectors, a vector store, a config UI, role-based access, and an evaluation dashboard — none of which a user had touched. The fast version of that same project ships one question, answered well, for one team, behind a flag, in three weeks. Everything else is a follow-on, funded by the fact that the first slice worked.

So the discipline isn't moving faster. It's refusing to widen the cut.

Pick one thin vertical slice

A vertical slice means one use case, one real data source, and one success metric — end to end, all the way to a user. Not a horizontal layer (an ingestion pipeline, a "model service") that serves no one until three other layers land. The test we apply: can a real user get a real answer to a real question on day fourteen? If the architecture diagram has to be complete before anyone benefits, the slice is too wide.

Concretely, scoping a slice looks like forcing three answers before any code:

If you can't name the metric, you can't ship the slice. "Make it smarter" is not a target you can pass or fail. When a stakeholder can't tell us how they'd grade the output, that's not a delay — it's the most valuable conversation of the engagement, and it happens in week one instead of at the demo.

Evals are the spec

The single biggest difference between an AI prototype and an AI product is that the product has an eval set, and "done" means passing it. We write the evals with the build — often before — because in an LLM system the eval set is the specification. It's the only artifact that encodes, in a way you can run, what correct behavior looks like.

This isn't a model-accuracy leaderboard. It's a curated set of real inputs paired with what should happen — graded by exact match where you can, and by a model-graded rubric where the output is open-ended. You start small and honest: twenty to fifty cases hand-built from real data, including the ugly ones (empty input, the adversarial prompt, the ticket in the wrong language, the question the system should refuse). Every bug a user finds becomes a new eval case, so the same failure can never ship twice.

cases:
  - id: refund-out-of-policy
    input: "I want a refund, I bought it 90 days ago"
    expect:
      must_contain: ["30-day", "policy"]
      must_not: ["sure, processing your refund"]
      grader: rubric   # model-graded: polite, cites policy, no false promise

  - id: pii-redaction
    input: "my card is 4111 1111 1111 1111, charge failed"
    expect:
      must_not_contain_raw_pii: true
      grader: regex

gate:
  pass_threshold: 0.95   # CI fails the build below this
  block_on: [pii-redaction]   # safety cases are non-negotiable

Once the set exists, it stops being a test and becomes the steering wheel. Changed the prompt? Run the evals. Swapped the model to cut cost? Run the evals. The number tells you whether you improved the system or just moved the failures somewhere you weren't looking. Without it, every prompt tweak is a vibe and every regression is a surprise in production.

Build vs buy: don't fine-tune on day one

The most expensive early mistake is reaching for the heaviest tool first. We default to the cheapest mechanism that could plausibly work and only climb the ladder when the evals say we have to:

  1. Prompting first. A strong frontier model with a well-structured prompt and a few examples clears a startling number of use cases outright. This is a day, not a sprint.
  2. Then retrieval. If the gap is missing knowledge — your docs, your policies, your data — that's a retrieval problem, not a training problem. Give the model the right context at inference time.
  3. Then tools and structure. If it needs to do things or return strict shapes, add tool calls and schema-constrained output before you touch weights.
  4. Fine-tuning last, and rarely. Only when prompting plus retrieval has plateaued below the bar and you have the labeled data to justify it. Fine-tuning on day one buys you a frozen snapshot of a problem you don't understand yet, plus a retraining bill every time the world changes.

The same restraint applies to infrastructure. Call a managed API before you self-host a model; use a hosted vector store before you operate one; reach for an existing framework before you write one. Buy the undifferentiated parts so your senior time goes into the slice that is actually yours. We hold the opposite bar only where it earns its keep — our own Sweep iOS app ships with zero third-party dependencies and runs 100% on-device because privacy is the product there. That's a deliberate trade, not a default.

Ship behind a flag, to real users, early

A demo that only the team has seen is a hypothesis, not a result. We get the slice in front of a small set of real users behind a feature flag as soon as it clears the eval gate — not when it's polished. Early exposure does two things no internal testing can: it surfaces the inputs you didn't imagine, and it tells you whether the output is actually useful or merely correct.

A flag is also your safety rail. It means you can ship to ten people, watch the traces, and pull it back to zero in seconds if something drifts — no redeploy, no incident. "Production" doesn't mean "everyone at once." It means a real user, in the real system, with a real path to turn it off.

The production-readiness checklist a buyer should demand

Here is the bar we hold for ourselves, and the one a technical buyer should make any AI vendor meet before the word "production" gets used. If a supplier can't check these off, what they have is a demo.

[ ] Tests + CI            unit + integration tests run on every PR; red = blocked
[ ] Eval regression gate  eval suite runs in CI; build fails below threshold
[ ] Observability         every LLM call traced: inputs, outputs, tokens, latency
[ ] Cost budget           per-request + monthly ceiling, alerting on breach
[ ] Latency budget        p95 target defined and measured, not hoped for
[ ] Guardrails            input/output filtering, PII handling, prompt-injection review
[ ] Security review       secrets management, data retention, access scoped
[ ] Rollback              feature flag + one-step revert; no "redeploy to undo"
[ ] Owned code + docs      you hold the repo, the keys, and a README that onboards
[ ] On-call owner         a named human who gets paged when it breaks

None of this is exotic. It's the same discipline good teams apply to any production service — applied to a system that is non-deterministic, so the eval gate and the tracing matter more, not less. As a proof point on our own work, the Sweep app we built ships with 23/23 passing tests; we don't ask clients to hold a bar we don't hold ourselves.

"Owned code" is the line that protects you. When the engagement ends you should hold the repository, the API keys, the eval set, and documentation that lets your own team take over. If switching vendors means starting over, you didn't buy software — you rented a dependency.

How a small senior team avoids the coordination tax

There's a reason the same slice that takes a large org a quarter takes a small senior team weeks, and it isn't talent in the abstract — it's the coordination tax. Every handoff between a product manager, an architect, a backend team, an ML team, and a QA team is a queue, a meeting, and a translation loss. The work spends most of its life waiting in someone's backlog, not being done.

A small team of senior engineers collapses those handoffs. The person scoping the slice is the person writing the evals is the person shipping the code. Decisions that would be a cross-team ticket are a conversation. There's no junior bait-and-switch — the people who sold you the senior team are the people doing the work, which is exactly why the estimate holds. Beyond a handful of people, you start paying more in coordination than you gain in throughput; for one well-scoped slice, small wins outright.

That's the whole method: cut the scope to one slice, make the eval set the spec, buy the boring parts, ship behind a flag, and hold a real production bar with a team small enough that nothing waits in a queue. It isn't fast because it's reckless. It's fast because it's disciplined.

Key takeaways

  • Shipping AI in weeks is a scope decision — one thin vertical slice, not a platform — not a matter of working harder.
  • A slice is one use case, one real data source, and one success metric, all the way to a real user.
  • Treat the eval set as the spec: write it with the build, and "done" means it passes in CI.
  • Climb the build ladder in order — prompting, then retrieval, then tools — and fine-tune last, if ever.
  • Ship behind a feature flag to real users early; a flag is both your feedback loop and your rollback.
  • Demand the full production checklist — tests, eval gate, tracing, cost/latency budgets, guardrails, rollback, owned code — before anyone says "production."

Have a use case that's been stuck in "exploration"?

We take one use case and ship it to production — fast, with milestone billing.

Book a 30-min call →