RAG that actually retrieves: patterns we use in production
Almost every "the model hallucinated" bug we get handed turns out to be a retrieval miss. The model answered faithfully — it just never saw the right chunk. Here are the retrieval patterns we reach for to make retrieval-augmented generation reliable in production.
The bug is almost never the model
When a RAG system gives a confidently wrong answer, the instinct is to blame the language model. In our experience that instinct is wrong most of the time. We instrument retrieval and generation as two separate stages, and when we trace a bad answer back through the pipeline, the failure has usually already happened by the time the model is invoked: the relevant passage was never in the top-k, or it was buried below noise, or it was chunked so badly that the one sentence that mattered got split across two fragments and neither made the cut.
A useful mental model: the model can only be as good as its context window. If the answer to the question is not in the retrieved context, a well-behaved model will either refuse or improvise — and improvising is what gets logged as a hallucination. So before touching prompts, sampling, or swapping models, we ask one question: did the right chunk make it into the context? You can answer that directly by logging retrieved chunk IDs alongside every answer and spot-checking the failures. The fix is nearly always upstream of the model.
Chunking: respect the document, not the token count
Fixed-size chunking — split every 500 tokens with a 50-token overlap — is the default in most tutorials and it is where most retrieval quality goes to die. A flat character window cheerfully cuts through the middle of a table, separates a heading from the paragraph it introduces, or splits a function signature from its body. The embedding of half a thought is a bad embedding.
We chunk structurally first and only fall back to size limits as a backstop. Markdown and HTML give you headings, lists, and code fences for free; PDFs give you layout if you parse them properly rather than flattening to a text blob. We split on those natural boundaries, keep each chunk to a single coherent unit, and carry a little overlap (typically one sentence or one heading of context) so a chunk is never stranded without the thing it refers to. Three rules we hold to:
- Never split a structural unit mid-stream. A table, a code block, a list item, or a short section stays whole even if it pushes a chunk slightly over the nominal size.
- Prepend context to the chunk body. We stuff the document title and the heading trail (
"Billing > Refunds > Partial refunds") into the embedded text. A chunk that reads "you must request this within 30 days" is useless on its own; with its heading trail it is retrievable. - Keep metadata on every chunk. Source, section, last-updated date, access scope. You will need it for filtering and for citations, and bolting it on later is painful.
Chunk size is a genuine tradeoff, not a constant to copy from a blog post. Smaller chunks embed more precisely but fragment context; larger chunks preserve context but dilute the embedding and waste tokens. We tune it against an evaluation set (below) rather than guessing.
Embeddings: quality and domain fit both matter
The embedding model decides what "similar" means for your entire corpus, and not all embedding models are equally good — nor equally suited to your domain. A general-purpose model trained on web text will happily tell you that "Java the language" and "Java the island" are neighbours, which is exactly wrong in a developer-docs corpus. We treat embedding choice as an empirical question: build a small labelled set of query/relevant-passage pairs from your actual domain and measure recall before committing.
Two practical notes. First, the embedding you store and the embedding you query with must come from the same model and the same preprocessing — an obvious point that nonetheless breaks systems when someone upgrades the model on the query side and forgets to re-index the corpus. Second, dimensionality and cost scale with the corpus; a marginally better model that triples your index size and latency may not be worth it. Measure, don't assume.
Hybrid search: dense vectors miss the exact terms
Pure vector search is excellent at semantic matching and quietly terrible at exact matching. Dense embeddings smear meaning into a continuous space, which is what lets them match "how do I cancel" to "subscription termination." But that same smearing is why they fumble the things that must match literally: part numbers, error codes, function names, acronyms, ticket IDs, a specific SKU. Ask a vector index for ERR_4012 and it will return passages about errors in general — semantically close, exactly useless.
Lexical search (BM25 and friends) has the opposite profile: it nails exact tokens and ranks by term frequency, but it has no idea that "cancel" and "terminate" are the same intent. The two approaches fail in complementary ways, which is precisely why we run both and fuse the results. Dense search supplies the semantic recall; BM25 supplies the precision on the literal tokens that embeddings drop.
A compact way to combine them is Reciprocal Rank Fusion, which merges ranked lists without needing to calibrate score scales between two very different systems:
# Hybrid retrieve: dense + lexical, fused by reciprocal rank.
dense_hits = vector_index.search(embed(query), top_k=50) # semantic recall
lexical_hits = bm25_index.search(query, top_k=50) # exact-term precision
def rrf(ranked_lists, k=60):
scores = {}
for hits in ranked_lists:
for rank, doc_id in enumerate(hits):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)
candidates = rrf([dense_hits, lexical_hits])[:50] # one merged candidate set
RRF is deliberately dumb — it only looks at ranks, not raw scores — and that is its strength: there is nothing to tune between the two backends and it is robust to wildly different score distributions. The merged candidate set then goes to the reranker.
Reranking: the highest-ROI upgrade we know
If we could make exactly one change to a struggling RAG pipeline, it would be adding a cross-encoder reranker. It is the single highest-return upgrade in retrieval, and it is almost embarrassingly simple to bolt on.
Here is why it works. The embedding step is a bi-encoder: it embeds the query and each document separately and compares vectors, which is fast enough to search millions of chunks but loses information because the query and the document never actually meet. A cross-encoder reranker takes the query and a candidate chunk together as one input and scores their relevance directly. That joint attention is far more accurate at judging relevance — and far too slow to run over the whole corpus. So you use it surgically: retrieve a generous candidate set cheaply (hybrid search, top 50), then rerank just those 50 with the cross-encoder and keep the top 5.
# Rerank the fused candidates, then ground generation on the survivors.
ranked = reranker.score(query, candidates) # cross-encoder: query+chunk together
top_chunks = [c for c, _ in ranked[:5]] # keep the best handful
context = format_with_citations(top_chunks) # each chunk tagged with its source ID
answer = llm.generate(system=GROUNDED_PROMPT,
context=context,
question=query)
The pattern — cheap high-recall retrieval, then expensive high-precision reranking on a small set — is the backbone of essentially every production RAG system we ship. It lets you set the first-stage top_k generously (recall is what matters there) and lean on the reranker to do the precision work before anything reaches the model.
Query understanding: the user's words aren't the query
Users do not phrase questions the way documents phrase answers. They type fragments, pile multiple questions into one sentence, lean on pronouns from three turns ago, and use their own vocabulary rather than yours. Embedding the raw user string and hoping for the best leaves a lot of recall on the table. We do light query rewriting before retrieval:
- Rewriting and expansion. Resolve conversational references ("it," "that one") against history into a standalone query, and expand with synonyms and likely domain terms so lexical search has more to grab onto.
- Decomposition. A compound question ("what's the refund window and how do I start one?") is split into sub-queries, retrieved separately, and the results merged. One vector search cannot serve two distinct information needs well.
- HyDE-style expansion. For sparse or jargon-heavy queries, have a model draft a short hypothetical answer and embed that instead of the question. A hypothetical answer lives in the same vocabulary space as the real documents, so it often retrieves better than the terse question did.
These steps cost an extra model call or two, so we apply them judgmentally — a crisp keyword query needs none of it; a vague multi-part question benefits from all three. The point is that the string the user typed is an input to the query, not the query itself.
Grounding and citations: make the model show its work
Retrieval can be perfect and the system can still mislead if the model is allowed to answer from its parameters instead of the context. We constrain generation hard. The system prompt instructs the model to answer only from the supplied context, to cite the chunk ID behind each claim, and — critically — to say "I don't have enough information to answer that" when the context doesn't support an answer. That refusal path is a feature, not a failure: a system that declines when it should is far more trustworthy than one that always produces something.
Citations do double duty. For the user, they make answers auditable — every claim links back to a source they can open and verify. For us, they are a debugging and evaluation signal: if the model cites chunk 7 and chunk 7 is irrelevant, we know retrieval misfired even when the prose sounds plausible. We render citations inline and treat an uncited claim in a grounded context as a bug to investigate. Modern models — Claude among them — follow this kind of grounding instruction well when the context is clean and the chunks carry clear source tags, which is one more reason the upstream chunking and reranking work pays off here.
Freshness: an index is a cache that goes stale
A vector index is a snapshot, and snapshots rot. The day a document changes, every chunk embedded from its old version is now actively wrong, and a confident citation to stale content is worse than no answer. We treat indexing as an incremental pipeline, not a nightly full rebuild: when a source document changes, we re-chunk and re-embed just that document and atomically swap its chunks, keying on a content hash so unchanged content is never needlessly re-embedded. We also stamp every chunk with a last-updated date in its metadata, which lets us filter stale content out of retrieval and surface freshness to the user. For corpora where recency is decisive, that date becomes a ranking signal in its own right.
Evaluation: you cannot improve what you don't measure
Every change discussed above — chunk size, embedding model, fusion weights, reranker, query rewriting — is a knob, and turning knobs by vibes is how RAG systems quietly regress. We wire a golden evaluation set into CI and measure two layers separately, because they fail for different reasons:
- Retrieval quality. A labelled set of queries with their known-relevant chunk IDs, scored on precision@k and recall@k. This tells you whether the right chunk is making it into the context at all — independent of what the model does with it.
- Answer quality. Faithfulness (is every claim supported by the retrieved context?) and groundedness (does the answer stick to the context rather than the model's memory?), plus correctness against reference answers. We use a model-as-judge for the faithfulness scoring, which scales far better than manual review.
# Retrieval recall@k over the golden set — the metric that catches regressions.
def recall_at_k(eval_set, k=5):
hits = 0
for q in eval_set:
retrieved = pipeline.retrieve(q.query, k=k) # full hybrid+rerank path
got = {c.id for c in retrieved}
if got & set(q.relevant_chunk_ids): # did any gold chunk survive?
hits += 1
return hits / len(eval_set)
assert recall_at_k(GOLDEN_SET, k=5) >= 0.90 # fail the build on a retrieval regression
The split matters: if answer quality drops, the two-layer score tells you immediately whether retrieval got worse or the generation prompt did, instead of leaving you guessing. We hold an engineering bar across everything we ship — our Sweep iOS app, for instance, runs 23/23 tests with zero third-party dependencies and 100% on-device — and RAG is no exception. A golden set in CI is what turns "it feels better" into a number you can defend.
Key takeaways
- Most "the model hallucinated" bugs are retrieval misses — log retrieved chunk IDs for every answer and the cause becomes visible.
- Chunk along document structure, carry headings and metadata into the chunk, and never split a table or code block mid-stream.
- Run hybrid search: dense vectors for semantic recall, BM25 for the exact terms, IDs, and acronyms that embeddings drop — fuse with RRF.
- A cross-encoder reranker over the top-k is the single highest-ROI upgrade: retrieve wide and cheap, then rerank narrow and precise.
- Rewrite, decompose, and HyDE-expand queries when they're vague; force the model to cite chunks and to refuse when context is insufficient.
- Index incrementally to fight staleness, and wire a golden eval set into CI measuring retrieval precision/recall@k and answer faithfulness separately.
Need RAG that retrieves the right thing every time?
We design and ship production AI for companies — idea to production in weeks.
Book a 30-min call →