Expletus Logo
EXPLETUS
Blog
AI Architecture4 min read

RAG is a data system, not a prompt trick

Why reliable retrieval-augmented generation depends more on document structure, retrieval policy, evaluation, and access control than on a clever prompt.

Retrieval-augmented generation is often explained as a simple recipe: split documents into chunks, embed them, retrieve the nearest matches, and place them in a prompt.

That recipe is enough for a demo. It is not enough for a dependable system.

Production RAG is a data system with a language model at the end. Its quality depends on ingestion, document structure, permissions, ranking, freshness, citation, and evaluation. A stronger model can improve the final answer, but it cannot reliably repair missing or wrongly retrieved evidence.

The real pipeline

A useful RAG pipeline has at least five stages:

  1. Ingestion: collect documents, preserve metadata, and detect changes.
  2. Representation: decide what a retrievable unit means for each document type.
  3. Retrieval: find candidate evidence using lexical, semantic, or structured search.
  4. Assembly: select, order, and compress evidence for the model.
  5. Generation: answer from the evidence and expose uncertainty or citations.

Every stage can fail independently. Treating the whole pipeline as “vector search” hides where the errors actually come from.

Chunking is a modeling decision

Fixed chunks of 500 tokens are convenient, but documents are not naturally made of equal rectangles.

A policy document has sections and exceptions. Source code has symbols and dependencies. A support thread has chronology. A spreadsheet has rows, columns, formulas, and headers. Splitting all of them by character count destroys useful structure.

Good chunking follows the information:

  • Keep headings with the paragraphs they govern.
  • Preserve table headers with relevant rows.
  • Index code by symbol while retaining file and dependency metadata.
  • Keep conversation turns in temporal order.
  • Store parent-child relationships so retrieval can expand around a match.

The embedding is only as meaningful as the unit it represents.

Semantic similarity is not relevance

An embedding retrieves text with a related meaning. That is not always the evidence the user needs.

Exact identifiers, dates, product codes, legal clauses, and error messages often benefit from lexical search. Broader questions benefit from semantic search. Filters may be required for tenant, language, date, document status, or permission level.

This is why mature systems often use hybrid retrieval:

LayerWhat it is good at
Lexical searchExact names, codes, phrases, and rare terms
Vector searchParaphrases and conceptual similarity
Metadata filtersScope, permissions, freshness, and document type
RerankingComparing candidates against the complete question

Retrieval should be a policy, not one database query.

More context can make the answer worse

Adding the top twenty chunks feels safer than adding the top five. It can instead dilute the useful evidence, introduce contradictions, increase latency, and make citations ambiguous.

The goal is not maximum context. It is the smallest sufficient evidence set.

A practical assembler can remove duplicates, group passages from the same source, prefer current documents, and reserve space for instructions and tool results. For long sources, it can retrieve a precise passage and then expand to its parent section.

Evaluate retrieval separately from generation

Teams often read a few answers and decide that the system “feels good.” That mixes two different questions:

  1. Did the system retrieve the evidence needed to answer?
  2. Did the model use that evidence correctly?

Build a small set of real questions with known relevant sources. Measure whether those sources appear in the candidates, whether the final context contains them, and whether the answer is supported.

Useful failure labels include:

  • Source was never ingested.
  • Correct chunk existed but was not retrieved.
  • Correct candidate was removed during assembly.
  • Retrieved sources conflicted.
  • Model ignored or misread correct evidence.
  • Answer should have abstained.

These labels tell an engineering team what to fix.

Permissions belong inside retrieval

Filtering a generated answer after retrieval is too late. Private information has already entered the model context.

Authorization must constrain the candidate set before content is retrieved. The retrieval layer needs the same seriousness as an API or database: tenant boundaries, user roles, audit logs, deletion behavior, and source-level permissions.

Our opinion

RAG is useful, but it is frequently sold as a shortcut around information architecture. It is the opposite. RAG makes the quality of your information architecture visible.

The systems that last will not be the ones with the cleverest “chat with your documents” prompt. They will be the ones that can explain which sources were available, why particular evidence was selected, what permissions were applied, and how retrieval quality is measured over time.