Building a RAG Pipeline That Actually Cites Its Sources

Most RAG demos return an answer and call it done. Production systems can’t afford that. If an AI tells a user their insurance covers something it doesn’t, you need to know why it said that — and so does the user.

Citation-grounded RAG is the difference between an AI that sounds confident and one that can be audited. This post walks through how I built a pipeline where every answer is traceable to an exact document, page number, and excerpt — based on my Healthcare Plan RAG Demo using FastAPI, PostgreSQL, and an LLM.


Why Most RAG Pipelines Skip Citations

The standard RAG pattern is:

  1. Embed the user’s query
  2. Retrieve the top-k similar chunks from a vector store
  3. Stuff those chunks into a prompt
  4. Return the LLM’s response

That works for demos. It fails in production because you lose the connection between what the model said and where it came from. Chunks go in, an answer comes out, and there’s no way to verify it or explain it to a user.

Adding citations requires a bit more structure at every stage — but it’s not complicated once you design for it from the start.


Step 1: Store Metadata Alongside Embeddings

Citations start at ingestion time. When you chunk your documents, each chunk needs to carry its provenance with it — not just its text and embedding vector.

In my healthcare demo, each chunk stored in PostgreSQL includes:

CREATE TABLE plan_chunks (
    id          SERIAL PRIMARY KEY,
    plan_id     TEXT NOT NULL,
    source_doc  TEXT NOT NULL,   -- e.g. "SBC_BlueCross_Gold.pdf"
    page_number INT,
    section     TEXT,            -- e.g. "Out-of-Network Emergency Care"
    chunk_text  TEXT NOT NULL,
    embedding   VECTOR(1536)
);

Every chunk knows which plan it belongs to, which document it came from, and which page. That metadata is retrieved alongside the chunk text — so you always know where your context came from before you even call the LLM.


Step 2: Return Chunks with Their Metadata

When you retrieve chunks at query time, don’t just pass the text to the LLM. Keep the full structured result:

async def retrieve_chunks(query: str, plan_ids: list[str], top_k: int = 8):
    query_embedding = await embed(query)
    results = await db.fetch("""
        SELECT chunk_text, plan_id, source_doc, page_number, section,
               1 - (embedding <=> $1) AS similarity
        FROM plan_chunks
        WHERE plan_id = ANY($2)
        ORDER BY embedding <=> $1
        LIMIT $3
    """, query_embedding, plan_ids, top_k)
    return [dict(r) for r in results]

Each item in the returned list has both the text the LLM will read and the metadata you’ll include in the citation. Don’t discard the metadata before the LLM call — you need it after.


Step 3: Tell the LLM to Cite Its Sources

The prompt is where you enforce citation behavior. The key is to number the chunks and instruct the model to reference them explicitly — and to say so when the answer isn’t in the provided context.

def build_prompt(question: str, chunks: list[dict]) -> str:
    context_blocks = "\n\n".join([
        f"[{i+1}] Plan: {c['plan_id']} | Doc: {c['source_doc']} | Page: {c['page_number']}\n{c['chunk_text']}"
        for i, c in enumerate(chunks)
    ])
    return f"""You are answering questions about health insurance plans.
Use ONLY the excerpts below. For each claim, cite the excerpt number like [1], [2].
If the answer varies by plan, say so explicitly.
If the information is not present in the excerpts, say: "This information is not available in the provided plan documents."

Excerpts:
{context_blocks}

Question: {question}
Answer:"""

This prompt does three things: constrains the model to the provided context, forces explicit citation markers, and gives it an honest escape hatch when the answer isn’t there.


Step 4: Parse Citations from the Response

After the LLM responds, extract the citation numbers and map them back to your chunk metadata:

import re

def extract_citations(answer: str, chunks: list[dict]) -> list[dict]:
    cited_indices = set(int(n) - 1 for n in re.findall(r'\[(\d+)\]', answer))
    return [
        {
            "plan_id":    chunks[i]["plan_id"],
            "source_doc": chunks[i]["source_doc"],
            "page":       chunks[i]["page_number"],
            "section":    chunks[i]["section"],
            "excerpt":    chunks[i]["chunk_text"][:300]
        }
        for i in cited_indices if i < len(chunks)
    ]

The final API response then looks like this:

{
  "answer": "Out-of-network emergency care is covered under all three plans [1][2], though cost-sharing varies. Plan A covers it at 80% after deductible [1], while Plan B requires a $500 copay [2].",
  "citations": [
    {
      "plan_id": "PLAN_A",
      "source_doc": "SBC_PlanA_2026.pdf",
      "page": 4,
      "section": "Emergency Care",
      "excerpt": "Emergency services provided by out-of-network providers are covered at 80% of allowed amount after the annual deductible is met..."
    },
    {
      "plan_id": "PLAN_B",
      "source_doc": "SBC_PlanB_2026.pdf",
      "page": 6,
      "section": "Out-of-Network Benefits",
      "excerpt": "Out-of-network emergency room visits require a $500 copayment, which is waived if admitted..."
    }
  ]
}

Step 5: Render Citations in the UI

On the frontend, render the citations as an expandable accordion — visible but not overwhelming. The user sees the answer first. If they want to verify it, the source is one click away with the exact document, page, and excerpt.

This is what I've built in the HCGov Demo — try asking "Is out-of-network covered for emergency care?" and expanding the citations panel to see exactly which plan documents the answer came from.


Why This Matters for Production

In 2026, the retrieval step is increasingly recognized as the critical bottleneck in RAG pipelines — not generation. The RAGAS evaluation framework measures faithfulness (does the answer match the retrieved context?) and citation quality as first-class metrics. Systems that can't be audited can't be trusted in high-stakes domains.

Citation-grounded RAG is also the foundation for everything more advanced: user-defined weighting, personalized recommendations, explainable AI decisions. You can't build on top of a black box.


Summary

  1. Store metadata at ingestion — every chunk should know its document, page, and section.
  2. Retrieve metadata alongside text — don't strip it before the LLM call.
  3. Prompt for explicit citation markers — number your chunks and tell the model to reference them.
  4. Parse and map citations back — extract citation numbers and attach the full source metadata.
  5. Surface citations in the UI — give users a way to verify what the AI told them.

If you're building a RAG system that needs to hold up under real scrutiny, this pattern is the foundation. Happy to dig into any part of it — reach out via the contact page or explore the live demo.


Building something like this?

Get a free 30-minute architecture review or a written AI readiness audit — no commitment.


Martin Baker

Martin Baker — Solutions Architect specializing in AI, RAG systems, and WordPress engineering. 15+ years building systems that hold up under real business pressure.

LinkedIn · GitHub · Get in touch

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *