Most RAG demos return an answer and call it done. Production systems can’t afford that. If an AI tells a user their insurance covers something it doesn’t, you need to know why it said that — and so does the user.
Citation-grounded RAG is the difference between an AI that sounds confident and one that can be audited. This post walks through how I built a pipeline where every answer is traceable to an exact document, page number, and excerpt — based on my Healthcare Plan RAG Demo using FastAPI, PostgreSQL, and an LLM.
Why Most RAG Pipelines Skip Citations
The standard RAG pattern is:
- Embed the user’s query
- Retrieve the top-k similar chunks from a vector store
- Stuff those chunks into a prompt
- Return the LLM’s response
That works for demos. It fails in production because you lose the connection between what the model said and where it came from. Chunks go in, an answer comes out, and there’s no way to verify it or explain it to a user.
Adding citations requires a bit more structure at every stage — but it’s not complicated once you design for it from the start.
Step 1: Store Metadata Alongside Embeddings
Citations start at ingestion time. When you chunk your documents, each chunk needs to carry its provenance with it — not just its text and embedding vector.
In my healthcare demo, each chunk stored in PostgreSQL includes:
CREATE TABLE plan_chunks (
id SERIAL PRIMARY KEY,
plan_id TEXT NOT NULL,
source_doc TEXT NOT NULL, -- e.g. "SBC_BlueCross_Gold.pdf"
page_number INT,
section TEXT, -- e.g. "Out-of-Network Emergency Care"
chunk_text TEXT NOT NULL,
embedding VECTOR(1536)
);
Every chunk knows which plan it belongs to, which document it came from, and which page. That metadata is retrieved alongside the chunk text — so you always know where your context came from before you even call the LLM.
Step 2: Return Chunks with Their Metadata
When you retrieve chunks at query time, don’t just pass the text to the LLM. Keep the full structured result:
async def retrieve_chunks(query: str, plan_ids: list[str], top_k: int = 8):
query_embedding = await embed(query)
results = await db.fetch("""
SELECT chunk_text, plan_id, source_doc, page_number, section,
1 - (embedding <=> $1) AS similarity
FROM plan_chunks
WHERE plan_id = ANY($2)
ORDER BY embedding <=> $1
LIMIT $3
""", query_embedding, plan_ids, top_k)
return [dict(r) for r in results]
Each item in the returned list has both the text the LLM will read and the metadata you’ll include in the citation. Don’t discard the metadata before the LLM call — you need it after.
Step 3: Tell the LLM to Cite Its Sources
The prompt is where you enforce citation behavior. The key is to number the chunks and instruct the model to reference them explicitly — and to say so when the answer isn’t in the provided context.
def build_prompt(question: str, chunks: list[dict]) -> str:
context_blocks = "\n\n".join([
f"[{i+1}] Plan: {c['plan_id']} | Doc: {c['source_doc']} | Page: {c['page_number']}\n{c['chunk_text']}"
for i, c in enumerate(chunks)
])
return f"""You are answering questions about health insurance plans.
Use ONLY the excerpts below. For each claim, cite the excerpt number like [1], [2].
If the answer varies by plan, say so explicitly.
If the information is not present in the excerpts, say: "This information is not available in the provided plan documents."
Excerpts:
{context_blocks}
Question: {question}
Answer:"""
This prompt does three things: constrains the model to the provided context, forces explicit citation markers, and gives it an honest escape hatch when the answer isn’t there.
Step 4: Parse Citations from the Response
After the LLM responds, extract the citation numbers and map them back to your chunk metadata:
import re
def extract_citations(answer: str, chunks: list[dict]) -> list[dict]:
cited_indices = set(int(n) - 1 for n in re.findall(r'\[(\d+)\]', answer))
return [
{
"plan_id": chunks[i]["plan_id"],
"source_doc": chunks[i]["source_doc"],
"page": chunks[i]["page_number"],
"section": chunks[i]["section"],
"excerpt": chunks[i]["chunk_text"][:300]
}
for i in cited_indices if i < len(chunks)
]
The final API response then looks like this:
{
"answer": "Out-of-network emergency care is covered under all three plans [1][2], though cost-sharing varies. Plan A covers it at 80% after deductible [1], while Plan B requires a $500 copay [2].",
"citations": [
{
"plan_id": "PLAN_A",
"source_doc": "SBC_PlanA_2026.pdf",
"page": 4,
"section": "Emergency Care",
"excerpt": "Emergency services provided by out-of-network providers are covered at 80% of allowed amount after the annual deductible is met..."
},
{
"plan_id": "PLAN_B",
"source_doc": "SBC_PlanB_2026.pdf",
"page": 6,
"section": "Out-of-Network Benefits",
"excerpt": "Out-of-network emergency room visits require a $500 copayment, which is waived if admitted..."
}
]
}
Step 5: Render Citations in the UI
On the frontend, render the citations as an expandable accordion — visible but not overwhelming. The user sees the answer first. If they want to verify it, the source is one click away with the exact document, page, and excerpt.
This is what I've built in the HCGov Demo — try asking "Is out-of-network covered for emergency care?" and expanding the citations panel to see exactly which plan documents the answer came from.
Why This Matters for Production
In 2026, the retrieval step is increasingly recognized as the critical bottleneck in RAG pipelines — not generation. The RAGAS evaluation framework measures faithfulness (does the answer match the retrieved context?) and citation quality as first-class metrics. Systems that can't be audited can't be trusted in high-stakes domains.
Citation-grounded RAG is also the foundation for everything more advanced: user-defined weighting, personalized recommendations, explainable AI decisions. You can't build on top of a black box.
Summary
- Store metadata at ingestion — every chunk should know its document, page, and section.
- Retrieve metadata alongside text — don't strip it before the LLM call.
- Prompt for explicit citation markers — number your chunks and tell the model to reference them.
- Parse and map citations back — extract citation numbers and attach the full source metadata.
- Surface citations in the UI — give users a way to verify what the AI told them.
If you're building a RAG system that needs to hold up under real scrutiny, this pattern is the foundation. Happy to dig into any part of it — reach out via the contact page or explore the live demo.
Building something like this?
Get a free 30-minute architecture review or a written AI readiness audit — no commitment.

Martin Baker — Solutions Architect specializing in AI, RAG systems, and WordPress engineering. 15+ years building systems that hold up under real business pressure.
Leave a Reply