A practical RAG readiness playbook: document prep, chunking, citations, evaluation, and rollout. Built for operators and engineers.
Last updated: February 27, 2026
Summary
This playbook helps you decide if you are actually ready for RAG, or if you are about to build a search problem with extra steps.
It is designed for operators and engineers who want a practical path: what to check, what to fix, and what to measure.
Use the LLM Safety Review Checklist for data and output guardrails, and the AI Toolkit for tool picks.
Run this before you commit to a RAG build, and again before you roll it out to real users.
Who it is for
Operators and engineers building internal knowledge tools.
Founders selling a product that needs grounded answers with citations.
Teams with a pile of docs and no way to find the right answer fast.
Anyone getting burned by “the model made it up.”
What you get
A readiness checklist that forces clarity on the job to be done.
A document and data quality checklist that prevents garbage retrieval.
A minimal architecture outline: ingest, chunk, retrieve, cite, refuse.
Evaluation metrics you can track without a research team.
Templates for test queries and acceptance criteria.
Steps
Define the job to be done, not the tech. Write one sentence:
“Users want to [task] using [source], and the result must be [quality bar].”
If you cannot state the task, you cannot evaluate success.
Inventory the sources and the truth level. List each source and label it:
Authoritative: policies, contracts, system docs, verified runbooks
Helpful: meeting notes, tickets, chat logs
Untrusted: random docs, outdated files, duplicated content
RAG is only as trustworthy as the sources you feed it.
Fix content quality before you build retrieval. Do a quick cleanup pass:
Remove duplicates
Remove dead docs and old versions
Standardize titles and section headers
Add “last updated” where possible
If the library is messy, retrieval will be messy.
Choose a minimum viable retrieval design. Start simple:
Chunk by headings, not arbitrary size
Store source URL and section title per chunk
Retrieve top K and cite them
Refuse when there is no evidence
You can add reranking later. Do not start with complexity.
Define evaluation before you ship. Create a test set of 25 questions:
10 easy (direct lookup)
10 medium (needs synthesis across sections)
5 hard (edge cases, ambiguous)
For each question, define what a correct answer must cite.
Ship with guardrails and logging. Your first version should:
Show citations for every claim
Say “Not in library” when it cannot prove it
Log the question, top docs, and whether the user was satisfied
This is how you improve relevance and trust over time.
Templates
Copy these into your doc, then fill them with your actual system details.
Template 1: RAG readiness snapshot
One page that says if you are ready and why.
RAG readiness snapshot
Job to be done:
Users:
Primary sources:
Success criteria:
Sources inventory:
- Source | type (authoritative/helpful/untrusted) | owner | last updated
Data quality issues:
- duplicates:
- outdated:
- missing metadata:
- access problems:
MVP retrieval plan:
- chunking strategy:
- top K:
- citations:
- refusal behavior:
Evaluation plan:
- test questions count:
- acceptance criteria:
- metrics tracked:
Decision:
- Ready / Not ready
Next actions:
Template 2: Test questions table
Use this to build your eval set.
Test questions
Format:
Question | difficulty | expected sources | must-cite sections | answer notes
Examples:
- What is our refund policy? | easy | policy doc | section 2 | must cite
- How do we handle access exceptions? | medium | runbook + policy | both | synthesize
- What should we do if the system is down and logs are missing? | hard | incident doc | section 4 | edge case
Template 3: Acceptance criteria
Keep it measurable and strict.
Acceptance criteria (MVP)
- Every answer includes citations for factual claims.
- If no evidence is retrieved, the system refuses and says "Not in library."
- Top 3 retrieved chunks are relevant for at least 70% of test questions.
- Correct answer rate is at least 60% on the 25-question test set.
- Users can report "wrong" and we capture: question, retrieved docs, and why it failed.
Common mistakes
Building RAG when search would solve it. Start with good search first.
Messy sources. Duplicates and outdated docs destroy trust.
No refusal mode. If it cannot prove it, it should not answer.
No evaluation set. You cannot improve what you cannot measure.
Chunking by character count only. Headings and structure matter.
No citations. Without citations you will not earn trust.