RAG Readiness Playbook

Summary

This playbook helps you decide if you are actually ready for RAG, or if you are about to build a search problem with extra steps. It is designed for operators and engineers who want a practical path: what to check, what to fix, and what to measure. Use the LLM Safety Review Checklist for data and output guardrails, and the AI Toolkit for tool picks. Run this before you commit to a RAG build, and again before you roll it out to real users.

Who it is for

Operators and engineers building internal knowledge tools.
Founders selling a product that needs grounded answers with citations.
Teams with a pile of docs and no way to find the right answer fast.
Anyone getting burned by “the model made it up.”

What you get

A readiness checklist that forces clarity on the job to be done.
A document and data quality checklist that prevents garbage retrieval.
A minimal architecture outline: ingest, chunk, retrieve, cite, refuse.
Evaluation metrics you can track without a research team.
Templates for test queries and acceptance criteria.

Steps

Define the job to be done, not the tech.
Write one sentence:
- “Users want to [task] using [source], and the result must be [quality bar].”
If you cannot state the task, you cannot evaluate success.
Inventory the sources and the truth level.
List each source and label it:
- Authoritative: policies, contracts, system docs, verified runbooks
- Helpful: meeting notes, tickets, chat logs
- Untrusted: random docs, outdated files, duplicated content
RAG is only as trustworthy as the sources you feed it.
Fix content quality before you build retrieval.
Do a quick cleanup pass:
- Remove duplicates
- Remove dead docs and old versions
- Standardize titles and section headers
- Add “last updated” where possible
If the library is messy, retrieval will be messy.
Choose a minimum viable retrieval design.
Start simple:
- Chunk by headings, not arbitrary size
- Store source URL and section title per chunk
- Retrieve top K and cite them
- Refuse when there is no evidence
You can add reranking later. Do not start with complexity.
Define evaluation before you ship.
Create a test set of 25 questions:
- 10 easy (direct lookup)
- 10 medium (needs synthesis across sections)
- 5 hard (edge cases, ambiguous)
For each question, define what a correct answer must cite.
Ship with guardrails and logging.
Your first version should:
- Show citations for every claim
- Say “Not in library” when it cannot prove it
- Log the question, top docs, and whether the user was satisfied
This is how you improve relevance and trust over time.

Templates

Copy these into your doc, then fill them with your actual system details.

Template 1: RAG readiness snapshot

One page that says if you are ready and why.

RAG readiness snapshot
    
    Job to be done:
    Users:
    Primary sources:
    Success criteria:
    
    Sources inventory:
    - Source | type (authoritative/helpful/untrusted) | owner | last updated
    
    Data quality issues:
    - duplicates:
    - outdated:
    - missing metadata:
    - access problems:
    
    MVP retrieval plan:
    - chunking strategy:
    - top K:
    - citations:
    - refusal behavior:
    
    Evaluation plan:
    - test questions count:
    - acceptance criteria:
    - metrics tracked:
    
    Decision:
    - Ready / Not ready
    Next actions:

Template 2: Test questions table

Use this to build your eval set.

Test questions
    
    Format:
    Question | difficulty | expected sources | must-cite sections | answer notes
    
    Examples:
    - What is our refund policy? | easy | policy doc | section 2 | must cite
    - How do we handle access exceptions? | medium | runbook + policy | both | synthesize
    - What should we do if the system is down and logs are missing? | hard | incident doc | section 4 | edge case

Template 3: Acceptance criteria

Keep it measurable and strict.

Acceptance criteria (MVP)
    
    - Every answer includes citations for factual claims.
    - If no evidence is retrieved, the system refuses and says "Not in library."
    - Top 3 retrieved chunks are relevant for at least 70% of test questions.
    - Correct answer rate is at least 60% on the 25-question test set.
    - Users can report "wrong" and we capture: question, retrieved docs, and why it failed.

Common mistakes

Building RAG when search would solve it. Start with good search first.
Messy sources. Duplicates and outdated docs destroy trust.
No refusal mode. If it cannot prove it, it should not answer.
No evaluation set. You cannot improve what you cannot measure.
Chunking by character count only. Headings and structure matter.
No citations. Without citations you will not earn trust.