From RAG to CAG: Cache-Augmented Generation in AI
For the past few years, RAG (Retrieval-Augmented Generation) has been the de facto standard for giving language models access to up-to-date knowledge. But in 2025, a compelling alternative emerged that is reshaping how AI systems are built: CAG (Cache-Augmented Generation). If you work with AI in your projects, understanding this evolution — and knowing when each approach wins — is now essential.
What is RAG and what's the real problem?
RAG solves the "static knowledge" problem of LLMs. When a model doesn't know something (because its training cutoff is outdated, or the data is private), RAG fetches it at query time: it chunks documents, generates embeddings, stores them in a vector database, and retrieves the most relevant pieces for each query.
Classic RAG pipeline:
User → Question → Embed query → Vector search →
Retrieve chunks → Build prompt → LLM → Answer
It works, but it carries significant structural limitations:
- Double latency: retrieval time stacks on top of inference time
- Fragile chunking: poor document splits degrade answer quality
- Fragmented context: the model only sees pieces, never the full document
- Operational overhead: keeping vector indexes current has a real cost
- Hallucinations from retrieval: if the wrong chunk is returned, the answer fails
What is CAG and why is it different?
CAG (Cache-Augmented Generation) takes advantage of ultra-long context models (Gemini 1.5 Pro at 1M tokens, Claude at 200K, GPT-4o at 128K) to eliminate the retrieval step entirely. Instead of searching for relevant pieces, you preload the entire knowledge base into the model's context once, generating a KV (Key-Value) cache that persists across queries.
CAG pipeline:
[Once] Full documents → Preload into KV cache → LLM
[Per query] User → Question → LLM (context already available) → Answer
The technical core is the KV cache: when the model processes the initial context, it stores the computed attention representations (keys and values) internally. Subsequent queries reuse that cache without reprocessing the documents, dramatically cutting per-query latency.
Real advantages of CAG
| Aspect | Traditional RAG | CAG |
|---|---|---|
| Architecture | Complex pipeline (embed, index, retrieve) | Simple (preload + query) |
| Per-query latency | High (retrieval + inference) | Low (inference only) |
| Context quality | Fragmented (chunks) | Complete (full document) |
| Maintenance | Continuous vector index updates | Periodic cache refresh |
| Consistency | Variable (depends on retrieval) | Deterministic |
| Setup cost | High (embeddings + vector DB) | Low |
When to use CAG vs RAG: making the right call
CAG does not replace RAG in every scenario. The choice depends on three variables: knowledge size, update frequency, and latency requirements.
Use CAG when:
- Your knowledge base fits within the model's context window (~200K–1M tokens)
- You need coherent answers with a global view of the document
- Low latency is critical (chatbots, real-time assistants)
- You want a simple architecture with no vector infrastructure
- Content updates are periodic, not continuous
Use RAG when:
- Your knowledge base is massive (millions of documents)
- You need precise semantic search across highly heterogeneous corpora
- Token cost is a hard constraint
- Data changes in real time (live news feeds, streaming logs)
The hybrid RAG+CAG approach
For intermediate cases, the most powerful architecture combines both:
async function hybridQuery(question: string, knowledgeBase: Document[]) {
// CAG: critical documents always in context
const coreContext = knowledgeBase
.filter(doc => doc.priority === 'critical')
.map(doc => doc.content)
.join('\n\n');
// RAG: secondary documents only when relevant
const retrieved = await vectorSearch(question, knowledgeBase
.filter(doc => doc.priority !== 'critical'));
const prompt = `
=== CORE KNOWLEDGE (always available) ===
${coreContext}
=== ADDITIONAL RETRIEVED CONTEXT ===
${retrieved.map(d => d.content).join('\n')}
=== QUESTION ===
${question}
`;
return await llm.complete(prompt);
}
Applying CAG in your AI projects: real examples
Example 1: Customer support chatbot with CAG
The most immediate use case. If you run a business with product documentation, FAQs, policies, and a catalog, CAG lets you load everything into context and deliver perfectly coherent answers — no vector infrastructure needed.
import Anthropic from "@anthropic-ai/sdk";
import fs from "fs";
const client = new Anthropic();
// Preload: prepare all business knowledge
async function buildCacheContext(docsDir: string): Promise<string> {
const files = fs.readdirSync(docsDir);
const sections = files.map(file => {
const content = fs.readFileSync(`${docsDir}/${file}`, 'utf-8');
return `=== ${file.replace('.txt', '').toUpperCase()} ===\n${content}`;
});
return sections.join('\n\n');
}
// Query: reuse the cached context
async function queryWithCAG(
userQuestion: string,
cachedContext: string
): Promise<string> {
const response = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 1024,
system: `You are an expert assistant. Use ONLY the following knowledge
base to answer. If the information is not there, say so clearly.
\n\n${cachedContext}`,
messages: [{ role: "user", content: userQuestion }],
});
return response.content[0].type === 'text' ? response.content[0].text : '';
}
// Usage in a Next.js API Route
export async function POST(req: Request) {
const { question } = await req.json();
// In production: cache in Redis or memory, don't rebuild on every request
const context = await buildCacheContext('./data/knowledge');
const answer = await queryWithCAG(question, context);
return Response.json({ answer });
}
Example 2: Code assistant with repository context
A place where CAG shines: loading the full context of a small-to-medium codebase so the assistant understands the entire architecture without losing coherence between files.
import Anthropic from "@anthropic-ai/sdk";
import { glob } from "glob";
import fs from "fs";
const client = new Anthropic();
interface RepoContext {
files: { path: string; content: string }[];
totalTokens: number;
}
async function loadRepoContext(repoPath: string): Promise<RepoContext> {
// Selective loading: only relevant files to avoid saturating the context
const files = await glob(`${repoPath}/src/**/*.{ts,tsx,js}`, {
ignore: ['**/node_modules/**', '**/*.test.*', '**/dist/**']
});
const loaded = files.map(file => ({
path: file.replace(repoPath, ''),
content: fs.readFileSync(file, 'utf-8')
}));
return {
files: loaded,
totalTokens: loaded.reduce((acc, f) => acc + f.content.length / 4, 0)
};
}
async function askAboutCode(
question: string,
repoCtx: RepoContext
): Promise<string> {
const contextStr = repoCtx.files
.map(f => `// FILE: ${f.path}\n${f.content}`)
.join('\n\n---\n\n');
const response = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 2048,
system: `You are an expert on this repository. You have full access
to the source code:\n\n${contextStr}`,
messages: [{ role: "user", content: question }],
});
return response.content[0].type === 'text' ? response.content[0].text : '';
}
Example 3: CAG with incremental cache updates
In real projects, knowledge changes. This pattern handles updates without rebuilding the entire cache from scratch:
interface CacheEntry {
context: string;
version: string;
createdAt: Date;
ttlHours: number;
}
class CAGCacheManager {
private cache = new Map<string, CacheEntry>();
async getOrBuild(
cacheKey: string,
buildFn: () => Promise<string>,
ttlHours = 24
): Promise<string> {
const cached = this.cache.get(cacheKey);
const isExpired = cached &&
(Date.now() - cached.createdAt.getTime()) > ttlHours * 3600 * 1000;
if (cached && !isExpired) {
return cached.context;
}
// Rebuild only when expired or missing
const freshContext = await buildFn();
this.cache.set(cacheKey, {
context: freshContext,
version: Date.now().toString(),
createdAt: new Date(),
ttlHours
});
return freshContext;
}
invalidate(cacheKey: string): void {
this.cache.delete(cacheKey);
}
}
// In your Next.js API
const cacheManager = new CAGCacheManager();
export async function POST(req: Request) {
const { question, namespace } = await req.json();
const context = await cacheManager.getOrBuild(
namespace,
() => loadKnowledgeBase(namespace),
12
);
const answer = await queryWithCAG(question, context);
return Response.json({ answer });
}
The future: CAG as the new standard in AI applications
The trend is clear. Models with context windows of 1 million tokens (and growing) are making CAG viable for increasingly large knowledge bases. What requires RAG with chunking today will fit in context directly tomorrow.
For medium-sized projects — product documentation, technical manuals, catalogs, codebases — CAG is already the most pragmatic architecture available right now: less complexity, better answer quality, lower latency.
The roadmap for 2026:
- New projects: Start with CAG. Add RAG only if knowledge exceeds available context.
- Existing RAG projects: Check whether your knowledge base fits in context. If it does, migrating to CAG will simplify your stack and improve results.
- Complex cases: Implement the CAG+RAG hybrid with critical documents always in context and the rest in retrieval.
Want to implement CAG in your project or migrate from an existing RAG architecture? At dailymp.es I help teams design and implement AI architectures tailored to their real needs. Also explore how AI-driven development can transform your entire engineering workflow. Get in touch — let's build something that actually works.