From RAG to CAG: Cache-Augmented Generation in AI

AI Integration

•17 de marzo de 2026•7 min read•Por Daily Miranda Pardo

For the past few years, RAG (Retrieval-Augmented Generation) has been the de facto standard for giving language models access to up-to-date knowledge. But in 2025, a compelling alternative emerged that is reshaping how AI systems are built: CAG (Cache-Augmented Generation). If you work with AI in your projects, understanding this evolution — and knowing when each approach wins — is now essential.

What is RAG and what's the real problem?

RAG solves the "static knowledge" problem of LLMs. When a model doesn't know something (because its training cutoff is outdated, or the data is private), RAG fetches it at query time: it chunks documents, generates embeddings, stores them in a vector database, and retrieves the most relevant pieces for each query.

Classic RAG pipeline:

User → Question → Embed query → Vector search →
Retrieve chunks → Build prompt → LLM → Answer

It works, but it carries significant structural limitations:

Double latency: retrieval time stacks on top of inference time
Fragile chunking: poor document splits degrade answer quality
Fragmented context: the model only sees pieces, never the full document
Operational overhead: keeping vector indexes current has a real cost
Hallucinations from retrieval: if the wrong chunk is returned, the answer fails

What is CAG and why is it different?

CAG (Cache-Augmented Generation) takes advantage of ultra-long context models (Gemini 1.5 Pro at 1M tokens, Claude at 200K, GPT-4o at 128K) to eliminate the retrieval step entirely. Instead of searching for relevant pieces, you preload the entire knowledge base into the model's context once, generating a KV (Key-Value) cache that persists across queries.

CAG pipeline:

[Once] Full documents → Preload into KV cache → LLM
[Per query] User → Question → LLM (context already available) → Answer

The technical core is the KV cache: when the model processes the initial context, it stores the computed attention representations (keys and values) internally. Subsequent queries reuse that cache without reprocessing the documents, dramatically cutting per-query latency.

Real advantages of CAG

Aspect	Traditional RAG	CAG
Architecture	Complex pipeline (embed, index, retrieve)	Simple (preload + query)
Per-query latency	High (retrieval + inference)	Low (inference only)
Context quality	Fragmented (chunks)	Complete (full document)
Maintenance	Continuous vector index updates	Periodic cache refresh
Consistency	Variable (depends on retrieval)	Deterministic
Setup cost	High (embeddings + vector DB)	Low

When to use CAG vs RAG: making the right call

CAG does not replace RAG in every scenario. The choice depends on three variables: knowledge size, update frequency, and latency requirements.

Use CAG when:

Your knowledge base fits within the model's context window (~200K–1M tokens)
You need coherent answers with a global view of the document
Low latency is critical (chatbots, real-time assistants)
You want a simple architecture with no vector infrastructure
Content updates are periodic, not continuous

Use RAG when:

Your knowledge base is massive (millions of documents)
You need precise semantic search across highly heterogeneous corpora
Token cost is a hard constraint
Data changes in real time (live news feeds, streaming logs)

The hybrid RAG+CAG approach

For intermediate cases, the most powerful architecture combines both:

async function hybridQuery(question: string, knowledgeBase: Document[]) {
  // CAG: critical documents always in context
  const coreContext = knowledgeBase
    .filter(doc => doc.priority === 'critical')
    .map(doc => doc.content)
    .join('\n\n');

  // RAG: secondary documents only when relevant
  const retrieved = await vectorSearch(question, knowledgeBase
    .filter(doc => doc.priority !== 'critical'));

  const prompt = `
    === CORE KNOWLEDGE (always available) ===
    ${coreContext}

    === ADDITIONAL RETRIEVED CONTEXT ===
    ${retrieved.map(d => d.content).join('\n')}

    === QUESTION ===
    ${question}
  `;

  return await llm.complete(prompt);
}

Applying CAG in your AI projects: real examples

Example 1: Customer support chatbot with CAG

The most immediate use case. If you run a business with product documentation, FAQs, policies, and a catalog, CAG lets you load everything into context and deliver perfectly coherent answers — no vector infrastructure needed.

import Anthropic from "@anthropic-ai/sdk";
import fs from "fs";

const client = new Anthropic();

// Preload: prepare all business knowledge
async function buildCacheContext(docsDir: string): Promise<string> {
  const files = fs.readdirSync(docsDir);
  const sections = files.map(file => {
    const content = fs.readFileSync(`${docsDir}/${file}`, 'utf-8');
    return `=== ${file.replace('.txt', '').toUpperCase()} ===\n${content}`;
  });
  return sections.join('\n\n');
}

// Query: reuse the cached context
async function queryWithCAG(
  userQuestion: string,
  cachedContext: string
): Promise<string> {
  const response = await client.messages.create({
    model: "claude-opus-4-6",
    max_tokens: 1024,
    system: `You are an expert assistant. Use ONLY the following knowledge
             base to answer. If the information is not there, say so clearly.
             \n\n${cachedContext}`,
    messages: [{ role: "user", content: userQuestion }],
  });

  return response.content[0].type === 'text' ? response.content[0].text : '';
}

// Usage in a Next.js API Route
export async function POST(req: Request) {
  const { question } = await req.json();

  // In production: cache in Redis or memory, don't rebuild on every request
  const context = await buildCacheContext('./data/knowledge');
  const answer = await queryWithCAG(question, context);

  return Response.json({ answer });
}

Example 2: Code assistant with repository context

A place where CAG shines: loading the full context of a small-to-medium codebase so the assistant understands the entire architecture without losing coherence between files.

import Anthropic from "@anthropic-ai/sdk";
import { glob } from "glob";
import fs from "fs";

const client = new Anthropic();

interface RepoContext {
  files: { path: string; content: string }[];
  totalTokens: number;
}

async function loadRepoContext(repoPath: string): Promise<RepoContext> {
  // Selective loading: only relevant files to avoid saturating the context
  const files = await glob(`${repoPath}/src/**/*.{ts,tsx,js}`, {
    ignore: ['**/node_modules/**', '**/*.test.*', '**/dist/**']
  });

  const loaded = files.map(file => ({
    path: file.replace(repoPath, ''),
    content: fs.readFileSync(file, 'utf-8')
  }));

  return {
    files: loaded,
    totalTokens: loaded.reduce((acc, f) => acc + f.content.length / 4, 0)
  };
}

async function askAboutCode(
  question: string,
  repoCtx: RepoContext
): Promise<string> {
  const contextStr = repoCtx.files
    .map(f => `// FILE: ${f.path}\n${f.content}`)
    .join('\n\n---\n\n');

  const response = await client.messages.create({
    model: "claude-opus-4-6",
    max_tokens: 2048,
    system: `You are an expert on this repository. You have full access
             to the source code:\n\n${contextStr}`,
    messages: [{ role: "user", content: question }],
  });

  return response.content[0].type === 'text' ? response.content[0].text : '';
}

Example 3: CAG with incremental cache updates

In real projects, knowledge changes. This pattern handles updates without rebuilding the entire cache from scratch:

interface CacheEntry {
  context: string;
  version: string;
  createdAt: Date;
  ttlHours: number;
}

class CAGCacheManager {
  private cache = new Map<string, CacheEntry>();

  async getOrBuild(
    cacheKey: string,
    buildFn: () => Promise<string>,
    ttlHours = 24
  ): Promise<string> {
    const cached = this.cache.get(cacheKey);
    const isExpired = cached &&
      (Date.now() - cached.createdAt.getTime()) > ttlHours * 3600 * 1000;

    if (cached && !isExpired) {
      return cached.context;
    }

    // Rebuild only when expired or missing
    const freshContext = await buildFn();
    this.cache.set(cacheKey, {
      context: freshContext,
      version: Date.now().toString(),
      createdAt: new Date(),
      ttlHours
    });

    return freshContext;
  }

  invalidate(cacheKey: string): void {
    this.cache.delete(cacheKey);
  }
}

// In your Next.js API
const cacheManager = new CAGCacheManager();

export async function POST(req: Request) {
  const { question, namespace } = await req.json();

  const context = await cacheManager.getOrBuild(
    namespace,
    () => loadKnowledgeBase(namespace),
    12
  );

  const answer = await queryWithCAG(question, context);
  return Response.json({ answer });
}

The future: CAG as the new standard in AI applications

The trend is clear. Models with context windows of 1 million tokens (and growing) are making CAG viable for increasingly large knowledge bases. What requires RAG with chunking today will fit in context directly tomorrow.

For medium-sized projects — product documentation, technical manuals, catalogs, codebases — CAG is already the most pragmatic architecture available right now: less complexity, better answer quality, lower latency.

The roadmap for 2026:

New projects: Start with CAG. Add RAG only if knowledge exceeds available context.
Existing RAG projects: Check whether your knowledge base fits in context. If it does, migrating to CAG will simplify your stack and improve results.
Complex cases: Implement the CAG+RAG hybrid with critical documents always in context and the rest in retrieval.

Want to implement CAG in your project or migrate from an existing RAG architecture? At dailymp.es I help teams design and implement AI architectures tailored to their real needs. Also explore how AI-driven development can transform your entire engineering workflow. Get in touch — let's build something that actually works.