What are the best practices for integrating AI into React and Next.js apps?

Key best practices include: use streaming for LLM responses to avoid timeout issues, implement RAG for context-aware answers, handle loading and error states designed for AI latency (not standard API latency), use MCPs to connect agents to external tools, and cache embeddings to reduce costs.

How do you add LLM streaming to a Next.js app?

In Next.js, use Route Handlers with ReadableStream or the Vercel AI SDK's streamText(). On the client, read the stream with a Reader and update state incrementally. This avoids 30s+ timeouts and gives users instant feedback while the model generates the response.

AI Integration in React & Next.js: Best Practices 2026 Guide

AI Integration

•20 de enero de 2026•6 min read•Por Daily Miranda Pardo

Integrating large language models (LLMs) into frontend apps has transformed how we build user interfaces. In this post we explore the best practices to bring AI into your React and Next.js applications.

Why integrate AI in the frontend?

Moving AI logic to the browser provides:

Lower latency: Responses arrive faster without server roundtrips
More privacy: User data does not leave the device
Better UX: Real-time conversational experiences

Recommended stack for 2026

Framework

React 19+ or Next.js 16+ for maximum performance
Server Components for sensitive logic
Client Components for real-time interactivity

LLMs

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();
const message = await client.messages.create({
  model: "claude-3-5-sonnet-20241022",
  max_tokens: 1024,
  messages: [
    { role: "user", content: "Hi, how are you?" }
  ],
});

Streaming responses

One of the biggest UX upgrades is enabling streaming:

const response = await client.messages.stream({
  model: "claude-3-5-sonnet-20241022",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Hello" }],
});

for await (const event of response) {
  if (event.type === "content_block_delta") {
    console.log(event.delta.text);
  }
}

Latency optimization

1. Token budgets

Do not send the whole context. Calculate how many tokens you really need:

const estimateTokens = (text: string) => {
  return Math.ceil(text.length / 4); // Rough estimate
};

2. Context caching

Reuse frequent contexts to reduce cost and latency:

const cachedContext = {
  userProfile: `Name: Daily, AI expert`,
  systemPrompt: `You are a helpful assistant...`,
};

3. Parallel requests

Process multiple requests simultaneously whenever possible.

Practical use cases

Conversational chat

Replace static forms with natural conversations where the LLM extracts data automatically.

Generative UI

The LLM generates dynamic React components based on user context:

const generatedComponent = await generateUI(userContext);
return <>{generatedComponent}</>;

RAG (Retrieval-Augmented Generation)

RAG connects an LLM to your own knowledge base so it answers from your data, not from its training data. The frontend pattern in React:

// 1. User asks a question
// 2. Embed the question and search your vector DB
// 3. Inject retrieved docs into the prompt
// 4. Stream the answer

async function ragQuery(userQuestion: string) {
  // Step 1: embed the question
  const embedding = await embedText(userQuestion);

  // Step 2: find relevant chunks from your knowledge base
  const relevantDocs = await vectorDB.search(embedding, { topK: 5 });

  // Step 3: build context-enriched prompt
  const context = relevantDocs.map(d => d.content).join('\n\n');

  // Step 4: stream answer to the UI
  const stream = await anthropic.messages.stream({
    model: 'claude-opus-4-5',
    max_tokens: 1024,
    messages: [{
      role: 'user',
      content: `Answer based on this context:\n\n${context}\n\nQuestion: ${userQuestion}`
    }]
  });

  return stream; // pipe directly to ReadableStream in your route handler
}

When to use RAG on the frontend:

Product documentation chatbots (answers from your docs, not hallucinations)
Support agents with access to your FAQ and ticket history
Internal tools that query company knowledge bases

Connecting AI to external tools with MCPs

Model Context Protocol (MCP) is the standard for giving LLMs structured access to tools, APIs, and data sources. Instead of hardcoding every integration, MCPs let the model decide which tool to call and when.

// Define tools the LLM can use
const tools = [
  {
    name: 'get_user_orders',
    description: 'Fetch order history for a user ID',
    input_schema: {
      type: 'object',
      properties: {
        userId: { type: 'string', description: 'The user ID' },
        limit: { type: 'number', description: 'Max orders to return' }
      },
      required: ['userId']
    }
  },
  {
    name: 'update_shipping_address',
    description: 'Update the shipping address for an order',
    input_schema: {
      type: 'object',
      properties: {
        orderId: { type: 'string' },
        address: { type: 'string' }
      },
      required: ['orderId', 'address']
    }
  }
];

// The model decides when to call them
const response = await anthropic.messages.create({
  model: 'claude-opus-4-5',
  max_tokens: 1024,
  tools,
  messages: [{ role: 'user', content: 'What are my last 3 orders?' }]
});

// Handle tool call from model
if (response.stop_reason === 'tool_use') {
  const toolUse = response.content.find(b => b.type === 'tool_use');
  const result = await executeToolCall(toolUse.name, toolUse.input);
  // Feed result back and continue conversation
}

MCPs are the reason AI chat UIs can now do real actions — place orders, update records, call internal APIs — without you writing a custom parser for every use case.

Error handling patterns designed for AI latency

Standard API error handling doesn't work for LLMs. You need patterns built around three new realities: responses take 2-15 seconds, they can stop mid-generation, and rate limits are per-token not per-request.

// ❌ What breaks with LLMs
try {
  const data = await fetch('/api/ai').then(r => r.json());
  setContent(data.text);
} catch (e) {
  setError('Something went wrong');
}

// ✅ What actually works
async function streamWithRecovery(prompt: string) {
  const controller = new AbortController();
  const timeoutId = setTimeout(() => controller.abort(), 30_000); // 30s hard limit

  try {
    const response = await fetch('/api/ai/stream', {
      method: 'POST',
      body: JSON.stringify({ prompt }),
      signal: controller.signal
    });

    if (!response.ok) {
      // Differentiate between rate limit (429) and model error (500)
      if (response.status === 429) {
        const retryAfter = response.headers.get('retry-after') ?? '5';
        await delay(parseInt(retryAfter) * 1000);
        return streamWithRecovery(prompt); // retry once
      }
      throw new Error(`AI service error: ${response.status}`);
    }

    const reader = response.body!.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      const chunk = decoder.decode(value);
      setContent(prev => prev + chunk); // update UI incrementally
    }
  } catch (e) {
    if (e instanceof DOMException && e.name === 'AbortError') {
      setError('Response timed out. Try a shorter question.');
    } else {
      setError('AI service unavailable. Please try again.');
    }
  } finally {
    clearTimeout(timeoutId);
  }
}

The 3 loading states every AI feature needs (not the usual 1):

thinking — request sent, waiting for first token (spinner)
streaming — tokens arriving, show them progressively (no spinner, show partial text)
complete — full response, enable copy/share/follow-up actions

Users tolerate 10-15 seconds of wait time when they can see tokens appearing. The same wait with a spinner causes abandonment.

Conclusions

AI integration on the frontend is the natural evolution of modern web development. The difference between a good implementation and a frustrating one is mostly in the details: streaming from the first token, error states designed for AI latency, and RAG that gives the model your actual data.

Key takeaways:

✅ Use streaming — it's the single biggest UX improvement
✅ Design 3 loading states, not 1
✅ Use RAG to ground answers in your data
✅ Use MCPs to give the model real actions, not just text
✅ Handle rate limits and timeouts explicitly

Need help integrating AI into your React or Next.js app? Contact me — I specialize in exactly this.