Building a Production RAG Pipeline That Doesn't Hallucinate

The Problem With Most RAG Tutorials

Every RAG tutorial follows the same script: load a PDF, chunk it with RecursiveCharacterTextSplitter, stuff it into Pinecone, and call it done. The demo works perfectly on the author’s machine with their carefully chosen sample document. Then you try it on real data and:

The retrieval returns irrelevant chunks from a different document
The LLM confidently cites sources that say the opposite of what it claims
Your Pinecone bill is $400 because you’re embedding the same documents on every deploy
There’s no way to know if the answer is actually good

I’ve built RAG systems for enterprise learning platforms, legal document review, and recipe recommendation. Here’s what actually matters in production.

1. Chunking Is Everything

The single most impactful decision in your RAG pipeline isn’t the embedding model or the vector database — it’s how you split your documents.

The naive approach (don’t do this)

// ❌ Fixed-size chunks lose semantic boundaries
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
});

This splits mid-sentence, mid-paragraph, even mid-word. Your embeddings represent sentence fragments, not ideas.

What actually works

// ✅ Semantic chunking with section awareness
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
  separators: ["\n\n", "\n", ". ", " ", ""], // paragraph → line → sentence
});

The separators array is the secret. It tries to split on paragraph boundaries first, then line breaks, then sentences. Only as a last resort does it split mid-word. This means each chunk is a coherent unit of meaning.

Key insight: Chunk overlap isn’t just about context — it’s about ensuring that ideas that span chunk boundaries are captured in at least one complete chunk. 200 characters of overlap means a sentence split at position 900 in chunk 1 will appear starting at position 0 in chunk 2.

2. Embed Once, Not on Every Deploy

The most expensive mistake I see: regenerating embeddings on every CI/CD deploy.

// ✅ Idempotent ingestion with deterministic document IDs
export function documentId(source: string): string {
  return createHash("sha256").update(source).digest("hex").slice(0, 16);
}

// Before embedding, check if this document version already exists
const docId = documentId(`${source}:${contentHash}`);
const existing = await store.similaritySearchWithScore("", 1, {
  documentId: docId,
});
if (existing.length > 0) {
  console.log(`Document ${source} already indexed, skipping`);
  return;
}

Hash the source path + content hash to generate a stable document ID. Before embedding, check if vectors for this ID already exist. This turns a $400/month embedding bill into $40.

3. Retrieval: More Isn’t Better

The default in most tutorials is topK: 4 or topK: 5. But retrieval quality isn’t about how many chunks you return — it’s about returning the right chunks.

The score threshold pattern

const results = await store.similaritySearchWithScore(query, topK * 2); // oversample
const relevant = results
  .filter(([_, score]) => score >= 0.7) // score threshold
  .slice(0, topK); // then take top K

Oversample (fetch 2x what you need), filter by relevance score, then take the top K. This prevents low-relevance chunks from diluting the LLM’s context window.

Real numbers from production: With a 0.7 threshold on text-embedding-3-small + cosine similarity, we filtered out ~40% of initially retrieved chunks while improving answer accuracy by 22%.

Source deduplication

// Keep only the highest-scoring chunk per source document
const seenSources = new Set<string>();
const deduped = [];
for (const chunk of chunks.sort((a, b) => b.score - a.score)) {
  if (!seenSources.has(chunk.source)) {
    seenSources.add(chunk.source);
    deduped.push(chunk);
  }
}

Without this, a 50-page PDF about “Kubernetes networking” will dominate your results — even if the user asked about “database indexing” and that PDF only mentions it once in passing.

4. The Hallucination Detection Layer

This is what separates a demo from production. After the LLM generates an answer, run a second, cheaper model to verify every claim against the source material:

const HALLUCINATION_CHECK_PROMPT = `Verify whether every claim in the answer
below is supported by the provided sources.

Answer: {answer}
Sources: {sources}

Unsupported claims (one per line, or "ALL_CLAIMS_SUPPORTED"):`;

// Use a cheap, fast model for verification
const checker = new ChatOpenAI({
  modelName: "gpt-4o-mini",
  temperature: 0, // deterministic
});

const result = await checker.invoke(prompt);
if (result !== "ALL_CLAIMS_SUPPORTED") {
  // Flag for human review or regenerate with stricter prompt
}

In my experience, this catches ~15-20% of hallucinations that would otherwise reach users. The cost is negligible — gpt-4o-mini verification costs about $0.0001 per check.

5. Structured Output Is Non-Negotiable

Free-text LLM responses are impossible to validate programmatically. Use function calling (or structured output mode) to enforce a schema:

const answerSchema = z.object({
  answer: z.string(),
  citations: z.array(z.object({
    index: z.number(),
    source: z.string(),
    quote: z.string(), // exact text from source
  })),
  confidence: z.enum(["high", "medium", "low"]),
  needsMoreInfo: z.boolean(),
});

This gives you:

Machine-verifiable citations — every claim links to a source quote
Confidence signaling — the LLM self-assesses, and you can route low-confidence answers to human review
Programmatic fallback — if needsMoreInfo: true, trigger a follow-up retrieval or ask the user for clarification

6. Observability: You Can’t Fix What You Can’t See

console.log is not observability. In production, you need to trace every step of the pipeline:

// LangFuse auto-instruments LangChain calls
import { CallbackHandler } from "langfuse-langchain";

const handler = new CallbackHandler({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  secretKey: process.env.LANGFUSE_SECRET_KEY,
});

// Every LLM call, embedding, and retrieval is now traced
await model.invoke(prompt, { callbacks: [handler] });

LangFuse gives you a dashboard showing:

Latency per pipeline stage (retrieval vs generation vs verification)
Cost per query (token counts × model pricing)
Hallucination rate over time (are things getting better or worse?)
User feedback (thumbs up/down on answers)

I chose LangFuse over LangSmith because it’s open-source and self-hostable — your data stays yours.

The Full Pipeline (All Together)

Here’s the complete production RAG flow:

User Query
    │
    ▼
Retrieval (oversample → score filter → deduplicate)
    │
    ▼
Context Assembly (ranked chunks + source metadata)
    │
    ▼
Answer Generation (LLM with function-calling for structured output)
    │
    ▼
Hallucination Check (gpt-4o-mini verifier)
    │
    ├── Pass → Return answer with citations
    │
    └── Fail → Regenerate with stricter prompt, or flag for review

Every step is traced, metered, and measurable. This isn’t a demo — it’s infrastructure.

What’s Next

This is the first in a three-part series. In the next post, I’ll cover MCP (Model Context Protocol) servers — how to give your agents tools without coupling them to specific APIs.

The full source code is available at github.com/mamenesia/ai-agent-starter-kit. Star it if you’re building something similar.

This post is part of my AI Agent Starter Kit project. I’m building production AI infrastructure in public — follow along on LinkedIn or GitHub.