Preventing RAG Regressions: Eval Harnesses and Production Gates

Part 3 forced quote-first answers. That fixed one failure mode. It did not give me a way to know whether I was still moving forward.

RAG pipelines moonwalk easily. Retrieval and generation are separate layers, but each one can play a horrible game I like to call moonkwalk-whack-a-mole. A retrieval fix that surfaces the right chunk for one question can knock another question off the list that used to pass. A prompt tweak that nails one ruling can break a different ruling that worked last week. You feel like you are moving forward because the case in front of you is fixed. Without something to replay after each change, you are guessing.

Moonwalking backward while appearing to move forward

This post covers what I built to stop guessing, then a second section of operational pieces that did not fit cleanly into the earlier installments.

Preventing Regressions

The Reality Check

Fixing the bug in front of you is seductive. You ship the patch, run one question that used to fail, celebrate, and move on. Two weeks later a different question regresses and you have no idea which change caused it.

A golden set is the usual antidote: a curated list of questions with known correct answers you re-run after every meaningful change. Textbook guidance says aim for 50 to 150 examples, weighted toward retrieval (60% to 70%) with the rest covering generation.

I did not have time for that. I had a corpus, a retrieval stack I was still changing, and a prompt I was tuning by hand.

What I Built Instead

I wrote a script that asks an LLM to generate player-style questions from random chunks in the index. For each chunk it produces two variants:

Direct: vocabulary stays close to the source section.
Indirect: same rule, different phrasing (the Alias Gap shape from Part 2 without naming every lane again).

Because each generated question is tied to a source chunk, the test is simple: did retrieval return that chunk?

The harness reports Recall@5 (did the expected chunk appear anywhere in the top five retrieved results?), plus Recall@1, Recall@3, and MRR (mean reciprocal rank: how high the first correct chunk landed). On a baseline run in early June, 96 question pairs scored Recall@5 = 0.74 (0.79 direct, 0.69 indirect).

That number is a baseline, not a victory lap. The eval runner uses a simpler retrieval path than production: one dense embedding and lexical search, no alias expansion, no dual dense lanes, no core rulebook boost. I use it to measure ingest and raw retrieval quality. Production adds the lanes from Part 2 on top.

I have added hand-written questions to the set as real failures showed up. Generation is a different story. The harness records a keyword hit rate on answers (0.35 on that same baseline run), but I do not treat that as a generation eval. Every generation check still costs an LLM call and counts against the free-tier rate limits I am trying to stay inside. I re-run interesting questions manually, including the dice question from Part 3. I have not built an automated generation suite. The cost and quota math did not justify it for how I was working at the time.

Logging to Grow Signals

Shortly before beta I added query logging. Every completed call stores the question, the answer, latency, token counts, raw model output, and two chunk lists:

Retrieved: everything in the top-ten context window.
Cited: the subset the model actually used in its answer.

That split is what I reach for when someone says the ruling was wrong. If the right chunk was never retrieved, I look at ingest, alias tables, or search lanes. If it was retrieved but not cited, I look at the quote-first prompt, provider behavior, or reasoning drift.

// query_logs: simplified row shape (not DDL)
type QueryLog = {
  id: string
  question: string
  answer: string
  retrieved_chunk_ids: string[] // top-10 context window
  cited_chunk_ids: string[] // subset the model actually used
  raw_model_output: string
  latency_ms: number
}

The UI has thumbs up and thumbs down on responses. An admin review route lets me filter and sort logged queries without opening the database. Logging is a tradeoff: enough signal to debug the failures I can anticipate, not so much storage that a side project turns into a logging bill. The schema will grow when a real blind spot shows up.

New logs also feed the golden set. Retrieval examples are cheap to add: even some wrong answers still retrieved the right chunk. Generation examples are expensive to validate, so I only keep questions that represent a failure class I care about.

Residual Production Realities

The next section is a grab bag: interesting operational pieces that did not fit cleanly into the ingest, retrieval, or prompting posts.

Surgical Overrides

I try to fix classes of problems, not one-off PDF quirks. Still, I added an override table. If vision parse garbles one stat line, re-ingesting an entire book to fix one number is wasteful. Overrides store corrected text for a specific chunk, leave the original embedding alone, and swap the text only when assembling context for the answer generator. Delete the row and the original OCR text comes back. I have not needed it yet. It is insurance.

Provider Registry: Surviving the 413 Incident

Early in development I hit an HTTP 413 (Payload Too Large) from Groq. A short question had pulled several large rule chunks into context. The free-tier payload ceiling was lower than my assembled context. The API route returned 500 and the feature went dark.

What swapping models bought: Immediate recovery. I could point the answer call at a provider with a higher context ceiling and keep testing.

What hardcoding a single provider would have cost: Every outage or pricing change becomes a redeploy. I was iterating on prompts daily.

Why a registry won: One environment variable selects the answer backend. The rest of the stack expects a structured answer object and does not care which API fulfilled it.

// features/rules-qa/_lib/providers/index.ts
export function getOracleProvider(): OracleProvider {
  const provider = process.env.LLM_PROVIDER ?? 'groq'
  if (provider === 'deepseek') return new DeepSeekProvider()
  return new GroqProvider() // default + fallback
}

Today that is Groq (qwen/qwen3-32b) or DeepSeek (deepseek-v4-flash). Latency, token billing, and occasional empty responses still differ. Provider choice is operations config, not architecture gospel.

Defensive Gates

A pre-flight relevance classifier is a tiny model call that answers one question: is this about tabletop game rules? I run it before retrieval, embedding, and the main answer model. Llama 3.1 8B Instant on Groq, five tokens max, temperature 0. Junk queries get rejected before they touch pgvector.

If the classifier call errors, the route fails open. I would rather waste one retrieval than block a legitimate question because the classifier hiccuped.

For prompt injection, the user's words arrive inside <question> tags. The system prompt treats that region as a question only. That covers direct injection in the user box. Hostile text inside a retrieved chunk is a different problem. Citation filtering and structural truth at ingest still matter.

Beta Safeguards and Economics

Hosted inference costs money. This is a side project I pay for myself. I am not interested in a runaway bill because a bot farm found an open endpoint.

The tool is invite-only behind Google sign-in. New accounts land in a pending state until I approve them manually. Beyond that:

Daily query caps: approved free-tier users get ten queries per day.
Circuit breakers: a config flag can shut off model calls site-wide without a redeploy.
Pre-allocated usage: the API reserves a quota slot before the model call starts so parallel requests cannot race the daily limit.

At about one hundred questions per day with six thousand tokens of context per answer, generation lands around four dollars per month in my estimates. Usage is much lighter than that today, which is the point of the safeguards: caps and circuit breakers keep a side project from turning into a surprise invoice while I still have headroom if a few beta users actually use it. I would rather throttle early than optimize for scale I don't have yet.

Where the Series Lands

I never expected a hostile PDF library to be right on the first try. I wanted a pipeline where each stage has a defined job and a defined place to look when that stage fails.

Ingest (Part 1) buys structural truth and caches it so iteration stays cheap.
Retrieval (Part 2) fuses lanes instead of betting on one embedding.
Generation (Part 3) forces evidence before conclusions.
Measurement (this post) separates retrieval drift from generation drift and keeps providers swappable when APIs have bad days.

The test corpus is a wargame PDF library I own and play. I picked it because the layout is messy in useful ways. When this opens wider, the books are swappable. The architecture is what I wanted to keep.

The pipeline is instrumented enough now that the next failure should point at a layer instead of a mood. That is about what I was hoping for on a side project.

Earlier in this series: Part 1: Ingest · Part 2: Retrieval · Part 3: Prompting