Fielding Questions: Using JSON Constraints to Force Grounded Answers

New to this series? Open for context

Engineering the Rules Oracle

Modern tabletop games rely on massive, highly fragmented ecosystems of rulebooks, supplements, and constantly updating FAQs. When an obscure rule interaction or edge case arises mid game, it takes players out of the fun. I am building the Rules Oracle to solve this: a hosted Q&A engine that provides cited, page-referenced answers to complex rules questions. This series will cover my thoughts and learnings on engineering the initial work on the Rules Oracle.

Access is currently invite only during the beta period.

  1. Part 1: In Good Character: Designing an Ingestion Pipeline for Hostile Tabletop Rules
  2. Part 2: Pulling Rank: Using Fused Retrieval to Bridge the Alias Gap
  3. Part 3: Fielding Questions: Using JSON Constraints to Force Grounded Answers ← you are here
  4. Part 4: Preventing RAG Regressions: Eval Harnesses and Production Gates

You are mid-game. Someone asks a rules question. Everyone waits while someone thumbs through a PDF or argues from memory.

I tried the obvious shortcut first: ask a general-purpose AI. For a mainstream game like D&D, that sometimes works. For a wargame ruleset with a thinner training footprint, the answers come back just as confident and wrong as often as not. Push it to cite a page or quote the rule, and it will often invent one. Confident prose with fabricated evidence is worse than admitting uncertainty.

That is why I built citation-first RAG. Retrieval's job is to put real rule text in front of the model. Generation's job is to answer from that text. I also wanted a fallback when generation drifts: if retrieval did its job, the response should still show the quoted passages. Your group can read the actual rules and make the call without opening the book.

A RAG stack still has two jobs that fail independently. Retrieval can miss the relevant chunk. Generation can do something subtler: the right chunks are in context, but the model latches onto a nearby passage, skips a threshold rule, and states a conclusion that sounds authoritative. This post is about the second problem. Fielding the question once the evidence is (hopefully) on the page.

Most of my test questions came from a Discord server where people play the game. One of them:

I was checking a rule in Saga Age of Magic and I'm not sure: do Creatures continue to generate saga dice down to the last model in the unit?

That question chains a unit type (Creatures), a modifier (Presence), and a cutoff rule (when a unit stops generating dice). The kind of ruling where a general chatbot will pattern-match and sound sure. I needed the answer call to quote the passages first and show the chain before it committed to yes or no.


The JSON Constraint: Evidence Before Conclusions

The Reality Check

Free-form answers hide the reasoning chain. On modifier math (a unit counts as two models for one rule but one for another), the model can jump straight to a number that sounds right. You cannot tell whether it quoted the threshold rule, applied the modifier, or guessed.

Think of a deposition, not an essay. The witness lists the exhibits first. Then explains how they connect. Only then states the conclusion. Reorder that sequence and the conclusion stops being evidence-backed.

The Technical Evidence

The system prompt requires a single JSON object and nothing else: no preamble, no markdown fences. Fields must appear in this order: rules, reasoning, answer, confidence, citedChunkIds, caveat. The model is not allowed to reach answer until rules and reasoning are on the page.

The rules field is an array of objects. Each entry binds a chunk ID to a verbatim quote from that chunk. Paraphrase is forbidden. If the model cannot find a quote that bears on the question, it cannot build a rule entry for that chunk.

The reasoning field must work only from those quotes. When a rule modifies how models are counted (figures, wounds, remaining models in a unit), the prompt requires an explicit chain: raw count, modified count, then whether a threshold is met. The model commits to a conclusion only after that chain is written.

For the dice question, the model must return a JSON object like this before the client formats it for display:

{
  "rules": [
    {
      "chunkId": "d5d6602f-474a-4bc3-b19d-8aefe3a6f84a",
      "text": "The Presence rule which affects every Creature unit means that each figure in the unit counts as 4 figures when determining the effects of Saga abilities."
    },
    {
      "chunkId": "9d642c4b-dab3-4482-9c5d-20cdfbe07e2d",
      "text": "As soon as a unit of Mercenary Warriors has 3 figures or less, it no longer generates a Saga die."
    }
  ],
  "reasoning": "Saga dice generation is an effect of Saga abilities. Presence makes each Creature figure count as 4 figures for those effects. One remaining Creature figure counts as 4, which is above the 3-or-fewer cutoff where a unit stops generating its Saga die.",
  "answer": "Yes. Creatures continue to generate Saga dice down to the last model. Presence counts each Creature figure as 4 figures when determining Saga ability effects, including dice generation. A single remaining figure still counts as 4, above the threshold where generation stops.",
  "confidence": "high",
  "citedChunkIds": [
    "d5d6602f-474a-4bc3-b19d-8aefe3a6f84a",
    "9d642c4b-dab3-4482-9c5d-20cdfbe07e2d"
  ],
  "caveat": null
}

On the client the player sees the answer prose plus a citation panel: book title, heading path, page numbers, and the chunk text for each cited passage. If the conclusion is wrong but retrieval was solid, those quotes are still on screen. That is the whole point. The table can disagree with the model and still have the evidence in front of them.

If the threshold FAQ chunk had been missing from context, the honest path is an empty or partial rules array and a short answer saying the books provided do not cover it. That is the behavior I want when retrieval misses. The generation layer should not paper over a gap with confident prose.

The player's question arrives inside <question> tags in the provider message. The system prompt tells the model to treat that region as a question only, not as instructions that override the system prompt. That is a narrow guard against direct injection in the user box. Hostile text inside a retrieved chunk is a different problem.

The Gotcha

The field order is a prompt convention, not API-level schema enforcement. I ask for rules before reasoning before answer because that sequence is the behavior I want. The model still returns JSON I have to parse. Prompt order and parsed order are not the same guarantee a strict schema would give you.

Models also leak internal chunk IDs into reasoning or answer despite instructions not to. A cleanup pass for that is in the next update. Until then, I filter cited IDs to the retrieved set so stray UUIDs do not become bogus citations.

Where This Lives in the Lifecycle

Heavy structural work (vision parse, alias extraction, supersession flags) runs at ingest: fix vocabulary and layout once, query many times. Quote-first JSON constraints belong on the hot path. Every question pays for this prompt overhead, so I kept the schema to fields I actually read when debugging a bad answer.


Conclusion: Retrieved Is Not Cited

Quote-first output only works when the right chunk is in the window. When it is, the quotes land on the client even if the conclusion is off. An empty rules array with an honest "not in context" answer is the generation layer confirming retrieval missed.

When a player says the answer was wrong, I need to split the failure:

  • Retrieved but not cited: the Presence chunk or the 3-figures FAQ was in the top ten, but the model never quoted it. That points at prompt shape, provider behavior, or reasoning drift.
  • Never retrieved: the chunk never entered context. That points back at ingest, aliases, or search lanes.

I was already running manual generation checks on questions like the dice question while I tuned this prompt. Shortly before beta I added query logging that stores both chunk lists on every completed call. Part 4: Evaluation is where I write about that logging, the retrieval eval harness, and how I grow golden signals over time. Generation checks stayed manual. I did not build an automated suite for them.

For the dice question specifically, the log row I want to see cites the Presence chunk and the FAQ passage that sets the 3-figures cutoff. If only a tangentially related passage from the core rulebook shows up in cited_chunk_ids, I know which layer failed without guessing.