Structure Before Semantics: Local RAG for Adversarial PDFs

Graph-based retrieval, hierarchical chunking, and two-pass tokenization in TypeScript and Ollama

Why, you might ask, would I decide to build a local RAG (Retrieval-Augmented Generation) pipeline in TypeScript using Ollama? I wish I could say it stemmed from a desire to crush my friends' weak tabletop RPG rules knowledge. Alas, the truth is that our rules knowledge is collectively spotty because we switch game systems too often, and I was tired of asking Gemini during sessions only to have it confidently lie to me. I wanted a project that demonstrated end-to-end ML system design — so I built one that actually solved a problem I had.

Why make it local first? Why use unpdf, Ollama running qwen2.5:3b-instruct for generation, and nomic-embed-text for embeddings? Why struggle to be cheap in both API spending and processing power?

Again, I wish I could say it stemmed from a desire to demonstrate architectural mastery, to prove that I could do more with less. Alas, the truth is far less epic: my primary constraint was a 6-year-old Intel Mac where the ML bottleneck is everything, and let's not forget, I'm cheap.

Having said that, when building a RAG pipeline, you can absolutely just slap a massive context window on a problem and throw the smartest, most expensive model at it. That's a perfectly valid strategy. But that's like buying a Porsche when a Honda Civic with the right aftermarket parts will get you there just fine. Plus, the Porsche is still going to hallucinate once in a while.

The real challenge in the AI space right now is cost optimization and constraint-aware design. My aging hardware forced some interesting architectural choices, proving that constraint-aware design is often the most interesting problem to solve. The full source code is on GitHub.

PDF Ingestion and Tokenization Over Adversarial Formats

The initial plan was simple: unpdf, tokenize, embed. During the planning phase, I had used an AI assistant to scope the work. The model (let's call it Confident-But-Wrong) laid out a beautiful strategy about tokenizers, padding, and wiggle room. It estimated exactly how everything would fit together. It was perfectly, cleanly incorrect.

Immediately, tokenization choked. I asked Confident-But-Wrong for a fix, and it kept suggesting we just widen the buffer — the equivalent of teaching kids guess-and-check math. Instead, I added logging to see where it was failing.

The culprit? Kickstarter backer lists at the end of the PDFs. Proper names are an absolute nightmare for tokenizers, exploding the token count unpredictably.

To fix this, I implemented a two-pass tokenization strategy.

First, as text accumulates word-by-word, a cheap estimate (word count multiplied by three) acts as a tripwire. It's fast and costs almost zero CPU cycles.

Second, once that estimate gets close to the target chunk size, I call a BPE tokenizer compatible with tiktoken (gpt-tokenizer with a 1.5× scaling factor that inflates the count to approximate Ollama's higher token counts) to get exact bounds before flushing.

The true token count then gets stamped directly onto the chunk record in the database. Now, query-time operations — reranking, context budgeting, neighbor expansion — can reason about exact chunk sizes without ever re-tokenizing a single string.

PDF Heading Detection and Hierarchical Chunking

With ingestion working, I started asking questions. The answers were atrocious.

The system was pulling up chunks that contained the keywords from my query, but it was grabbing lore text from expansion books and examples of play instead of the actual mechanical rules. I squeezed the retrieval pipeline, tuning k-NN (k-nearest neighbors, to find the closest conceptual matches) on the way in, and top-k (keeping only the highest-scoring subset) on the way out. The citations got better, but the answers were still wrong.

The issue was structure. TTRPG books are fiercely hierarchical. A paragraph about "Action Points" means nothing if you don't know it's under the "Combat" heading.

So, I wrote a parser to sniff out heading lines (ALL CAPS, Title Case, single capitalized words, and markdown # headers) and split the page into segments. Every resulting chunk inherited the sectionHeading of its parent. That moved the needle.

Graph-Based RAG and Automated Document Reconciliation

I live in a personal knowledge management (PKM) system, and I rely heavily on tags to understand connections. I wanted the pipeline to do the same: automated document reconciliation with ground truth verification.

I built a pass where consecutive chunks from the same section within the same document are grouped into a zettel, and the LLM tags that group with:

concepts (up to 24 mechanical noun phrases)
definesConcepts (up to 12 concepts this section primarily lays down rules for)
referencesConcepts (up to 12 concepts cited, but defined elsewhere)
seeAlsoTitles (up to 8 other headings it likely cross-references)
confidence (how reliable the LLM thinks its own tagging is)

Of course, LLMs lie. To keep it honest, I added a sanityFilterDefines step: if the lowercased concept phrase doesn't appear as a substring in the chunk's synopsis, it gets dropped. No hallucinated definitions allowed.

From there, I built a knowledge graph by drawing edges between zettels. Next/previous section got a weight of 1.0. If zettel A defines a concept and zettel B references it, they get an edge (0.85). see_also edges (driven by the seeAlsoTitles from the LLM enrichment) got a weaker edge (0.5).

Why does this matter? At query time, retrieving one highly relevant chunk allows the pipeline to pull in its graph neighbors, letting the system naturally traverse from a vague rule to the core mechanic it depends on without needing another LLM call.

I also added a Map of Content (MOC). This actually came about because I was uploading a lot of sourcebooks and errata. The system needs to know when data is meant to enhance the root rules, be the root rules, or replace the root rules. For every document, the LLM generates a summary of what the whole PDF covers, rolls up the top 40 concepts by frequency, and builds an authoritativeFor list. This makes the MOC a powerful retrieval artifact, and the MOC's scope summary is also embedded as a standalone vector. Every piece of structure I added resulted in more correct answers.

Back-of-Book Index Hinting and Citation Pruning

Some answers were still slipping through because the core rule was vague, but it was sitting right where the index pointed.

So, I wrote a quick pass to scan the end of the PDF looking for a traditional back-of-book index. I bounded the scan to the last 22% of the document (with a floor of 14 pages and a ceiling of 95) because processing the whole book looking for an index is a massive waste of cycles, and no TTRPG index I've found falls outside those margins. When it finds one, it dumps the mappings into index-hints.json. At query time, if your term matches an index entry (say, "grappling: p.195"), any chunk on page 195 gets a massive boost before the embedding similarity even kicks in.

I still had one more glaring issue to fix before the system was actually usable at the table. The backend logs showed the system had the right context in the larger grouping of chunks it read, but when it came time to show the top citations to the user, the real answer's chunk kept getting outranked by tangentially related lore.

The fix was twofold:

Citation Pruning (alignCitationsToAnswer): After the LLM spits out an answer, I re-score every cited chunk based on lexical overlap (unigram, bigram, and stacking rule phrase detection) with the final answer text. If a citation scores below 42% of the top match, I drop it. I never add, I only prune.
Snippet Focusing (focusCitationSnippetForDisplay): No one wants to read an 800-token chunk to verify a rule. Even for citations that survive pruning, the system slides a 460-character window over the chunk to find the segment that best aligns with the generated answer, and only shows that slice.

It surfaces the exact, relevant rule text when the answer is there, and it's remarkably good at saying "I don't know" when it isn't. Even when it doesn't know the answers, it gives decently relevant citations for you to check. It's not perfect. It still gets things wrong occasionally.

Eventually, I hit a performance wall on local hardware. The 2019 Mac was not keeping up. During ingestion, it would take 15 to 20 minutes to chew through a single PDF, and query times were crawling. The local prototype validated the retrieval architecture before I committed to cloud infrastructure costs.

Ultimately, the local prototype did its job. It proved the core thesis: if you take the time to organize the messy reality of your data first, you don't need the biggest model in the world to find the right answer.