Home / Blog / When Your AI Summariser Flips "Revenue Up 20%" to ...

When Your AI Summariser Flips "Revenue Up 20%" to "Revenue Down 20%"

By CaelLee | | 9 min read

When Your AI Summariser Flips "Revenue Up 20%" to "Revenue Down 20%"

Last year, a client in financial research nearly gave me a heart attack. They were using a RAG system to auto-generate research summaries, and it confidently reported "revenue declined 20%" when the actual figure was a 20% increase. Think about that for a second—these summaries go to investment firms making million-dollar decisions. One word wrong, and you're looking at lawsuits, not just embarrassment.

We caught it early. Thank god.

But here's the thing that stuck with me: when multi-hop reasoning fails, it's not just about accuracy metrics dropping a few points. It's about systems making claims that are precisely, dangerously wrong.

I've spent the better part of a year wrestling with this problem. What follows is what actually worked, what didn't, and what still keeps me up at night.

First, Let's Understand Where Things Go Wrong

Standard RAG summarisation follows a dead-simple logic: chunk the document, retrieve the most relevant bits via vector search, feed them to an LLM, get a summary. Works fine for single-hop facts like "the company's 2023 revenue was £5 billion." The retrieval hits, the LLM paraphrases, job done.

Real documents aren't that cooperative.

I ran a test on a 30-page medical research report. One conclusion required information spread across five different sections: baseline measurements on page 3, post-intervention data on page 7, statistical tests on page 12, subgroup analysis on page 18, and clinical interpretation on page 25. Standard RAG? It mashed together data from two different subgroups and generated the exact opposite conclusion.

The numbers: across 100 medical reports, factual accuracy came in at 67.3%. Multi-hop reasoning errors accounted for 58% of all failures. More than half the hallucinations weren't the model making things up—they were the model failing to connect information scattered across the document.

The root cause is painfully obvious once you see it. Traditional RAG retrieval is flat—each chunk treated as an independent unit. But real documents are hierarchical, with cross-references, implicit dependencies, and logical threads that weave through sections. You can't expect an LLM to magically piece these fragments together at generation time.

You just can't.

What I Tried, and What Actually Worked

Approach 1: Document Structure Graphs + Multi-Hop Retrieval

My first attempt borrowed from Microsoft's GraphRAG concept. But honestly? GraphRAG is overkill—the build cost is astronomical, and small teams can't justify it. I built a simplified version that constructs a document structure graph during indexing.

Here's the approach:

  1. Preserve the document hierarchy during parsing—chapters, sections, paragraph numbering, all of it
  2. Identify explicit cross-references between paragraphs: phrases like "as discussed above," "see Table 3," "based on the conclusion in Section 2.1"
  3. Store these reference relationships as graph edges, with paragraphs as nodes

Retrieval then happens in two phases: vector search finds seed nodes, then we traverse the graph to pull in predecessors and successors along the reference edges.

The results? Same 100 medical reports—factual accuracy jumped from 67.3% to 81.5%, and multi-hop error rates dropped from 58% to 31%.

But there's a glaring weakness. This approach assumes documents have explicit reference structures. When I tested it on internal due diligence reports from a corporate client, the whole thing fell apart. Those reports rarely say "see Section X"—they rely on implicit logical flow. The graph barely got built.

That was frustrating.

Approach 2: Query Decomposition + Iterative Verification

This one's interesting. I stole the idea from how Pieter Levels built Nomad List—don't try to solve everything in one shot. Break it into small, verifiable steps.

The flow works like this:

  1. When a summarisation request comes in, a lightweight LLM decomposes the complex query into atomic sub-queries
  2. Each sub-query retrieves independently and generates its own answer fragment
  3. A verification step checks for logical contradictions between fragments
  4. If contradictions surface, backtrack to retrieve more context and regenerate
  5. Finally, a main LLM stitches the verified fragments into a coherent summary

For example, if a user asks "compare Product A and Product B's market performance in 2023," the system automatically decomposes into:

The upside? No dependency on document structure. The downside? The call count explodes. A complex summary might trigger 15-20 LLM calls, pushing latency past 30 seconds. Users won't wait that long.

Wait—I should correct myself here. Initially I used GPT-4 for query decomposition, and the cost was painful. About $12-15 per 100 decompositions. I later switched to a fine-tuned Qwen-2.5-7B specifically for this task. Cost dropped to roughly 1/20th, and the decomposition quality actually improved—because the fine-tuning data was all multi-hop reasoning scenarios, the model learned to recognise query patterns that require cross-paragraph retrieval. Qwen-2.5 launched in September 2024, I started using it around late October, and it's been solid for nearly six months now.

Approach 3: Fact Chain Tracing

This is what I'm using now, and it's the approach I'm most satisfied with.

The core idea is simple: make summaries carry their evidence chains from the moment of generation, rather than verifying after the fact.

Here's the detailed flow:

  1. During retrieval, don't set a top-k limit. Instead, set a similarity threshold and pull in every chunk above it—typically 40-80 chunks
  2. Use a dedicated "evidence linker" model to identify logical connections between information fragments across these chunks, threading them into evidence chains
  3. Each evidence chain corresponds to one factual claim in the summary
  4. When the LLM generates the summary, require it to annotate each factual sentence with the evidence chain ID it depends on
  5. In post-processing, do exact-match verification between annotated evidence chains and the source text

The key numbers: on my medical report test set, factual accuracy hit 93.7%, with multi-hop error rates below 9%.

But it's not free. The evidence linker needs dedicated training. I used sentence-transformers as the base and fine-tuned on 5,000 manually annotated multi-hop reasoning chains. The annotation cost about £2,200 through a labelling platform—I won't name which one, don't want this to sound like an ad. Training was cheap though: one A100 for two hours, PyTorch 2.1.2, transformers 4.36.2.

Let me give you a real example. A report discussing a drug's side effects had critical information scattered across four locations: the adverse reaction summary table on page 8, serious adverse event descriptions on page 12, dose-dependency analysis on page 15, and benefit-risk assessment on page 20. Standard RAG only pulled from page 8's summary table and concluded "side effects are mild and manageable." The fact chain system correctly linked all four locations, producing: "Low-dose groups showed mild side effects, but the high-dose group experienced three serious adverse events, warranting attention to dose-dependent risk." That's the right answer.

The Stuff That Bites You in Production

The approaches sound great in theory. In practice? Loads of headaches.

First headache: chunk strategy matters more than your model. I spent months tweaking model parameters and doing elaborate prompt engineering. Then I discovered that changing chunk size from 512 tokens to 256 tokens with 50% overlap boosted factual accuracy by six points. Six points! The reason's straightforward: smaller chunks make retrieval more precise and reduce noise during evidence chain construction. I realised this around June 2024, after way too many detours.

Second headache: don't skimp on document preprocessing. Lots of long documents come as PDFs, and parsing them produces garbled paragraph ordering with table data completely lost. I now force all documents through a layout analysis model first—I use Marker, an open-source tool, version 0.2.3—to properly extract tables, figure captions, and footnotes. This step's impact on multi-hop reasoning is massive. I had a client whose PDF had headers and footers mixed into the body text, so retrieved chunks were full of "Page X — Confidential" noise. Evidence chain construction completely broke down.

Third headache: build your own evaluation metrics. Generic RAG evaluation frameworks like RAGAS have poor coverage for multi-hop reasoning scenarios. I built my own evaluation set with queries specifically constructed to require 2-hop, 3-hop, and 4-hop reasoning. Each query has a manually written gold answer and a list of must-cover factual points. Now every time I change the system, I run this set. Way more reliable than reading paper benchmarks. The set has about 300 queries now—I've been building it for over a year and keep adding to it.

What I Still Haven't Solved

I'll be honest: cross-document multi-hop reasoning still stumps me. All my current approaches assume information lives within a single document. But when a user asks "compare our company's R&D investment trends against competitors over the past three years," that requires multi-hop reasoning across five annual reports. The difficulty scales exponentially.

I'm currently experimenting with knowledge graphs for cross-document entity alignment, but the results aren't great—factual accuracy hovers around 70%. I've tried Neo4j 5.18 for graph storage and spaCy 3.7 with custom rules for entity alignment, but entity disambiguation keeps tripping me up. "Apple" in one report is a company; in another report, it might actually be the fruit. I suspect this will take quite a while to get right.

Then there's cost. The fact chain tracing approach works brilliantly, but each summary consumes 3-4x the tokens of standard RAG. For scenarios processing thousands of documents daily, that's hard to justify. I'm currently distilling the evidence linker from sentence-transformers down to a smaller model, aiming to keep inference costs within 1.5x. Halfway through the distillation process now, and the results are unstable—sometimes evidence chains break mid-way.

Funny enough, I saw someone on X last week who built something similar using LlamaIndex. Looked promising, but I haven't had time to dig into their implementation yet. If you're working on multi-hop reasoning or have better approaches, drop a comment. I'm in the middle of rebuilding this system and genuinely want to hear different perspectives.

Oh, and someone asked why I don't use LangChain. My take: LangChain has too many abstraction layers, and debugging becomes miserable. I eventually rewrote everything from scratch. That might annoy some people, but there you go.

Key Takeaways:

What's your experience with multi-hop reasoning in RAG? Have you found approaches I haven't tried? Let me know in the comments.

rag #nlp #aiengineering #machinelearning #llm

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free