GPT-5 Thinking Actually Understands Long Documents — Here's the Evidence

Last Thursday, I asked GPT-4 to review a 47-page project contract. It confidently mashed together the payment terms on page 12 with the breach-of-contract clause on page 38. Different things entirely.

Seriously.

I then ran the same contract through GPT-5's Thinking mode. Not only did it nail every cross-reference — it flagged three risk areas I'd completely missed. One of them would've cost us about £15,000 in penalties.

That's the moment I stopped treating these models as clever parrots and started wondering what's actually changed under the hood.

TL;DR for the Impatient

GPT-4 loses the plot around page 10-15 of a long document — cross-references become hallucinations
GPT-5 Thinking actually re-reads sections during reasoning, catches contradictions, and explains why something looks wrong
It's 3-5x slower and 4x more expensive — worth it for contracts and code reviews, wasteful for summaries
Tables and diagrams still trip it up (multi-modal models are the fix here)
I've tested this across 30+ documents over six weeks. The difference isn't incremental — it's architectural.

The "Memory Collapse" Problem Nobody Talks About

I've been wrangling models for document review for nearly two years now. Contracts. Technical proposals. Codebase audits. The failure patterns are so predictable I could write a bingo card.

Here's the classic: you drop in a 30-page architecture proposal. The first few pages? Spot-on. References are crisp. Then somewhere around page 15, it starts attributing conclusions from Chapter 3 to Chapter 7. Or it completely ignores a term you defined on page 2. You can see the model improvising.

Wait — I should correct myself. It's not improvising, exactly. It's the attention mechanism decaying over extremely long sequences. GPT-4 claims a 128k context window, which sounds impressive until you realise the usable window is far smaller. Cross-paragraph reasoning across 20+ pages? Forget it. Those middle pages might as well not exist.

GPT-5 Thinking takes a different approach. It doesn't try to digest the entire document in one gulp. Instead, it loops back — re-reading sections, comparing claims, checking its own work. More on that mechanism later. Let's look at actual tests first.

Three Real-World Tests (With Receipts)

Test 1: Cross-Reference Hell in a Technical Spec

I prepared a 35-page microservices architecture document with 12 cross-references sprinkled throughout — things like "see Section 3.2 for the security strategy" or "must align with performance thresholds defined in Section 7.1." Some were legitimate. Some I deliberately broke, pointing to sections that don't exist.

Same PDF. Two models. One question: "How many cross-references in this document are incorrect?"

GPT-4's attempt: Found 3 errors. Missed 4. Falsely flagged 2 correct references as wrong. The error descriptions referenced section numbers that didn't match anything in the actual document. Classic hallucination.

GPT-5 Thinking's attempt: Found all 7. Zero false positives. And here's the interesting bit — its output included a self-audit trail: "I noticed page 8 references Section 4.3, but the document's table of contents shows Chapter 4 only contains subsections 4.1 and 4.2. This is therefore flagged as an erroneous reference."

It's not remembering the document. It's actively comparing different parts against each other. That's the difference.

Test 2: Spotting Character Contradictions in Long Dialogue

This one's a bit unusual. I wrote a 3,000-word fictional meeting transcript — six stakeholders debating whether to retire a legacy system. On page 2, Character B says (and I quote): "I absolutely disagree with migration. Full stop." By page 8, the same character says: "I've supported this approach from the beginning."

I asked both models: "Is Character B's position consistent throughout the meeting?"

GPT-4: "Character B maintains a generally open attitude toward migration." Completely. Missed. It.

GPT-5 Thinking: "Character B explicitly states on page 2: 'I absolutely disagree with migration.' However, on page 8 they claim: 'I've supported this approach from the beginning.' These statements directly contradict each other. Possible explanations: 1) the transcript contains errors, or 2) Character B changed their stance without explanation. Recommend verifying against the original recording."

What struck me wasn't just the contradiction detection — it was the actionable next step. In a business context, that's the difference between catching a problem and rubber-stamping nonsense.

Test 3: Tracing a Bug Across 40+ Files

This was our own project. Node.js backend, roughly 40 modules, core business logic spread across 7 files. We'd been chasing a bug for days — under certain conditions, orders were skipping the "confirmed" status and jumping straight to "shipped."

I dumped the entire src directory into both models and asked: "Identify all possible code paths that could bypass the 'confirmed' order status."

GPT-4 found two paths. Both were single-file logic issues. It completely missed the cross-file async call chain — the actual bug.

GPT-5 Thinking took about three minutes. Yes, it's slower. But it output four possible paths, including the real culprit: a webhook callback in `paymentService.js that directly invoked updateStatus from orderService.js, bypassing the statusGuard middleware entirely. The full call chain: paymentService.confirmPayment() → webhook.onSuccess() → orderService.updateStatus('shipped')`, with a note: "statusGuard validation not applied at this point."

I stared at my screen. Took a sip of coffee.

This isn't a "language model" anymore. This is doing the job of a static analysis tool.

Why This Actually Works

After observing GPT-5 Thinking's behaviour for a few weeks, I've noticed a few patterns:

It re-reads. The model produces a "thinking process" summary before the final answer. Inside that summary, you'll often see phrases like "let me re-examine page X" or "I need to compare this against the definition provided earlier." This isn't simple retrieval-augmented generation. It feels more like an internalised self-verification loop.

It's allergic to contradictions. Previous models, when faced with conflicting information, would typically pick one interpretation and quietly ignore the other. Or blend them into mush. GPT-5 Thinking shows a kind of alertness — when it detects inconsistency, it flags it rather than smoothing it over.

It trades speed for accuracy. On the same long-document tasks, GPT-5 Thinking runs 3-5x slower than GPT-4. It doesn't rush to spit out an answer. For precision-critical work, this trade-off makes sense. For quick summaries? Overkill.

But It's Not Magic

Let me be clear about the downsides, because I've been burnt by AI hype before:

It's slow. That 35-page spec? GPT-4 took about 20 seconds. GPT-5 Thinking ran for nearly two minutes. If you just need a quick summary, stick with GPT-4.
Overly cautious. I once asked it to clean up meeting notes. It appended "this conclusion is inferred from the transcript and should be verified" to every single point. Twelve conclusions, twelve disclaimers. Exhausting.
Still rubbish with tables and diagrams. Complex nested tables or flowcharts? Performance is barely better than GPT-4. From what I understand, this is a limitation across all current LLMs — you'll want a multi-modal setup for that.
API costs sting. It's roughly 4x the price of GPT-4. If you're processing documents in bulk, the bill adds up fast. I spent about $340 last month just on contract reviews.

When to Use Which

After six weeks of testing (and production use, on my actual work), here's my heuristic:

Use GPT-5 Thinking for:

Contract review, compliance checks — anything where a missed detail has real consequences
Cross-chapter or cross-file logic consistency verification
Multi-character dialogue analysis with conflicting viewpoints
Codebase-level bug hunting and call chain tracing

GPT-4 is fine for:

Quick summaries, translations, rewrites
Single-turn Q&A without cross-paragraph reasoning
Creative writing where precision isn't critical
Chatbot scenarios where response speed matters (users won't wait 2 minutes)

The Bottom Line

Honestly? I didn't expect much from GPT-5 Thinking. "Long document understanding" has been marketing vapourware for two years. When Claude 3 launched in 2024, they called it the king of long-context — and in practice, it was just... fine.

But after living with this model for over a month, my view has shifted. It's not solving the problem of "can a model read a whole book?" GPT-4 could already do that. It's solving the problem of "can a model actually understand what it read?"

That's the difference. And it's not subtle.

What's your experience? Have you tested GPT-5 or Claude 3.5 on long documents? Found any surprising wins or spectacular failures? I'm compiling a comparison table of long-context capabilities across models, and your real-world data would be incredibly useful. Drop a comment below.

GPT5 #AI #LongContextNLP #CodeReview #TechEvaluation #LLMTesting

GPT-5 Thinking Actually Understands Long Documents — Here's the Evidence

GPT-5 Thinking Actually Understands Long Documents — Here's the Evidence

TL;DR for the Impatient

The "Memory Collapse" Problem Nobody Talks About

Three Real-World Tests (With Receipts)

Test 1: Cross-Reference Hell in a Technical Spec

Test 2: Spotting Character Contradictions in Long Dialogue

Test 3: Tracing a Bug Across 40+ Files

Why This Actually Works

But It's Not Magic

When to Use Which

The Bottom Line

GPT5 #AI #LongContextNLP #CodeReview #TechEvaluation #LLMTesting

Cael Lee

Ready to get started?