Home / Blog / GPT-5 Thinking Actually Understands Long Documents...

GPT-5 Thinking Actually Understands Long Documents — Here's the Evidence

By CaelLee | | 7 min read

GPT-5 Thinking Actually Understands Long Documents — Here's the Evidence

Last Thursday, I asked GPT-4 to review a 47-page project contract. It confidently mashed together the payment terms on page 12 with the breach-of-contract clause on page 38. Different things entirely.

Seriously.

I then ran the same contract through GPT-5's Thinking mode. Not only did it nail every cross-reference — it flagged three risk areas I'd completely missed. One of them would've cost us about £15,000 in penalties.

That's the moment I stopped treating these models as clever parrots and started wondering what's actually changed under the hood.

TL;DR for the Impatient

The "Memory Collapse" Problem Nobody Talks About

I've been wrangling models for document review for nearly two years now. Contracts. Technical proposals. Codebase audits. The failure patterns are so predictable I could write a bingo card.

Here's the classic: you drop in a 30-page architecture proposal. The first few pages? Spot-on. References are crisp. Then somewhere around page 15, it starts attributing conclusions from Chapter 3 to Chapter 7. Or it completely ignores a term you defined on page 2. You can see the model improvising.

Wait — I should correct myself. It's not improvising, exactly. It's the attention mechanism decaying over extremely long sequences. GPT-4 claims a 128k context window, which sounds impressive until you realise the usable window is far smaller. Cross-paragraph reasoning across 20+ pages? Forget it. Those middle pages might as well not exist.

GPT-5 Thinking takes a different approach. It doesn't try to digest the entire document in one gulp. Instead, it loops back — re-reading sections, comparing claims, checking its own work. More on that mechanism later. Let's look at actual tests first.

Three Real-World Tests (With Receipts)

Test 1: Cross-Reference Hell in a Technical Spec

I prepared a 35-page microservices architecture document with 12 cross-references sprinkled throughout — things like "see Section 3.2 for the security strategy" or "must align with performance thresholds defined in Section 7.1." Some were legitimate. Some I deliberately broke, pointing to sections that don't exist.

Same PDF. Two models. One question: "How many cross-references in this document are incorrect?"

GPT-4's attempt: Found 3 errors. Missed 4. Falsely flagged 2 correct references as wrong. The error descriptions referenced section numbers that didn't match anything in the actual document. Classic hallucination.

GPT-5 Thinking's attempt: Found all 7. Zero false positives. And here's the interesting bit — its output included a self-audit trail: "I noticed page 8 references Section 4.3, but the document's table of contents shows Chapter 4 only contains subsections 4.1 and 4.2. This is therefore flagged as an erroneous reference."

It's not remembering the document. It's actively comparing different parts against each other. That's the difference.

Test 2: Spotting Character Contradictions in Long Dialogue

This one's a bit unusual. I wrote a 3,000-word fictional meeting transcript — six stakeholders debating whether to retire a legacy system. On page 2, Character B says (and I quote): "I absolutely disagree with migration. Full stop." By page 8, the same character says: "I've supported this approach from the beginning."

I asked both models: "Is Character B's position consistent throughout the meeting?"

GPT-4: "Character B maintains a generally open attitude toward migration." Completely. Missed. It.

GPT-5 Thinking: "Character B explicitly states on page 2: 'I absolutely disagree with migration.' However, on page 8 they claim: 'I've supported this approach from the beginning.' These statements directly contradict each other. Possible explanations: 1) the transcript contains errors, or 2) Character B changed their stance without explanation. Recommend verifying against the original recording."

What struck me wasn't just the contradiction detection — it was the actionable next step. In a business context, that's the difference between catching a problem and rubber-stamping nonsense.

Test 3: Tracing a Bug Across 40+ Files

This was our own project. Node.js backend, roughly 40 modules, core business logic spread across 7 files. We'd been chasing a bug for days — under certain conditions, orders were skipping the "confirmed" status and jumping straight to "shipped."

I dumped the entire src directory into both models and asked: "Identify all possible code paths that could bypass the 'confirmed' order status."

GPT-4 found two paths. Both were single-file logic issues. It completely missed the cross-file async call chain — the actual bug.

GPT-5 Thinking took about three minutes. Yes, it's slower. But it output four possible paths, including the real culprit: a webhook callback in `paymentService.js that directly invoked updateStatus from orderService.js, bypassing the statusGuard middleware entirely. The full call chain: paymentService.confirmPayment() → webhook.onSuccess() → orderService.updateStatus('shipped')`, with a note: "statusGuard validation not applied at this point."

I stared at my screen. Took a sip of coffee.

This isn't a "language model" anymore. This is doing the job of a static analysis tool.

Why This Actually Works

After observing GPT-5 Thinking's behaviour for a few weeks, I've noticed a few patterns:

It re-reads. The model produces a "thinking process" summary before the final answer. Inside that summary, you'll often see phrases like "let me re-examine page X" or "I need to compare this against the definition provided earlier." This isn't simple retrieval-augmented generation. It feels more like an internalised self-verification loop.

It's allergic to contradictions. Previous models, when faced with conflicting information, would typically pick one interpretation and quietly ignore the other. Or blend them into mush. GPT-5 Thinking shows a kind of alertness — when it detects inconsistency, it flags it rather than smoothing it over.

It trades speed for accuracy. On the same long-document tasks, GPT-5 Thinking runs 3-5x slower than GPT-4. It doesn't rush to spit out an answer. For precision-critical work, this trade-off makes sense. For quick summaries? Overkill.

But It's Not Magic

Let me be clear about the downsides, because I've been burnt by AI hype before:

When to Use Which

After six weeks of testing (and production use, on my actual work), here's my heuristic:

Use GPT-5 Thinking for:

GPT-4 is fine for:

The Bottom Line

Honestly? I didn't expect much from GPT-5 Thinking. "Long document understanding" has been marketing vapourware for two years. When Claude 3 launched in 2024, they called it the king of long-context — and in practice, it was just... fine.

But after living with this model for over a month, my view has shifted. It's not solving the problem of "can a model read a whole book?" GPT-4 could already do that. It's solving the problem of "can a model actually understand what it read?"

That's the difference. And it's not subtle.

What's your experience? Have you tested GPT-5 or Claude 3.5 on long documents? Found any surprising wins or spectacular failures? I'm compiling a comparison table of long-context capabilities across models, and your real-world data would be incredibly useful. Drop a comment below.

GPT5 #AI #LongContextNLP #CodeReview #TechEvaluation #LLMTesting

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free