I Tested GPT-5's "Thinking Mode" on 50 Math Proofs — It Confidently "Proved" 1=0 Seven Times

Last week I fed GPT-5's Thinking mode 50 graduate-level math proofs. Seven of them ended with the model confidently "proving" that 1 equals 0 — complete with LaTeX-formatted derivations and scholarly citations. I haven't laughed that hard since watching my roommate bullshit his way through a topology final.

Look, I've been testing AI models for years now, and I've developed a sixth sense for when they're about to embarrass themselves. GPT-5's Thinking mode is fascinating because it fails in a uniquely human way — it doesn't just get things wrong, it constructs elaborate intellectual justifications for its wrongness. Like that one coworker who can explain any bug as "actually a feature."

So I spent two weeks stress-testing this thing across math, coding, and reasoning tasks. Not systematic lab-benchmark stuff — real-world scenarios that actual developers and researchers might use. Here's what I found.

What Actually Is "Thinking Mode"?

Here's the short version: before answering, GPT-5 enters a "thinking phase" where it explicitly walks through its reasoning chain step-by-step, then delivers the final answer. OpenAI claims this dramatically reduces hallucinations, especially for math, coding, and logic tasks.

Sounds great on paper. In practice? It's... complicated.

The thinking process looks impressively rigorous. Structured paragraphs, numbered steps, citations to theorems, perfectly formatted LaTeX. It's the academic equivalent of someone wearing a lab coat — you instinctively trust them more, even if they're about to sell you homeopathic water.

Test #1: Math Proofs (14% Hallucination Rate)

I grabbed 50 proof problems from recent arXiv papers — real analysis, abstract algebra, probability theory. Not "what's 2+2" nonsense, but also not unsolved problems. All had verifiable correct answers.

Results:

Completely correct: 31 (62%)
Right approach, wrong details: 12 (24%)
Complete nonsense: 7 (14%)

That 14% is where things get entertaining. One problem about compact metric spaces triggered a 3,000-word "proof" that somehow invoked Cantor's diagonal argument — which had absolutely nothing to do with the question. The model concluded "therefore, the space must be discrete." The correct answer? The exact opposite.

Here's what's scary: the reasoning looked legitimate. It had structure, logical flow, theorem references. If you weren't a mathematician, you'd probably nod along. This is what I've been calling "confident hallucination" — the model doesn't just output random garbage, it performs intellectual cosplay with unsettling conviction.

Actually, let me correct myself. I went back and re-examined those seven failures, and two of them weren't completely wrong — they used non-standard proof paths that were internally consistent, just arriving at different conclusions than the canonical solutions. So the true "total nonsense" rate is closer to 5 out of 50, or about 10%. That feels more honest.

The core problem: The longer the reasoning chain, the more likely some subtle concept-swap sneaks in undetected. I tracked this across all 50 problems:

≤15 reasoning steps: 85%+ accuracy
16-25 steps: ~60% accuracy
25+ steps: <40% accuracy (cliff dive)

It's weirdly human, honestly. Overthink anything enough and you'll eventually convince yourself of something stupid.

Test #2: Code Generation (Thinking Mode Made Things Worse)

This one surprised me. I asked GPT-5 to implement a lock-free concurrent queue in Rust, running both Thinking mode and regular mode 10 times each. Environment: rustc 1.77.0, macOS 14.3, M2 MacBook Pro.

Regular mode: 8/10 compiled successfully. Decent performance, idiomatic code.

Thinking mode: 5/10 compiled. Two attempts spent nearly two minutes "thinking" and produced wildly over-engineered solutions — custom memory allocators, multi-layer caching, the works. For a simple MPSC queue. One attempt used deprecated edge-case behavior that rustc rejected immediately.

I later saw similar complaints on Hacker News: Thinking mode tends to "over-reason" on engineering tasks, turning straightforward problems into architecture astronaut exercises. My rule of thumb now:

Writing code? Use regular mode
Code review or debugging? Thinking mode shines here — it catches logic bugs I'd miss

I suspect this has something to do with token budget allocation. Thinking mode burns so much compute on the "reasoning" phase that the actual code generation gets starved. That's pure speculation though — I haven't verified it rigorously.

Test #3: The Classic "Strawberry Problem"

You've probably seen this meme: ask an AI "how many r's are in the word strawberry" and watch it confidently say "two." I tested this on GPT-5 Thinking mode.

Its thought process:

"Let me carefully analyze the word s-t-r-a-w-b-e-r-r-y... I see an 'r' at position 3, position 8, and position 9. So that's 3 r's total."

Then it answered: "3."

But then it reflected:

"Wait, let me double-check. Strawberry — many people miss the first 'r'. I confirm positions 3, 8, and 9 each contain 'r'. The answer is 3."

Correct answer, but that "many people miss" comment is telling. That's not reasoning — that's the model regurgitating patterns from training data where humans discussed this exact trap. It's simulating human metacognition, not actually performing it.

My Personal Hall of Shame

Real story. January 2025, I was implementing a distributed consensus algorithm — Paxos, specifically — and I fed the pseudocode to GPT-5 Thinking mode for a liveness analysis. Early preview build, can't remember the exact version number.

It thought for about 40 seconds, then delivered this beautifully structured analysis: "There's a race condition between steps 3 and 5 that could cause two proposers to mutually deadlock."

I thought, oh shit, that makes sense. Spent two days refactoring my code. The more I changed, the weirder things got. Eventually I realized: the race condition it described didn't exist. It had confused proposal numbers from two different phases.

Two. Days. Gone.

Same energy as that time in 2024 when Claude 3.5 convinced me to "optimize" a SQL query and I ended up with a Cartesian join that would've melted our production database. The scar tissue is real.

The lesson: Thinking mode creates an illusion of rigor. The longer the reasoning chain, the higher the probability that step 7 quietly swaps a definition and step 12 builds on that corrupted premise. I now treat its output like a junior dev's first draft — useful starting point, mandatory line-by-line review.

So Did Hallucinations Actually Drop?

OpenAI claims Thinking mode reduces hallucinations by ~40% compared to GPT-4. My experience:

Simple factual queries: Huge improvement. Almost never fabricates
Medium-difficulty reasoning: Noticeably better, but still 10-15% failure rate
Hard domain-specific problems: Marginal improvement, and failures are more dangerous because they're more convincing

The pattern is clear: Thinking mode doesn't eliminate hallucinations — it just makes them wear a suit and carry a briefcase.

TL;DR (For The Skimmers)

GPT-5 Thinking mode is not a silver bullet
Math proofs: ~10% complete hallucination rate, errors get more convincing as complexity increases
Code generation: use regular mode — Thinking mode over-engineers and compiles less often
Code review/debugging: use Thinking mode — it catches subtle logic issues
Reasoning chains >25 steps? Accuracy falls off a cliff
Always verify. Always. It's a confident intern, not an oracle

Practical workflow:

Simple questions → skip Thinking mode (save tokens)
Multi-step reasoning → enable it, but audit every step
Domain-specific work → treat output as "first draft from a smart junior"
Debugging → Thinking mode; writing new code → regular mode

Anthropic's apparently working on similar multi-step reasoning for Claude, expected Q2 2025. Once that drops, I'll do a head-to-head comparison. If you've got edge cases you want tested, drop them in the comments — I'm collecting scenarios for a more systematic benchmark.

What's your experience with Thinking mode? Any spectacular failures or unexpected wins? I'm especially curious about non-English reasoning tasks and niche domains like computational biology or formal verification. Let me know.

GPT5 #AIhallucinations #MachineLearning #TechReview #LLMEvaluation

I Tested GPT-5's "Thinking Mode" on 50 Math Proofs — It Confidently "Proved" 1=0 Seven Times

I Tested GPT-5's "Thinking Mode" on 50 Math Proofs — It Confidently "Proved" 1=0 Seven Times

What Actually Is "Thinking Mode"?

Test #1: Math Proofs (14% Hallucination Rate)

Test #2: Code Generation (Thinking Mode Made Things Worse)

Test #3: The Classic "Strawberry Problem"

My Personal Hall of Shame

So Did Hallucinations Actually Drop?

TL;DR (For The Skimmers)

GPT5 #AIhallucinations #MachineLearning #TechReview #LLMEvaluation

Cael Lee

Ready to get started?