I Stress-Tested GPT-5.5 Instant's Hallucination Control on a 15k-Word Doc and the Results Are... Wei
I Stress-Tested GPT-5.5 Instant's Hallucination Control on a 15k-Word Doc and the Results Are... Wei
Throwaway account because my main one's tied to my startup and I really don't need investors watching me publicly roast the tools we're paying for. Again.
So last week I saw that thread on r/MachineLearning about GPT-5.5 Instant's new "long-context hallucination suppression" and thought, "Sure, buddy. I've heard this one before." We've been hearing it since GPT-3. The TL;DR was that OpenAI claims they've solved the thing where models start confidently making up nonsense after ~8k tokens. Having been burned by Claude 2's "creative interpretations" of my API docs back in October—it literally invented a webhook endpoint that cost us 3 hours of debugging—I was skeptical.
Decided to run my own tests. For science. And because I'm procrastinating on a production deployment that's been haunting me since Wednesday.
The Setup
I fed it a 15,000-word technical specification for a legacy payment processing system I used to work on (don't worry, stripped of anything proprietary). This doc is intentionally boring and dense. I mean the kind of thing where if you zone out for two paragraphs, you'll miss that the idempotency key format changed in v2.1. And then you'll spend a Tuesday afternoon wondering why your refunds are failing in production.
Ask me how I know.
The test was simple: ask specific questions that require cross-referencing details from different sections of the document. Stuff like "What happens if a refund request uses the v1.4 idempotency format but hits the v2.1 endpoint?"
What Actually Happened
Three findings that surprised me:
1. The "I don't know" rate went way up—and that's actually good?
About 12% of my edge-case questions got some variation of "The document doesn't specify this behavior." Maybe a bit more, I didn't keep perfect track. Previous models would've just hallucinated a plausible-sounding answer with that weird AI confidence. You know the tone.
One time it literally said "Section 4.2 mentions error code 451 but does not define the retry behavior for this specific scenario." That's... exactly correct and exactly what I'd want a junior dev to tell me. Actually, that's better than what I'd want. I'd want them to tell me before I ask.
2. Citation accuracy got weirdly granular
It started citing specific paragraph numbers. Not just sections.
I asked about rate limiting thresholds and it responded with "According to paragraph 3 of Section 7.2.1, the limit is 100 requests per second, but paragraph 5 notes this was reduced to 50 for sandbox environments." I checked. It was right. Both times. I've worked with this doc for 3 years and I forgot about the sandbox exception.
I felt personally attacked.
3. The failure mode is now "overly cautious" instead of "confidently wrong"
This is the part that's going to divide people, I think.
When I deliberately gave it contradictory information (modified two sections to have conflicting timeout values), it didn't pick one and run with it. Instead: "Section 3.1 states a 30-second timeout while Section 5.4 specifies 45 seconds. These values are inconsistent and the document does not indicate which takes precedence."
My previous experience with GPT-4 on this same test back in November? It picked 30 seconds and invented a whole justification about "standard industry practice." Sounded great. Completely wrong.
The Catch (because there's always a catch)
The hallucination suppression seems to come with a verbosity tax. Responses that GPT-4 would handle in 2 paragraphs now take 4-5 because it's constantly hedging and citing sources. For quick Q&A, it's annoying. Like, genuinely irritating when you just want a yes/no. For technical documentation where being wrong costs real money, it's probably worth it.
Also, it's noticeably slower on long contexts. Like, go-get-coffee slower. Maybe two coffees. My unscientific benchmark: same query took ~3 seconds on GPT-4, ~8 seconds on 5.5 Instant. The "Instant" branding is doing some Olympic-level heavy lifting here. Reminds me of when Comcast calls their basic tier "Blast."
The Real Test
Here's where it gets interesting. I have this habit—maybe it's toxic, I don't know—of asking models to explain their reasoning, then asking "Are you sure?" to see if they'll fold. Previous models would apologize and "correct" correct answers about 40% of the time. Drove me absolutely nuts.
GPT-5.5 Instant pushed back on 8 out of 10 challenges, with citations. One exchange:
Me: Are you sure the v2.1 endpoint rejects v1.4 idempotency keys?
Model: Yes. Section 2.1.3, paragraph 2 explicitly states
"idempotency keys using the v1.4 format will receive a 400
response with error code FORMAT_MIGRATION_REQUIRED." The
migration guide in Appendix B further confirms this behavior
was introduced in v2.1.
That's not just correct—it's the kind of answer that makes me trust the "I don't know" responses more. Which is... weird. I'm trusting a model more because it admits when it's clueless. That's a sentence I never thought I'd write in 2025.
Actually, wait—I should clarify that the 8 out of 10 number is from my notes but I realize I didn't document whether those 2 "folded" responses were actually correcting real errors or just getting bullied by my tone. Could be either. Need to re-run that part more carefully.
What This Actually Means (IMO)
We're seeing a shift from "make the model more knowledgeable" to "make the model better at knowing what it doesn't know." It's less fun to demo but more useful in practice.
The hallucination problem isn't solved—I still caught it making up a deprecated endpoint that doesn't exist anywhere in the doc, some nonsense about a "/v1/legacy-refund" path that I've never seen in my life—but the failure rate on my test suite dropped from ~15% (GPT-4) to maybe 3-4%.
The bigger question nobody's asking: are we ready for AI that admits ignorance? Because I've worked with plenty of senior engineers who'd rather give a wrong answer than say "I don't know." Hell, I've been that engineer. Watching a model do it is almost uncomfortable. Like, if the AI can admit it doesn't know something, what's my excuse?
TL;DR
- GPT-5.5 Instant's hallucination control on long docs is genuinely improved
- It mostly does this by getting comfortable with saying "I don't know" and citing sources obsessively
- It's slower and more verbose, but for technical work where accuracy matters more than speed, it's a meaningful upgrade
- Still not perfect—caught it hallucinating an endpoint—but the failure mode is now "overly cautious" instead of "confidently wrong"
- I'll take it
Anyone else run similar tests? Particularly curious if anyone's tried it on legal documents or medical literature where the stakes are higher than my stupid payment API. Also wondering if the verbosity is tunable—haven't dug into the API params yet, and the docs for 5.5 Instant are still kind of a mess.
Edit: formatting, sorry I'm on mobile and this looked better in my head
Edit 2: Several people DMing asking for the test doc. Can't share the original (NDAs and all that) but I'll clean up a sanitized version this weekend and post it.
Edit 3: Thanks for the gold kind stranger. First time getting gilded for complaining about AI, truly living in the future. My mom will be so proud.
Edit 4: Someone pointed out I said "hallucination control" like it's a volume knob. It's not. I know it's not. But honestly after 6 months of dealing with these models, I wish it was.
ai #llm #gpt5 #testing #hallucination #machinelearning
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.