The Real Reason Your LLM Can't Output Valid JSON (And What Actually Fixes It)
The Real Reason Your LLM Can't Output Valid JSON (And What Actually Fixes It)
Someone asked me yesterday: "You've been writing tech columns for a decade—what's the question you get asked most?"
I didn't even hesitate. "How do I make the bloody model output proper JSON?"
Seriously.
This question was my personal nightmare for about two years. Back when I was generating training data, I'd get woken up at 3 AM by alerts—JSON parsing failures. I'd check the logs and find the model had helpfully prefixed its output with "Sure, here's the JSON result:" or forgotten a closing quote somewhere. json.loads() would just... die.
Picture this: it's three in the morning, you're staring at a single line of red error text, and you're mentally cursing the model into next week.
But looking back now, this whole thing's actually fascinating. It seems like a tiny problem, but it exposes the fundamental contradiction at the heart of LLMs—how does a probabilistic token sampler output strictly structured language?
Why this is genuinely hard
Before we jump to solutions, let's understand what we're actually fighting.
LLMs are autoregressive models. At each step, they can only see what's been generated so far, then predict the probability distribution for the next token. But JSON is a context-free language—its validity depends on global structure. Every { needs a matching }, every array that opens must properly close.
There's a paper that tested this with matched parentheses. GPT-2 XL had an error rate above 95% once sequences exceeded 36 characters. Even Gemma, with nearly 7.8 billion parameters, fell apart at 282 characters.
This isn't the model being stupid. It's an architectural mismatch.
Neural networks are naturally terrible at algorithmic tasks requiring precise counting and pairing. The model has to implicitly maintain something like a stack state, and that's brutally hard for a probabilistic system.
So solutions tend to fall into two camps: either you fix things outside the generation process, or you intervene during it.
Prompt engineering helps—but only so much
When I first started tackling this, I genuinely thought prompts would solve everything. I wrote elaborate instructions: "Output pure JSON only," "No explanatory text," "Do not wrap in Markdown."
Total disaster.
The model would nod along and then cheerfully add "Certainly!" before the JSON anyway. I eventually got smarter about it—instead of just saying what to do, I learned to emphasise what not to do.
Here's the system prompt template I use now:
You are a professional data formatting assistant.
Strictly follow these rules:
1. Output ONLY pure JSON conforming to RFC 8259. No extra text, explanations, or comments.
2. Do NOT use Markdown code blocks. Do NOT add ```json markers.
3. Field names and types must match requirements exactly. Use double quotes for strings. No trailing commas.
4. Return only JSON. Do not answer any other questions.
But rules alone aren't enough. Models obey examples far more reliably than written instructions.
I typically add few-shot examples—two or three input-output pairs. And here's a trick I picked up from a Chinese tech blog that I now can't live without: describe your structure using TypeScript interfaces instead of JSON Schema. Models understand it much better.
At the prompt level, you'll solve maybe 60% of problems. It's cheap and fine for prototypes or low-risk scenarios.
But production? Relying on prompts alone is gambling.
API-native capabilities are a massive step up
Most providers now offer forced JSON output parameters.
OpenAI has responseformat={"type": "jsonobject"}, and DeepSeek has a similar JSON Mode. Under the hood, it's constrained decoding—when predicting the next token, any token that doesn't conform to JSON syntax gets masked out.
I've tested this. GPT-4o and DeepSeek hit legitimate JSON success rates above 99% with this approach.
Here's the catch though: valid JSON doesn't mean correct fields.
The model might give you perfectly valid JSON, but with field names like name_xxx, or string values where you needed integers. JSON Mode guarantees syntax, not schema.
That's why OpenAI later introduced Structured Outputs, where you can pass a full JSON Schema:
response = client.chat.completions.create(
model="gpt-4o",
response_format={
"type": "json_schema",
"json_schema": {
"name": "extraction_result",
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
}
}
}
}
)
This is much more solid. Field names, types, required fields—all guaranteed.
The problem? Not every model supports this. If you're self-hosting an open-source model, you don't get these niceties.
When you're on your own
If you're deploying open-source models yourself—vLLM or Llama.cpp—you need the heavy artillery.
The Outlines library and Llama.cpp's Grammars feature can lock down output format at the logits level. It's not "suggesting" the model output JSON—it's forcing it. At every generation step, only tokens that conform to the grammar rules are allowed to appear.
I used Outlines when deploying Qwen-2, and it was rock solid. Bonus discovery: inference actually got faster. Fewer candidate tokens means less computation.
Configuration is genuinely fiddly though. You define your data structure with Pydantic, convert it to JSON Schema, then feed that to Outlines. My first attempt took an entire afternoon.
There's also the Function Calling approach—tricking the model into outputting structured data as "function parameters." It's clever because function calling parameters are naturally JSON-formatted.
But honestly? Smaller models can't even do Function Calling reliably. I've tested some models on SiliconFlow's platform, and their parameter JSON still drops quotation marks.
Incredible.
This is where the last line of defence comes in.
Post-processing as a safety net
No matter what you do upstream, always assume the model output might be invalid.
My current approach: grab the output, try json.loads(), and if it fails, enter the repair pipeline.
The repair strategy has several layers:
- Strip extraneous text before and after (regex match from the first
{to the last}) - Fix common errors: single quotes to double quotes, remove trailing commas
- Use a JSON parser to locate the error position, then regenerate from that point
There's a library called strict-json that does exactly this. It parses until it hits an error, then only retries the portion after the error—no need to regenerate from scratch. Saves money.
If that still doesn't work, full retry. I usually set three attempts, each time feeding the previous error back to the model: "Your previous JSON output had a syntax error at character 42. Please correct it."
Three failures? That data point is probably fundamentally broken. Log it for manual review.
Two things that drove me absolutely mental
Beyond formatting issues, there are two more traps.
Truncation: The JSON is broken because max_tokens was too low—the model got cut off mid-sentence.
Two fixes: either bump up max_tokens, or use streaming with incremental parsing. Don't wait for the entire response before parsing. Use libraries like jstream to process chunks as they arrive. Even if it cuts off at the end, you've salvaged the earlier data.
Hallucination: You ask the model to extract a contract amount, the contract doesn't mention it, but the model confidently invents "£1 million."
This is where your schema needs generous use of Optional and Nullable. In your prompt, explicitly tell the model: if the source text doesn't contain the information, use null. Don't make things up.
This is especially critical with constrained decoding. If you don't give the model an escape hatch—the null option—the constraints force it to pick whatever answer looks most plausible, even if it's wrong.
My decision framework
After all this, here's a quick cheat sheet:
If you're primarily calling OpenAI or Claude APIs, just use the Instructor library. It wraps Pydantic definitions, JSON Mode, and retry logic into one package. Writing code feels like calling a local function. This is what I recommend most right now.
Self-hosting open-source models? Go with Outlines or Llama.cpp Grammars. Locking format at the logits level is the most reliable approach, and you get faster inference as a bonus.
For extremely complex data, try Function Calling combined with Chain of Thought. Let the model think in a thought field first, then output JSON in a data field. Often, incorrect outputs happen because the model hasn't thought things through.
One prompt trick: use TypeScript Interfaces instead of JSON Schema to describe structure, and in Completion mode, manually prepend { to the response.
What I still haven't figured out
Is fine-tuning actually worth it?
In theory, fine-tuning with a few hundred pure JSON examples using LoRA can build "muscle memory" into the model. I've tried it—a fine-tuned Qwen-2 can indeed match GPT-4 on specific tasks, at lower cost.
But the MLOps overhead of maintaining a fine-tuned model far exceeds writing a few lines of post-processing code. Model update? Re-fine-tune. Data distribution shift? Re-fine-tune. New task? You guessed it—re-fine-tune.
So my principle is: if you can solve it with engineering, don't rush into fine-tuning.
This whole strategy might be obsolete next year anyway. The field moves that fast. Six months ago I thought JSON Mode was the silver bullet. Now Structured Outputs exist.
Who knows what next year brings.
Key Takeaways:
- Prompt engineering solves ~60% of JSON problems—good for prototypes, risky for production
- API-native JSON Mode guarantees syntax but not schema compliance
- Structured Outputs (OpenAI) or Outlines/Grammars (self-hosted) lock things down properly
- Always post-process: assume outputs might be invalid and have a repair pipeline ready
- Use
nullgenerously in schemas to prevent hallucination under constraints
What's been your experience wrestling JSON out of LLMs? Found any tricks I haven't mentioned? Drop a comment below—I'm genuinely curious.
ai #programming #llm #python #webdev
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.