I Made 1,000 API Calls to GPT-4 and Realised Function Calling's "Structured Output" Is a Lie

Last Wednesday at 2:47 AM, PagerDuty woke me up.

The 43rd "JSON Parse Failed" alert. I squinted at the monitoring dashboard, and it suddenly hit me — we've been way too optimistic about AI's structured output capabilities. Like, dangerously optimistic.

Here's a number that'll make you uncomfortable: I analysed three months of GPT-4-0613 call logs from our production system. Pure JSON format error rate: 12.7%. And I'm not talking about semantic mistakes or wrong parameter values. I mean the format itself broke — extra commas, missing quotation marks, or my personal favourite, the model randomly inserting // TODO: confirm later in the middle of a JSON object.

You're probably thinking, "Jordan, did you not enable response_format?"

Ha.

Not only did I enable it, I built three layers of fallback. I put "MUST return valid JSON" in the system prompt three times. In bold. But here's the thing — when a model lacks confidence, it "creatively" violates your structural constraints. The harder you try to box it in, the more it rebels.

The Bug That Woke Me Up at 2:47 AM

The requirement was dead simple. Extract order information from a user conversation and output this schema:


{
 "order_id": "string",
 "items": [{"name": "string", "quantity": "integer"}],
 "total": "number"
}

Completely harmless, right? I followed OpenAI's documentation religiously, configured the functions parameter, wrote painfully detailed descriptions.

First 100 tests: flawless.

First week in production: flawless.

Until this user input:

"I want to return that black one from last time — wait, no, it was more of a dark blue — anyway, the one that was 399"

The model returned:


{
 "order_id": null,
 "items": [{"name": "that black one (or maybe dark blue)", "quantity": 1}],
 "total": "399"
}

See that? The total field type just collapsed. Because the user showed uncertainty describing the colour, the model started "hesitating" on the numeric value too, wrapping it in quotation marks on its own initiative.

This is Function Calling's first trap: semantic uncertainty infects syntactic structure.

Actually, let me correct myself — "infects" isn't quite right. What's happening is that when models process ambiguous input, they exhibit a kind of "compensatory behaviour" at the structural level. I saw a paper on this at a NeurIPS 2024 workshop — they called it "structural hedging." The basic idea is that models break structure to signal "I'm not sure either." It's... complicated. Let's not go down that rabbit hole.

Why Schema Constraints Fail

A lot of people don't get this. You defined type: "number" — why would the model output a string?

Here's the counterintuitive truth: Function Calling parameter definitions are "strong suggestions" to the model, not hard constraints.

From a token generation perspective, the functions parameter is just part of the context. When the model generates the next token after "total":, it faces two competing forces:

Syntactic pressure: should output a number (from the Schema)
Semantic pressure: user is uncertain, need to preserve ambiguity (from the conversation)

When semantic pressure wins, the model picks a "compromise" — wrapping the number in a string.

I tested an even more absurd case:


# Function parameter definition
"parameters": {
 "type": "object",
 "properties": {
 "user_age": {"type": "integer"}
 }
}

# User input: "I'm around 30-ish"
# Model output: {"user_age": "30-35"}
# ← Completely ignores type constraint AND gives you a range

OpenAI's technical support told me this was "expected behaviour." I was like... what?

From what I understand, this ties back to the model's RLHF alignment strategy. If the training process over-emphasises "understanding user intent," the model learns that comprehension matters more than format compliance.

Three Survival Mechanisms

After that incident, I built a "defensive parsing" strategy. Error rate dropped from 12.7% to under 0.3%.

1. JSON Repair + Type Coercion

Stop expecting models to output perfect JSON. Treat it like a note from your slightly drunk mate at the pub.

I use the json_repair library, but repair alone isn't enough — you need type coercion too:


import json_repair
from pydantic import BaseModel, ValidationError
import re

def safe_parse(llm_output: str, schema: BaseModel):
 # Layer 1: Fix malformed JSON
 repaired = json_repair.repair_json(llm_output)
 
 # Layer 2: Type coercion
 data = json.loads(repaired)
 for field_name, field_info in schema.model_fields.items():
 if field_info.annotation == int and isinstance(data.get(field_name), str):
 # Try extracting numbers from the string
 numbers = re.findall(r'\d+', data[field_name])
 if numbers:
 data[field_name] = int(numbers[0])
 
 # Layer 3: Schema validation
 return schema(**data)

Critical detail: Never use json.loads directly. Repair first, then parse. I've seen models output {'total': 399,} (Python dict syntax), and // This is the total\n"total": 399 (JSON with comments). json_repair handles most of these.

2. Constrained Sampling

This is the actual nuclear option.

Instead of letting the model generate freely and then fixing it, clamp down during generation.

I use the outlines library, which directly modifies model logits to enforce Schema compliance:


import outlines

schema = """
{
 "type": "object",
 "properties": {
 "total": {"type": "number"}
 }
}
"""

generator = outlines.generate.json(model, schema)
result = generator("User said they spent 399 yuan total")
# result["total"] is guaranteed to be a number type

The principle is brutally simple: when the model needs to generate the total field value, outlines pushes the probability of all non-numeric tokens to negative infinity. The model never even gets the chance to output a quotation mark.

The cost? Inference is 15-20% slower. Every step requires Schema validation — it's like putting handcuffs on the model. But worth it.

3. Structured Output with "Degradation Strategy"

This one I've been experimenting with recently.

The idea is simple: when the model lacks confidence in a field, let it output "structured data with metadata."


{
 "total": 399,
 "_metadata": {
 "total_confidence": 0.92,
 "total_raw": "399 yuan",
 "needs_human_review": false
 }
}

Three benefits:

Doesn't break the main Schema structure
Preserves the model's "uncertainty" information
Downstream systems decide on human intervention based on confidence scores

I tried this in a customer service system, and human intervention rates dropped by 40%. Because most "ambiguous" scenarios don't actually need human review — the previous system just couldn't distinguish between "genuinely ambiguous" and "slightly fuzzy."

Some Uncomfortable Truths

Function Calling, from a product perspective, is half-baked.

At OpenAI DevDay last November, they hyped up Structured Outputs with claims of 100% format compliance. I tested it — GPT-4o-2024-08-06 is definitely better, but it still stumbles in long-context, multi-turn conversation scenarios.

I've seen too many teams (including myself six months ago) treating Function Calling like a database query. The result? 3 AM wake-up calls to fix JSON parsing errors.

Here's the ironic bit: Claude is actually more reliable at structured output. I ran the same test suite, and Claude 3.5 Sonnet's format error rate was only 3.2%. My guess is Anthropic weighted format correctness higher during RLHF.

This isn't about picking sides. The real issue is: the entire industry systematically overestimates LLMs' structured output capabilities.

Scroll through HN or r/MachineLearning — every few days someone posts "Why does my Function Calling always return broken JSON?" And the comments are full of "your prompt isn't good enough." Rubbish.

What Should You Actually Do?

If you're deploying Function Calling in production right now, three things:

Never trust the model's output format, even if it got the last 100 calls right
Constrained sampling is 10x more effective than prompt engineering, but you'll sacrifice some flexibility
Design your Schema with escape hatches — like _metadata fields — to give uncertainty somewhere to go

One last question: what's the most absurd Function Calling output you've seen in production?

I once had a model write an apology letter inside the JSON — "Sorry, I'm not sure what this parameter should be, here are some possibilities:" — followed by a bulleted list. I've also seen it return a Markdown table inside an array field.

Drop your stories in the comments. I'll send HackerNoon stickers to the three most ridiculous ones. I grabbed a massive stack at the AI Engineer Summit in SF last year — finally clearing out my inventory.

Related reads:

Why Your Prompt Engineering Is Actually Overfitting
Defensive Programming in the LLM Era: From Try-Catch to Schema-Fix
I Made GPT-4 and Claude 3.5 Review Each Other's Code and Found 47 Bugs

programming #AI #function-calling #LLM #production-war-stories

I Made 1,000 API Calls to GPT-4 and Realised Function Calling's "Structured Output" Is a Lie

I Made 1,000 API Calls to GPT-4 and Realised Function Calling's "Structured Output" Is a Lie

The Bug That Woke Me Up at 2:47 AM

Why Schema Constraints Fail

Three Survival Mechanisms

1. JSON Repair + Type Coercion

2. Constrained Sampling

3. Structured Output with "Degradation Strategy"

Some Uncomfortable Truths

What Should You Actually Do?

programming #AI #function-calling #LLM #production-war-stories

Cael Lee

Ready to get started?