Structured Outputs Saved My Sanity (and 200 Lines of Try-Catch)
Structured Outputs Saved My Sanity (and 200 Lines of Try-Catch)
Last Tuesday at 11 PM, I was staring at my screen, wondering if I should just become a farmer. Our team had spent two full days tweaking prompts for a customer service extraction task, and GPT-4 kept doing that thing where it returns "true" instead of true, or wraps JSON in `json fences, or—my personal favorite—invents entirely new fields we never asked for.
Then I switched to OpenAI's Structured Outputs. Ten minutes later, everything worked.
I genuinely wanted to slap myself.
Look, I know the official docs cover this feature pretty thoroughly. But after banging my head against it in production, I've found there's a gap between what the docs say and what actually happens when you push this thing hard. Let me save you some debug time.
What Problem This Actually Solves
For the uninitiated: until recently, getting GPT-4 to return valid JSON was basically a trust exercise. You'd write "Please return JSON format" in your prompt, cross your fingers, and then wrap your parsing logic in enough try-catch blocks to make a Java developer blush.
I still have trauma from a midnight rollout last year where a model decided to return "isactive": "true" instead of isactive: true, and the whole pipeline just... collapsed. Good times.
Structured Outputs changes the game by making valid JSON output a hard constraint rather than a polite suggestion. You give it a JSON Schema, and it guarantees—actually guarantees—the output will match. No missing fields, no type confusion, no creative additions.
The underlying mechanism isn't fully public, but from what I understand (and what makes technical sense), OpenAI injects schema constraints directly into the token sampling phase. Tokens that would violate the schema simply can't be selected. This isn't post-hoc validation—it's prevented at generation time. Which, honestly, is kind of brutal in the best way.
Schema Constraints: What You Can and Can't Do
The Supported Subset
Here's the thing the docs don't shout about: OpenAI uses a subset of JSON Schema, not the full spec. You need to know where the walls are.
What you get: object, string, number, boolean, array, null, plus enum.
Supported constraints:
type(required)propertiesfor object fieldsrequiredarray for mandatory fieldsitemsfor array element typesenumfor restricted valuesdescriptionfor field guidanceadditionalProperties: falseto block extra fields
What's missing:
oneOf,anyOf,allOf(the combinatorial keywords)patternregex validationminLength/maxLength,minimum/maximumnumeric constraints$refreferences- Nested
anyOfcomplexity
This limitation hits hard in practice. Say you want the model to return either a cat (with a meow field) or a dog (with a bark field). That's literally what oneOf was designed for—and Structured Outputs doesn't support it.
Wait, I should clarify: it's not completely unsupported. As of August 2024, the gpt-4o-2024-08-06 version relaxed some anyOf scenarios. But strict oneOf mutual exclusion logic? Still a no-go. Don't get your hopes up.
A Schema That Actually Works
{
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Customer's full name"
},
"age": {
"type": "number",
"description": "Age in years"
},
"tags": {
"type": "array",
"items": {
"type": "string"
},
"description": "Interest tags"
}
},
"required": ["name", "age"],
"additionalProperties": false
}
That additionalProperties: false is not optional. I'll explain why in a moment, and trust me—this one bit me hard.
Three Traps I Fell Into
Trap #1: Skip `additionalProperties: false` and Prepare to Suffer
My first attempt defined an object with just name and score. No additionalProperties flag. The model occasionally returned {"name": "Alice", "score": 95, "comment": "Excellent work"}.
Wait—shouldn't Structured Outputs only return fields I defined? That's what I thought too. But here's the subtlety: without explicitly banning extra properties, OpenAI's constraint system doesn't hard-block them. It strongly encourages the model to follow your schema, but the real enforcement only applies to the types of fields you've declared.
Once I added additionalProperties: false, any extra field triggered an immediate error. Rock solid.
In my opinion, this should be the default behavior. But it's not, so learn from my mistake.
Trap #2: `required` Behaves Differently in Function Calling vs. Direct Calls
This one cost me an entire afternoon.
I was using the response_format parameter to specify a JSON Schema directly for structured output. My schema had required: ["name", "age"], but sometimes the model returned JSON without age.
Turns out, in non-function-calling mode, required isn't as strict as you'd think. The model tries its best, but if it feels information is insufficient, it might just... omit the field. My logs showed cases where "age": null would've been tolerable, but missing fields entirely? Nightmare.
The fix: In your description fields, explicitly state "return null or empty string if the information isn't available." Don't rely on required for hard enforcement.
One more thing—if you're using function calling, the required constraint is significantly stricter. I suspect the parameter validation layer in function calling mode is just more mature. Go figure.
Trap #3: Array Items Only Support Single Types
I needed the model to return a mixed-type array: [1, "hello", true]. With items, you can only define one type. No anyOf for multiple types.
// This won't work
{
"type": "array",
"items": {
"anyOf": [
{"type": "string"},
{"type": "number"}
]
}
}
The workaround I landed on: convert mixed arrays into object arrays, where each object uses different fields for different types. It's ugly. It works. But honestly, moments like this make me nostalgic for the days of just using Pydantic for validation.
Real Performance Numbers
We benchmarked 500 customer service conversation extractions using gpt-4o-2024-11-20, comparing three approaches:
| Approach | Format Compliance | Field Accuracy | Avg Latency |
|---|
| Pure prompt guidance | 87% | 91% | 1.2s |
|---|
| Prompt + retry on validation failure | 96% | 93% | 2.8s |
|---|
| Structured Outputs | 100% | 96% | 1.5s |
|---|
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.