Home / Blog / Structured Outputs Saved My Sanity (and 200 Lines ...

Structured Outputs Saved My Sanity (and 200 Lines of Try-Catch)

By CaelLee | | 7 min read

Structured Outputs Saved My Sanity (and 200 Lines of Try-Catch)

Last Tuesday at 11 PM, I was staring at my screen, wondering if I should just become a farmer. Our team had spent two full days tweaking prompts for a customer service extraction task, and GPT-4 kept doing that thing where it returns "true" instead of true, or wraps JSON in `json fences, or—my personal favorite—invents entirely new fields we never asked for.

Then I switched to OpenAI's Structured Outputs. Ten minutes later, everything worked.

I genuinely wanted to slap myself.

Look, I know the official docs cover this feature pretty thoroughly. But after banging my head against it in production, I've found there's a gap between what the docs say and what actually happens when you push this thing hard. Let me save you some debug time.

What Problem This Actually Solves

For the uninitiated: until recently, getting GPT-4 to return valid JSON was basically a trust exercise. You'd write "Please return JSON format" in your prompt, cross your fingers, and then wrap your parsing logic in enough try-catch blocks to make a Java developer blush.

I still have trauma from a midnight rollout last year where a model decided to return "isactive": "true" instead of isactive: true, and the whole pipeline just... collapsed. Good times.

Structured Outputs changes the game by making valid JSON output a hard constraint rather than a polite suggestion. You give it a JSON Schema, and it guarantees—actually guarantees—the output will match. No missing fields, no type confusion, no creative additions.

The underlying mechanism isn't fully public, but from what I understand (and what makes technical sense), OpenAI injects schema constraints directly into the token sampling phase. Tokens that would violate the schema simply can't be selected. This isn't post-hoc validation—it's prevented at generation time. Which, honestly, is kind of brutal in the best way.

Schema Constraints: What You Can and Can't Do

The Supported Subset

Here's the thing the docs don't shout about: OpenAI uses a subset of JSON Schema, not the full spec. You need to know where the walls are.

What you get: object, string, number, boolean, array, null, plus enum.

Supported constraints:

What's missing:

This limitation hits hard in practice. Say you want the model to return either a cat (with a meow field) or a dog (with a bark field). That's literally what oneOf was designed for—and Structured Outputs doesn't support it.

Wait, I should clarify: it's not completely unsupported. As of August 2024, the gpt-4o-2024-08-06 version relaxed some anyOf scenarios. But strict oneOf mutual exclusion logic? Still a no-go. Don't get your hopes up.

A Schema That Actually Works


{
 "type": "object",
 "properties": {
 "name": {
 "type": "string",
 "description": "Customer's full name"
 },
 "age": {
 "type": "number",
 "description": "Age in years"
 },
 "tags": {
 "type": "array",
 "items": {
 "type": "string"
 },
 "description": "Interest tags"
 }
 },
 "required": ["name", "age"],
 "additionalProperties": false
}

That additionalProperties: false is not optional. I'll explain why in a moment, and trust me—this one bit me hard.

Three Traps I Fell Into

Trap #1: Skip `additionalProperties: false` and Prepare to Suffer

My first attempt defined an object with just name and score. No additionalProperties flag. The model occasionally returned {"name": "Alice", "score": 95, "comment": "Excellent work"}.

Wait—shouldn't Structured Outputs only return fields I defined? That's what I thought too. But here's the subtlety: without explicitly banning extra properties, OpenAI's constraint system doesn't hard-block them. It strongly encourages the model to follow your schema, but the real enforcement only applies to the types of fields you've declared.

Once I added additionalProperties: false, any extra field triggered an immediate error. Rock solid.

In my opinion, this should be the default behavior. But it's not, so learn from my mistake.

Trap #2: `required` Behaves Differently in Function Calling vs. Direct Calls

This one cost me an entire afternoon.

I was using the response_format parameter to specify a JSON Schema directly for structured output. My schema had required: ["name", "age"], but sometimes the model returned JSON without age.

Turns out, in non-function-calling mode, required isn't as strict as you'd think. The model tries its best, but if it feels information is insufficient, it might just... omit the field. My logs showed cases where "age": null would've been tolerable, but missing fields entirely? Nightmare.

The fix: In your description fields, explicitly state "return null or empty string if the information isn't available." Don't rely on required for hard enforcement.

One more thing—if you're using function calling, the required constraint is significantly stricter. I suspect the parameter validation layer in function calling mode is just more mature. Go figure.

Trap #3: Array Items Only Support Single Types

I needed the model to return a mixed-type array: [1, "hello", true]. With items, you can only define one type. No anyOf for multiple types.


// This won't work
{
 "type": "array",
 "items": {
 "anyOf": [
 {"type": "string"},
 {"type": "number"}
 ]
 }
}

The workaround I landed on: convert mixed arrays into object arrays, where each object uses different fields for different types. It's ugly. It works. But honestly, moments like this make me nostalgic for the days of just using Pydantic for validation.

Real Performance Numbers

We benchmarked 500 customer service conversation extractions using gpt-4o-2024-11-20, comparing three approaches:

ApproachFormat ComplianceField AccuracyAvg Latency
Pure prompt guidance87%91%1.2s
Prompt + retry on validation failure96%93%2.8s

That 100% format compliance? Expected—it's a hard constraint. The field accuracy improvement mostly came from type enforcement. The model used to return numbers as strings constantly. Not anymore.

Latency is slightly higher than pure prompting, but dramatically better than the "validate and retry" approach. Plus, we deleted nearly 200 lines of retry logic and try-catch spaghetti. On my M2 MacBook Pro, the refactored code felt almost boring in its reliability. Boring is good.

When to Use It (and When to Walk Away)

Great fit for:

Not ideal for:

There's a gray area I'm still puzzling over: if your schema is deeply nested and complex, does the added latency from Structured Outputs outweigh the cost of retry logic? I suspect the answer depends heavily on your specific use case. We saw accuracy drop 8 percentage points with 5-level nesting, so... test your own scenarios.

Practical Tips

  1. description matters way more than you think. Schema constraints only guarantee types, not content quality. Write specific, detailed descriptions. I've started writing things like "Extract the product name in its original language, preserve brand casing" instead of generic placeholder text.
  1. Don't forget strict: true. Set "strict": true in your response_format for tighter constraints. This only stabilized in SDK versions after October 2024—older versions had bugs, so check your SDK version.
  1. enum is your best friend. If a field has a limited set of possible values, lock it down with enum. Much more reliable than letting the model freestyle.
  1. Debug without additionalProperties first. See what extra fields the model wants to return. Often, those are signals about information your schema is missing.
  1. Keep nesting shallow. The docs say nesting is supported, but my experience says 3 levels max. Beyond that, things get weird. We tried 5-level nesting and accuracy fell off a cliff.

The Bottom Line

Structured Outputs isn't revolutionary technology. It's something more valuable: infrastructure that standardizes what we used to hack together with prompt engineering tricks and prayer. It's the kind of "boring" improvement that actually makes your codebase simpler and your sleep better.

I removed 200 lines of validation and retry logic. The code is simpler. The outputs are predictable. That's a win in my book.

What about you? Have you rolled this into production yet? Hit any weird edge cases I missed? I'm especially curious how people are working around the oneOf limitation—I've been experimenting with multiple enum fields to simulate mutual exclusion, but it feels hacky. Drop your approach in the comments.

OpenAI #StructuredOutputs #JSONSchema #LLM #DeveloperExperience

Structured Outputs100%96%1.5s
C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free