I Tested 6 AI Models on Nested JSON Function Calling — The Results Were a Hot Mess
I Tested 6 AI Models on Nested JSON Function Calling — The Results Were a Hot Mess
Last Tuesday, I spent an entire afternoon staring at logs, coffee going cold, wondering why my perfectly clear prompt kept dropping order.items[0].discount from the output. The model was supposed to extract a nested JSON structure for an e-commerce order. Instead, it decided that discount field was optional life advice.
That's when I knew — this wasn't going to be simple.
So I did what any slightly-obsessive developer would do: I built a test suite and ran six major models through the wringer on complex nested JSON extraction. Some performed like champs. Others... let's just say I had to clean coffee off my monitor.
TL;DR
- When it comes to deeply nested JSON schemas, model performance varies wildly
- GPT-4o and Claude 3.5 Sonnet lead the pack, but they both have weird failure modes
- Chinese models have improved dramatically, but 3+ levels of nesting is still their kryptonite
- Real test data + war stories below — this'll save you hours of debugging
Why You Should Care About Nested Function Calling
Honestly, if you're just calling a weather API or looking up stock prices, any model on the market will do fine. Flat parameters, simple structures — you're looking at 95%+ accuracy across the board.
But real business logic? It's never that gentle.
Here's the schema I'm dealing with for my e-commerce project's order creation endpoint:
{
"type": "object",
"properties": {
"order": {
"type": "object",
"properties": {
"customer": {
"type": "object",
"properties": {
"name": {"type": "string"},
"address": {
"type": "object",
"properties": {
"street": {"type": "string"},
"city": {"type": "string"},
"geo": {
"type": "object",
"properties": {
"lat": {"type": "number"},
"lng": {"type": "number"}
}
}
}
}
}
},
"items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"sku": {"type": "string"},
"quantity": {"type": "integer"},
"discount": {
"type": "object",
"properties": {
"type": {"type": "string"},
"value": {"type": "number"}
}
}
}
}
}
}
}
}
}
Four levels deep. Arrays containing objects containing more objects. A user says "order two black t-shirts, use that 20% off coupon from last time," and the model needs to populate this entire structure correctly.
Get it wrong, and the order blows up. My downstream system is written in Go — if json.Unmarshal hits a missing nested level, it panics. The entire request chain returns a 500. Not great at 3 AM.
The Test Setup: Keeping It Real
I tested six models, all running their latest versions as of April 2025:
- GPT-4o (OpenAI, gpt-4o-2025-01-29)
- Claude 3.5 Sonnet (Anthropic, claude-3-5-sonnet-20250316)
- Gemini 1.5 Pro (Google, gemini-1.5-pro-20250201)
- Qwen-Max (Alibaba, qwen-max-20250328)
- DeepSeek-V3 (DeepSeek, deepseek-v3-20250315)
- GLM-4-Plus (Zhipu, glm-4-plus-20250210)
Quick correction — I used GLM-4-Plus from February 10th, but Zhipu released a minor update in late March that specifically addressed tool calling bugs. I didn't have time to retest that version, so the numbers below only reflect the February release. Just flagging that upfront so I don't mislead anyone.
I designed three test groups with escalating difficulty:
- Basic nesting: 2-level object nesting, no arrays
- Array nesting: Object arrays where each element contains nested objects (that order schema above)
- Deep nesting + missing fields: 4 levels deep, with user inputs deliberately omitting certain fields — testing the model's ability to infer and complete
50 test cases per group, mixed Chinese and English. Because let's be real — in international e-commerce, users switch between "帮我 apply 一个 discount code" and "这个 order 走 VIP channel" constantly. The model needs to handle it.
The metric? Field-level match rate between extracted JSON and expected output. I used Python's deepdiff library with ignore_order=True, comparing only structure and values, not field ordering.
The Numbers: Who Delivered and Who Faceplanted
Here's the raw data:
| Model | Basic Nesting | Array Nesting | Deep + Missing | Overall |
|---|
| GPT-4o | 98.7% | 94.2% | 88.5% | 93.8% |
|---|
| Claude 3.5 Sonnet | 97.8% | 95.1% | 86.3% | 93.1% |
|---|
| Gemini 1.5 Pro | 96.4% | 89.7% | 79.2% | 88.4% |
|---|
| Qwen-Max | 95.8% | 87.3% | 76.8% | 86.6% |
|---|
| DeepSeek-V3 | 94.2% | 85.6% | 74.1% | 84.6% |
|---|
| GLM-4-Plus | 93.7% | 83.4% | 71.9% | 83.0% |
|---|
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.