I Tested 6 AI Models on Nested JSON Function Calling — The Results Were a Hot Mess

Last Tuesday, I spent an entire afternoon staring at logs, coffee going cold, wondering why my perfectly clear prompt kept dropping order.items[0].discount from the output. The model was supposed to extract a nested JSON structure for an e-commerce order. Instead, it decided that discount field was optional life advice.

That's when I knew — this wasn't going to be simple.

So I did what any slightly-obsessive developer would do: I built a test suite and ran six major models through the wringer on complex nested JSON extraction. Some performed like champs. Others... let's just say I had to clean coffee off my monitor.

TL;DR

When it comes to deeply nested JSON schemas, model performance varies wildly
GPT-4o and Claude 3.5 Sonnet lead the pack, but they both have weird failure modes
Chinese models have improved dramatically, but 3+ levels of nesting is still their kryptonite
Real test data + war stories below — this'll save you hours of debugging

Why You Should Care About Nested Function Calling

Honestly, if you're just calling a weather API or looking up stock prices, any model on the market will do fine. Flat parameters, simple structures — you're looking at 95%+ accuracy across the board.

But real business logic? It's never that gentle.

Here's the schema I'm dealing with for my e-commerce project's order creation endpoint:


{
 "type": "object",
 "properties": {
 "order": {
 "type": "object",
 "properties": {
 "customer": {
 "type": "object",
 "properties": {
 "name": {"type": "string"},
 "address": {
 "type": "object",
 "properties": {
 "street": {"type": "string"},
 "city": {"type": "string"},
 "geo": {
 "type": "object",
 "properties": {
 "lat": {"type": "number"},
 "lng": {"type": "number"}
 }
 }
 }
 }
 }
 },
 "items": {
 "type": "array",
 "items": {
 "type": "object",
 "properties": {
 "sku": {"type": "string"},
 "quantity": {"type": "integer"},
 "discount": {
 "type": "object",
 "properties": {
 "type": {"type": "string"},
 "value": {"type": "number"}
 }
 }
 }
 }
 }
 }
 }
 }
}

Four levels deep. Arrays containing objects containing more objects. A user says "order two black t-shirts, use that 20% off coupon from last time," and the model needs to populate this entire structure correctly.

Get it wrong, and the order blows up. My downstream system is written in Go — if json.Unmarshal hits a missing nested level, it panics. The entire request chain returns a 500. Not great at 3 AM.

The Test Setup: Keeping It Real

I tested six models, all running their latest versions as of April 2025:

GPT-4o (OpenAI, gpt-4o-2025-01-29)
Claude 3.5 Sonnet (Anthropic, claude-3-5-sonnet-20250316)
Gemini 1.5 Pro (Google, gemini-1.5-pro-20250201)
Qwen-Max (Alibaba, qwen-max-20250328)
DeepSeek-V3 (DeepSeek, deepseek-v3-20250315)
GLM-4-Plus (Zhipu, glm-4-plus-20250210)

Quick correction — I used GLM-4-Plus from February 10th, but Zhipu released a minor update in late March that specifically addressed tool calling bugs. I didn't have time to retest that version, so the numbers below only reflect the February release. Just flagging that upfront so I don't mislead anyone.

I designed three test groups with escalating difficulty:

Basic nesting: 2-level object nesting, no arrays
Array nesting: Object arrays where each element contains nested objects (that order schema above)
Deep nesting + missing fields: 4 levels deep, with user inputs deliberately omitting certain fields — testing the model's ability to infer and complete

50 test cases per group, mixed Chinese and English. Because let's be real — in international e-commerce, users switch between "帮我 apply 一个 discount code" and "这个 order 走 VIP channel" constantly. The model needs to handle it.

The metric? Field-level match rate between extracted JSON and expected output. I used Python's deepdiff library with ignore_order=True, comparing only structure and values, not field ordering.

The Numbers: Who Delivered and Who Faceplanted

Here's the raw data:

Model	Basic Nesting	Array Nesting	Deep + Missing	Overall

GPT-4o	98.7%	94.2%	88.5%	93.8%

Claude 3.5 Sonnet	97.8%	95.1%	86.3%	93.1%

Gemini 1.5 Pro	96.4%	89.7%	79.2%	88.4%

Qwen-Max	95.8%	87.3%	76.8%	86.6%

DeepSeek-V3	94.2%	85.6%	74.1%	84.6%

A few things jumped out at me:

GPT-4o Isn't Invincible

It's the strongest on basic nesting, sure. But on deep nesting with missing fields? Accuracy drops to 88.5%. I dug through the failure cases and spotted a pattern — it loves to hallucinate. User didn't specify a discount type? GPT-4o confidently filled in "percentage". Problem is, our system uses an enum for discount types, and "percentage" isn't on the whitelist. Downstream validation rejected it immediately.

Claude 3.5 Sonnet Handles Arrays Surprisingly Well

It scored highest on array nesting at 95.1%. I examined the outputs closely — Claude maintains field completeness across array elements remarkably well. You rarely see cases where items[0] has a discount object but items[1] doesn't. In order processing, this matters enormously. You can't have the first item discounted and the second at full price while the system only parses one discount object.

From what I understand, Anthropic did some targeted RLHF training on structured outputs starting late last year. I don't know the exact details, but the results speak for themselves.

Chinese Models: Solid at 2 Levels, Wheezing at 4

This one's... complicated.

Qwen-Max and DeepSeek-V3 actually perform well on basic nesting — ~95% accuracy is perfectly usable. But push to four levels with missing fields, and accuracy plummets to the 70% range. The typical failure mode? Deep fields just vanish. Not filled incorrectly — completely absent. The entire order.customer.address.geo object would just... not exist. My logs were flooded with KeyError: 'geo'.

War Story: The Bug That Cost Me Three Hours

Let me tell you about my favorite failure.

Testing DeepSeek-V3, I had this input: "Ship to Wangjing SOHO in Beijing's Chaoyang district, two lattes, use membership pricing."

Expected output: items[0].discount.type should be "membership", items[0].discount.value should be null. In our system, membership pricing uses a separate calculation engine — we don't need a specific value at the order level.

DeepSeek returned:


{
 "items": [
 {
 "sku": "latte",
 "quantity": 2,
 "discount": null
 }
 ]
}

It set the entire discount object to null.

Boom.

The downstream order system uses strict struct parsing. The discount field expects an object, gets null, and throws a null pointer exception. Entire chain returns 500. Order creation fails.

I initially thought my prompt wasn't clear enough. Rewrote it three, four times. Even added: "If a field is optional but its parent object is required, keep the parent object with null fields." Didn't help. Still dropped it.

My theory — and this is just speculation — is that the model confuses "optional fields within an object" with "the entire object is optional." It figures: if I don't know what value to put, the whole discount object can go. Except the schema explicitly defines discount as an object type, and it's in the required fields list.

These are the bugs that kill you in production. There's a running joke in developer circles: "AI helps you write code, and AI also helps you write bugs." Yeah. That.

How to Choose: Don't Just Look at Benchmarks

Based on this round of testing, here's my advice:

Complex business logic (3+ levels + arrays): Go with GPT-4o or Claude 3.5 Sonnet. GPT-4o is the strongest overall, Claude handles arrays better. If budget allows, use both with a fallback mechanism.
Medium complexity (2 levels max + light arrays): Qwen-Max offers great value. Best performer among Chinese models, low latency too. We use Alibaba Cloud internally — latency is around 200ms. GPT-4o routing through US servers often hits 800ms+.
Simple scenarios: Pick whatever's cheapest. GLM-4-Plus works fine and won't break the bank. Zhipu's running a promotion right now — 50% off API calls, though I forget when it ends. Check their website.
Latency-sensitive applications: DeepSeek-V3 is genuinely fast. But for nested scenarios, you absolutely must add post-processing validation. Otherwise, your production incident rate will make you question your career choices.

One more thing — regardless of which model you pick, add a schema validation layer after Function Calling. I wrote a validation middleware using Python's jsonschema library, and it's caught so much "creative interpretation" from models. The approach: validate JSON structure first, fill missing fields with defaults, strip extra fields, then run everything through a business rules engine. Our production incident rate dropped by an order of magnitude after implementing this.

The Bottom Line

Function Calling was mind-blowing when it launched in 2023 — "wait, it can call functions?" By 2025, we're arguing about whether it can handle four levels of nesting without dropping a single field. The progress is real, but we're still far from "throw a schema at it and get perfect output."

I've developed a habit: every time a model gets updated, the first thing I do isn't check benchmarks. I run my own test suite. Benchmarks won't tell you that your order system will explode because of a null value. They won't tell you what it feels like to get paged at 3 AM to fix an AI-generated bug.

What's your experience with Function Calling in production? Ever seen a model "creatively" fill in parameters? My personal favorite: a model once set discount.value to the string "free". The downstream system's type conversion failed spectacularly. Drop your war stories in the comments — I'll buy you a coffee. Metaphorically.

functioncalling #ai #programming #webdev #machinelearning

GLM-4-Plus	93.7%	83.4%	71.9%	83.0%

I Tested 6 AI Models on Nested JSON Function Calling — The Results Were a Hot Mess

I Tested 6 AI Models on Nested JSON Function Calling — The Results Were a Hot Mess

TL;DR

Why You Should Care About Nested Function Calling

The Test Setup: Keeping It Real

The Numbers: Who Delivered and Who Faceplanted

GPT-4o Isn't Invincible

Claude 3.5 Sonnet Handles Arrays Surprisingly Well

Chinese Models: Solid at 2 Levels, Wheezing at 4

War Story: The Bug That Cost Me Three Hours

How to Choose: Don't Just Look at Benchmarks

The Bottom Line

functioncalling #ai #programming #webdev #machinelearning

Cael Lee

Ready to get started?