Home / Blog / I Tested 6 AI Models on Nested JSON Function Calli...

I Tested 6 AI Models on Nested JSON Function Calling — The Results Were a Hot Mess

By CaelLee | | 8 min read

I Tested 6 AI Models on Nested JSON Function Calling — The Results Were a Hot Mess

Last Tuesday, I spent an entire afternoon staring at logs, coffee going cold, wondering why my perfectly clear prompt kept dropping order.items[0].discount from the output. The model was supposed to extract a nested JSON structure for an e-commerce order. Instead, it decided that discount field was optional life advice.

That's when I knew — this wasn't going to be simple.

So I did what any slightly-obsessive developer would do: I built a test suite and ran six major models through the wringer on complex nested JSON extraction. Some performed like champs. Others... let's just say I had to clean coffee off my monitor.

TL;DR

Why You Should Care About Nested Function Calling

Honestly, if you're just calling a weather API or looking up stock prices, any model on the market will do fine. Flat parameters, simple structures — you're looking at 95%+ accuracy across the board.

But real business logic? It's never that gentle.

Here's the schema I'm dealing with for my e-commerce project's order creation endpoint:


{
 "type": "object",
 "properties": {
 "order": {
 "type": "object",
 "properties": {
 "customer": {
 "type": "object",
 "properties": {
 "name": {"type": "string"},
 "address": {
 "type": "object",
 "properties": {
 "street": {"type": "string"},
 "city": {"type": "string"},
 "geo": {
 "type": "object",
 "properties": {
 "lat": {"type": "number"},
 "lng": {"type": "number"}
 }
 }
 }
 }
 }
 },
 "items": {
 "type": "array",
 "items": {
 "type": "object",
 "properties": {
 "sku": {"type": "string"},
 "quantity": {"type": "integer"},
 "discount": {
 "type": "object",
 "properties": {
 "type": {"type": "string"},
 "value": {"type": "number"}
 }
 }
 }
 }
 }
 }
 }
 }
}

Four levels deep. Arrays containing objects containing more objects. A user says "order two black t-shirts, use that 20% off coupon from last time," and the model needs to populate this entire structure correctly.

Get it wrong, and the order blows up. My downstream system is written in Go — if json.Unmarshal hits a missing nested level, it panics. The entire request chain returns a 500. Not great at 3 AM.

The Test Setup: Keeping It Real

I tested six models, all running their latest versions as of April 2025:

Quick correction — I used GLM-4-Plus from February 10th, but Zhipu released a minor update in late March that specifically addressed tool calling bugs. I didn't have time to retest that version, so the numbers below only reflect the February release. Just flagging that upfront so I don't mislead anyone.

I designed three test groups with escalating difficulty:

  1. Basic nesting: 2-level object nesting, no arrays
  2. Array nesting: Object arrays where each element contains nested objects (that order schema above)
  3. Deep nesting + missing fields: 4 levels deep, with user inputs deliberately omitting certain fields — testing the model's ability to infer and complete

50 test cases per group, mixed Chinese and English. Because let's be real — in international e-commerce, users switch between "帮我 apply 一个 discount code" and "这个 order 走 VIP channel" constantly. The model needs to handle it.

The metric? Field-level match rate between extracted JSON and expected output. I used Python's deepdiff library with ignore_order=True, comparing only structure and values, not field ordering.

The Numbers: Who Delivered and Who Faceplanted

Here's the raw data:

ModelBasic NestingArray NestingDeep + MissingOverall
GPT-4o98.7%94.2%88.5%93.8%
Claude 3.5 Sonnet97.8%95.1%86.3%93.1%
Gemini 1.5 Pro96.4%89.7%79.2%88.4%
Qwen-Max95.8%87.3%76.8%86.6%
DeepSeek-V394.2%85.6%74.1%84.6%

A few things jumped out at me:

GPT-4o Isn't Invincible

It's the strongest on basic nesting, sure. But on deep nesting with missing fields? Accuracy drops to 88.5%. I dug through the failure cases and spotted a pattern — it loves to hallucinate. User didn't specify a discount type? GPT-4o confidently filled in "percentage". Problem is, our system uses an enum for discount types, and "percentage" isn't on the whitelist. Downstream validation rejected it immediately.

Claude 3.5 Sonnet Handles Arrays Surprisingly Well

It scored highest on array nesting at 95.1%. I examined the outputs closely — Claude maintains field completeness across array elements remarkably well. You rarely see cases where items[0] has a discount object but items[1] doesn't. In order processing, this matters enormously. You can't have the first item discounted and the second at full price while the system only parses one discount object.

From what I understand, Anthropic did some targeted RLHF training on structured outputs starting late last year. I don't know the exact details, but the results speak for themselves.

Chinese Models: Solid at 2 Levels, Wheezing at 4

This one's... complicated.

Qwen-Max and DeepSeek-V3 actually perform well on basic nesting — ~95% accuracy is perfectly usable. But push to four levels with missing fields, and accuracy plummets to the 70% range. The typical failure mode? Deep fields just vanish. Not filled incorrectly — completely absent. The entire order.customer.address.geo object would just... not exist. My logs were flooded with KeyError: 'geo'.

War Story: The Bug That Cost Me Three Hours

Let me tell you about my favorite failure.

Testing DeepSeek-V3, I had this input: "Ship to Wangjing SOHO in Beijing's Chaoyang district, two lattes, use membership pricing."

Expected output: items[0].discount.type should be "membership", items[0].discount.value should be null. In our system, membership pricing uses a separate calculation engine — we don't need a specific value at the order level.

DeepSeek returned:


{
 "items": [
 {
 "sku": "latte",
 "quantity": 2,
 "discount": null
 }
 ]
}

It set the entire discount object to null.

Boom.

The downstream order system uses strict struct parsing. The discount field expects an object, gets null, and throws a null pointer exception. Entire chain returns 500. Order creation fails.

I initially thought my prompt wasn't clear enough. Rewrote it three, four times. Even added: "If a field is optional but its parent object is required, keep the parent object with null fields." Didn't help. Still dropped it.

My theory — and this is just speculation — is that the model confuses "optional fields within an object" with "the entire object is optional." It figures: if I don't know what value to put, the whole discount object can go. Except the schema explicitly defines discount as an object type, and it's in the required fields list.

These are the bugs that kill you in production. There's a running joke in developer circles: "AI helps you write code, and AI also helps you write bugs." Yeah. That.

How to Choose: Don't Just Look at Benchmarks

Based on this round of testing, here's my advice:

One more thing — regardless of which model you pick, add a schema validation layer after Function Calling. I wrote a validation middleware using Python's jsonschema library, and it's caught so much "creative interpretation" from models. The approach: validate JSON structure first, fill missing fields with defaults, strip extra fields, then run everything through a business rules engine. Our production incident rate dropped by an order of magnitude after implementing this.

The Bottom Line

Function Calling was mind-blowing when it launched in 2023 — "wait, it can call functions?" By 2025, we're arguing about whether it can handle four levels of nesting without dropping a single field. The progress is real, but we're still far from "throw a schema at it and get perfect output."

I've developed a habit: every time a model gets updated, the first thing I do isn't check benchmarks. I run my own test suite. Benchmarks won't tell you that your order system will explode because of a null value. They won't tell you what it feels like to get paged at 3 AM to fix an AI-generated bug.

What's your experience with Function Calling in production? Ever seen a model "creatively" fill in parameters? My personal favorite: a model once set discount.value to the string "free". The downstream system's type conversion failed spectacularly. Drop your war stories in the comments — I'll buy you a coffee. Metaphorically.

functioncalling #ai #programming #webdev #machinelearning

GLM-4-Plus93.7%83.4%71.9%83.0%
C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free