Function Calling in Production: The Scars, Stats, and Solutions I Wish I'd Had 8 Months Ago

Meta Description: I deployed LLM function calling to production and immediately regretted it. Here's everything I learned about validation, performance, and preventing your chatbot from cancelling real user subscriptions when they ask about weekend plans.

Last week, I spent four hours debugging a production incident where our chatbot kept calling the cancel_subscription function whenever users mentioned "cancelling their weekend plans." Four. Hours.

If you're deploying LLM function calling without proper safeguards, you're one prompt away from a similar disaster. Trust me on this one—I've got the PagerDuty alerts to prove it.

I've been implementing function calling across AWS Lambda microservices for the past 8 months. Actually, wait—that makes it sound like full-time work. It's been more like 6 months of actual hands-on coding, with a 2-month detour into prompt engineering hell that I'd rather not discuss. Whether you're using OpenAI's GPT-4, Anthropic's Claude, or open-source models via Ollama, the principles are surprisingly similar. Mostly.

TL;DR / Key Takeaways:

Don't dump 47 functions into a single prompt (I did, it was carnage)
Validate everything before execution—LLMs get creative with SQL injection
Cache aggressively and execute independent calls in parallel
Monitor everything or you're flying blind
Open-source models are getting there, but that 7% accuracy gap still hurts at 3 AM

Prerequisites

Before we dive in, make sure you've got:

OpenAI Python SDK >= 1.20.0 (released March 2024)
Python 3.10+ with pydantic >= 2.0
AWS CLI configured (for the Lambda examples)
Basic understanding of JSON Schema


pip install openai==1.20.0 pydantic==2.5.3 boto3==1.34.51

Quick note on version pinning: the 1.21.0 release introduced some... let's call them "surprises" with the strict parameter. I discovered this at 11 PM on a Friday. The hard way. Don't be me.

Understanding Function Calling Architecture

Function calling isn't magic—it's a structured conversation where the LLM decides when and how to invoke external tools. Think of it as the model generating a structured JSON payload that your application then interprets and executes.

Well, that's the theory. In practice, it's more like the model making educated guesses about which JSON structure you want, and sometimes those guesses are spectacularly wrong. But we'll get to that.


sequenceDiagram
 User->>API: "What's the weather in Tokyo?"
 API->>LLM: Send prompt + function definitions
 LLM->>API: Return function_call: get_weather({city: "Tokyo"})
 API->>WeatherService: Execute get_weather("Tokyo")
 WeatherService->>API: Return {temp: 22, humidity: 65}
 API->>LLM: Send function result
 LLM->>User: "Tokyo is 22°C with 65% humidity"

This diagram makes it look clean. It's not. In reality, you'll have retries, timeouts, malformed JSON, and the occasional existential crisis from your LLM. I've seen a model refuse to call any functions for 30 seconds while it generated a philosophical monologue about whether weather "truly exists."

I wish I was joking.

The Anatomy of a Function Definition

Here's what a production-ready function definition actually looks like:


from pydantic import BaseModel, Field
from typing import Literal

class WeatherParams(BaseModel):
 city: str = Field(
 ..., 
 description="City name in English (e.g., 'Tokyo', not '東京')",
 min_length=1,
 max_length=100
 )
 unit: Literal["celsius", "fahrenheit"] = Field(
 default="celsius",
 description="Temperature unit"
 )

function_definition = {
 "type": "function",
 "function": {
 "name": "get_current_weather",
 "description": "Get current weather for a city. Use ONLY for present conditions, not forecasts.",
 "parameters": WeatherParams.model_json_schema(),
 "strict": True # New in OpenAI v1.20.0
 }
}

I spent far too long on that description field. The difference between "Get weather" and "Get current weather for a city. Use ONLY for present conditions, not forecasts" was the difference between our model trying to use this function for 5-day forecasts and actually respecting boundaries. Words matter enormously here—treat your function descriptions like you're explaining something to a brilliant but slightly chaotic intern.

Pitfall #1: The "Too Many Functions" Problem

In my first production deployment, I defined 47 functions in a single prompt.

47.

The model started hallucinating function calls that didn't exist and mixing up parameters across functions. It was like watching someone try to juggle flaming torches while reading a dictionary. We had getweather being called with billing parameters, cancelsubscription invoked for weather queries, and one memorable incident where the model invented a function called makeuserhappy that absolutely did not exist.

The Fix: Function Grouping Strategy

Instead of dumping all functions at once, categorise them by domain and only expose relevant ones:


class FunctionRouter:
 """Routes user intent to appropriate function groups"""
 
 FUNCTION_GROUPS = {
 "weather": ["get_current_weather", "get_forecast"],
 "billing": ["check_balance", "pay_invoice"],
 "account": ["update_profile", "cancel_subscription"],
 }
 
 @staticmethod
 async def get_relevant_functions(user_input: str) -> list[dict]:
 """Use a cheap classification call to determine intent"""
 intent = await classify_intent(user_input)
 group = FunctionRouter.FUNCTION_GROUPS.get(intent, [])
 
 # Always include "cancel_subscription" for explicit cancel intents
 if "cancel" in user_input.lower() and intent != "account":
 group.append("cancel_subscription")
 
 return load_function_definitions(group)

I think this approach works pretty well, though I'm still not entirely happy with the intent classification. We're using a lightweight model for that—Claude Haiku, if you're curious—and it gets confused about 8% of the time. That's down from 23% with our original naive approach, so... progress?

Production Metric: This reduced incorrect function calls by 73.4% in our system, tracked via CloudWatch over 16 days. I meant to stop at 14 days, but forgot to turn off the metrics collection. Serendipity.

Pitfall #2: Parameter Validation Before Execution

Never trust the LLM's output blindly.

Seriously. Don't.

I learned this when a model passed {"city": "DELETE FROM users;--"} to our weather function. While we had SQL injection protection (thank god), it highlighted a critical gap. The model had apparently been fed some sketchy training data and decided to get creative with city names. I nearly had a heart attack at my desk.

The Fix: Multi-Layer Validation


from pydantic import ValidationError
import re
from aws_lambda_powertools import Logger

logger = Logger()

class ParameterValidator:
 @staticmethod
 def validate_and_sanitize(function_name: str, arguments: dict) -> dict:
 """Validate before executing any function"""
 
 # Layer 1: Schema validation
 try:
 if function_name == "get_current_weather":
 validated = WeatherParams(**arguments)
 arguments = validated.model_dump()
 except ValidationError as e:
 logger.error("Schema validation failed", extra={
 "function": function_name,
 "errors": str(e.errors())
 })
 raise ValueError(f"Invalid parameters: {e.errors()}")
 
 # Layer 2: Business logic validation
 if function_name == "cancel_subscription":
 if not arguments.get("confirmation_code"):
 # Require explicit confirmation for destructive actions
 raise ValueError("Missing confirmation_code for destructive action")
 
 # Layer 3: Sanitisation
 for key, value in arguments.items():
 if isinstance(value, str):
 # Strip special characters from string inputs
 arguments[key] = re.sub(r'[<>&\'"]', '', value)
 
 return arguments

This three-layer approach has saved me more times than I can count. The business logic layer in particular—I almost skipped it because "the schema validation should catch everything."

Narrator voice: It did not catch everything.

Real incident: On 15 January 2024, this validation caught an attempted prompt injection where a user convinced the model to call cancel_subscription with an empty confirmation code. The validation layer rejected it and logged the attempt. I bought myself a very nice whisky that night.

Performance Optimisation: Latency Is Brutal

Users expect sub-second responses.

We're not consistently delivering them yet, but we're working on it. Here's my optimisation journey, complete with battle scars:

1. Parallel Function Execution

When the model calls multiple independent functions, execute them concurrently:


import asyncio
from typing import Any

async def execute_function_calls(function_calls: list[dict]) -> list[dict]:
 """Execute multiple function calls in parallel when possible"""
 
 # Detect dependencies (if any function result feeds into another)
 independent_calls = [fc for fc in function_calls if not has_dependency(fc)]
 dependent_calls = [fc for fc in function_calls if has_dependency(fc)]
 
 # Execute independent calls in parallel
 tasks = [execute_single_function(fc) for fc in independent_calls]
 results = await asyncio.gather(*tasks, return_exceptions=True)
 
 # Handle dependent calls sequentially
 for fc in dependent_calls:
 result = await execute_single_function(fc)
 results.append(result)
 
 return results

The has_dependency() function is doing a lot of heavy lifting there. I wrote it in a panic at 2 AM and it's basically a bunch of regex patterns checking if one function's output parameter name appears in another function's input. Ugly but effective. I keep meaning to refactor it. I probably won't.

Benchmark: On a call with 3 independent API lookups, parallel execution reduced latency from 2.1s to 0.8s (measured via X-Ray traces). That 0.8s still feels slow to me, but my PM says I'm being "unreasonable." She might be right.

2. Streaming with Function Calls

OpenAI's streaming API now supports function calling (as of v1.10.0). Here's how to implement it properly:


from openai import AsyncOpenAI

client = AsyncOpenAI()

async def stream_with_functions(messages: list[dict], functions: list[dict]):
 """Stream responses while handling function calls"""
 
 stream = await client.chat.completions.create(
 model="gpt-4-0125-preview",
 messages=messages,
 functions=functions,
 stream=True,
 function_call="auto"
 )
 
 function_call_buffer = {"name": "", "arguments": ""}
 current_content = ""
 
 async for chunk in stream:
 delta = chunk.choices[0].delta
 
 if delta.function_call:
 if delta.function_call.name:
 function_call_buffer["name"] += delta.function_call.name
 if delta.function_call.arguments:
 function_call_buffer["arguments"] += delta.function_call.arguments
 
 if delta.content:
 current_content += delta.content
 yield {"type": "content", "text": delta.content}
 
 if chunk.choices[0].finish_reason == "function_call":
 yield {
 "type": "function_call",
 "name": function_call_buffer["name"],
 "arguments": function_call_buffer["arguments"]
 }

One thing that tripped me up: the function call name and arguments arrive in separate chunks. I spent 45 minutes wondering why my function names looked like "get" "curr" "ent" "_wea" "ther" before realising I needed to buffer them. Not my finest professional moment. I may have yelled at my monitor.

3. Caching Function Results

For deterministic functions with identical inputs, cache aggressively:


from functools import lru_cache
import hashlib
import json

class FunctionCache:
 def __init__(self, redis_client):
 self.redis = redis_client
 self.ttl = 300 # 5 minutes default
 
 def cache_key(self, function_name: str, arguments: dict) -> str:
 """Generate deterministic cache key"""
 payload = json.dumps({"f": function_name, "a": arguments}, sort_keys=True)
 return f"fn:{hashlib.sha256(payload.encode()).hexdigest()[:16]}"
 
 async def get_or_execute(self, function_name: str, arguments: dict, executor):
 key = self.cache_key(function_name, arguments)
 
 cached = await self.redis.get(key)
 if cached:
 return json.loads(cached)
 
 result = await executor(function_name, arguments)
 await self.redis.setex(key, self.ttl, json.dumps(result))
 return result

I know, I know—sha256 for cache keys is overkill. But after debugging a collision on MD5 that caused a user in Tokyo to get weather data for Toronto, I'm not taking chances anymore. The user was very confused. So was I.

Production data: Our weather function cache hit rate is 42%, saving roughly $0.002 per cached call. Not much individually, but it adds up at 50K requests per day. I think that's about $42 daily savings. Maths isn't my strong suit, but my AWS bill noticed.

Monitoring and Observability

You can't optimise what you don't measure. Someone much smarter than me said that.

Here's our monitoring stack:


from aws_lambda_powertools import Metrics
from aws_lambda_powertools.metrics import MetricUnit

metrics = Metrics(namespace="FunctionCalling")

@metrics.log_metrics
async def monitored_function_call(function_name: str, arguments: dict):
 """Wrap function calls with metrics"""
 
 metrics.add_metric(
 name="FunctionCallAttempt",
 unit=MetricUnit.Count,
 value=1
 )
 
 start_time = time.time()
 
 try:
 result = await execute_function(function_name, arguments)
 
 metrics.add_metric(
 name="FunctionCallSuccess",
 unit=MetricUnit.Count,
 value=1
 )
 
 execution_time = (time.time() - start_time) * 1000
 metrics.add_metric(
 name="FunctionCallDuration",
 unit=MetricUnit.Milliseconds,
 value=execution_time
 )
 
 return result
 
 except Exception as e:
 metrics.add_metric(
 name="FunctionCallFailure",
 unit=MetricUnit.Count,
 value=1
 )
 
 metrics.add_dimension(
 name="ErrorType",
 value=type(e).__name__
 )
 
 raise

I should probably set up better alerting on that FunctionCallFailure metric. Right now it just pages me directly, and I've developed a proper Pavlovian response to the PagerDuty sound. My therapist is concerned. My partner is less sympathetic—apparently I've woken her up one too many times at 3 AM.

Production Deployment Checklist

Before you ship, verify these:

Rate Limiting: Implement token bucket algorithm per user


 # 100 function calls per minute per user
 limiter = TokenBucket(rate=100, capacity=100)

We started with 1000 per minute. That was... optimistic. Our bill was not.

Timeout Configuration: Set aggressive timeouts


 FUNCTION_TIMEOUTS = {
 "get_weather": 2.0, # Fast API
 "process_payment": 10.0, # Payment gateway
 "default": 5.0
 }

The process_payment timeout gave me genuine heartburn. 10 seconds feels like an eternity when a user is waiting, but Stripe sometimes takes 7-8 seconds for international transactions. Compromises were made. I'm not happy about them.

Function Call Budget: Limit total function calls per conversation


 MAX_FUNCTION_CALLS_PER_SESSION = 5

We had a user whose session called get_weather 247 times. They were probably just refreshing the page, but our AWS bill noticed immediately. So did my manager.

Cost Tracking: Tag every API call

OpenAI cost: roughly $0.01 per 1K tokens for GPT-4
Our average: $0.003 per function call (including execution)
Monthly bill last I checked: $847.23. That's down from $1,200 after implementing caching, so I'll take the win.

When Function Calling Goes Wrong: A Post-Mortem

On 3 February 2024, at 2:47 AM UTC, our monitoring fired an alert: function call failures spiked to 87%.

I was asleep. Obviously.

Root cause? A new team member committed a function definition with a typo in the parameter name—temprature instead of temperature. The model kept trying to call the function but couldn't match the schema. It attempted 3 times per request before giving up, which is why the failure rate went absolutely bonkers.

The fix we added to CI/CD:


def validate_function_schemas():
 """Validate all function schemas before deployment"""
 for func in load_all_functions():
 try:
 jsonschema.Draft7Validator.check_schema(func["parameters"])
 
 assert len(func.get("description", "")) > 10, \
 f"{func['name']}: Description too short"
 
 params = func["parameters"].get("properties", {})
 for param_name in params:
 assert not is_common_misspelling(param_name), \
 f"Possible typo: {param_name}"
 
 except Exception as e:
 raise SystemExit(f"Schema validation failed: {e}")

The new team member is fine, by the way. We've all been there. I once pushed a typo that turned calculatetax into calculatetaz, and the model started hallucinating a Looney Tunes character into our billing system. That's a story for another post, but let's just say Taz did not handle VAT correctly.

The Open Source Alternative

While OpenAI dominates, I've been experimenting with function calling on open-source models:


# Ollama with Llama 3 (supports function calling as of v0.1.32)
ollama run llama3:8b

# Test function calling
curl http://localhost:11434/api/chat -d '{
 "model": "llama3:8b",
 "messages": [{"role": "user", "content": "Weather in Paris?"}],
 "tools": [{
 "type": "function",
 "function": {
 "name": "get_weather",
 "description": "Get current weather",
 "parameters": {
 "type": "object",
 "properties": {
 "city": {"type": "string"}
 }
 }
 }
 }]
}'

The quality is improving rapidly—Llama 3 70B achieves roughly 89% accuracy on our function calling test suite, compared to GPT-4's 96%. That 7% gap matters enormously in production though. It's the difference between "mostly works during business hours" and "I can sleep through the night without PagerDuty haunting my dreams."

I've also been playing with Mistral's function calling, but honestly? It's not ready for prime time. Got it to work about 70% of the time before I gave up and went back to the big players. Maybe I'll revisit in 6 months when the ecosystem matures.

What's Your Experience?

I'm genuinely curious: what's the weirdest function calling behaviour you've seen in production?

Last month, our model tried to call get_weather with {"city": "the moon"}—that's when I realised we needed input validation for celestial bodies too. The temperature came back as null, obviously, and the model cheerfully informed the user that "the moon's temperature is unknown, but likely quite cold."

Technically correct. Horrifyingly unhelpful.

Drop your horror stories in the comments, or better yet, contribute to the open-source validation library I'm building. The more edge cases we catch collectively, the fewer 4 AM alerts we all endure. I've got a toddler at home, so I'm already not sleeping—I don't need my production code making it worse.

function-calling #llm #production-engineering #openai #aws-lambda #devops #machine-learning

Function Calling in Production: The Scars, Stats, and Solutions I Wish I'd Had 8 Months Ago

Function Calling in Production: The Scars, Stats, and Solutions I Wish I'd Had 8 Months Ago

Prerequisites

Understanding Function Calling Architecture

The Anatomy of a Function Definition

Pitfall #1: The "Too Many Functions" Problem

The Fix: Function Grouping Strategy

Pitfall #2: Parameter Validation Before Execution

The Fix: Multi-Layer Validation

Performance Optimisation: Latency Is Brutal

1. Parallel Function Execution

2. Streaming with Function Calls

3. Caching Function Results

Monitoring and Observability

Production Deployment Checklist

When Function Calling Goes Wrong: A Post-Mortem

The Open Source Alternative

Further Reading

What's Your Experience?

function-calling #llm #production-engineering #openai #aws-lambda #devops #machine-learning

Cael Lee

Ready to get started?