That PagerDuty Call at 2 AM: How LLM Agents Fill Your Server with Zombie Processes
That PagerDuty Call at 2 AM: How LLM Agents Fill Your Server with Zombie Processes
Last Wednesday at 2 AM, my phone buzzed. PagerDuty. The kind of buzz that makes your stomach drop before you even look at the screen.
Server CPU was pegged at 100% across all cores. SSH felt like typing through molasses. top showed load average north of 80, but here's the weird part—iowait was eating 70% of CPU time. My first thought: disk failure. Classic.
Then I ran ps aux | wc -l.
31000.
Linux has a default pid_max of 32768. We were 768 processes away from complete system paralysis—the point where even ssh can't spawn a new session.
Digging deeper, I found 200+ processes in D state (uninterruptible sleep) and over 400 zombies. All of them were orphaned children from our LLM Agent's function calling. These zombies were holding file descriptors hostage, maxing out our ulimit of 65535. Our Agent service runs under systemd with LimitNOFILE=65535, and every single fd slot was consumed.
That night I learned something the hard way: child process management for function calling isn't a nice-to-have. It's a survival requirement.
Here's what I wish someone had told me before deploying LLM Agents to production.
Where the bodies get buried
Function calling in LLM Agents is fundamentally a loop: model decides → execute external code → feed results back. You ask "what's eating my disk space?" and it runs df -h, parses the output, and continues reasoning.
The problem lives in that "execute external code" step.
Most Agent frameworks (LangChain, AutoGPT, every custom implementation I've seen) use subprocess under the hood. Every function call forks a child process, does the work, and exits. If the parent never collects the child's exit status, that child becomes a zombie.
Here's the thing about zombies—they don't eat CPU or memory. But they do consume a slot in the process table. And 32768 slots sounds like a lot, right?
Wrong. A high-frequency Agent can fork thousands of times per hour. I once watched an automated testing Agent fill the entire PID space in half a day. The server couldn't spawn new processes. Couldn't even launch ssh. We had to hard reboot.
What's worse is the file descriptors. Zombies hold onto any fd they opened. If a child process opens a temp file and doesn't close it cleanly, and the parent doesn't wait(), that fd is gone forever. In our outage, the root cause was Agent calls to kubectl timing out, leaving TCP sockets permanently tied up by zombie processes.
Case 1: The `subprocess.run` trap hiding in plain sight
Here's code I've written. You've probably written something similar:
import subprocess
def execute_command(cmd: str) -> dict:
result = subprocess.run(
cmd,
shell=True,
capture_output=True,
text=True,
timeout=30
)
return {
"stdout": result.stdout,
"stderr": result.stderr,
"returncode": result.returncode
}
Looks harmless. subprocess.run does wait() automatically, right?
Mostly. But when that timeout actually fires—plot twist.
Wait, I need to correct myself here. Python 3.9+ sends SIGKILL and then wait() on timeout. The version I burned myself on was 3.8, which behaved... differently. But even with 3.9+, SIGKILL is useless against processes in D state (uninterruptible sleep).
That's exactly what happened to us. The Agent called kubectl get pods --all-namespaces against a dead cluster. kubectl hung on the TLS handshake and entered D state. The Agent's 10-second timeout fired, but you can't kill a D-state process. Over one night: 400+ zombies.
D state. This one's nasty.
A process enters D state when waiting on I/O—usually disk or network filesystem operations. It won't respond to any signal, including SIGKILL. You either wait for the I/O to complete (which might be never) or reboot. In our case, a stale NFS mount was the culprit. Server restart was the only way out.
Case 2: LangChain's ShellTool meets thread pools
Lots of people use LangChain's ShellTool out of the box. It's just subprocess underneath. But add ThreadPoolExecutor for concurrent calls, and things get interestingly broken.
from langchain.tools import ShellTool
from concurrent.futures import ThreadPoolExecutor
shell_tool = ShellTool()
with ThreadPoolExecutor(max_workers=10) as executor:
futures = [executor.submit(shell_tool.run, f"sleep {i} && echo done")
for i in range(10)]
Two problems here.
First, zombie processes propagate across threads. The main thread forks a child, but a worker thread does the wait(). In certain Python versions, this creates race conditions. Python's GIL doesn't protect fork()/wait() atomicity. This was especially bad before Python 3.11—I think 3.11 fixed some related bugs but not all of them.
Second, signal handling goes sideways. Python signal handlers only execute in the main thread. When a child process times out and receives SIGTERM, the signal goes to the main thread—which might be busy with model inference. Signal delivery gets delayed. In that gap, the child becomes a zombie.
Back in 2024 (feels like yesterday), I saw this play out in a customer service Agent system. Thirty concurrent users, each Agent calling ShellTool. Half a day later: 800 zombies. The PID space wasn't the first thing to break—file descriptors went first. Each zombie was holding a TCP socket. Our workaround: a cron job running waitpid(-1, WNOHANG) every 5 minutes. Band-aid solution, but it kept us alive until we could fix it properly.
Case 3: GPU processes—when zombies get expensive
This one's complicated and honestly, painful to relive.
If your Agent calls CUDA-accelerated tools (like running a small model for inference), the stakes get higher. When a GPU process exits without properly releasing its CUDA context, the VRAM doesn't get freed. A zombie process doesn't consume CPU, sure—but its GPU memory mappings stick around.
# Check what GPU resources zombies are holding
fuser -v /dev/nvidia*
# nvidia-smi showing VRAM usage for PIDs that no longer exist
nvidia-smi | grep -E "[0-9]+MiB"
I worked on a video analysis Agent that called ffmpeg with CUDA acceleration for transcoding. Every timeout turned the ffmpeg child into a zombie, but the 2GB VRAM mapping stayed allocated. Twenty invocations later, the 8GB GPU was full. Everything OOM'd.
The really insidious part? nvidia-smi showed PIDs that didn't exist anymore—the processes were zombies, invisible to the scheduler, but the VRAM was still locked. You can't kill them. You can't signal them. Either reboot or manually mess with GPU resource mappings. We were on A10G cards with driver version 535.129.03, which—from what I understand—had known issues with resource cleanup on abnormal process exit. NVIDIA fixed this in the 550.54.14 driver (November 2024), but who upgrades GPU drivers in production without a gun to their head?
How to actually fix this
Emergency triage
# Find zombie parents
ps aux | grep 'Z'
# If parent is still alive, nudge it to reap
kill -SIGCHLD <parent_pid>
# If parent is dead and zombies have PPID 1 (init)
# init should auto-reap, but sometimes doesn't
# Last resort: restart the parent process
These are band-aids. Fine for 2 AM panic mode. Not a strategy.
The proper fix: `Popen` with explicit lifecycle management
Replace subprocess.run with Popen and own the lifecycle:
import subprocess
import signal
import os
def execute_command_safe(cmd: str, timeout: int = 30) -> dict:
proc = subprocess.Popen(
cmd,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
preexec_fn=os.setsid # Critical: child gets its own process group
)
try:
stdout, stderr = proc.communicate(timeout=timeout)
return {
"stdout": stdout,
"stderr": stderr,
"returncode": proc.returncode
}
except subprocess.TimeoutExpired:
# Kill the entire process group—grandchildren too
os.killpg(os.getpgid(proc.pid), signal.SIGKILL)
# MUST wait to reap the zombie
proc.wait()
return {"error": "timeout", "stdout": "", "stderr": ""}
Three things that matter here:
preexec_fn=os.setsidgives the child its own process groupos.killpgnukes the whole tree on timeout, not just the parentproc.wait()is non-negotiable—even after killing, you must reap
SIGCHLD handler for long-running services
Register a global reaper:
import signal
import os
def setup_child_reaper():
def reap_children(signum, frame):
try:
while True:
pid, status = os.waitpid(-1, os.WNOHANG)
if pid == 0:
break
except ChildProcessError:
pass
signal.signal(signal.SIGCHLD, reap_children)
# Call this when your service starts
setup_child_reaper()
Catch: Python only handles signals in the main thread. This won't work well with multi-threaded Agent services. Our current Agent runs single-threaded asyncio with this reaper, and we've had exactly zero zombies for six months running.
Process pools with monitoring for high-frequency calls
If you're forking constantly, stop. Use a pool:
from concurrent.futures import ProcessPoolExecutor
import atexit
executor = ProcessPoolExecutor(max_workers=4)
def cleanup():
executor.shutdown(wait=True)
atexit.register(cleanup)
Process pools cap the number of child processes. But timeout handling gets trickier. My approach: a separate monitoring coroutine.
import asyncio
async def monitor_process(pid: int, timeout: int):
await asyncio.sleep(timeout)
try:
os.kill(pid, signal.SIGKILL)
os.waitpid(pid, 0)
except ProcessLookupError:
pass
Container isolation: the nuclear option
The most thorough fix: run each function call in its own container.
apiVersion: batch/v1
kind: Job
metadata:
name: agent-function-{{ uuid }}
spec:
ttlSecondsAfterFinished: 100 # Auto-cleanup
template:
spec:
containers:
- name: executor
image: agent-executor:latest
command: ["python", "-c", "{{ code }}"]
restartPolicy: Never
Container exits, everything inside it—zombies, leaked fds, orphaned GPU mappings—ceases to exist. Clean resource isolation.
The tradeoff is startup latency. Our testing with containerd and pre-pulled images: ~1.2 seconds cold start, ~300ms warm. For latency-insensitive batch workloads, this is the safest bet.
Don't skip the monitoring
Instrument this:
import subprocess
def check_zombie_count():
result = subprocess.run(
"ps aux | awk '$8==\"Z\"' | wc -l",
shell=True, capture_output=True, text=True
)
count = int(result.stdout.strip())
if count > 50:
send_alert(f"Zombie process count: {count}")
return count
Wire it to Prometheus:
from prometheus_client import Gauge
zombie_gauge = Gauge('agent_zombie_processes', 'Number of zombie processes')
zombie_gauge.set(check_zombie_count())
Our alert threshold is 100. Since deploying the SIGCHLD handler, this alert has never fired. Not once.
The bottom line
This problem is deceptively simple: LLM Agent function calling drags traditional OS process management into the application layer. SREs used to be the only ones who worried about zombies. Now every engineer deploying Agents needs to understand process lifecycle.
Here's what sticks:
- Don't trust
subprocess.rundefaults—usePopen+ explicitwait()for anything with timeouts - Put child processes in their own process group so you can nuke the whole tree
- File descriptors and GPU resource leaks are often more dangerous than the zombies themselves
- High-frequency scenarios need process pools or container isolation
- Monitor zombie counts and alert before you run out of PIDs
I'm curious—has anyone else had their servers taken down by Agent-induced zombies? What was your root cause? Drop a comment. I have a feeling we're all stepping on the same landmines.
Tags: #LLM #Agent #FunctionCalling #DevOps #Python #ZombieProcesses #ResourceLeaks #ProductionIncidents #SiteReliabilityEngineering
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.