Why I Finally Switched to llama.cpp (And When I Definitely Won't Use It)
Why I Finally Switched to llama.cpp (And When I Definitely Won't Use It)
Last Tuesday, I sat on a train with a laptop that had no GPU, 8GB of RAM, and a 7B parameter language model running locally. It wasn't fast. But it worked. That moment—after six months of bouncing between inference engines—finally crystallized why llama.cpp exists and where it absolutely falls apart.
Here's the deal: I started my local LLM journey like most people. Downloaded Ollama, typed two commands, and felt like a wizard. M2 Max, 64GB memory, everything humming along. Two weeks in, I hit a wall. Ollama's version lag meant I couldn't test Qwen2.5's new quantization formats. So I went straight to the source—llama.cpp.
The first compilation took maybe 30 seconds. Two commands:
cmake -B build
cmake --build build --config Release -j 12
The binary? Under 50MB. Seriously. Fifty. Megabytes. For comparison, vLLM's Docker image starts at 10GB. I could compile llama.cpp ten times in the time it takes to pull that container. That's not an exaggeration—I actually timed it while waiting for coffee.
The GGUF Epiphany I Almost Missed
My initial reaction to GGUF was annoyance. "Great, another format to convert." HuggingFace uses safetensors, so every model requires a conversion script. Annoying, right?
Wrong. This—and I'm going to call it a design decision rather than a limitation—is exactly the point.
Safetensors needs the safetensors library. That library ties into PyTorch. PyTorch drags in an entire ecosystem. GGUF is one file. Self-contained. Zero external dependencies. You read it yourself, you parse it yourself, you're done. It's the kind of design philosophy that makes you stop and think about how much bloat you've accepted as "normal."
The project's lineage explains everything. It grew out of GGML, hand-written by Georgi Gerganov—pure C/C++ vector libraries built from scratch. Traditional ML follows a path: train in PyTorch → run inference in Python → hit problems → optimize within that stack. llama.cpp reverses this: strip away everything, write the inference engine bare-metal, then feed improvements back into the Python ecosystem (that's how llama-cpp-python was born).
Think about what this means: custom matrix-vector implementations. Manual memory layout calculations. Permute operators in C++. Model loading from scratch. Encoding and decoding handled manually. Chinese characters and emojis breaking constantly. This is not normal developer work—this is the kind of project that filters out everyone without borderline obsessive tendencies.
Where It Works (And Where It Absolutely Doesn't)
Here's something that surprised me: single-user performance is nearly indistinguishable from vLLM. I tested Qwen2.5-7B with Q4KM quantization on my M2 Max. llama.cpp churned out 30-40 tokens per second. More than enough for my needs.
Then I tried concurrent requests. Three clients hitting it simultaneously. The throughput collapsed. Like, not just degraded—flatlined.
Why? No Continuous Batching. vLLM dynamically packs multiple requests into single batches, squeezing every possible cycle from the GPU. llama.cpp processes requests one at a time, like customers queueing for boba tea. On H200 GPUs under peak load, vLLM's request throughput was 35x higher, token throughput over 44x higher. Those numbers lived rent-free in my head for an entire week.
The single-user experience, though? Nearly identical. That's the fork in the road.
vLLM and SGLang: Standing on Shoulders
vLLM takes the other path entirely. It sits on top of torch, numpy, sentencepiece, and the entire HuggingFace stack. The innovation isn't in rebuilding fundamentals—it's in making the system go faster. PagedAttention splits KV caches into fixed-size blocks so multiple requests share GPU memory without waste. Add Continuous Batching, and you've got an engine that can handle 50 million API calls per day on one-third the GPUs. Stripe's migration to vLLM cut inference costs by 73%. That's not marketing fluff—those are production numbers.
SGLang goes further. It adds what they call an "inference compiler" with RadixAttention. Multiple requests sharing a common prompt prefix? SGLang caches and reuses the computation. User A asks "Explain quantum computing" and User B asks "Explain quantum computing applications"—SGLang identifies "Explain quantum computing" as shared, reuses the KV cache, and only computes "applications." The benchmarks in agent and multi-turn scenarios are faster than vLLM, yyds as we used to say.
But—and this matters if you're betting production traffic on it—SGLang's community is smaller. Large-scale validation is ongoing. I tested the latest version last Wednesday afternoon. Still hit some edge cases. I'm waiting another six months before trusting it with critical workloads.
The llama.cpp Sweet Spot
So when does llama.cpp actually make sense? I didn't figure this out until I needed it.
Edge devices. Disconnected environments. Scenarios where data cannot leave the machine. Think about running document analysis on a laptop with 8GB RAM and no GPU—Q4_0 quantization squeezes that 7B model down to 4GB. It won't win any speed records, but it runs. Actually runs.
Apple Silicon is especially interesting here. The M-series unified memory architecture means CPU and GPU share the same memory pool with 400GB/s bandwidth, plus the AMX coprocessor's outer product engine handling roughly 1,024 scalar FMAs per cycle. llama.cpp optimizes for this specifically—ARM NEON SDOT instructions, custom Apple AMX kernels. On a Mac, it flies.
How I Actually Choose Now
Use llama.cpp when:
- Running on consumer hardware (MacBooks, laptops without GPUs, edge devices)
- Data sovereignty matters—nothing leaves your machine
- You're a solo developer or tiny team just trying things out
- You want to test new models immediately, no waiting for library updates
- Offline deployment is a requirement
Use vLLM when:
- You're serving an API with real users
- You have A100s, H100s, or enterprise GPUs sitting around
- Concurrent request volume is high
- Throughput and cost efficiency actually impact your bottom line
- You need Kubernetes-native deployment
Use SGLang—carefully—when:
- Multi-turn conversations or agent workflows dominate
- Structured output with JSON Schema constraints matters
- Prefix caching will actually save you significant compute
- You're okay watching a project that's still maturing
My actual setup: local development and testing on llama.cpp (lightweight, flexible, runs on anything), production deployment on vLLM (stable, efficient, battle-tested ecosystem). SGLang is in my monitoring phase—when agent scenarios become more common in my work, I'll reconsider.
Next up: trying llama.cpp on a Raspberry Pi with a 1.5B model for basic voice assistant tasks. Probably pointless. Definitely fun. But that's the joy of tinkering, isn't it?
TL;DR (Key Takeaways)
- llama.cpp is absurdly lightweight (50MB binary vs 10GB containers) and runs on anything, but crumbles under concurrent load—throughput can be 35-44x lower than vLLM on enterprise GPUs
- vLLM dominates high-concurrency production with PagedAttention and Continuous Batching; Stripe cut costs 73% after migrating
- Single-user performance is nearly identical between the two on consumer hardware—the difference only emerges at scale
- GGUF format seems annoying but is actually brilliant: zero dependencies, self-contained, clean design
- SGLang is promising for agent/multi-turn workloads but needs more production validation before I'd trust it with critical systems
- Choose based on your deployment context, not abstract "which is better" debates
What's your experience running inference locally? Have you found a setup that works for edge devices without pulling your hair out? Drop a comment—I'm genuinely curious what others are doing in those weird in-between scenarios where neither option feels perfect.
ai #llm #machinelearning #llama #vllm #opensource #mlops
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.