Token 消耗降低 90% (English)
Token 消耗降低 90% (English)
Generated: 2026-06-23 14:43:43
---
Last month I did something incredibly stupid—I left OpenClaw running for three straight days without checking on it. While I was away, it quietly went ahead and executed seven or eight scheduled tasks, processed dozens of rounds of dialogue. By the time I noticed, my Claude API bill had racked up almost $200. All because I was too lazy to touch the damn default configuration.
And I’m not afraid to tell you—there are plenty of people in the community who’ve "paid tuition" just like me. One guy had it even worse: he wrote an infinite-loop skill and forgot to turn it off. When he woke up, his account had been charged $800. Later, when I told people that "OpenClaw can cut token consumption by 90%," their first reaction was always an eye roll. Their second reaction? "Alright then, let’s see your bill, smart guy."
Fine. I’ll show it. My monthly API cost now sits steady at around $12. And before optimization? Nearly $140. 90%? At least. In this post I’m going to walk you through exactly how I did it. All practical steps, zero fluff. You ready?
---
1. First, figure out where your money is actually burning
I used to make the mistake of following the herd and tweaking compaction settings—did absolutely nothing for me, because the main cost wasn’t in the dialogue history at all.
See, OpenClaw is basically a "token snake" (a nickname my colleague came up with—fits perfectly). Every task triggers several rounds of API calls, each round carrying a full context. If it’s just a simple chat it’s fine, but once you enable tool calls or attach a few skills, tokens start gushing out like a leaking pipe—you can’t stop it.
I ran a full OpenTelemetry trace (the new telemetry feature added in v2026.4.25—I strongly recommend you try it too) and the token distribution in my use case was a real eye‑opener:
- History accumulation: roughly 25% (lower than I expected, since I’d already enabled compaction earlier)
- Tool invocation results: 35%—this is the real monster. Every file_read or command execution result gets stuffed back into the context. Once or twice is fine, but after dozens of rounds it explodes.
- Tool schema injection: I had 30+ tools attached, and just the schema cost 2,000 tokens per request.
- Framework system prompts: 15%, which is actually quite hard to cut.
- Miscellaneous: 10%
When I saw this I was stunned—my biggest enemy wasn’t conversation history at all, it was tool calls and schemas. Truly counter‑intuitive. So my first suggestion is: run an OTEL trace first, don’t optimise blindly by gut feeling. How? One line of config:
{
"telemetry": {
"enabled": true,
"exporter": "console",
"sampleRate": 1.0
}
}
Then you’ll see something like this in the logs:
[otel] openclaw.context.assembled: 4521ms ← bottleneck right here!
[otel] tools.manifest: 1100ms
[otel] gen_ai.client.token.usage: {input: 12340, output: 456}
You can see exactly which part is eating your budget, crystal clear. No guessing required.
---
2. Three configuration hacks: Compaction, Heartbeat, Agent splitting
Once you know where the problem is, the rest is just cutting it out. I’ve summed up the three most effective tricks—just follow along, I promise they work.
First hack: Don’t use compaction with safeguard
OpenClaw’s default compaction mode is called safeguard. Sounds safe, right? It means “try to preserve as much raw context as possible, only compress when things are about to burst.” But the price? Context is always at full load—every request carries tens of thousands of tokens of history, and your wallet is bleeding.
I switched it straight to default:
"compaction": {
"mode": "default"
}
Context usage dropped from often 80%+ to a stable ~40%. If you’re aiming for extreme savings, you could try even more aggressive modes (like aggressive), but I didn’t—I was afraid of losing context. But if you haven’t even enabled compaction at all… I’m genuinely worried for your wallet 😱
Second hack: More heartbeat tasks isn’t better
OpenClaw has a really convenient feature for periodically checking email, calendar, to‑dos. But don’t set the frequency too high. The default might be every 5 minutes? I’ve never used the default—I changed mine to every 30 minutes, and only check the last message (target: "last"):
"heartbeat": {
"every": "30m",
"target": "last"
}
I also merged all the scattered scheduled checks into one single "Daily Morning Brief", running only once:
{
"name":"Daily Morning Brief",
"schedule":{"kind":"cron","expr":"0 8 * * *"},
"sessionTarget":"isolated",
"payload":{
"kind":"agentTurn",
"message":"Check today’s email, calendar, and to‑dos — give me a short summary."
}
}
See that sessionTarget: "isolated"? That’s the key. Every run starts a fresh session, and previous briefs are never carried over. Saves tokens and brain power—two birds with one stone.
Third hack: Split agents by workload—separate heavy from light
I should have done this ages ago. My workflow roughly splits into two routes: everyday chatting, researching, drafting; and code review, debugging, script execution. The former only needs Sonnet, the latter needs a stronger model.
But before? One agent to rule them all. Every chat carried all tools and skills—even asking about the weather loaded the code-analysis tool schemas. Like using a cannon to shoot a mosquito.
Here’s my current config:
"agents": {
"defaults": {
"model": {"primary": "anthropic/claude-sonnet-4-5"}
},
"list": [
{"id": "main", "default": true},
{"id": "light", "workspace": "~/.openclaw/workspace-light"}
]
}
Then bind Telegram group chats and casual conversations to the light agent:
"bindings": [
{
"agentId": "light",
"match": {"channel": "telegram", "peer": {"kind": "group", "id": "your-group-id"}}
}
]
The light workspace holds only the most essential prompts and three or four commonly used tools. Context consumption instantly halved. The main agent stays reserved for complex tasks, loading all tools only when needed.
Honestly, this change saved me the most money. Most of my time is spent on lightweight conversations—heavy lifting accounts for less than 20%. Isn’t it the same for you?
---
3. The trump card: QMD cuts context requirements by 95%
Everything above is about “throttling”, but what really made my token usage drop off a cliff was the QMD feature introduced in OpenClaw 2026.2.2.
QMD stands for “Query‑based Memory with Distribution” (or something similar—the official name is “Semantic Local Search Memory”). It’s the polar opposite of the traditional “stuff all history into the context” approach. Instead, it retrieves only the most relevant memories from local storage when needed, and injects just those few chunks.
I tried it, and the effect was ridiculously dramatic. Seriously, you can’t imagine the impact.
Here’s an example: I had a session that lasted two weeks, accumulating about 50k tokens of dialogue history. Normally, every new question would carry those 50k tokens as context—that’s almost $1 just in input cost. After enabling QMD? The system retrieved only 5 relevant records, totalling less than 2,000 tokens.
50k vs 2,000—a 98% reduction!!! Isn’t that insane?
The official claim is “reduce context by 95%+, speed
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.