Home / Blog / Kimi K2 Thinking模型发布并开源,该模型哪 (English)

Kimi K2 Thinking模型发布并开源,该模型哪 (English)

By CaelLee | | 6 min read

Kimi K2 Thinking模型发布并开源,该模型哪 (English)

Generated: 2026-06-22 04:09:51

---

Kimi K2 Thinking: This "Monster" Kept Me Up All Night After Testing!

Guess what? I've been losing sleep over a model recently.

Since last year, I've been quietly watching the folks at Kimi. K2, K2.5, K2.6... I've watched them the whole way, like a clumsy apprentice slowly turning into a master craftsman who can stand on their own. To be honest, at first I thought—yeah, another day of hype.

Until K2 Thinking dropped. I got up at 2 AM to test it, and afterward I just sat there for half an hour, motionless.

This isn't a model! This is a "digital worker" with a brain!

---

Timeline: Kimi's "Mad Scientist" Evolution

October 2024: K2 is born, starts with a bang

1T total parameters, 32B activated parameters, MoE architecture, routing experts bumped from DeepSeek-V3's 256 all the way to 384—I read these numbers three times.

Think about it: 15.5 trillion tokens of training, with zero peak fluctuation throughout. Anyone who's done large model training knows—this is like keeping a high-speed train perfectly smooth the entire journey, not even a bump. Insane.

But what really made my heart skip a beat was their tagline—"Kimi K2: Open Agentic Intelligence."

From day one, they never intended to just make a "chatbot."

Early 2025: K2.5 steps up, can understand videos

Biggest change in K2.5? Unified model.

Text, images, video—all supported in one go. I threw a super slick asteroid belt interactive webpage at it—showed it a screen recording and asked it to reconstruct it. Guess what? 80 points! The only thing missing was the flexible material bending effect; everything else was spot on.

In Python tests, that classic "pour water between cups" problem? It chased down Claude-Sonnet and beat it. A 400-particle, 80,000-collisions-per-frame O(n²) problem? I optimized it with spatial grid partitioning, and it ran smooth as butter.

Honestly, this already made quite a few closed-source models blush.

March 2025: K2.6 explodes, I witnessed the birth of a "Digital Employee"

Same 1T/32B architecture, but on SWE-Bench Pro it hit 58.6% straight up—leaving GPT-5 and Claude Opus in the dust.

What really made my scalp tingle: K2.6 downloaded and deployed Qwen3.5-0.8B locally on a Mac, continuously optimizing inference performance in Zig. After more than 4,000 tool calls, running continuously for over 12 hours, and 14 iterations, throughput went from 15 tokens/s all the way to 193 tokens/s—20% faster than LM Studio!

This isn't just a "model" anymore. This is an "employee" that can write its own code, optimize itself, and run its own experiments.

April 2025: K2 Thinking arrives, the "monster" that thinks and works simultaneously

K2 Thinking is the "thinking version" of K2. What's its main selling point? "Model as Agent"—thinks while using tools, capable of 200–300 consecutive tool calls without human intervention. 1TB parameters, 32B activated, 256K context, INT4 quantization.

On the "Humanity's Last Exam" (HLE), with search, Python, and web browsing tools allowed, K2 Thinking scored 44.9% SOTA. On BrowseComp, it hit 60.2% to become the new SOTA—the human average is only 29.2%.

Let that sink in. Really think about it.

---

I tested it, here's the real experience

Reasoning ability: A "genius-type player," but prone to crashing

I threw a set of logic reasoning problems at it. K2 Thinking did indeed slightly edge out Grok 4 on complex reasoning, overall close to GPT-5 Mini.

But here's the problem: It's unstable.

Letter combination puzzles, train ticket booking puzzles, 3D projection puzzles—it only got a perfect score once, while Grok 4 consistently aced them.

Simply put, K2 Thinking is like a "genius-type player"—when it's on form it's mind-blowing, when it's not... it crashes right in front of you.

Context hallucination: Finally cured the "making stuff up" problem

Kimi officially highlighted K2 Thinking's low-hallucination advantage. I tested it on log analysis tasks—it even managed a perfect score occasionally, reaching GPT-5 level. On annual report summarization, although still a step behind Grok 4, it's improved massively over K2.

Overall, its hallucination rate is top tier. You've got to give them a thumbs-up for that.

Computational ability: Now *that's* a true engineering mind

Thanks to the hallucination improvements, K2 Thinking aced direct computation problems across the board—arithmetic, calculus, probability and statistics, all perfect scores, even more stable than Grok 4.

Instruction following: Surprises and faceplants side by side

On simple instructions (like board game simulation) and complex instructions (like code derivation), K2 Thinking scored behind Grok 4.

But one surprise made my eyes light up: a diary organization task that required the output to have an exact word count. Most models just estimate roughly. K2 Thinking, within its chain of thought, strictly counted every character word by word—the final character count was only 6 characters off from the requirement!

However, if you push the word count requirement high enough, this method falls apart—the token length blows up. But at least it shows it's really trying to follow instructions, not just phoning it in.

Coding ability: Stagnant, but headed in the right direction

K2's coding fundamentals weren't that strong to begin with, but it has the ability to steadily improve in multi-turn environments. K2 Thinking didn't get any reinforcement in coding, so its performance is about the same as K2.

The Kimi team probably thinks: in an agent working mode, "knowing how to fix mistakes" is more important than "never making mistakes."

What do you think? I agree with that logic.

Out of 800 outputs, some have NBSP? That's a mystery

More than 80% of output texts have NBSP special characters mixed in, which didn't exist in the K2 base model.

I have a reasonable suspicion: Kimi deliberately added them during the post-training phase as some kind of marker. Also, there's a small probability of English output—a common problem with domestic Chinese models, I'll let it slide.

Token control: Brute-force solution, a bit expensive

There's clear evidence of control in K2 Thinking's reasoning CoT—for example, heavy use of Chinese shorthand, content that's almost not meant to be read by humans.

But overall thinking efficiency is still not high. Many complex problems lean toward brute-force solutions, with too many verification rounds, leading to significantly higher token costs than Grok 4, currently the second highest.

In plain English: It solves problems aggressively, but it's expensive.

---

Technical talk: INT4 quantization, why is this so critical?

K2 Thinking uses native INT4 quantization, not the common FP8.

Here's a key issue: the decoding stage of MoE models is almost inevitably memory-bound; the weight size determines computational efficiency.

K2's original FP8 weights are about 1TB—right at the boundary where high-speed interconnects on many single-GPU machines can't hold it. After W4A16 quantization, inference latency is significantly better than W8A8.

But why not just use simpler PTQ (post-training quantization)?

The Kimi team found that as model generation length increases, PTQ errors accumulate. Also, PTQ depends on a calibration set; when MoE is very sparse, even with large-scale calibration data, some experts are only routed to a few tokens, causing the quantization results to be "distorted."

So they used QAT (quantization-aware training)—higher cost, but the results are genuinely good. All benchmark scores were achieved at INT4 precision.

**This

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free