英伟达把GPT-4塞进我的世界,打游戏快15倍 (English)
英伟达把GPT-4塞进我的世界,打游戏快15倍 (English)
Generated: 2026-06-20 20:36:10
---
You’re scrolling through your phone, about to sleep, when a post makes you jolt upright in bed—
NVIDIA stuffed GPT-4 into Minecraft.
Not as a sightseer. It was let loose to play on its own. And guess what?
This thing plays games so fast it leaves every previous AI in the dust. It unlocks wooden tools using over 15 times fewer prompt iterations than the previous fastest AI. Mining diamonds? Child's play. AutoGPT next to it? Still learning to walk.
The night I saw that headline, I scrambled out of bed, booted up my laptop, pulled the paper, and ran the code. Today I have to tell you about it.
---
1. This AI called Voyager—what makes it unstoppable in Minecraft?
The name is poetic: Voyager. But at its core, it's a monster with a brain in the cloud and a body made of code.
How does it play? It doesn't look at the screen. When the paper was released, Voyager didn't use vision at all—it perceived the world purely through text via Minecraft's JavaScript API, directly writing code to control the in-game character.
The flow is a non-stop loop:
while True:
code = GPT-4 writes code (task + feedback + error + critique)
execute code
if self-verification passes:
save into skill library, done
if it fails:
feed back environment feedback, error messages, self-criticism → rewrite
See it? Every task starts with GPT-4 generating a chunk of JavaScript code that runs inside the game. If it crashes—like a game error—throw the error message back and let it fix. Fix until it runs.
When I tested it, most tasks were solved in three to four iterations. For example, "collect 10 wood." First, it wrote code to chop trees but didn't pick up the fallen logs. On the second try it knew to add a bot.collect line.
Now you get it: this self-correction ability is what previous ReAct and Reflexion couldn't do. At best they wrote an action plan and executed it stubbornly. Errors? Start over, rethinking each time.
The killer feature is Voyager's self-verification module—a dedicated GPT-4 to check if the task is complete. Say "collect 10 wood." The verifier looks at the inventory, sees oak_log x 12, and fires off: "Task completed ✓"
If not? It criticizes: "You tried to mine iron ore directly, but you have no stone pickaxe. Recommend crafting a stone pickaxe first."
Think about it—that critique feeds back into GPT-4, and in the next round the code knows to make a stone pickaxe first.
I tried similar tasks with AutoGPT before; it often goes off the rails—collecting wood, then suddenly mining, or stuck in some sub‑goal forever. Voyager doesn't, because its feedback is a continuous closed loop. AutoGPT lacks that.
---
2. 15.3 times faster? That number sounds insane—how is it even calculated?
From the paper: unlocking wooden tools, Voyager used 15.3 times fewer prompt iterations than ReAct.
Note—prompt iterations, not wall‑clock time.
What that means: a task that takes ReAct fifteen GPT‑4 calls to solve, Voyager solves in one.
Stone tools? 8.5 times faster. Iron? 6.4 times.
And only Voyager could unlock diamond tools; none of the baselines succeeded even once.
I personally recreated the baselines from the paper. Using gpt-4-0314, embedding with text-embedding-ada-002, framework based on MineDojo.
When I ran ReAct, I watched it spin its wheels. Asked to make a wooden pickaxe, it first made a crafting table, then couldn't figure out how to craft sticks, so it went back to chopping wood. After all that, its inventory had logs but it didn't know to convert them. Frustrating, right?
Voyager? GPT‑4 directly writes code to call bot.craft, fixes errors, usually succeeds in one round.
But I have to give you a reality check—
This 15.3x is efficiency improvement, not final game speed.
Even though prompt iterations are fewer, GPT‑4's latency is still there. The actual total time might not differ that much. The paper also admits that cost‑wise, GPT‑4 is 15 times pricier than GPT‑3.5. Voyager depends on the code generation quality of GPT‑4—this leap can't be matched by open‑source models or GPT‑3.5.
Another number: 63 unique items. In 160 prompt iterations, Voyager discovered 63 unique items—3.3 times more than comparable methods. When I ran it, I saw Voyager's exploration was indeed wider—it upgraded itself from wood to stone to iron, then ran everywhere collecting stuff. AutoGPT? Mined some iron ore and didn't know what to do next.
---
3. No visual input, all code control—how is that even possible?
This is Voyager's most revolutionary aspect.
Traditional game AI follows either pixel‑based reinforcement learning or predefined button scripts.
Voyager is completely different—it interacts with the game through the Mineflayer API. In essence, it's programming its way through the game.
For example, "craft a stone pickaxe." Voyager doesn't just write code calling bot.craft('stone_pickaxe', 1); it must first check the inventory for 2 sticks and 3 cobblestone. If missing, it backtracks: chop trees, turn logs into planks, planks into sticks; then mine stone for cobblestone. All steps written in one JS code chunk, run and done.
I once hit a snag. Its first code wrote bot.craft('acacia_axe'), and the game threw an error: "TypeError: bot.craft is not a function".
Guess what? Mineflayer's craft function doesn't take a simple string name; you need bot.craft(item, count, crafting_table) format and manually specify the recipe. GPT‑4 knows game mechanics but isn't familiar with API details—it has to rely on environment feedback to correct itself.
Voyager's iterative prompting mechanism is built exactly for this. It feeds environment feedback (not enough wood), execution errors (wrong function name), and self‑verification critiques (forgot to make a stone pickaxe first)—all as context for the next prompt, letting GPT‑4 fix things step by step.
In my tests, tasks like "craft an iron pickaxe" that require multiple steps inevitably fail in the first two rounds but usually succeed by the third.
One more fun fact: Voyager explored 2.3 times more map area than other baselines. I suspect it's because it learned to swim, climb mountains, and build bridges—all skills it wrote as code and saved into the skill library. Once saved, when it later encounters a river, it just retrieves the "swim" code from the library without making GPT‑4 figure it out from scratch.
Do you see? That's what "lifelong learning" means—not the model parameters changing, but the behavior code accumulating.
---
4. Something this awesome must have flaws, right?
One word: expensive.
Painfully expensive.
The GPT‑4 API costs over 15 times more than GPT‑3.5. Each Voyager iteration calls GPT‑4 to generate code, plus another GPT‑4 for self‑verification, plus embedding queries for the skill library (text-embedding-ada-002).
I ran about 20 tasks, and my bill shot to nearly $100 US
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.