能够解决复杂问题的思维链技术:Cot,ToT,GoT,AoT (English)

Generated: 2026-06-20 15:33:02

---

Don’t say I didn’t warn you! To really get to the bottom of this whole chain‑of‑thought business, I locked myself in front of my computer for three solid weeks—battling it out with these techniques till the bitter end.

Let me give you the punchline right up front: all that flashy stuff from 2023—CoT, ToT, GoT, AoT—boils down to one simple trick: trading the model’s own compute for better results. But if you go straight for the fanciest method, chances are you’re just digging a hole for yourself. Think about it—doesn’t that ring true?

I threw math problems, planning tasks, document aggregation, and even a little backtracking game I made up at GPT‑3.5 and GPT‑4. Tested every angle. And what I found is this: the only techniques that can really hold their own in a production environment are probably CoT and its minor variants. The rest? More like academic self‑indulgence—clever tricks that impress in a paper but fall short of “actually useful” by a whole mountain of engineering crap.

CoT – Looks the simplest, hits the hardest!

The biggest selling point of Chain‑of‑Thought (CoT) is just one word: cheap! Stick “let’s think step by step” in your prompt, or throw in an example with intermediate steps, and the model starts reasoning pretty convincingly. And its performance is tightly tied to model size. When I compared GPT‑3.5 and GPT‑4, CoT gave a huge boost to GPT‑4—accuracy jumped from a bare‑bones 40% to over 70%! As for those little 7B and 13B models? Adding CoT made no difference—sometimes it even made things worse, with the model starting to ramble.

But CoT has a hidden trap: one wrong step and everything falls apart! I tested a logic puzzle—three people, some telling the truth, some lying, who stole the thing? The chain CoT produced had a single misstep: “If A tells the truth, then B lies” was reversed, and the whole thing went off the rails from there. The output was smooth as butter—and completely wrong. That’s error propagation in action.

So how do you fix it? Self‑Consistency gives you a brute‑force trick: generate multiple CoT chains and let them vote. I tried it—with 5 chains, accuracy jumped from 58% to 82%; with 10 chains, it hit 86%. The price? API calls multiplied by 5 or 10, and both latency and cost went up. But the upside is you don’t have to change your prompt—just ask more times.

So my first conclusion is: if CoT can do the job, don’t mess with anything else! For most business scenarios—customer service Q&A, document extraction, simple reasoning—CoT plus voting can already get you above 85%. Trying to push higher? The marginal returns drop off fast, and both you and your boss will wince at the cost.

ToT – Grand in theory, heartbreaking in practice!

When I first heard about Tree‑of‑Thought (ToT), I was pumped! Finally, a way for the model to explore multiple paths on its own and backtrack when needed. In the paper, the 24‑point game example shot GPT‑4’s accuracy from single‑digit CoT up to 74%! That number had me all in.

Then I fell into the pit. Guess what?

First, implementing ToT needs two agents: a proposal generator and an evaluator. Both are LLMs, but you have to write different prompts for each. Think about it—your task is already complex enough that you need a tree structure, and now you also have to craft a perfect evaluation prompt that makes the model score each intermediate idea (sure/likely/impossible). That’s a whole new prompt‑engineering headache!

I tried it on travel planning—given a budget, preferences, and time constraints, get the model to plan an itinerary. The CoT version could produce steps sequentially, but often missed conflicts like museum hours clashing on a certain day. I figured ToT would generate multiple plans and self‑evaluate, so it should handle that. And what happened? The proposal agent spat out 6 routes at once, and the evaluation agent scored each one. But the scores were all over the place! A perfectly reasonable route got “impossible” while a nonsensical one got “sure.” Why? Because the evaluation agent only did shallow reasoning and didn’t spot the conflict at all. Ask it to think carefully, and you need more rounds of calls, blowing up the overhead.

I ran 10 travel‑planning cases with GPT‑4. ToT’s accuracy was 35%, while CoT with manual spot‑checking hit 30%. As for API calls, ToT averaged 70 per task (generation plus evaluation), while CoT with checking took only 15. Is that 4x cost worth it? My boss’s answer: hell no.

So my second conclusion is: ToT is only for closed problems with a limited solution space and clear validation—like 24‑point games, Sudoku, or logic puzzles. For open‑ended tasks that require common sense, ToT’s evaluation step becomes another hard problem in itself. Tune it right, and it’s magic; tune it wrong, and it’s just wasted calls.

That said, if you really have a scenario where you must try multiple paths and backtrack when stuck, ToT is still less painful than writing your own backtracking logic by hand. But there’s a catch—your evaluation prompt has to be good enough that the model knows what “good” looks like at every step.

GoT and AoT – One’s too wild, the other’s too picky!

GoT (Graph‑of‑Thought) turns the reasoning structure from a tree into a graph. Nodes can merge, depend on each other, even form loops. It sounds beautiful, because human thinking really isn’t linear—we often mash two lines of thought together or circle back to refine a concept. The paper showed examples like sorting arrays, aggregating documents, counting word frequencies—the idea is to represent dependencies between subtasks as a graph and then have the LLM execute them in topological order.

I tried GoT on a multi‑turn dialogue summarization task: split the conversation into logical chunks, summarize each chunk separately, merge them, then refine the merged result a second time, and backtrack if something’s off. Sounds like a perfect fit for a graph, right? But when I actually wrote the prompt, I had to manually describe which nodes were inputs to which other nodes, when to merge, and what the merge rules were. That’s basically making the LLM follow a directed graph algorithm I designed—so why not just write a script myself and call the API once per node?

Look at it another way: GoT is valuable if you already have a deep understanding of the solution path for a given domain. You can explicitly encode it as a graph and have the LLM walk through it step by step. But that “if” is a huge condition. In most scenarios, we don’t even have a clear picture of the problem’s structure, so what graph are we supposed to draw? Right now, GoT is more for research—engineering‑wise, it’s terrible value for money.

As for AoT (Algorithm of Thoughts), the idea is solid: bake algorithm examples (like DFS or binary search) into the prompt so the model mimics that algorithmic trajectory. The paper says it works great on GPT‑4. I tested it on a problem about finding the minimum under inequality constraints. AoT gave a binary search example, and sure enough, the model followed the binary search steps to approach the solution. The result was noticeably more accurate than CoT.

But AoT has a fatal flaw: it demands an extremely capable LLM! When I switched from GPT‑4 to GPT‑3.5, AoT fell apart. 3.5 just couldn’t mimic recursive or backtracking logic—the search steps it generated often skipped crucial checks. That tells me AoT is all about squeezing the potential of big models. If the model isn’t big enough, injecting an algorithm won’t help.

So my third conclusion is: both GoT and AoT are heavily dependent on manual knowledge distillation. You have to deeply understand the solution path, abstract it into a graph or algorithm, and then feed it to the model. If you already know the algorithm, why not just write code and execute it? Oh, “to use the LLM for flexibility.” But flexibility comes at the cost of prompt maintenance and debugging complexity

能够解决复杂问题的思维链技术:Cot,ToT,GoT,AoT (English)

能够解决复杂问题的思维链技术:Cot,ToT,GoT,AoT (English)

CoT – Looks the simplest, hits the hardest!

ToT – Grand in theory, heartbreaking in practice!

GoT and AoT – One’s too wild, the other’s too picky!

Cael Lee

Ready to get started?