细说复旦大学智能体综述AI-Agent二更 (English)

Generated: 2026-06-21 23:01:28

---

After revising the Nth version of my Agent deep into the night, I downed my third cup of coffee and decided to have a real talk with you about this thing.

Let me tell you a story.

Last year, when I finished writing that survey article about Agents from Fudan and Stanford, ChatGPT was still a novelty. The term "Agent"? Honestly, hardly anyone took it seriously. People in the field would talk about it and say, "That's way too far off."

Guess what?

Less than a year later, the whole damn thing had flipped.

OpenAI made Agent a core battleground. Anthropic rolled out Computer Use—letting AI manipulate a computer screen just like a human. When Manus launched, Zhihu exploded. A ton of people asked, "Is this Agent wave actually real, or is it just hype?"

What hit me the most?

My own project team started converting a few standard RAG applications into Agent mode late last year. If I had to sum up how it felt in one sentence: I fell into so many pitfalls it made me question my entire existence.

So today, I want to talk to you again about that 86-page survey from Fudan.

But I'm not just going to rehash the paper—tossing it at you wouldn't really help anyway. I want to combine what I've learned from crawling around in those pits over the past year to help you understand two things:

What can an Agent actually do? How does it do it? And where is it most likely to crash?

Trust me, you're going to have to take this ride sooner or later.

---

Reset Your Thinking: Don't Treat the Agent as a "New Species" — It's Just a "Worker with a Brain"

Every time I talk about this topic, someone asks me:

"What's the relationship between an Agent and a large language model? We haven't even figured out LLMs yet, and now here comes Agent?"

I get it. I really do.

See, most people's gut reaction to a new concept is "Oh great, another thing." But there's a passage in the Fudan survey that made me slam the table when I read it.

The AI Agent didn't just pop out of nowhere.

Think about it. Back in the 1950s, when Turing extended the concept of "intelligence" to artificial entities, the seed of the Agent was already planted. During the reinforcement learning era—Q-learning, SARSA, Deep Q-Network, then AlphaGo—they were all essentially Agents.

So why didn't it blow up before?

Simple—the old Agents were limited by the model. Weak generalization, slow training, narrow applications. You'd design a Q-learning algorithm from scratch, tinker with the reward function for ages, and all you'd get out of it was playing a single game.

But after large language models (LLMs) came out?

Everything changed.

LLMs gave the Agent a truly powerful brain—born with language understanding, reasoning, and generalization, without needing to be trained from scratch the old way. So what's the relationship between an LLM and an Agent?

The LLM is the chassis, and the Agent is the driver sitting behind the wheel.

There's a remarkably apt metaphor in the Fudan survey: LLM-based Agents are "the sparks toward AGI." Sparks aren't flames, but they make ignition possible.

Now, let me give you my own more straightforward take—listen to this:

Before, when we called a large model, it was "you ask, I answer."

With the Agent model?

I give you a goal, and you break it down, do the work, and tell me when it's done.

The difference isn't the model itself—it's how you use it.

---

What Does an Agent Actually Look Like? The Framework from Fudan Is Essentially "Brain + Senses + Hands & Feet + Diary"

The Fudan survey proposed a three-module structure: Brain, Perception, and Action.

Later, the industry added Memory, making it four modules.

Why do I particularly resonate with this framework?

Because I took a major fall over this.

Let me walk you through each one, I promise no jargon, all plain talk.

Perception Module: This Is the Agent's "Eyes and Ears"

It receives external information—text, images, audio, JSON from API calls, DOM structure from web pages.

Think about it: can your Agent read a user's uploaded PDF? Can it parse a web table? That determines the types of scenarios it can handle.

I had a project where the Agent was supposed to scrape public tender information. Guess where the first bottleneck was?

As soon as the web page structure changed, the perception module crashed. The hardcoded parsing rules became useless.

See, if you hardcode even one part of this, it breaks.

Brain Module: This Is the LLM, But Don't Misuse It

The Fudan survey emphasizes that the brain is responsible for reasoning and planning.

For example, when the user says, "Book me a flight to Shanghai next Tuesday," the brain needs to break it down into: confirm date → search flights → filter by budget → present options → execute booking after user confirmation.

You want to know what happened when I used an open-source small model for planning?

It told me, dead serious, "Already booked," even though nothing had actually happened.

A large language model without reasoning ability is just confidently spouting nonsense.

Action Module: If You Want It to Actually Do Things, You Have to Give It Tools

Search, calculation, sending emails, manipulating files, calling a code interpreter…

OpenAI's code interpreter is a classic tool. Anthropic's Computer Use is even more interesting—it gives the Agent a "see screen → click mouse → hit keys" interface, letting it operate a computer just like a human.

I had a colleague test it. Guess where it got stuck?

CAPTCHAs. It got stuck three times and needed human intervention in the end.

So tool invocation can never be 100% reliable. You have to accept that during the design phase.

Memory Module: The Pitfall I Initially Completely Ignored

The first time I built a customer service Agent, once the conversation went beyond ten rounds, it would forget key information the user mentioned at the beginning.

Think about it: a customer service rep that can't even remember what you said at the start—how can you trust it?

Later, I used a two-tier structure: short-term memory (for the current session) and long-term memory (for user preferences and history). Only then did it stabilize.

An Agent without memory is like a patient with amnesia—every time you talk, it

细说复旦大学智能体综述AI-Agent二更 (English)

细说复旦大学智能体综述AI-Agent二更 (English)

Reset Your Thinking: Don't Treat the Agent as a "New Species" — It's Just a "Worker with a Brain"

What Does an Agent Actually Look Like? The Framework from Fudan Is Essentially "Brain + Senses + Hands & Feet + Diary"

Perception Module: This Is the Agent's "Eyes and Ears"

Brain Module: This Is the LLM, But Don't Misuse It

Action Module: If You Want It to Actually Do Things, You Have to Give It Tools

Memory Module: The Pitfall I Initially Completely Ignored

Cael Lee

Ready to get started?