关于 Tool Use 的 Agent 工程师面试题目 (English)
关于 Tool Use 的 Agent 工程师面试题目 (English)
Generated: 2026-06-24 04:50:44
---
Those Interview Questions Don't Screen Candidates—They Screen the Honest Ones
A friend texted me late last night.
He said he was training a group of interviewers from several big tech companies for Agent positions. He flipped through the questions they'd prepared and just burst out laughing.
"What is Function Calling?"
"What's the difference between ReAct and CoT?"
"What problem does the MCP protocol solve?"
— Tell me, are you recruiting or quizzing them on textbook definitions?
He said last year they hired three candidates who could rattle off answers to these questions perfectly. In the first month on the job, every single one of them crashed the production Agent system.
One didn't know how to roll back after a tool call failure, and directly locked the production database into read-only mode.
The other was worse. He let the model loop and call an API over forty times in a row. When the bill came, the CTO's face turned green.
Those interview questions of yours aren't filtering for smart people. They're filtering for honest ones.
---
Level One: Has this person actually written real code?
When I interview, I never ask about concepts.
I just throw out a question: "In that Agent project you worked on recently, what was the longest chain of tool calls? And what was your rollback strategy for each step?"
Most people freeze on the spot.
Anyone who can say "I tried OpenAI's weather query demo" has clearly never run a real project. Believe me, they fall apart the moment they join the team.
Someone who's actually done the work will answer without missing a beat—
"We had to query the database for user info first, then decide which pricing API to call based on the user tier, validate inventory after getting the price, and finally go through an approval workflow. I set a timeout with retries for every step. After three retries it would degrade to a cache fallback."
Hearing an answer like that, I'm already applauding in my head.
Last year I interviewed about forty people. I could count on one hand the ones who went into that level of detail.
Most people's answers stayed at one level: "I used the tools parameter and passed in a JSON Schema."
"What was your success rate?"
"About 80%."
"Give me a failure case."
— They'd start hemming and hawing.
If you haven't been tortured by tool calls in production, you'd never say something like this:
"Sometimes the model messes up the parameter format—like putting 2024/13/01 for a date. I had to add another layer of parameter validation at the tool level."
See, that's how ugly the real world is.
In my own project, I raised tool call success rates from 85% to 97%. It wasn't because the model got smarter—it was three full layers of defense: input validation, output validation, and result fallback.
So, interviewers, could you please stop asking useless questions like "Have you used Function Calling?" Just ask what kind of trouble they've run into. That's way more useful.
---
Level Two: Can you tell the difference between "it works" and "it works well"?
Speaking of which, a lot of people can recite the differences between native Function Calling and ReAct Prompting backward and forward. "Native uses fewer tokens," "ReAct is more flexible."
But if you ask me about my actual experience, I'd tell you something else.
The biggest advantage of native Function Calling isn't token savings.
It's that it separates "talking" from "doing."
Think about it: if the model is thinking "I need to look up the weather" and simultaneously replying to the user "Sure, I'll check that for you"—the output is mixed with chain-of-thought and conversational text. How painful is that to parse downstream?
I had a project that started with a pure prompt-based ReAct.
Every single round, the model would output: "I need to call tool X to get information Y. Let's get started! Okay, now calling:…"
I wrote forty or fifty lines of regex just to parse that nonsense.
After switching to native Function Calling, the code shrank from over a hundred lines to twenty.
The model didn't get smarter—the output structure got cleaner, and the state machine got simpler.
But there are downsides too.
Native Function Calling has limited ability to describe complex tools. I tried giving a tool description over 800 tokens, and the model just ignored it, returning an empty list.
I had to patch it with prompt engineering: put the most commonly used tools in the system prompt, and only put secondary ones in the tools parameter.
Isn't that ironic? You think the standard solution will solve everything, and then you realize you still have to balance between "it works" and "it works well."
Someone told me: "Models are getting stronger every day; these problems will gradually disappear."
I can only say: naive.
The stronger the model, the more people want it to do complex things. Today you think 800 tokens is the ceiling for a tool description; tomorrow someone will stuff a 2000-token description in there. This cycle never stops.
---
Level Three: Do you really understand MCP, or are you just reciting definitions?
MCP is being asked to death in interviews now.
Out of ten candidates, nine can say "standardized protocol, solves the N×M integration problem."
But if I follow up with: "How many tool services have you actually deployed using MCP?"
— Silence.
I tried MCP in my own project earlier this year.
Honestly, it was a pain.
First, its server side isn't well polished. I tried Anthropic's official Python SDK and found the default timeout is 30 seconds. My tool is an internal data analysis that sometimes takes a minute or two for complex queries.
By the time it finished, Claude Desktop had already disconnected.
I had to dig into the code and rewrite the Transport layer myself. A simple timeout configuration took me two days.
But that doesn't mean MCP is bad.
On the contrary, I'm very optimistic about this direction.
The reason is simple: it decouples tools from models.
Before, if you used LangChain's Tool spec, then later switched to LlamaIndex, you'd have to rewrite everything. Now you just implement an MCP Server once, and any client can use it.
My current approach: package all internal tools with MCP, then register them all in a Gateway service.
The front-end Agent doesn't care whether it's Claude, GPT, or an open-source model—as long as it implements an MCP Client, it can call any tool seamlessly.
Once you get this right, you never have to worry about "changing models means rewriting tool integration code."
But is MCP a silver bullet?
Of course not.
Its biggest problem right now is that it doesn't handle access control and audit logging well.
You call a database query tool through MCP—who called it? What data did they query? What results came back? MCP doesn't record this by default. You have to add it manually on the server side.
So if interviewers really want to test MCP, don't ask about protocol definitions.
Ask directly: "When you deployed your MCP Server, how did you log everything? How did you manage access?"
Whoever can answer that clearly is someone who's actually done the work.
---
Level Four: Is your Agent still reliable after ten steps?
This is where I think most Agent interviews fall apart.
A lot of people can give a brilliant explanation of ReAct principles, and can fluently describe the differences between Plan-Then-Act, ReAct with planning, and Tree Planning.
But if you ask: "After ten steps, can your Agent still remember the original task?"
— They go silent.
A huge pitfall I experienced: I had an Agent research three papers on RAG + reinforcement learning.
At first, the Agent ran perfectly: search papers → read abstracts → download full texts → extract core ideas.
By step five, it found an interesting citation in one paper and started following the citation chain to check other papers. Then it felt another paper's experimental data wasn't sufficient, so it went to look at the author's GitHub.
By the time I looked at the log, it was on step fourteen.
In the output, the original task "research three papers" was only one paper done. The rest was analysis of cited papers.
This is what I often call "context drift."
The model didn't get stupid—it's just that at each step, the model is making "the most reasonable next move given the current context." But the most reasonable isn't always the most aligned with the original goal.
How do I solve this?
My current approach has three layers.
First layer: explicitly write a sentence in the System Prompt: "Every three steps, review the original task goal and confirm you haven't drifted." It's cheap but effective. In my experiments, adding this sentence reduced drift from 30% to 8%.
Second layer: add a state machine at the code level. Something like this:
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.