基于大模型 + 知识库的 Code Review 实践 (English)

Generated: 2026-06-21 14:42:54

---

Listen to the story I’m about to tell you.

That day I was staring at two snippets of code, agonizing over them—version A did the job in three lines but read like hieroglyphics, while version B was verbose but instantly understandable. I tossed both into Claude and casually asked, “Which one’s better?”

Guess what? Not only did it break down exactly what was bugging me in my head, but it also spotted an edge case I had completely missed. In that moment, a thought popped into my mind: What if this guy could actually help me carry some of the load?

On my way home, the idea kept buzzing, and the moment I got through the door I fired up my laptop and started building.

You wouldn’t believe it—the first hurdle almost drove me away.

Just throw company code at ChatGPT? Do I look like Samsung to you? After they introduced ChatGPT, they had several data leaks within twenty days. Who’s going to take the fall for that?

To stay safe, you have to sanitize first—abstract the specific business logic into incomprehensible descriptions. I tried it once. Writing that description took me over half an hour. When I looked back, it would’ve been faster to just review it myself.

But what really made me determined to make something happen was another scene entirely.

Every day, our team has to review ten to twenty merge requests. Unit tests and linters already catch the obvious dumb stuff—wrong indentation, unused variables. But the rest—whether the code design is sound, whether the business logic is correct, whether this MR might break something elsewhere—all has to be done manually. When things get busy, you know how it is: you glance at it, hit Approve, and only slap your forehead when something fails in production.

What’s worse, it’s not like the team lacks guidelines. Feishu docs are dozens of pages long, written clearly. But who actually goes back to flip through them during a review? Today Zhang San says do it this way, tomorrow Li Si says do it that way—it all depends on mood and habit.

In short, we needed someone who always has our backs, someone who sticks to the rules every single time.

First version of the code? Don’t overthink it—just get it running.

I didn’t start by applying for GPUs to run a private LLM. First, the process is too long. Second, who knows if this thing is even worth it. So I went with API first—DeepSeek, so cheap you wouldn’t believe it. I deposited ten bucks, and two months later I still haven’t used it all.

I wrote a script, less than twenty lines, something like this:


import openai

openai.api_key = "your_key"
openai.api_base = "https://api.deepseek.com"

code_diff = "Read the MR diff from here"

review_prompt = f"Please review the following code changes, focusing on logic defects, security risks, and performance issues: {code_diff}"

response = openai.ChatCompletion.create(
 model="deepseek-chat",
 messages=[{"role": "user", "content": review_prompt}]
)

print(response.choices[0].message.content)

Boom! My first AI reviewer was born—and it was absurdly dumb.

Everything it flagged was trivial stuff like indentation, spaces, line breaks. Real logic flaws? Not a single one. AI is just like humans: if you don’t tell it what to focus on, it’ll focus on anything. So I added a rules file .ai-review.json:


{
 "priority": ["null_pointer", "sql_injection", "transaction", "sensitive_data"],
 "ignore": ["format", "naming_convention"]
}

Static formatting issues? That’s for ESLint/Pylint. Let AI handle only dynamic logic and security. Now the output was at least passable.

The real core: it’s the knowledge base, not the model.

After running the API-pattern for a while, two problems kept nagging at me:

No team context. The AI didn’t know our team’s habits—like that we must use a specific wrapped tool class for network requests, never write HTTP calls directly. All that was clearly written in our Feishu docs, but the AI couldn’t see it. It was like a fresh intern who hadn’t even read the documentation.
Hallucinations galore. It would frequently make up APIs that didn’t exist or recommend libraries we never used.

That’s when it hit me: a knowledge base.

Extract our Feishu specification documents, convert them into vectors using an embedding model, and throw them into a database. For each review, first search for the specification snippets related to the current change, then feed them together with the diff to the LLM.

I used the LangChain stack. For the embedding model I chose BGE-small-zh (good for Chinese and fast), and for the vector database I used the lightweight version of Milvus. The flow was dead simple:

MR triggers → pull the diff
Retrieve relevant documentation snippets from the knowledge base (e.g., “User info endpoints must have permission checks”)
Throw the search results + diff + review instructions at the LLM
Get the comments back and post them to the corresponding code lines via the GitLab API

This step was critical. Without a knowledge base, AI just talks in generalities; with a knowledge base, it knows exactly which pitfalls your team has already stepped into.

Check out the difference: before, AI would give a suggestion like “consider using a distributed lock.” After adding the knowledge base, it became—“Consider using the company’s wrapped RedisLock utility class, see the internal doc link for reference.” Our team’s custom conventions finally came to life.

Let AI be the first line of defense, not the gatekeeper.

Integrating it into CI was a must. I added a few lines to .gitlab-ci.yml:


code_review:
 stage: test
 script:
 - python ai_review.py --mr-id $CI_MERGE_REQUEST_IID
 only:
 - merge_requests

Simple enough, but there’s a huge pitfall: never let AI automatically block or allow merges.

I’ve seen people set AI review results as a gating condition—must pass to merge. The result? Tons of false positives, developers furious, cursing in the chat every day.

So my approach is: AI gives suggestions and marks risk levels. LOW is ignored automatically. MEDIUM prompts manual confirmation. HIGH must be manually confirmed before merging.

Another detail: rule‑based scanning must not be discarded.

Null pointers, SQL injection, sensitive data exposure, missing transaction boundaries—these are highly deterministic problems and should be caught by a static rule engine. I set up two layers:

Layer 1: Rule scanning (based on AST or regex), blazing fast, directly marks definite violations as HIGH.
Layer 2: AI analysis for logic defects, context relevance, business reasonableness.

The two complement each other perfectly. Never, ever hand over the rules to AI—it’s expensive, slow, and not necessarily accurate.

The numbers don’t lie—we measured it.

We tested it on a real Java 8 legacy system. Comparing 50 PRs:

Human review average issues found per PR: 3.2 (including formatting and logic)
AI‑assisted (rules + AI + knowledge base): average 7.6 issues per PR

More than double!

But don’t expect a free lunch. AI’s false positive rate was about 30%. The false positives mainly came from:

It thought a variable might be null (but the upstream already checked)
It recommended new syntax (but the legacy system doesn’t support Java 8+ features)

That’s when the knowledge base really shined again—we also stuffed historical false positives into it as negative examples. After that, when the AI encounters a

基于大模型 + 知识库的 Code Review 实践 (English)

基于大模型 + 知识库的 Code Review 实践 (English)

You wouldn’t believe it—the first hurdle almost drove me away.

First version of the code? Don’t overthink it—just get it running.

The real core: it’s the knowledge base, not the model.

Let AI be the first line of defense, not the gatekeeper.

The numbers don’t lie—we measured it.

Cael Lee

Ready to get started?