Home / Blog / We Boosted Our LLM Code Review Accuracy by 40% in ...

We Boosted Our LLM Code Review Accuracy by 40% in 3 Months (Here's What Actually Worked)

By CaelLee | | 10 min read

We Boosted Our LLM Code Review Accuracy by 40% in 3 Months (Here's What Actually Worked)

Last quarter, our AI code reviewer was catching 52% of critical bugs. Today it's at 73%. I'm going to walk you through exactly what we did—the wins, the face-plants, and the stuff I'd absolutely do differently if I could start over.

When we first plugged an LLM into our CI/CD pipeline back in January, I genuinely expected magic. Like, Harry Potter-level magic. What I got instead was a tsunami of false positives and a missed null pointer exception that took down our payment service at 11pm on a Saturday.

That was a fun weekend. By "fun" I mean my CTO texted me "call me" with zero context. You know exactly the vibe I'm talking about.

That's when it hit me: treating an LLM like a plug-and-play senior engineer is a recipe for disaster. It's not a senior engineer. Honestly, it's not even a junior engineer. It's a tool, and like any tool—a table saw, a Kubernetes cluster, your weirdly specific espresso machine—it needs serious calibration.

If you're leading an engineering team and exploring AI-assisted code review, here's what actually moved the needle for us. Some of this might be obvious. Most of it surprised me. Actually, wait—all of it surprised me.

TL;DR for the Skimmers

The Problem: Our LLM code reviewer was missing half the critical bugs and drowning us in false positives.

What Worked:

The Big Lesson: LLMs for code review work best when you treat them like a specialized tool, not a replacement for human judgment. Narrow the scope, close the feedback loop, and never fully automate decisions.

1. We Stopped Asking for Everything, Everywhere

Our initial prompt was basically a novel.

"Review this code for bugs, security vulnerabilities, performance issues, style violations, and adherence to best practices." I think we even threw in "and be thorough" at the end. Because that's how prompts work, right? Just add "be thorough" and the magic happens.

Spoiler: it doesn't.

The model skimmed the surface on all five areas and excelled at exactly none of them. Classic garbage-in-garbage-out. We'd get comments like "consider using a more descriptive variable name" while completely missing an unvalidated user input that was basically a welcome mat for SQL injection.

So we pivoted to a multi-pass architecture. Three separate passes, each with a hyper-specific job:

Each pass uses its own specialized prompt with few-shot examples pulled from our actual PR history. Not synthetic examples. Not "imagine you have a function that..." textbook nonsense. Real bugs that real engineers on our team shipped to production and then regretted.

Here's roughly what the security pass prompt looks like:


SYSTEM: You are a security-focused code reviewer. Your ONLY job is to identify 
security vulnerabilities in the provided code diff. Ignore all other issues.

FOCUS AREAS:
- SQL injection
- XSS vulnerabilities 
- Broken authentication/authorization
- Sensitive data exposure
- Input validation failures

You will be shown examples of real vulnerabilities we've found in our codebase.
Then you will review the new code diff.

EXAMPLE 1: [Real SQL injection from our reporting module]
EXAMPLE 2: [Real XSS from our user content feature]
... [6 more examples]

Now review this diff and flag ONLY security issues:
[CODE DIFF]

This alone took our bug detection rate from 52% to 64% in about two weeks.

The lesson? Precision beats breadth every single time. Andrew Ng talks about this constantly—narrow AI applications consistently outperform general ones in production. I should probably listen to him more instead of learning everything the hard way.

2. We Built a Feedback Loop That Actually Closes

Most teams I've talked to stop at "the model flagged it, a human reviewed it, done."

That's what we did for the first month too. And we stayed flat at 64%. Flatlined. The accuracy line on my dashboard looked like a heart monitor after bad news.

So we built something we call a review registry. It's honestly just a Postgres table—nothing fancy, no vector databases, no blockchain (I can hear the Web3 people getting excited; please stop). But it logs every single false positive and false negative: the code snippet, the model's verdict, the human reviewer's correction, and a category tag.

The critical piece: we built a little Slack bot that lets reviewers submit corrections in about 3 seconds. This matters more than you'd think. If it takes longer than 3 seconds, people won't do it. We learned that the hard way in week one when our carefully designed 90-second form got exactly zero submissions.


-- This is literally it. Told you it wasn't fancy.
CREATE TABLE review_registry (
 id SERIAL PRIMARY KEY,
 pr_number INTEGER,
 code_snippet TEXT,
 model_verdict VARCHAR(50), -- 'bug', 'clean', 'needs_review'
 human_correction VARCHAR(50),
 category VARCHAR(100), -- 'sql_injection', 'null_pointer', 'false_positive'
 reviewer_notes TEXT,
 created_at TIMESTAMP DEFAULT NOW()
);

Every Friday afternoon, our staff engineer runs a fine-tuning script on this growing dataset. We're not doing full model retraining (we're not OpenAI, and our GPU budget is... let's call it "modest" because "embarrassing" sounds unprofessional). But we use LoRA adapters on top of our base model, and the impact compounds in a way I genuinely didn't expect:

By month three, something wild started happening. The model began catching domain-specific bugs—like incorrect state transitions in our order management system—that generic linters would never find. Stuff that requires actually understanding our business logic, not just pattern matching against common bug signatures.

That's when I knew we were building institutional knowledge, not just consuming AI.

Well... that's complicated. I should say we're starting to build institutional knowledge. It's early. Ask me again in six months and I'll either sound brilliant or very, very foolish.

3. We Measured What Actually Matters (And Ignored the Rest)

Early on, I obsessed over precision and recall like everyone else. Spent hours tweaking thresholds. Made very fancy dashboards with lots of green numbers.

Nobody cared. My CEO certainly didn't. She'd glance at my beautiful Grafana dashboard for exactly 1.4 seconds and ask "but is this actually helping us ship better software faster?"

Ouch. But fair.

So we started tracking three things that actually map to business outcomes:

These numbers got budget. The accuracy percentage? That got polite nods in standup and zero additional headcount.

If you take one thing from this entire post, make it this: connect engineering metrics to customer value. It's a lesson from Marty Cagan's Empowered that I keep having to relearn. Probably will relearn it again next quarter too. Some lessons just don't stick the first time.

4. The Human-in-the-Loop Paradox

Here's the counterintuitive part. Actually, it's more than counterintuitive—it's kind of the opposite of everything I expected when we started this project.

As our LLM got more accurate, we had to increase human oversight.

Let me say that again because it sounds wrong: better AI meant we needed more human involvement, not less.

Why? Because developers started trusting the model too much. Not maliciously—just... automatically. The suggestions looked so reasonable, so well-formatted, so confident. Why would you question something that sounds that authoritative?

We caught two instances in March where a junior engineer rubber-stamped the AI's suggestion without understanding the context. One of them would have introduced a race condition in our inventory system. The kind of bug that doesn't show up in testing but manifests as "why do we have -3 items in stock" at the worst possible moment. Think Black Friday. Think customer support tickets. Think nightmares.

So we now enforce a simple rule: LLM suggestions are "advisory" for junior devs, "confirmatory" for seniors. Every AI-flagged issue must include a human comment explaining the "why" before merge. Not just "fixed" or "done" or "👍" (I see you, GitHub emoji people). An actual explanation of the reasoning.

This slowed us down by maybe 8% initially. Some seniors pushed back hard—"I know why this is a bug, why do I need to write a paragraph about it?" I get it. Adding process feels bad. It feels like bureaucracy. But our escape rate dropped further, and honestly? The junior devs are learning faster because they have to articulate the reasoning instead of just clicking "accept suggestion."

Speed without safety is just technical debt with a pretty interface. I think I stole that from a conference talk. If it was yours, tell me and I'll buy you coffee next time you're in London.

What I'd Do Differently

If I could go back to January and give myself advice (besides "don't deploy on Fridays, you absolute maniac"), here's what I'd say:

  1. Start with one pass, not three. We over-engineered early. Security-only was the obvious MVP—it's the highest-stakes category and the one where false negatives hurt most. We should have nailed that before adding logic and performance passes.
  1. Build the feedback loop on day one. We lost a month of training data because we thought "we'll set up the logging infrastructure later." Later meant 30 days of missed learning. Don't be us.
  1. Talk to the skeptics first. Our biggest champion ended up being the senior architect who initially said "LLMs are just fancy autocomplete." I spent 45 minutes showing him real examples from our codebase, and he became the internal evangelist I couldn't be. Find your skeptic and win them over with data, not hype.

What's Next

I'm not sharing this because we've cracked the code. We haven't. Our accuracy is 73%, and I want it at 85% by end of year. That's probably ambitious. No, it's definitely ambitious. But the path is clearer now: narrow the scope, close the feedback loop, measure business impact, and never fully automate judgment.

We're also experimenting with something I'm tentatively calling "context-aware review"—giving the model access to the Jira ticket and design doc behind each PR. Early results are promising but messy. Maybe I'll write about that in a few months. Or maybe I'll write about how it all went terribly wrong. Either way, it'll be interesting.

What's your experience with LLM-based code review look like? Have you found a sweet spot between automation and oversight? I'm genuinely curious—drop your numbers or horror stories in the comments. Especially the horror stories. Those are always more useful than the success stories, somehow. There's something about failure that teaches better than success ever does.

Last Tuesday I reviewed a PR where our AI caught a subtle authentication bypass that I'm 90% sure I would have missed. That's when I knew this whole experiment was worth it. That's also when I realized I'm becoming the kind of engineer who says things like "our AI caught this" in standup, and I'm not sure how I feel about that yet.

AIEngineering #CodeReview #LLM #DevOps #EngineeringLeadership #MachineLearning

C

Cael Lee

Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.

Ready to get started?

Get your API key and start building with 180+ AI models.

Get API Key Free