We Boosted Our LLM Code Review Accuracy by 40% in 3 Months (Here's What Actually Worked)
We Boosted Our LLM Code Review Accuracy by 40% in 3 Months (Here's What Actually Worked)
Last quarter, our AI code reviewer was catching 52% of critical bugs. Today it's at 73%. I'm going to walk you through exactly what we did—the wins, the face-plants, and the stuff I'd absolutely do differently if I could start over.
When we first plugged an LLM into our CI/CD pipeline back in January, I genuinely expected magic. Like, Harry Potter-level magic. What I got instead was a tsunami of false positives and a missed null pointer exception that took down our payment service at 11pm on a Saturday.
That was a fun weekend. By "fun" I mean my CTO texted me "call me" with zero context. You know exactly the vibe I'm talking about.
That's when it hit me: treating an LLM like a plug-and-play senior engineer is a recipe for disaster. It's not a senior engineer. Honestly, it's not even a junior engineer. It's a tool, and like any tool—a table saw, a Kubernetes cluster, your weirdly specific espresso machine—it needs serious calibration.
If you're leading an engineering team and exploring AI-assisted code review, here's what actually moved the needle for us. Some of this might be obvious. Most of it surprised me. Actually, wait—all of it surprised me.
TL;DR for the Skimmers
The Problem: Our LLM code reviewer was missing half the critical bugs and drowning us in false positives.
What Worked:
- Split one giant prompt into three focused passes (security, logic, performance) → 12% accuracy boost in 2 weeks
- Built a feedback loop that logs every mistake and fine-tunes weekly → steady compound improvements
- Measured business impact instead of ML metrics → got actual budget approval
- Increased human oversight as the AI got better (counterintuitive but crucial)
The Big Lesson: LLMs for code review work best when you treat them like a specialized tool, not a replacement for human judgment. Narrow the scope, close the feedback loop, and never fully automate decisions.
1. We Stopped Asking for Everything, Everywhere
Our initial prompt was basically a novel.
"Review this code for bugs, security vulnerabilities, performance issues, style violations, and adherence to best practices." I think we even threw in "and be thorough" at the end. Because that's how prompts work, right? Just add "be thorough" and the magic happens.
Spoiler: it doesn't.
The model skimmed the surface on all five areas and excelled at exactly none of them. Classic garbage-in-garbage-out. We'd get comments like "consider using a more descriptive variable name" while completely missing an unvalidated user input that was basically a welcome mat for SQL injection.
So we pivoted to a multi-pass architecture. Three separate passes, each with a hyper-specific job:
- Pass 1: Security vulnerabilities only. OWASP Top 10 focus, nothing else. We included 8 real examples from our own codebase where we'd been burned before—SQL injection in the reporting module, XSS in user-generated content, a broken access control that let regular users see admin dashboards. Real scars, real examples.
- Pass 2: Logic bugs and edge cases. Null pointers, off-by-one errors, state machine violations. This one catches the stuff that keeps you up at night. The bugs that don't crash your app—they just corrupt your data silently for three weeks until someone notices.
- Pass 3: Performance anti-patterns. N+1 queries, memory leaks, blocking I/O on the main thread. The things that don't break your app immediately but make it crawl at 3am when traffic spikes and suddenly your pager won't stop screaming.
Each pass uses its own specialized prompt with few-shot examples pulled from our actual PR history. Not synthetic examples. Not "imagine you have a function that..." textbook nonsense. Real bugs that real engineers on our team shipped to production and then regretted.
Here's roughly what the security pass prompt looks like:
SYSTEM: You are a security-focused code reviewer. Your ONLY job is to identify
security vulnerabilities in the provided code diff. Ignore all other issues.
FOCUS AREAS:
- SQL injection
- XSS vulnerabilities
- Broken authentication/authorization
- Sensitive data exposure
- Input validation failures
You will be shown examples of real vulnerabilities we've found in our codebase.
Then you will review the new code diff.
EXAMPLE 1: [Real SQL injection from our reporting module]
EXAMPLE 2: [Real XSS from our user content feature]
... [6 more examples]
Now review this diff and flag ONLY security issues:
[CODE DIFF]
This alone took our bug detection rate from 52% to 64% in about two weeks.
The lesson? Precision beats breadth every single time. Andrew Ng talks about this constantly—narrow AI applications consistently outperform general ones in production. I should probably listen to him more instead of learning everything the hard way.
2. We Built a Feedback Loop That Actually Closes
Most teams I've talked to stop at "the model flagged it, a human reviewed it, done."
That's what we did for the first month too. And we stayed flat at 64%. Flatlined. The accuracy line on my dashboard looked like a heart monitor after bad news.
So we built something we call a review registry. It's honestly just a Postgres table—nothing fancy, no vector databases, no blockchain (I can hear the Web3 people getting excited; please stop). But it logs every single false positive and false negative: the code snippet, the model's verdict, the human reviewer's correction, and a category tag.
The critical piece: we built a little Slack bot that lets reviewers submit corrections in about 3 seconds. This matters more than you'd think. If it takes longer than 3 seconds, people won't do it. We learned that the hard way in week one when our carefully designed 90-second form got exactly zero submissions.
-- This is literally it. Told you it wasn't fancy.
CREATE TABLE review_registry (
id SERIAL PRIMARY KEY,
pr_number INTEGER,
code_snippet TEXT,
model_verdict VARCHAR(50), -- 'bug', 'clean', 'needs_review'
human_correction VARCHAR(50),
category VARCHAR(100), -- 'sql_injection', 'null_pointer', 'false_positive'
reviewer_notes TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
Every Friday afternoon, our staff engineer runs a fine-tuning script on this growing dataset. We're not doing full model retraining (we're not OpenAI, and our GPU budget is... let's call it "modest" because "embarrassing" sounds unprofessional). But we use LoRA adapters on top of our base model, and the impact compounds in a way I genuinely didn't expect:
- Month 1: Accuracy improved 5%
- Month 2: Another 4%
- Month 3: Another 3%
By month three, something wild started happening. The model began catching domain-specific bugs—like incorrect state transitions in our order management system—that generic linters would never find. Stuff that requires actually understanding our business logic, not just pattern matching against common bug signatures.
That's when I knew we were building institutional knowledge, not just consuming AI.
Well... that's complicated. I should say we're starting to build institutional knowledge. It's early. Ask me again in six months and I'll either sound brilliant or very, very foolish.
3. We Measured What Actually Matters (And Ignored the Rest)
Early on, I obsessed over precision and recall like everyone else. Spent hours tweaking thresholds. Made very fancy dashboards with lots of green numbers.
Nobody cared. My CEO certainly didn't. She'd glance at my beautiful Grafana dashboard for exactly 1.4 seconds and ask "but is this actually helping us ship better software faster?"
Ouch. But fair.
So we started tracking three things that actually map to business outcomes:
- Mean Time to Detect (MTTD): How quickly a bug is caught after commit. Dropped from 4.2 hours to 1.1 hours. That's real money saved—and more importantly, it's real sleep saved for the on-call engineer who used to catch these at 2am.
- Reviewer Fatigue Score: We survey senior devs every Wednesday on a 1-5 scale. Started at 2.8 ("drained, please make it stop, I'm updating my LinkedIn"). Now at 4.1 ("I actually get to think about architecture again instead of hunting for missing null checks"). This one hit home for me personally—I was burning out our best people without realizing it.
- Escape Rate: Bugs reaching production. Down 31% quarter-over-quarter. This is the number that got my CEO to ask "what do you need to scale this?" instead of "why are we spending on AI tools?" Funny how that works.
These numbers got budget. The accuracy percentage? That got polite nods in standup and zero additional headcount.
If you take one thing from this entire post, make it this: connect engineering metrics to customer value. It's a lesson from Marty Cagan's Empowered that I keep having to relearn. Probably will relearn it again next quarter too. Some lessons just don't stick the first time.
4. The Human-in-the-Loop Paradox
Here's the counterintuitive part. Actually, it's more than counterintuitive—it's kind of the opposite of everything I expected when we started this project.
As our LLM got more accurate, we had to increase human oversight.
Let me say that again because it sounds wrong: better AI meant we needed more human involvement, not less.
Why? Because developers started trusting the model too much. Not maliciously—just... automatically. The suggestions looked so reasonable, so well-formatted, so confident. Why would you question something that sounds that authoritative?
We caught two instances in March where a junior engineer rubber-stamped the AI's suggestion without understanding the context. One of them would have introduced a race condition in our inventory system. The kind of bug that doesn't show up in testing but manifests as "why do we have -3 items in stock" at the worst possible moment. Think Black Friday. Think customer support tickets. Think nightmares.
So we now enforce a simple rule: LLM suggestions are "advisory" for junior devs, "confirmatory" for seniors. Every AI-flagged issue must include a human comment explaining the "why" before merge. Not just "fixed" or "done" or "👍" (I see you, GitHub emoji people). An actual explanation of the reasoning.
This slowed us down by maybe 8% initially. Some seniors pushed back hard—"I know why this is a bug, why do I need to write a paragraph about it?" I get it. Adding process feels bad. It feels like bureaucracy. But our escape rate dropped further, and honestly? The junior devs are learning faster because they have to articulate the reasoning instead of just clicking "accept suggestion."
Speed without safety is just technical debt with a pretty interface. I think I stole that from a conference talk. If it was yours, tell me and I'll buy you coffee next time you're in London.
What I'd Do Differently
If I could go back to January and give myself advice (besides "don't deploy on Fridays, you absolute maniac"), here's what I'd say:
- Start with one pass, not three. We over-engineered early. Security-only was the obvious MVP—it's the highest-stakes category and the one where false negatives hurt most. We should have nailed that before adding logic and performance passes.
- Build the feedback loop on day one. We lost a month of training data because we thought "we'll set up the logging infrastructure later." Later meant 30 days of missed learning. Don't be us.
- Talk to the skeptics first. Our biggest champion ended up being the senior architect who initially said "LLMs are just fancy autocomplete." I spent 45 minutes showing him real examples from our codebase, and he became the internal evangelist I couldn't be. Find your skeptic and win them over with data, not hype.
What's Next
I'm not sharing this because we've cracked the code. We haven't. Our accuracy is 73%, and I want it at 85% by end of year. That's probably ambitious. No, it's definitely ambitious. But the path is clearer now: narrow the scope, close the feedback loop, measure business impact, and never fully automate judgment.
We're also experimenting with something I'm tentatively calling "context-aware review"—giving the model access to the Jira ticket and design doc behind each PR. Early results are promising but messy. Maybe I'll write about that in a few months. Or maybe I'll write about how it all went terribly wrong. Either way, it'll be interesting.
What's your experience with LLM-based code review look like? Have you found a sweet spot between automation and oversight? I'm genuinely curious—drop your numbers or horror stories in the comments. Especially the horror stories. Those are always more useful than the success stories, somehow. There's something about failure that teaches better than success ever does.
Last Tuesday I reviewed a PR where our AI caught a subtle authentication bypass that I'm 90% sure I would have missed. That's when I knew this whole experiment was worth it. That's also when I realized I'm becoming the kind of engineer who says things like "our AI caught this" in standup, and I'm not sure how I feel about that yet.
AIEngineering #CodeReview #LLM #DevOps #EngineeringLeadership #MachineLearning
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.