How to Measure Code Review Quality in 2026 (Not Just Speed)
Every engineering metrics tool measures review speed. Almost none measure review quality. Here's what quality actually means in code reviews and how to track it.
TL;DR: Most teams measure how fast code reviews happen but never whether they’re actually good. Rubber-stamp approvals — reviews with zero substantive comments — account for 30-60% of all reviews on most teams. Measuring comment quality (not just speed) is the key to reviews that catch bugs before production.
Why aren’t fast reviews the same as good reviews?
Because speed and quality are independent variables. A reviewer can glance at a 400-line PR, type “LGTM,” and approve it in 90 seconds. Your review turnaround metric looks incredible. Your code review process is broken.
I’ve seen this firsthand at multiple companies. The team celebrates hitting a 4-hour median review turnaround time. Meanwhile, production incidents keep happening from bugs that were clearly visible in the diff. Nobody’s connecting the dots because no metric captures the difference between a thoughtful review and a rubber stamp.
Here’s a stat that changed how I think about reviews: across the teams I’ve analyzed, 38% of approved pull requests received zero substantive review comments. Not zero comments — zero substantive comments. The approvals came with “LGTM,” a thumbs-up emoji, or literally nothing at all.
That’s not code review. That’s a merge button with extra steps.
What is the rubber-stamp problem?
The rubber-stamp problem is when reviews technically happen but don’t actually review anything. It’s the most common failure mode in code review, and it’s almost completely invisible to standard metrics.
Why does it happen? Three reasons:
Social pressure. Nobody wants to be the person who blocks a teammate’s PR for two days over a naming convention. So reviewers approve quickly and swallow their concerns. The code ships. The tech debt accumulates.
Review fatigue. If your team is drowning in PRs and every engineer has 5+ reviews waiting, quality drops. Reviewers triage by scanning the PR title, checking if tests pass, and approving. They don’t have bandwidth for line-by-line review.
No accountability. Nobody tracks whether reviews are substantive. There’s no feedback loop. A reviewer who rubber-stamps everything and a reviewer who catches critical bugs look exactly the same in your metrics dashboard. Both show fast turnaround times and high review completion rates.
The cost is real. Teams with high rubber-stamp rates have 2-3x higher rework rates — PRs that need follow-up fixes within a week of merging. Those bugs were in the diff. A real review would have caught them.
What makes a review comment “substantive” vs. noise?
This is the core question, and it’s harder to answer than it seems. Let me give you a framework.
Noise comments (not substantive):
- “LGTM”
- “Looks good to me!”
- Emoji reactions (thumbs up, rocket, etc.)
- “nit: spacing” (low-value formatting comments that a linter should catch)
- “+1”
- “Nice work!”
Substantive comments:
- “This query could return null if the user doesn’t exist — should we handle that case?”
- “This function is doing three things. Can we split it so the retry logic is testable independently?”
- “I think there’s a race condition here if two requests hit this endpoint simultaneously.”
- “The approach works, but have you considered using a transaction? If the second write fails, the first one will leave the database in an inconsistent state.”
- “This changes the API response shape — did we check if any clients depend on the old format?”
The difference is clear when you see examples side by side. Substantive comments identify risks, suggest alternatives, ask questions about edge cases, or point out architectural concerns. Noise comments are social signals that add no technical value.
A good review doesn’t need to have 20 substantive comments. Even one or two comments that catch a real issue or improve the design make the review worthwhile. The problem is when a review has zero.
How do you measure code review quality with metrics?
There are four metrics that, together, give you a real picture of review quality.
Comment substantiveness rate
The percentage of review comments that are substantive (as defined above). This is the single most important review quality metric, and it’s the hardest to measure because it requires understanding the content of comments, not just counting them.
A healthy team has a substantiveness rate of 40-60%. That means roughly half of review comments are substantive technical feedback, and half are social/formatting/minor. Below 30%, your reviews are mostly noise. Above 70%, your reviewers might be nitpicking — that’s its own problem.
Comment depth (average comment length)
A rough proxy for substantiveness. Comments under 20 characters are almost never substantive (“LGTM” is 4 characters). Comments over 100 characters are almost always substantive — it’s hard to write a meaningless comment that long.
This isn’t perfect. “I don’t think this is right but I’m not sure why” is 60 characters of non-substance. But in aggregate, across hundreds of comments, average length correlates strongly with quality.
Change request resolution rate
When a reviewer requests changes, how often does the author actually address them? On healthy teams, this is 90%+. On teams where reviews are performative, authors frequently merge without resolving all change requests — either by dismissing stale reviews or getting approval from a different reviewer.
A low resolution rate means your review process has no teeth. Reviewers stop leaving substantive comments because they know the feedback gets ignored.
Reviews with zero substantive comments
The rubber-stamp rate. What percentage of approved PRs received no substantive feedback? Track this as a team metric, not individual — you want to fix the system, not shame people.
Benchmark: aim to get this below 25%. If more than half your approved PRs have zero substantive comments, your code review process is providing a false sense of security.
How does MergeScout score code review quality?
MergeScout is an AI-powered engineering metrics dashboard that watches your GitHub repos and delivers executive briefings in seconds. One of the features I’m most proud of is the comment quality scoring system.
Here’s how it works: every review comment across your repos gets analyzed by AI and scored on a 1-100 scale based on:
- Depth of analysis — Does the comment engage with the logic of the change, or just the surface?
- Specificity — Does it point to a concrete issue, or is it vague?
- Actionability — Can the author do something with this feedback?
- Technical substance — Does it address correctness, performance, security, or design?
A comment like “LGTM” scores in the single digits. A comment like “This SQL query builds the WHERE clause with string concatenation — that’s a SQL injection risk, use parameterized queries instead” scores in the 80s or 90s.
These scores aggregate into per-developer and per-repo quality metrics that you can track over time. You can see which repos get the most thorough reviews and which ones are getting rubber-stamped. You can see trends — is review quality improving after you introduced PR templates last month?
The key insight: you can’t improve what you can’t measure. And until you have a way to measure the content of reviews, not just the speed, you’re optimizing for the wrong thing. Try it free.
What does “good” review quality look like?
Based on the data across teams using MergeScout, here are the benchmarks I’d aim for:
| Metric | Needs Work | Healthy | Excellent |
|---|---|---|---|
| Avg. comment quality score | Below 30 | 40-60 | 60+ |
| Rubber-stamp rate | Over 50% | 20-35% | Below 20% |
| Substantive comment rate | Below 25% | 40-60% | 60%+ |
| Change request resolution | Below 75% | 85-95% | 95%+ |
| Avg. comment length | Under 30 chars | 60-120 chars | 120+ chars |
A few notes on these benchmarks:
You don’t want a 100% substantive comment rate. Social comments (“Nice refactor!” or “Good catch on that edge case”) are healthy for team morale. The goal is a balance — enough substance to catch real issues, enough social signal to keep reviews from feeling adversarial.
Comment quality also varies by PR type. A config change or dependency bump doesn’t need a deep architectural review. A new authentication flow does. Look at the trend over time and across PR types, not any single data point.
And remember: the goal of measuring review quality isn’t to create a leaderboard of “best reviewers.” It’s to identify systemic gaps. If an entire repo is getting rubber-stamped, that’s a process problem. If one team has consistently higher review quality, figure out what they’re doing differently and share it.
How do you improve code review quality once you’re measuring it?
Measurement without action is just surveillance. Here’s what actually moves the needle:
Make quality visible. Share the team’s review quality metrics in retros. Not individual scores — team averages and trends. When people see that 45% of reviews have zero substantive comments, it creates natural motivation to do better.
Pair-review complex PRs. For high-risk changes, do the review synchronously. Sit together (or screen-share) and walk through the code. These reviews are 3x more likely to catch critical issues than async reviews.
Rotate reviewers. When the same two people always review each other’s code, they develop blind spots. Cross-team reviews bring fresh eyes and catch assumptions that insiders miss.
Recognize great reviews. When someone leaves a review comment that catches a real bug or significantly improves the design, call it out in Slack or your team meeting. What gets recognized gets repeated.
Check out our blog for more on building a review culture that actually works.
FAQ
Can you measure code review quality without AI?
Partially. You can track comment length, number of comments per review, and change request resolution rate with basic GitHub API scripts. But classifying comments as substantive vs. noise requires understanding natural language — that’s where AI scoring adds real value. Manual classification doesn’t scale past a handful of PRs.
Does measuring review quality discourage quick approvals?
Not if you frame it correctly. Quick approvals on low-risk PRs (dependency bumps, config changes) are fine and expected. The metric should account for PR complexity. The goal is to eliminate rubber-stamps on meaningful code changes, not to make every review a 30-minute deep dive.
How does comment quality scoring handle different programming languages?
Good scoring systems are language-agnostic because they evaluate the review comment, not the code itself. A comment about SQL injection risk, race conditions, or error handling is substantive regardless of whether the PR is in Python, Go, or TypeScript. MergeScout’s scoring works across any language in your GitHub repos.
What if our team’s review quality is low — where do we start?
Start with the rubber-stamp rate. If over 40% of your approved PRs have zero substantive comments, that’s your biggest lever. Introduce a lightweight norm: every approval should include at least one specific observation about the code — even if it’s positive (“The error handling here is solid, covers all the failure modes”). That one change shifts the culture from “glance and approve” to “actually read the code.”
Should review quality metrics be tied to performance reviews?
No. Tying quality metrics to performance reviews creates gaming — reviewers will leave long but pointless comments to inflate their scores. Use quality metrics as team-level health indicators and coaching tools. Discuss them in 1:1s as development opportunities, not as evaluation criteria.