All posts
AIreinforcement learningcode review

What I Learned from Grading AI Code Reviews

April 6, 2026 · 8 min read

I recently spent 20+ hours on a task that, at the time, I thought was just a code review exercise. I was given a pull request to the Hono web framework — a TypeScript HTTP router — and asked to do three things: understand the architecture, independently review the diff, and then evaluate 10 AI-generated reviews of the same PR. Grading the AI generated reviews happened to be the most interesting and passionate work I've done in 2026 so far. It maps out a problem I care about, that is building RL environments for coding agents.

An RL environment for a coding agent has three primary components: a prompt, a codebase for the model to work in, and a grader that scores the result. The grader is the whole game. If it's noisy, unfair, or miscalibrated, the training signal is useless because you're rewarding the wrong behavior. The 10 AI reviews I evaluated were, functionally, 10 grading passes over the same submission. And they were broken in ways that perfectly illustrate why grader quality is so hard.

Three Ways a Grader Can Lie to You

The PR I reviewed refactored Hono's query parameter parsing — getQueryParam, getQueryParams, and their integration with the HonoRequest class. It was a meaty diff with real bugs, subtle design flaws, and a few red herrings. The 10 AI review passes were supposed to find and classify those issues by severity (Critical, Major, Minor) and correctness (does the code fail to implement a stated requirement and contain an actual flaw?).

They got a lot right. They also failed in three systematic ways that, if these passes were RL graders, would produce a training environment actively hostile to model improvement.

1. Label Noise on the Same Bug

The query(key) function returned null for missing keys instead of undefined, and queries(key) returned null instead of []. Both broke the established HonoRequest API contract — the overload signatures promised undefined and string[] respectively, and downstream callers would crash. These are clearly correctness issues.

But across 10 passes, the correctness label on these two bugs flipped between true and false with no discernible pattern. Some passes marked them as correctness issues; others didn't. In my eyes, this was mostly a description problem. Some spots labeled things like "will crash on user-controlled input" while labeling the correctness status as false in the same comment.

In an RL environment, this is nondeterminism in the grader. The model does the same thing on two runs and gets different scores. The model can't learn from inconsistency, and worse, it learns to distrust the scoring function entirely, optimizing for surface patterns that correlate with high scores rather than genuine correctness.

2. Severity Miscalibration That Rewards Data Loss

The PR introduced a decode-loop overwrite: when encoded and plain-text forms of the same key coexisted, the decoded loop would silently overwrite earlier values. In single-value mode you'd get the wrong result; in multi-value mode, entire arrays would be silently dropped.

Most of the 10 passes called it Major. Some called it Minor. Their own descriptions explained exactly why it was Critical, then assigned a lower severity anyway.

An RL grader with this kind of miscalibration would assign a passing score to a model that introduces silent data loss. This same model would learn that data integrity isn't/shouldn't be prioritized, because the scoring failed it. Severity calibration isn't a pedantic detail; it's the difference between an environment that teaches models to write reliable software and one that teaches them to write software that looks reliable.

3. Spec-Blindness: Penalizing the Model for Following Instructions

At least three of the 10 passes flagged the change to getQueryStringFromURL — which now strips the leading ? from the returned string — as a correctness issue. The task specification explicitly required this change. The grading passes didn't read the spec.

This failure mode is the most dangerous of the three, because it's the one that's hardest to detect at scale. The grader isn't noisy or miscalibrated — it's confidently wrong. It would assign zero credit to a model that followed the prompt correctly. In an RL training loop, the model would learn to avoid following certain instructions, because doing so is associated with low scores. You'd be training obedience out of the model by accident.

This maps directly to a distinction that matters in RL environment design: there's a difference between a task where the model fails because it lacks capability, and a task where the model "fails" because the grader didn't understand the spec. The first is a useful training signal. The second is poison.

Convention-Awareness Is the Capability That Matters Most

The places where coding agents fail in interesting ways tend to involve implicit codebase conventions. I think tools like Mintlify and resources like people who have worked around a specific codebase for hundreds of hours would know these things, but they aren't explicitly written down anywhere.

I learned this partially by contributing to Dracut, Red Hat's initramfs infrastructure. Over the course of 20+ accepted patches, I absorbed patterns that no documentation captures: how modules structure their install hooks, which shell idioms are preferred over technically equivalent alternatives, how error handling is expected to flow through the boot chain. Violating these 'rules' actually won't hurt much though, maybe you'll just get a glance of disappointment by your project maintainer.

This is exactly the kind of failure that makes a strong RL task. A coding agent working on Dracut would likely produce functionally correct patches that violate convention, using local imports where the project uses global ones, or structuring a module hook in a way that works but doesn't match the existing pattern. The code would pass unit tests. It would fail code review. And critically, you can grade convention adherence automatically by diffing the model's style choices against the project's existing patterns. It's hard to game, it requires genuine codebase understanding, and it separates models that can write code from models that can contribute to a project.

What I Got Wrong About Coding Agents

Six weeks ago, I would have told you coding agents were mostly useful for mundane tasks like scaffolding files, writing tests, or autocompleting obvious patterns. I was wrong, and not by a little. During the Hono review, I watched the AI passes catch issues I would have missed on a first read: subtle type mismatches buried in overload signatures, fragile ordering dependencies in the encoded flag logic, dead code in return type unions. Their recall on narrow, well-defined issues was legitimately impressive.

Where they actually failed was in judgment that requires holding multiple contexts simultaneously. The null vs. undefined regression is a perfect example — the function was correctly following the task spec by returning null, but the integration layer in HonoRequest still had overload signatures promising undefined. Recognizing this as a bug requires understanding two things at once: what the spec says, and what the public API contract promises. The AI passes that got this wrong weren't failing at pattern-matching — they were failing at reconciling two conflicting sources of truth. That's a fundamentally different kind of failure, and it's the one that matters for RL environment design.

What This Means for Building Environments

If I were designing RL tasks after this experience, three principles would guide my work.

First, grader determinism matters more than grader comprehensiveness. A grader that scores five things consistently is more useful for training than one that scores twenty things with label noise. The correctness-flag inconsistency across the 10 passes I reviewed would produce an environment that's actively harmful to train on, no matter how many issues it checks.

Second, the spec is part of the environment, not just part of the prompt. Ambiguity in the spec should be intentional — designed to test whether the model asks for clarification or makes reasonable assumptions. Accidental ambiguity just tests luck. Three passes penalized the model for correctly stripping the ? prefix because the spec was clear but the graders didn't read it carefully. That's not a model failure. That's an environment failure.

Third, convention-adherence tasks have unusually high signal-to-noise ratio. They resist prompt-gaming because the conventions aren't in the prompt — they're in the codebase. They require the model to do something close to genuine comprehension. And they're auto-gradable. That's a rare combination.

Closing

I didn't set out to study RL environment design. But spending 20+ hours evaluating AI-generated code reviews against a real codebase taught me more about grader failure modes than any paper could. The bottleneck isn't building the task. It's knowing when your grader is lying to you. I now walk away a person with a newfound passion amongst the ocean of AI.


Note: This text was cleaned up using AI. I like to be transparent.

Subscribe

Get new posts by email

No cadence, no digest — just an email when I publish something new.