A Structured Benchmark for AI Paper Review

Lenz Pracher (Technical University of Munich), Andrey Fradkin (Boston University), Benjamin Golub (Refine; Northwestern University), Yann Calvó López (Refine)

Refine, an AI-powered system that harnesses frontier AI models to perform substantive technical diligence on research work, won 90% of head-to-head matches against other LLM review systems on 150 economics preprints.

90.4%

of 1,349 head-to-head matches won

Against nine AI reviewers on 150 economics preprints (95% Wilson CI 88.8%–91.9%).

Summary

AI is increasingly being used to evaluate and verify technical writing and research reports. We created an evaluation to measure the performance of Refine, our AI research reviewer, against other AI-powered feedback systems.

Across 1,349 completed matches, Refine won 1,220 (90.4%; 95% Wilson CI: 88.8%-91.9%). Against standalone LLM referee reports, what most researchers can obtain out of the box, Refine won 710 of 749 matches (94.8%; 95% Wilson CI: 93.0%-96.2%).

The benchmark covers 150 economics preprints. We paired Refine's review against each completed competitor review, paper by paper. Rather than asking a model to pick the better referee report in one shot, each match decomposes both reviews into paper-grounded concerns, removes the concerns they share, verifies the rest against the paper, and has a bias-controlled judge panel compare only the concerns that distinguish the two reviews.

The judges preferred Refine because its residual concern lists contained more supported, paper-grounded issues. The comparison also shows that the unit of evaluation is the model x scaffold configuration, not the base model alone. Single-shot reports were weaker than scaffolded review systems, and the gap varied across papers and subfields.

Review methodology overview

The unit of our methodology is the paper-grounded atomic concern: a single, self-contained issue a review raises — one specific flaw, gap, or recommendation — recorded together with the place in the paper it points to. "Atomic" means it captures exactly one issue, so a paragraph that raises three distinct problems becomes three concerns rather than one; "paper-grounded" means each concern is tied to the spot in the paper it refers to, so it can be checked against the actual text. A free-text review that reads as flowing prose thus becomes a list of discrete, individually checkable items.

Working in these units, our procedure (1) reduces each review to paper-grounded atomic concerns, (2) removes issues both sides raised, (3) filters the remaining criticisms for correctness, and (4) asks a judge panel to compare the residual lists. This decompose-then-verify design follows established practice for evaluating long-form model output, where breaking text into atomic, individually checkable claims has been found to track human judgment more closely than scoring a passage as a whole.

The methodology is designed to address well-known biases in simpler LLM-as-judge methods.¹ Asking a model to simply rank or score reviews results in an assessment that weights a variety of criteria in an unpredictable way. Reducing the comparison to individual concerns that are evaluated separately improves accuracy and auditability of the LLM judgments. The benchmark also follows the task-level framing used in agent evaluations: it evaluates a complete review system under a specified scoring procedure, rather than treating the base model as the only object being measured.² For the full match-decision procedure, including the pipeline figure, see How a match is decided and the appendix material linked there.

Refine wins 90.4% of matches

We ran Refine, our concern-based AI research reviewer, against nine other AI reviewers on 150 economics preprints: 1,349 completed head-to-head matches in all. Refine won 1,220 of them (90.4%; 95% Wilson CI: 88.8%-91.9%). The comparison review won 62 matches (4.6%), and 67 matches (5.0%) were ties.

Refine wins 1,220 of 1,349 matches

Across 1,349 matches, Refine wins 1,220 (90.4%), ties 67 (5.0%), and loses 62 (4.6%).

A Refine win means the averaged judge panel score is above 0.5; ties remain in the denominator. The judges compare the residual concern lists: the paper-grounded criticisms that remain after shared issues and unsupported material have been removed. The 90.4% headline is wins over all 1,349 matches, not only decided matches. The mean panel score across all matches is 0.9207.

Panel scores are heavily concentrated at unanimous Refine preferences; the appendix gives the score distribution and binning analysis.

The practical comparison for most users is out-of-box LLM review

The simplest baseline is asking a frontier LLM, in a single pass, to write a referee report. The tournament included five such single-shot LLM reviewers. All used the same out-of-the-box referee prompt and full-PDF setup. One late-added Fable 5 run completed 149 rather than 150 papers, yielding 749 single-shot matches. For the scaffold analysis below, the paired comparisons use the four base-model families for which both single-shot and open-scaffolding versions were available.

Against those single-shot LLM reviews, Refine won 710 of 749 matches (94.8%; 95% Wilson CI: 93.0%-96.2%). This comparison measures Refine against the single-pass referee report produced by the shared prompt and full-PDF setup, without review scaffolding.

Scaffolded review systems are stronger

The tournament also tested four scaffolded LLM configurations. Each uses the same base-model families, but wraps them in an open-source scaffold, Coarse, that drafts, critiques, and revises a report before submission. In this benchmark, that scaffold made the underlying models stronger reviewers.

Refine win rate by comparison system

Each matchup split into its three outcomes — the comparison system's wins and ties on the left, Refine's wins on the right — sorted with Refine's strongest matchups on top.

Pooled across the four configurations, Refine won 510 of 600 matches against the scaffolded reviewers, or 85.0% (95% Wilson CI: 81.9%-87.6%). The margin is smaller than against single-shot LLMs. The closer contests come from review systems, not bare models.

Comparison-system win rate against Refine

How often each comparison system's review was preferred over Refine's, by system type (left) and by base-model family for paired single-shot and scaffolded comparisons (right). Lower is better for Refine; ties are excluded.

Among the four base-model families with both single-shot and scaffolded runs, the scaffold effect appears within every family. Refine's win rate falls from 90.7% against the single-shot GPT-5.5 reviewer to 72.0% against the GPT-5.5 scaffolded LLM; from 96.7% to 80.7% for Claude; from 99.3% to 91.3% for Gemini; and from 98.7% to 96.0% for DeepSeek. The GPT-5.5 scaffolded reviewer is the closest configuration in the tournament: Refine wins 108 of 150 matches (72.0%; 95% Wilson CI: 64.3%-78.6%), with the 42 non-wins split evenly between 21 ties and 21 comparison-review wins.

The relevant unit of comparison is the model x scaffold configuration, and those configurations vary substantially. For this task, the practical question is which review system works best for the paper in front of the user. The exact performance of different model choices will depend on the corpus of papers, so the model-specific statistics should be read with caution.

Refine win rate by base-model family

Refine win rate grouped by base-model family for paired single-shot and scaffolded comparisons, with 95% Wilson confidence intervals.

The advantage is broad across economics

The 150 preprints span two sources and four subfields. The 100 NBER working papers are split evenly across macro, econometrics, applied micro, and theory (25 papers each), while the 50 arXiv preprints are all economic theory.

Refine's win rate is similar across subfields. It ranges from 87.1% in applied micro to 92.6% in pooled theory, and the uncertainty intervals overlap.

Refine win rate by subfield

Refine's win rate by economics subfield, with 95% Wilson confidence intervals; theory pools the NBER and arXiv papers (n = 675).

The subfield rates are:

Macro: 200 / 225, 88.9%; SE 2.1 percentage points; 95% CI 84.1%-92.4%.
Econometrics: 200 / 225, 88.9%; SE 2.1 percentage points; 95% CI 84.1%-92.4%.
Applied micro: 195 / 224, 87.1%; SE 2.2 percentage points; 95% CI 82.0%-90.8%.
Theory: 625 / 675, 92.6%; SE 1.0 percentage points; 95% CI 90.4%-94.3%.

With roughly 225 matches in each NBER-only subfield, the estimates show breadth across the economics subfields in the corpus without finely ranking them.

Theory is the one subfield drawn from two sources, which gives a small source check. NBER theory is 202 of 225 (89.8%; SE 2.0 percentage points; 95% CI 85.1%-93.1%). arXiv theory is 423 of 450 (94.0%; SE 1.1 percentage points; 95% CI 91.4%-95.8%). The source split does not suggest a different conclusion for theory papers.

The larger variation appears when subfield is crossed with review-system configuration. Most cells still favor Refine by a wide margin, but a few Coarse-scaffolded LLM configurations narrow the gap in particular fields.

Refine win rate by comparison system and subfield

Refine's win rate for each comparison system (rows) against each subfield (columns). Darker cells mean Refine wins more often; the smaller n in individual cells makes these interactions exploratory.

Two cells are the low end of the matrix: Refine wins 15 of 25 matches against the GPT-5.5 scaffolded LLM in econometrics (60.0%; SE 9.8 percentage points; 95% CI 40.7%-76.6%) and 15 of 25 against the Claude scaffolded LLM in applied micro (60.0%; SE 9.8 percentage points; 95% CI 40.7%-76.6%). Other relatively close scaffolded cells are the same GPT-5.5 system in theory, where Refine wins 52 of 75 matches (69.3%; SE 5.3 percentage points; 95% CI 58.2%-78.6%), and the Claude system in macro, where Refine wins 18 of 25 matches (72.0%; SE 9.0 percentage points; 95% CI 52.4%-85.7%).

Refine's advantage is visible across macro, econometrics, applied micro, and theory, though the sample is too small to rank the subfields finely. Where other systems narrow the gap, it is in specific model x scaffold configurations rather than across a whole subfield.

The result is robust to the order in which the two reviews are presented, to scoring with either judge alone, and to the self-bias filter; the appendix reports those checks in full.

What drives the wins

The match procedure reduces two reviews to the concerns that distinguish them: points that are not shared by both sides and are not out-of-scope positioning comments. That residual list measures what one reviewer found and the other missed after shared and out-of-scope items are removed.

Refine carries more unique concerns than the comparison review at every tier

Across all 1,349 matches, Refine carries more unique residual concerns than the comparison review after shared issues are removed. The figure separates concerns that threaten a main result, local substantive issues, and cosmetic points.

Across 1,349 matches, Refine has 28.1 unique residual concerns per match on average, compared with 14.5 for the comparison review. That is roughly a two-to-one gap, but the important comparison is not raw volume. After alignment and scope filtering, the residual gap is 28.1 versus 14.5; restricting to the central and local substantive tiers gives 22.1 versus 11.8. The difference is computed after shared concerns are removed, so it measures paper-grounded catches unique to one review, not review length.

The gap also appears in the highest-stakes tier. Refine averages 1.76 unique central-result concerns per match, versus 1.03 for the comparison review. For more local but still substantive concerns, the averages are 20.3 versus 10.8. Cosmetic residuals are reported separately in the figure. Together, the central and local substantive tiers are 22.1 concerns per match for Refine versus 11.8 for the comparison review.

Refine win rate by residual concern surplus

Refine's win rate by residual concern surplus. Error bars are 95% Wilson intervals, and n is shown for each bin.

Refine's win rate rises with its residual concern surplus, but residual count is not the whole comparison. When Refine has at least five more unique residual concerns than the comparison review, which happens in 912 of 1,349 matches, it wins 96.1% of the time. Even when Refine has at least three fewer residual concerns, it wins 69.8% of those 189 matches. Across the full tournament, Refine's residual surplus is positive in 79.2% of matches, with a mean surplus of 13.6 concerns and a median surplus of 11.

The judge rationales point to the same mechanism. In Refine wins, judges' one-line reasons mention verified or grounded catches in 59.9% of decisions, precision or concreteness in 46.9%, and false positives or unsupported claims from the comparison review in 21.4%. In non-Refine wins, the corresponding shares are 47.8%, 27.4%, and 12.9%. The theme that moves the other way is a single central catch by the comparison review: it appears in 38.2% of non-Refine-win reasons and 24.5% of Refine-win reasons.

The judge rationales are consistent with a specific mechanism: Refine usually brought more unique, supported, paper-grounded concerns to the judge, especially concerns that would require a real correction if right.

Case study: a formal-theory paper

One example makes the aggregate pattern concrete. In the tournament, Refine reviewed The Design and Composition of Structural Causal Decision Processes (arXiv 2605.02681), a dense formal-theory preprint built around definitions, lemmas, and recursive decision-process claims.

Refine won all nine matches on this paper, with a 1.00 panel score each time. Its review contained 43 atomic concerns: 19 concerns threatening a main result, 22 local substantive concerns, and 2 cosmetic concerns. All 43 were supported by the paper text. After alignment removed shared points and out-of-scope material, Refine still carried between 19 and 37 unique residual concerns into the nine comparisons.

The strongest concerns were formal rather than stylistic. Examples from Refine's supported concern list are:

Paper location	Refine concern	Why it matters
Lemma 12 / finite-horizon recursion	The recursion uses the same-period value function where Definition 10 uses the next-period value function.	If correct, the dynamic recursion does not match the paper's stated finite-horizon definition.
Definition 5 / model composition	The product-distribution step requires disjoint shocks and independence assumptions that are not stated.	The composition result depends on conditions not specified in the formal definition.
SCDP expressiveness relative to POMDPs	The paper claims that SCDPs are strictly more expressive than POMDPs, but the comparison appears to conflate POMDP environments with a standard solution method.	This is a broad technical claim: either the comparison class needs to be defined precisely, or the claim should be reframed as a representational advantage.
Definition 14 / SCDP base object	The definition switches from SCDMs back to SCIMs.	If intentional, SCDPs do not formally inherit the open roots and decision constraints that motivate the construction.

These are issues an author would need to resolve, not requests for more exposition.

Central-result residual concerns on the case-study paper

Concerns threatening a main result that survived to the judge, per match. Refine brought 8 to 14 in every match; comparison systems brought 0 to 4. The four systems with zero in this tier were single-shot LLM reviewers built on DeepSeek, Gemini, GPT-5.5, and Claude Fable 5.

The residual counts show why the judges had a clear comparison. Across the nine comparison systems, Refine brought 8 to 14 unique central-result concerns to the judge; the other systems brought 0 to 4. On total residual concerns, Refine's range was 19 to 37, while the other systems' was 0 to 20. The judges cited Refine's supported formal concerns about definitions, recursions, and examples even when the order was flipped.

How a match is decided

Each match runs the same eight stages. The purpose is to compare the paper-grounded substance that distinguishes the two reviews, rather than asking a model to choose between two free-text referee reports in one shot. The scoring procedure measures author-facing usefulness through verifiable, paper-grounded concerns that would help an author revise or assess a draft; it gives less credit to broad literature positioning, fit, taste, or claims requiring outside facts.

This procedure is deliberately more elaborate than simply prompting for a comparison of the two reviews: such comparisons are prone to a number of distortions, including favoring a review for being longer, or for containing more criticisms, even ones that are not valid.¹ Reducing each review to clear, auditable concerns before any judgment keeps the result grounded in the reviews themselves. The full model-facing prompts are collected in the appendix.

Unique concerns after filtering

Average concerns per match surviving each stage. Refine carries 38.5 extracted concerns to 35.7 after filtering and 28.1 unique residual concerns; comparison reviews move from 26.8 to 22.2 to 14.5.

The stages are:

Extract. Convert each free-text review into atomic concerns and record the paper feature, if any, that each concern points to.
Classify. Label each concern by where it can be evaluated, how much it matters, whether the author can act on it, and whether outside factual knowledge is required.
Anchor-check. Verify whether the concern points to something that exists in the paper, without deciding whether the critique is correct.
Align. Match concerns the two reviews share, by content rather than wording, so neither side gets unique credit for a point both made.
Harmonize. Equalize the significance of matched concerns when two reviewers raised the same issue with different urgency.
Diff. Remove high-confidence overlaps and out-of-scope positioning comments, leaving each side's residual list.
Rank. Order the residual concerns within priority buckets so the judge sees the most useful catches first.
Judge. Ask the flip-averaged, self-bias-filtered judge panel to compare the supported, substantive residual concerns.

A match is counted as a Refine win when the averaged judge panel score is above 0.5. The panel is flip-averaged: each judge scores the two reviews in both orderings and the scores are averaged, so position preference is not allowed to decide the match. Ties are excluded from the win numerator but remain in the denominator.

This design makes the comparison based on unique, supported criticism of the paper.

Read the full appendix

Scope and limitations

Scope

The benchmark evaluates AI referee reports on economics preprints. It is not a general model leaderboard, and it does not rank base models independently of their review scaffolds. The measured object is the model x scaffold configuration applied to a specific author-facing task: producing useful, paper-grounded feedback on a draft.

Cautions and limitations

The measurement stages are model-assisted. Extraction, classification, alignment, etc. use LLMs. LLMs can make mistakes on these tasks. Our sense from human review of LLM judgments is that they are largely accurate in the domain of these papers. The appendix makes those stages auditable, and the deterministic support checks and flip-averaged judging reduce noise.
Some matches are single-judge by design. The self-bias filter removes same-family judges for GPT and Gemini comparison systems. The filtered matches run at 88.3% Refine wins, below the two-judge subset, so the filter does not inflate the headline, but it does widen uncertainty on those matches.

The LLM-as-judge methodology literature documents these effects directly: position, verbosity, and self-preference biases in pairwise judging (Zheng et al. 2023); the pairwise protocol amplifying length and style biases relative to scoring each review on its own (Jeong et al. 2024); and the value of decomposing long-form text into atomic, source-anchored claims before evaluation (Min et al. 2023; Jacovi et al. 2025). Our adjudication follows the corresponding best practice — a diverse, flip-averaged judge panel with same-family judges filtered out (Verga et al. 2024). ↩ ↩²
Anthropic's engineering note on agent evaluations makes a similar methodological distinction between the task, trials, graders, and the agent harness or scaffold; it also notes that evaluating an agent means evaluating the harness and the model together. See Anthropic, "Demystifying evals for AI agents," Jan. 9, 2026. ↩