Manuscript red-team reference · companion to Paper Review

LLM self-judge bias

Using the same model family to evaluate itself.

What it is

LLM-as-judge evaluations are now common — a language model scores another language model's outputs. When the judge is from the same family as the system under test, self-preference bias is well-documented: GPT-4 systematically rates GPT-4 outputs higher than humans rate them. The bias is largest for stylistic dimensions (fluency, coherence) and weakest for verifiable correctness.

Why a reviewer cares

Reviewers ask: which model is the judge? Is it the same family as the SUT? Is the eval robust to a different judge? Have you compared judge ratings to human ratings on a subset?

How to fix it

Use a judge from a different model family than the system under test. Where possible, anchor LLM-judge scores against human ratings on a subset. For stylistic dimensions, prefer reference-based metrics or paired comparisons over absolute scoring. Disclose the judge model and its known biases.

This is one of ~15 canonical methodology explainers Paper Review's red-team report links to. To get a full review of your manuscript, start a Paper Review — $9.