LLM self-judge bias
Using the same model family to evaluate itself.
What it is
LLM-as-judge evaluations are now common — a language model scores another language model's outputs. When the judge is from the same family as the system under test, self-preference bias is well-documented: GPT-4 systematically rates GPT-4 outputs higher than humans rate them. The bias is largest for stylistic dimensions (fluency, coherence) and weakest for verifiable correctness.
Why a reviewer cares
Reviewers ask: which model is the judge? Is it the same family as the SUT? Is the eval robust to a different judge? Have you compared judge ratings to human ratings on a subset?
How to fix it
Use a judge from a different model family than the system under test. Where possible, anchor LLM-judge scores against human ratings on a subset. For stylistic dimensions, prefer reference-based metrics or paired comparisons over absolute scoring. Disclose the judge model and its known biases.
This is one of ~15 canonical methodology explainers Paper Review's red-team report links to. To get a full review of your manuscript, start a Paper Review — $5.