Manuscript red-team reference · companion to Paper Review

test-train contamination

When the evaluation set leaked into training.

What it is

If even a small fraction of your test set ends up in training — directly or through near-duplicates, or via a pretraining corpus that included the benchmark — your evaluation no longer measures generalisation. For language-model evaluations, contamination is especially insidious because the training corpus is enormous and the test set is public.

Why a reviewer cares

Reviewers check: was the train/test split clean? For pretrained models, does the pretraining cutoff predate the test set's release? Did you de-duplicate against the eval set? Are there near-duplicates (paraphrases, translations, format variants)?

How to fix it

Build a contamination test into your evaluation: search the training corpus for substrings from the test set. For LLM evals, prefer benchmarks released AFTER your model's pretraining cutoff. Where contamination is possible, report a contamination-robustness check that masks or removes contaminated items.

This is one of ~15 canonical methodology explainers Paper Review's red-team report links to. To get a full review of your manuscript, start a Paper Review — $9.