test-train contamination
When the evaluation set leaked into training.
What it is
If even a small fraction of your test set ends up in training — directly or through near-duplicates, or via a pretraining corpus that included the benchmark — your evaluation no longer measures generalisation. For language-model evaluations, contamination is especially insidious because the training corpus is enormous and the test set is public.
Why a reviewer cares
Reviewers check: was the train/test split clean? For pretrained models, does the pretraining cutoff predate the test set's release? Did you de-duplicate against the eval set? Are there near-duplicates (paraphrases, translations, format variants)?
How to fix it
Build a contamination test into your evaluation: search the training corpus for substrings from the test set. For LLM evals, prefer benchmarks released AFTER your model's pretraining cutoff. Where contamination is possible, report a contamination-robustness check that masks or removes contaminated items.
This is one of ~15 canonical methodology explainers Paper Review's red-team report links to. To get a full review of your manuscript, start a Paper Review — $5.