Evaluating data contamination in LLMs
Introduction
In recent years, large language models (LLMs) like GPT and Llama have constantly been pushing accuracy numbers on popular benchmarks. However, one issue that has become increasingly important is data contamination—the overlap between a model’s training data and the evaluation benchmarks used to assess its performance. This