OpenAI’s BrowseComp: Redefining How We Benchmark Web-Browsing Agents
As language models become increasingly agentic, including browsing the internet, reasoning across sources, and acting on user instructions, it is imperative that our methods of evaluating their capabilities must evolve too. OpenAI’s BrowseComp introduces a fresh benchmark for this paradigm, offering a challenging, realistic, and carefully curated evaluation framework