Pushing the Frontier of AI for Science

Discover benchmarks and track model performance across scientific domains

Leaderboards

Rank Model Org Spearman's correlation Mean Absolute Error MAE Price (per 1M input tokens) Price Distribution Plot Date

About ICLR2026-1K Benchmark

This benchmark evaluates AI models' ability to evaluate scientific papers using submissions to the International Conference on Learning Representations (ICLR) 2026. The benchmark consists of a random sample of 1,000 papers that received human reviews during the actual conference review process.

Each model is tasked with generating review scores for these papers. We then compare the AI-generated scores against the actual human reviewer scores to measure how well AI models can assess scientific quality and provide meaningful feedback. Importantly, all models in this benchmark were released before the ICLR 2026 reviews were made publicly available, ensuring there is no possibility of data leakage or contamination.

Evaluation Metrics:

  • Mean Absolute Error (MAE): Measures the average absolute difference between AI-generated and human scores (lower is better)
  • Spearman's Correlation: Measures how well the AI's ranking of papers matches human reviewers' rankings (higher is better)

AI Reviewer Challenge

We're launching the first dynamic AI reviewer challenge to continuously evaluate AI's capabilities in reviewing scientific papers. The challenge will run multiple times per year, creating fresh, real-world evaluation sets to push the boundaries of AI for science. We partner with conference organizers and create our own review processes to ensure all the selected papers are not exposed in LLMs' training data and the reviews are created by human experts. Regular challenges throughout the year keep our benchmarks fresh and prevent overfitting, ensuring models are evaluated on genuinely new and unseen scientific content.

Join the Challenge

We are looking for reviewers, participants, collaborators and funders to help with the challenge. Please fill out the form below to submit your interest.

Team

Tom Cohen

Tom Cohen

MIT

tomco@mit.edu
Jiaxin Pei

Jiaxin Pei

Stanford&UT-Austin

pedropei@stanford.edu
Chenglei Si

Chenglei Si

Stanford

Jing Yang

Jing Yang

PaperCopilot

Affiliated Institutions

Distribution Plot