LLM Benchmarks

Review benchmark coverage and open task-suite pages to compare public model results within a consistent evaluation setup.

Coverage

Benchscope tracks public results for task suites such as MMLU, GSM8K, IFEval, MATH, MuSR, and related benchmarks.

Comparable Results

Benchmark pages keep runs anchored to a specific task, version, sample size, prompt mode, and scoring method so results can be interpreted in context.

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.