LLM Benchmark Results by Provider

Each benchmark page groups comparable public evaluation runs for one task suite, anchored to a fixed prompt, scoring method, and sample size. Use these pages to compare how providers and endpoints perform on the same task under the same conditions.

How to Use Benchmark Pages

Open a benchmark page to see accuracy scores and latency across all providers that have run it publicly. Filter by provider, model family, or prompt mode. Canonical-prompt runs give the fairest cross-provider comparisons. Check the methodology for how each benchmark is scored.

Flagship Benchmarks

MMLU covers 57-subject knowledge breadth and has the most public runs on Benchscope. MATH covers competition-level mathematical problem solving and produces the most differentiated scores across providers. These are the best starting points for endpoint comparison.

What Makes Results Comparable

Benchscope anchors each run to a specific task, version, sample size, prompt mode, and scoring method. Runs with different prompt modes or sample sizes require care before direct comparison. The methodology page explains each of these factors.

Current Benchmark Coverage

MMLU: Massive Multitask Language Understanding; 14,042 examples; 23 public runs.
GSM8K: Grade School Math 8K; 1,319 examples; 18 public runs.
MuSR: Multi-Step Soft Reasoning; 756 examples; 15 public runs.
IFEval: Instruction Following Evaluation; 541 examples; 14 public runs.
MATH: Competition Mathematics; 5,000 examples; 6 public runs.
GAIA (Text-Only): Curated text-only subset of GAIA real-world reasoning tasks; 18 examples; 3 public runs.
Garak: Refusal rate on Garak's static harmful prompt datasets; 953 examples; 0 public runs.

Browse by Benchmark

Read the methodology to understand how runs are defined, scored, and compared across benchmarks.

Editorial Comparisons

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.