Public LLM Eval Runs

Browse shared Benchscope runs and compare provider-hosted model endpoints by score, latency, throughput, sample size, prompt mode, and run status.

What Runs Include

Each run records the endpoint, model family, hosting provider, benchmark, prompt mode, lifecycle state, score, latency metrics, sample count, and per-example outputs.

How To Compare Runs

Use canonical-prompt runs for the cleanest comparisons. Partial runs and custom-prompt runs remain useful, but Benchscope marks them because sample selection and prompt wording can change outcomes.

Recent Public Runs

Groq / Llama 3.3 70B on GAIA_TEXT: completed; 20.0%; 1015 ms p50 latency; 10 samples.
api.groq.com on GAIA_TEXT: partial; 60.0%; 1047 ms p50 latency; 10 samples.
Groq / Qwen3 32B on GAIA_TEXT: partial; 33.3%; 8102 ms p50 latency; 10 samples.
4router.net on MMLU: completed; 100.0%; 2748 ms p50 latency; 1 samples.
4router.net on MMLU: partial; 92.5%; 3192 ms p50 latency; 14,042 samples.
api.code-relay.com on GSM8K: partial; 98.5%; 5748 ms p50 latency; 1,319 samples.

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.