Public LLM Eval Runs
Browse shared Benchscope runs and compare provider-hosted model endpoints by score, latency, throughput, sample size, prompt mode, and run status.
What Runs Include
Each run records the endpoint, model family, hosting provider, benchmark, prompt mode, lifecycle state, score, latency metrics, sample count, and per-example outputs.
How To Compare Runs
Use canonical-prompt runs for the cleanest comparisons. Partial runs and custom-prompt runs remain useful, but Benchscope marks them because sample selection and prompt wording can change outcomes.
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.