Public LLM Eval Runs

Browse shared Benchscope runs and compare provider-hosted model endpoints by score, latency, throughput, sample size, prompt mode, and run status.

What Runs Include

Each run records the endpoint, model family, hosting provider, benchmark, prompt mode, lifecycle state, score, latency metrics, sample count, and per-example outputs.

How To Compare Runs

Use canonical-prompt runs for the cleanest comparisons. Partial runs and custom-prompt runs remain useful, but Benchscope marks them because sample selection and prompt wording can change outcomes.

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.