Groq Benchmark Results
Groq is an inference provider known for high-throughput, low-latency serving using custom LPU hardware. Benchscope records public evaluation runs across 5 model families hosted on Groq, covering GAIA_TEXT, MUSR, GSM8K, MMLU.
About Groq Endpoints
Groq's LPU architecture prioritizes inference speed. Benchmark scores from Groq endpoints reflect their specific serving configuration and hardware — not model capability in isolation. For the same model family, scores from Groq can differ from the same model hosted elsewhere due to quantization, infrastructure, or serving optimizations. Use canonical-prompt runs for the cleanest cross-provider comparisons.
Hosted Model Families
Model families with public evaluation runs on Groq: Llama 3.3 70B, Qwen3 32B, Llama 3.1 8B, GPT-OSS 20B, GPT-OSS 120B.
Recent Groq Runs
- Groq / Llama 3.3 70B on GAIA_TEXT: completed; 20.0%; 1015 ms p50 latency; 10 samples.
- Groq / Qwen3 32B on GAIA_TEXT: partial; 33.3%; 8102 ms p50 latency; 10 samples.
- Groq / Qwen3 32B on MUSR: completed; 30.0%; 1672 ms p50 latency; 20 samples.
- Groq / Llama 3.1 8B on GSM8K: completed; 74.1%; 422 ms p50 latency; 27 samples.
- Groq / Qwen3 32B on GSM8K: completed; 89.0%; 4318 ms p50 latency; 100 samples.
Related
- MMLU benchmark results across all providers
- MATH benchmark results across all providers
- GSM8K benchmark results across all providers
- All model families on Benchscope
- Best LLM endpoint for MMLU
- Best LLM endpoint for MATH
- Best LLM endpoint for GSM8K
- Llama 3.3 70B on Groq vs Together AI
- How benchmark results are defined and compared
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.