Tests factual accuracy on simple questions from parametric knowledge, emphasizing calibration — knowing when the model doesn't know the answer.
Why it matters: GPT-4o scores below 40%, making it surprisingly challenging. Tests honesty and factual reliability, not just knowledge breadth.
Top Model
38.2%
GPT-4o
Average Score
38.2%
Across 1 model
Models Tested
1
Metric: accuracy
Human Baseline
-
Score Range: 0%–100%
SimpleQA Scores - Top 1 Models
Ranked by SimpleQA score (%)
All models with a reported SimpleQA score, ranked by highest accuracy.
SimpleQA is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.
GPT-4o currently holds the top score on the SimpleQA benchmark. See our full rankings table above for the complete leaderboard with 1 models.
We update benchmark data from multiple sources including HuggingFace open-source model leaderboards and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.
No. While SimpleQA is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.