AI models ranked by instruction-following accuracy using the IFEval benchmark.
GPT-5.4
Score: 93.5
85.2
Across all ranked models
39
With benchmark data
Top Best for Instructions Models by Weighted Score
Top 15 models by weighted score
| # | Model | Score |
|---|---|---|
| 1 | GPT-5.4OpenAI | 93.5 |
| 2 | GPT-5.2OpenAI | 93 |
| 3 | Claude Opus 4.6Anthropic | 92.8 |
| 4 | GPT-5.1OpenAI | 92.5 |
| 5 | Claude 3.7 SonnetAnthropic | 92.3 |
| 6 | Llama 3.3 70B InstructMeta | 92.1 |
| 7 | GPT-5OpenAI | 92 |
| 8 | o3OpenAI | 92 |
| 9 | Claude Opus 4.5Anthropic | 91.5 |
| 10 | Gemini 3 Flash PreviewGoogle | 91 |
| 11 | Claude Sonnet 4.6Anthropic | 91 |
| 12 | Grok 4xAI | 91 |
| 13 | Claude Sonnet 4Anthropic | 90.8 |
| 14 | Claude Opus 4Anthropic | 90.5 |
| 15 | Claude Sonnet 4.5Anthropic | 90.2 |
| 16 | o3 MiniOpenAI | 90.2 |
| 17 | o4 MiniOpenAI | 90 |
| 18 | DeepSeek V3 0324DeepSeek | 89 |
| 19 | GPT-4.1OpenAI | 88.2 |
| 20 | Grok 3xAI | 88 |
| 21 | Llama 4 MaverickMeta | 88 |
| 22 | Gemini 2.5 ProGoogle | 87.2 |
| 23 | DeepSeek V3DeepSeek | 87.1 |
| 24 | Mistral LargeMistral AI | 86.5 |
| 25 | o1OpenAI | 86.5 |
| 26 | R1 0528DeepSeek | 85.5 |
| 27 | Gemini 2.5 FlashGoogle | 85.5 |
| 28 | GPT-4oOpenAI | 84.3 |
| 29 | Claude Haiku 4.5Anthropic | 84 |
| 30 | Llama 3.1 70B InstructMeta | 83.6 |
| 31 | R1DeepSeek | 83.3 |
| 32 | Gemini 2.0 FlashGoogle | 82 |
| 33 | GPT-4o-miniOpenAI | 80.4 |
| 34 | Phi 4Microsoft | 80.1 |
| 35 | Command R7B (12-2024)Cohere | 77.1 |
| 36 | Qwen2.5 7B InstructAlibaba | 75.9 |
| 37 | Llama 3.1 8B InstructMeta | 72.1 |
| 38 | Llama 3.2 3B InstructMeta | 68.5 |
| 39 | Llama 3 8B InstructMeta | 24 |
Each model's score is a weighted average of its available benchmark results. When a model is missing some benchmarks, the weights are re-normalized across the benchmarks that are available. All scores are on a 0-100 scale. Data sourced from official model cards, published papers, and third-party evaluation platforms.
Based on our benchmark analysis, GPT-5.4 by OpenAI is currently the #1 ranked model for instructions, with a weighted score of 93.5/100.
Models are ranked using a weighted average of IFEval benchmark scores. All scores are normalized to a 0-100 scale.
We currently rank 39 models that have relevant benchmark data for instructions tasks.