Best AI Models for Instruction Following

AI models ranked by instruction-following accuracy using the IFEval benchmark.

Last updated: 11m ago

#1 Model

GPT-5.4

Score: 93.5

Average Score

85.2

Across all ranked models

Models Ranked

With benchmark data

Weights:IFEval (100%)

Top Best for Instructions Models by Weighted Score

Top 15 models by weighted score

LMMarketCap.com

#	Model	Provider	Score	IFEval
1	GPT-5.4OpenAI	OpenAI	93.5	93.5
2	GPT-5.2OpenAI	OpenAI	93	93
3	Claude Opus 4.6Anthropic	Anthropic	92.8	92.8
4	GPT-5.1OpenAI	OpenAI	92.5	92.5
5	Claude 3.7 SonnetAnthropic	Anthropic	92.3	92.3
6	Llama 3.3 70B InstructMeta	Meta	92.1	92.1
7	GPT-5OpenAI	OpenAI	92	92
8	o3OpenAI	OpenAI	92	92
9	Claude Opus 4.5Anthropic	Anthropic	91.5	91.5
10	Gemini 3 Flash PreviewGoogle	Google	91	91
11	Claude Sonnet 4.6Anthropic	Anthropic	91	91
12	Grok 4xAI	xAI	91	91
13	Claude Sonnet 4Anthropic	Anthropic	90.8	90.8
14	Claude Opus 4Anthropic	Anthropic	90.5	90.5
15	Claude Sonnet 4.5Anthropic	Anthropic	90.2	90.2
16	o3 MiniOpenAI	OpenAI	90.2	90.2
17	o4 MiniOpenAI	OpenAI	90	90
18	DeepSeek V3 0324DeepSeek	DeepSeek	89	89
19	GPT-4.1OpenAI	OpenAI	88.2	88.2
20	Grok 3xAI	xAI	88	88
21	Llama 4 MaverickMeta	Meta	88	88
22	Gemini 2.5 ProGoogle	Google	87.2	87.2
23	DeepSeek V3DeepSeek	DeepSeek	87.1	87.1
24	Mistral LargeMistral AI	Mistral AI	86.5	86.5
25	o1OpenAI	OpenAI	86.5	86.5
26	R1 0528DeepSeek	DeepSeek	85.5	85.5
27	Gemini 2.5 FlashGoogle	Google	85.5	85.5
28	GPT-4oOpenAI	OpenAI	84.3	84.3
29	Claude Haiku 4.5Anthropic	Anthropic	84	84
30	Llama 3.1 70B InstructMeta	Meta	83.6	83.6
31	R1DeepSeek	DeepSeek	83.3	83.3
32	Gemini 2.0 FlashGoogle	Google	82	82
33	GPT-4o-miniOpenAI	OpenAI	80.4	80.4
34	Phi 4Microsoft	Microsoft	80.1	80.1
35	Command R7B (12-2024)Cohere	Cohere	77.1	77.1
36	Qwen2.5 7B InstructAlibaba	Alibaba	75.9	75.9
37	Llama 3.1 8B InstructMeta	Meta	72.1	72.1
38	Llama 3.2 3B InstructMeta	Meta	68.5	68.5
39	Llama 3 8B InstructMeta	Meta	24	24

How scores are calculated

Each model's score is a weighted average of its available benchmark results. When a model is missing some benchmarks, the weights are re-normalized across the benchmarks that are available. All scores are on a 0-100 scale. Data sourced from official model cards, published papers, and third-party evaluation platforms.