Language Model by OpenAI
GPT-4.1 pro achieves 54.6% on SWE-bench Verified, a 21.4 percentage point jump over GPT-4o\'s 33.2%, marking OpenAI\'s pivot toward agent-optimized models that can actually write and modify code in production environments. The model hits 90.2% on MMLU (versus GPT-4o\'s 85.7%) while maintaining a 1M token context window at $2/$8 per million tokens input/output, making it 60% cheaper than GPT-4o for input processing. Released April 2025 with a planned ChatGPT retirement in February 2026 (API continues), GPT-4.1 pro specifically targets software engineering workflows with its 52.9% Aider Polyglot Diff score that more than doubles GPT-4o\'s 25%.
| Benchmark | GPT-4.1 pro | Comparison |
|---|---|---|
| SWE-bench Verified | 54.6% | 33.2% |
| MMLU | 90.2% | 85.7% |
| IFEval | 87.4% | 81% |
| GPQA Diamond | 66.3% | 54% |
| Video-MME (long, no subtitles) | 72% | 65.3% |
| Scale MultiChallenge | 38.3% | 27.8% |
| Aider Polyglot Diff | 52.9% | 25% |
| MMMU | null | - |
GPT-4.1 family launched in OpenAI API
GPT-4.1 and GPT-4.1 mini released in ChatGPT for Plus/Pro/Team users
Scheduled retirement from ChatGPT (API remains available)
GPT-4.1 pro is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
The 54.6% SWE-bench Verified score means GPT-4.1 pro successfully resolves over half of real GitHub issues that require understanding codebases, writing patches, and handling edge cases - compared to just 33.2% for GPT-4o. This 64% relative improvement shows up most dramatically in the Aider Polyglot Diff benchmark at 52.9% versus 25%, where the model must generate correct unified diffs across multiple programming languages. For context, human developers without prior knowledge of these codebases typically score around 70-80% on similar tasks.
GPT-4.1 pro costs $2 per million input tokens and $8 per million output tokens, compared to GPT-4o's $5/$15 pricing structure. For a typical coding assistant workload (80/20 input/output ratio), processing 10 million tokens costs $24 with GPT-4.1 pro versus $55 with GPT-4o - a 56% reduction. The 1M token context window means you can fit entire codebases (roughly 700k lines of Python) in a single prompt without chunking strategies.
OpenAI is repositioning GPT-4.1 pro as an agent-first model optimized for programmatic use rather than conversational interfaces, with the February 13, 2026 ChatGPT retirement reflecting this strategy. The model's architecture specifically optimizes for instruction following (87.4% IFEval versus 81% for GPT-4o) and structured output generation, making it better suited for API-driven workflows. This follows OpenAI's pattern of maintaining specialized models in API while streamlining ChatGPT to fewer, more general-purpose options.
While OpenAI hasn't disclosed parameter counts, the performance gains suggest targeted architectural changes: enhanced attention mechanisms for the 1M token context (maintaining needle-in-haystack accuracy), specialized coding tokens or embeddings (explaining the 2.1x improvement in diff generation), and likely a modified training objective emphasizing instruction adherence. The 38.3% Scale MultiChallenge score (versus 27.8% for GPT-4o) indicates improved multi-step reasoning, potentially through better chain-of-thought mechanisms or intermediate supervision during training.
GPT-4.1 pro scores 72% on Video-MME long-form without subtitles, a 6.7 percentage point improvement over GPT-4o's 65.3%, suggesting better temporal reasoning and visual understanding across extended sequences. The model accepts image inputs through the same unified API but processes them differently than GPT-4o, with reported improvements in OCR accuracy and diagram understanding. However, it lacks GPT-4o's native audio capabilities and requires separate transcription for speech-based inputs.