Detect when AI models may be declining. This tracker monitors rank movements (position changes on the leaderboard) and flags models that have dropped significantly over 24 hours or 7 days. A higher “degradation points” number means more warning signs.
Models at Risk
27
Declining (7d)
2
Fragile
1
Sustained Decline
25
27 models showing signs of degradation, ranked by risk score. Higher risk scores indicate more concerning performance trends.
272 models with no decline and a stable ranking state. These models are performing consistently.
| # | Model | Score | 24h | 7d | State |
|---|---|---|---|---|---|
| 1 | Claude Fable 5Anthropic | 96.6 | 0 | 0 | stable |
| 2 | Claude Opus 4.7 (Fast)Anthropic | 94.7 | 0 | 0 | stable |
| 3 | Claude Opus 4.7Anthropic | 94.7 | 0 | 0 | stable |
| 4 | Claude Opus 4.8 (Fast)Anthropic | 94.2 | 0 | 0 | stable |
| 5 | Claude Opus 4.8Anthropic | 94.2 | 0 | 0 | stable |
| 6 | GPT-5.5OpenAI | 92.2 | 0 | 0 | stable |
| 7 | Gemini 3.1 Pro Preview Custom ToolsGoogle | 91.7 | 0 | 0 | stable |
| 8 | Gemini 3.1 Pro PreviewGoogle | 91.7 | 0 | 0 | stable |
| 9 | GPT-5.4 ProOpenAI | 91.5 | 0 | 0 | stable |
| 10 | GPT-5.4OpenAI | 91.5 | 0 | 0 | stable |
| 11 | GPT-5.5 ProOpenAI | 90.3 | 0 | 0 | stable |
| 12 | GPT-5.2-CodexOpenAI | 90.1 | 0 | 0 | stable |
| 13 | GPT-5.2 ProOpenAI | 90.1 | 0 | 0 | stable |
| 14 | GPT-5.2OpenAI | 90.1 | 0 | 0 | stable |
| 15 | Claude Opus 4.6 (Fast)Anthropic | 90.0 | 0 | 0 | stable |
| 16 | Claude Opus 4.6Anthropic | 90.0 | 0 | 0 | stable |
| 17 | Grok 4.20xAI | 88.3 | 0 | 0 | stable |
| 18 | GPT-5.3-CodexOpenAI | 88.2 | 0 | 0 | stable |
| 19 | GPT-5 ProOpenAI | 88.2 | 0 | 0 | stable |
| 20 | GPT-5 CodexOpenAI | 88.2 | 0 | 0 | stable |
Showing top 20 of 272 stable models.
Our degradation detection system uses multiple signals to identify models that may be declining in quality or reliability.
Models whose 7-day rank change is worse than -2 positions. A sustained drop of more than two ranks over a week suggests the model may be losing ground to competitors or experiencing performance issues.
Models classified as "fragile" by our scoring system. These models have inconsistent performance metrics or borderline scores that could shift significantly with small changes in evaluation data.
Models declining on both the 24-hour and 7-day timeframes. When a model is losing rank on both short and medium-term windows, it indicates a persistent downward trend rather than temporary fluctuation.
The degradation risk score combines multiple signals: 7-day rank decline weighted 2x, 24-hour rank decline weighted 1x, plus 5 bonus points for fragile state. Higher scores indicate greater risk of meaningful performance degradation.
The tracker uses a multi-signal approach: it monitors 7-day rank decline (weighted 2x), 24-hour rank drops (weighted 1x), and fragile state classification (+5 points). Models are scored on a degradation risk scale where higher values indicate more warning signs of performance decline.
A fragile state indicates that a model has inconsistent performance metrics or borderline scores that could shift significantly with small changes in evaluation data. Fragile models are at higher risk of further ranking drops and warrant closer monitoring.
Yes, models can recover. Degradation may be temporary due to API issues, benchmark fluctuations, or scoring recalibrations. Models that show sustained decline over multiple weeks are more concerning than those with short-term dips. The tracker monitors both 24-hour and 7-day windows to help distinguish temporary noise from real trends.