Language Model by NVIDIA
NVIDIA\'s Gemma-4-31B-IT-NVFP4 delivers 89.2% on AIME 2026 mathematics (68.4 percentage points above Gemma 3 27B) while maintaining 84.05% GPQA Diamond accuracy after 4-bit quantization versus 84.3% at full precision. The model achieves 2150 Codeforces ELO rating (1,954% improvement over Gemma 3\'s 110) through a hybrid attention architecture combining 5:1 local sliding-window to global attention layers with Proportional RoPE for 256K context handling. At $0.14/$0.40 per million tokens (input/output), it operates 15-25x more efficiently than comparable frontier models while supporting multimodal inputs including 60-second videos at 1fps and variable-resolution images from 70-1120 tokens.
| Benchmark | Gemma-4-31B-IT-NVFP4 | Comparison |
|---|---|---|
| GPQA Diamond | 84.3% | 85.8% |
| AIME 2026 (Math) | 89.2% | 20.8% |
| LiveCodeBench v6 | 80% | 29.1% |
| MMLU Pro | 85.2% | 82.6% |
| Codeforces ELO | 2150% | 110% |
| τ2-bench Agentic (Retail) | 86.4% | 6.6% |
| BigBench Extra Hard | 74.4% | 19.3% |
| LMArena/Chatbot Arena ELO | 1452% | 1403% |
| MMMU Pro (Vision) | 76.9% | 49.7% |
| MMMLU (Multilingual) | 88.4% | - |
| Artificial Analysis Intelligence Index v4.0 | 39% | 15% |
Gemma 4 family released by Google DeepMind under Apache 2.0
NVFP4 quantized version released by NVIDIA
Gemma-4-31B-IT-NVFP4 is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
NVFP4 4-bit quantization reduces GPQA Diamond performance by just 0.25 percentage points (84.3% to 84.05%) while enabling deployment on edge devices with 75% memory reduction. The quantized model maintains competitive performance against larger models, scoring within 1.5 points of Qwen 3.5 27B on GPQA Diamond (84.3% vs 85.8%) and outperforming the median of similar-size open models by 24 points on Artificial Analysis Intelligence Index v4.0 (39 vs 15). NVIDIA's implementation preserves critical weights through mixed-precision strategies, keeping attention heads at higher precision while aggressively quantizing feedforward layers.
The model achieves its 2150 Codeforces ELO through three key innovations: hybrid attention with a 5:1 ratio of local sliding-window to global attention layers reducing computational complexity while preserving long-range dependencies, Per-layer Embeddings (PLE) architecture that adapts representations dynamically through training, and Proportional RoPE (p-RoPE) maintaining positional encoding accuracy across the full 256K context window. This architecture enables 80% accuracy on LiveCodeBench v6 (50.9 points above Gemma 3 27B's 29.1%) and native function calling support with structured JSON output generation for complex programming tasks.
At $0.14 per million input tokens and $0.40 per million output tokens, Gemma-4-31B-IT-NVFP4 costs 71-86% less than GPT-4o ($5/$15) and 93-97% less than Claude Opus ($15/$75) while delivering 85.2% MMLU Pro accuracy. For a typical workload processing 100M input tokens and generating 25M output tokens monthly, costs would be $24 versus $1,375 for GPT-4o or $3,375 for Claude Opus. The NVFP4 quantization further reduces deployment costs by enabling single-GPU inference on hardware supporting 8-16GB VRAM versus the 64-128GB typically required for 30B+ parameter models.
Video processing is limited to 60 seconds at 1fps (60 frames maximum), making it unsuitable for real-time video analysis or longer content requiring costs of 70-1120 tokens per frame depending on resolution settings. Image understanding scores 76.9% on MMMU Pro Vision (27.2 points above Gemma 3), but falls short of specialized vision models like GPT-4V (88.4%) or Claude 3.5 Vision (91.2%). The model processes images with configurable token budgets from 70 (low-res thumbnails) to 1120 (high-detail analysis), requiring careful optimization between accuracy and inference cost for vision-heavy workloads.
The <|think|> token enables Chain-of-Thought reasoning that improves complex problem-solving by 15-30% on benchmarks like BigBench Extra Hard (74.4% vs 19.3% for Gemma 3) and AIME 2026 mathematics (89.2% vs 20.8%). When activated, the model generates intermediate reasoning steps before the final answer, consuming additional output tokens (typically 2-5x more) but providing transparency into decision-making. The thinking process can be toggled per-request, allowing developers to balance between faster responses at $0.40/million tokens for direct answers versus more accurate reasoning at effectively $0.80-2.00/million tokens including thinking steps.