Gemma 4 is Google DeepMind's fourth-generation open-weight model family. The 31B variant launched on April 2, 2026 with Apache 2.0 licensing, 256K context window, and native multimodal capabilities.

Is Gemma 4 open source?

Gemma 4 is released under the Apache 2.0 license as open weights. This means you can freely download, fine-tune, distill, and deploy it commercially without licensing restrictions or usage fees.

How does Gemma 4 compare to Llama 4?

Both are open-weight models targeting developers. Gemma 4 31B is a dense transformer focused on efficiency and deployability, while Llama 4 Maverick uses a MoE architecture for higher raw parameter counts. Check our leaderboard for live score comparisons.

What sizes does Gemma 4 come in?

The Gemma 4 31B is the first variant released. Google DeepMind has indicated additional sizes will follow, continuing the multi-size strategy from previous Gemma generations.

Gemma 4 Review: Google DeepMind's Open-Weight Challenger

Google DeepMind released Gemma 4 31B on April 2, 2026 - the fourth generation of their open-weight model family and the most capable Gemma to date. At 30.7 billion parameters with Apache 2.0 licensing, it sits at an interesting intersection: powerful enough to compete with mid-tier proprietary models, small enough to self-host on a single GPU, and open enough to fine-tune for any use case. It currently ranks #45 out of 317 coding models with a composite score of 80/100.

#45

Rank

of 317 models

Score

/100

30.7B

Parameters

dense

256K

Context

tokens

131K

Max Output

tokens

$0.14

Input Price

/M tokens

$0.40

Output Price

/M tokens

140+

Languages

supported

Architecture: Dense 30.7B with Configurable Thinking

Gemma 4 31B is a dense transformer - all 30.7 billion parameters are active during every forward pass. This is a deliberate choice that contrasts with the MoE trend (see our MoE report). Dense models trade raw parameter efficiency for predictability: consistent latency, simpler deployment, and more uniform behavior across tasks.

The standout architectural feature is configurable thinking/reasoning mode. Like chain-of-thought prompting but built into the model natively, this allows Gemma 4 to allocate more compute to harder problems. When reasoning mode is enabled, the model produces intermediate thinking steps before its final answer, improving accuracy on complex coding, math, and logic tasks at the cost of higher latency and token usage.

This is similar to what models like o3 and DeepSeek R1 do, but Gemma 4 makes it configurable rather than always-on - you choose when to pay the reasoning tax.

Why 31B? The Self-Hosting Sweet Spot

The 30.7B parameter count is not arbitrary. Here is the hardware reality:

FP16 (full precision)

~62 GB VRAM

Single A100 80GB or 2x RTX 4090 - Maximum quality, highest memory

INT8 quantized

~31 GB VRAM

Single A100 40GB or 1x RTX 4090 - Minimal quality loss, practical for most

INT4/GPTQ quantized

~16 GB VRAM

RTX 4080 or M2 Ultra Mac - Some quality degradation, consumer hardware viable

GGUF Q4_K_M (llama.cpp)

~18 GB VRAM

RTX 3090 / M1 Pro Mac - CPU offloading possible, slowest but most accessible

This makes Gemma 4 31B the largest open-weight model that can realistically run on consumer hardware. Larger models (70B+) require multi-GPU setups or expensive cloud instances. Smaller models (7-13B) sacrifice too much quality. The 31B size maximizes the quality-to-deployability ratio. For self-hosting guidance, see our self-hosted AI models guide and best local LLM for coding rankings.