We track 157 open-source models that you can download and run on your own hardware. This guide covers everything from picking the right model to deploying it in production, with no per-token costs.
Full Privacy
No data leaves your servers. Critical for healthcare, legal, and financial data.
No Per-Token Costs
Fixed infrastructure cost regardless of volume. Cheaper at 10M+ tokens/day.
Full Control
Fine-tune for your use case. No rate limits. No content policies. No vendor lock-in.
The top open-source model families and their best use cases.
Llama 4 (Meta)
Best for: General-purpose, coding, chat
Sizes: Scout (17B), Maverick (400B)
DeepSeek V3.1
Best for: Coding, math, reasoning
Sizes: 685B MoE (37B active)
Qwen 3 (Alibaba)
Best for: Multilingual, coding
Sizes: 7B, 32B, 72B, 235B MoE
Mistral (Mistral AI)
Best for: Multilingual, function calling
Sizes: Small 3 (24B), Large (123B)
Gemma 3 (Google)
Best for: Compact, efficient
Sizes: 4B, 12B, 27B
VRAM is the bottleneck. Rule of thumb: a model needs roughly 2x its parameter count in GB of VRAM for FP16, or roughly 0.5x for Q4 quantization.
| Setup | GPU | VRAM |
|---|---|---|
| Consumer | RTX 4090 | 24 GB |
| Prosumer | 2x RTX 4090 | 48 GB |
| Cloud Basic | 1x A100 80GB | 80 GB |
| Cloud Pro | 2x A100 80GB | 160 GB |
| Enterprise | 8x H100 80GB | 640 GB |
Use our VRAM Calculator to check if your GPU can run a specific model.
Quantization reduces model size by lowering numerical precision. The sweet spot for most users is Q4_K_M: fits large models on consumer GPUs with minimal quality loss.
FP16 (no quantization)
Quality loss: None
Size: -0%
When quality is critical and you have the VRAM
Q8_0 (8-bit)
Quality loss: Negligible (<1%)
Size: -~50%
Best quality with significant VRAM savings
Q5_K_M (5-bit)
Quality loss: Minimal (1-2%)
Size: -~69%
Good balance for most use cases
Q4_K_M (4-bit)
Quality loss: Small (2-5%)
Size: -~75%
Sweet spot: fits large models on consumer GPUs
Q2_K (2-bit)
Quality loss: Noticeable (5-15%)
Size: -~87%
When you absolutely must fit on tiny hardware
The framework you choose determines throughput, ease of use, and production readiness.
High-throughput serving engine with PagedAttention. The industry standard for production deployments.
Pros
Cons
Dead-simple local LLM runner. Download and run models with a single command.
Pros
Cons
Hugging Face's production-grade inference server with built-in quantization support.
Pros
Cons
Lightweight C++ inference engine. Runs models on CPU with optional GPU acceleration.
Pros
Cons
Browse our complete index of 157 open-source models, check VRAM requirements, and compare performance against paid alternatives.
Yes, with quantization. Using Ollama or llama.cpp, you can run models up to 7B parameters at full precision or 13B-30B at Q4 quantization on a laptop with 16GB RAM and a modern GPU. For larger models, you need a desktop GPU like the RTX 4090 (24GB VRAM).
If you already own a GPU, the cost is just electricity. Cloud GPU costs range from $1,500/month (single A100) to $25,000+/month (8x H100 cluster). The breakeven vs API access typically happens at 10-50M tokens per day, depending on the model.
Llama 4 Maverick and DeepSeek V3.1 are the closest. They match or exceed proprietary models on many benchmarks but still trail on the hardest reasoning tasks. For most production use cases, the gap is small enough that self-hosting makes economic sense.
GGUF (llama.cpp format) supports CPU + GPU mixed inference and is great for consumer hardware. GPTQ is GPU-only quantization optimized for throughput. For production on NVIDIA GPUs, use GPTQ or AWQ with vLLM. For local development or CPU inference, use GGUF with Ollama or llama.cpp.