Self-Hosting Open-Source LLMs

Q: Can I run an LLM on my laptop?

Yes, with quantization. Using Ollama or llama.cpp, you can run models up to 7B parameters at full precision or 13B-30B at Q4 quantization on a laptop with 16GB RAM and a modern GPU. For larger models, you need a desktop GPU like the RTX 4090 (24GB VRAM).

Q: How much does it cost to self-host an LLM?

If you already own a GPU, the cost is just electricity. Cloud GPU costs range from $1,500/month (single A100) to $25,000+/month (8x H100 cluster). The breakeven vs API access typically happens at 10-50M tokens per day, depending on the model.

Q: Which open-source model is closest to GPT-5 or Claude Opus?

Llama 4 Maverick and DeepSeek V3.1 are the closest. They match or exceed proprietary models on many benchmarks but still trail on the hardest reasoning tasks. For most production use cases, the gap is small enough that self-hosting makes economic sense.

Q: What is the difference between GGUF and GPTQ?

GGUF (llama.cpp format) supports CPU + GPU mixed inference and is great for consumer hardware. GPTQ is GPU-only quantization optimized for throughput. For production on NVIDIA GPUs, use GPTQ or AWQ with vLLM. For local development or CPU inference, use GGUF with Ollama or llama.cpp.

We track 156 open-source models that you can download and run on your own hardware. This guide covers everything from picking the right model to deploying it in production, with no per-token costs.

Why Self-Host?

Full Privacy

No data leaves your servers. Critical for healthcare, legal, and financial data.

No Per-Token Costs

Fixed infrastructure cost regardless of volume. Cheaper at 10M+ tokens/day.

Full Control

Fine-tune for your use case. No rate limits. No content policies. No vendor lock-in.

Step 1: Pick Your Model

The top open-source model families and their best use cases.

Llama 4 (Meta)

Best for: General-purpose, coding, chat

Sizes: Scout (17B), Maverick (400B)

Details

DeepSeek V3.1

Best for: Coding, math, reasoning

Sizes: 685B MoE (37B active)

Details

Qwen 3 (Alibaba)

Best for: Multilingual, coding

Sizes: 7B, 32B, 72B, 235B MoE

Details

Mistral (Mistral AI)

Best for: Multilingual, function calling

Sizes: Small 3 (24B), Large (123B)

Details

Gemma 3 (Google)

Best for: Compact, efficient

Sizes: 4B, 12B, 27B

Details

Step 2: Hardware Requirements

VRAM is the bottleneck. Rule of thumb: a model needs roughly 2x its parameter count in GB of VRAM for FP16, or roughly 0.5x for Q4 quantization.

Setup	GPU	VRAM	Models	Monthly Cost
Consumer	RTX 4090	24 GB	Up to 13B (FP16) or 70B (Q4)	$0 (own hardware)
Prosumer	2x RTX 4090	48 GB	Up to 70B (FP16) or 235B (Q4)	$0 (own hardware)
Cloud Basic	1x A100 80GB	80 GB	Up to 70B (FP16)	$1,500-2,500
Cloud Pro	2x A100 80GB	160 GB	Up to 180B (FP16) or 405B (Q4)	$3,000-5,000
Enterprise	8x H100 80GB	640 GB	Any model, any precision	$20,000-30,000

Use our VRAM Calculator to check if your GPU can run a specific model.

Step 3: Choose Quantization Level

Quantization reduces model size by lowering numerical precision. The sweet spot for most users is Q4_K_M: fits large models on consumer GPUs with minimal quality loss.

FP16 (no quantization)

Quality loss: None

Size: -0%

When quality is critical and you have the VRAM

Q8_0 (8-bit)

Quality loss: Negligible (<1%)

Size: -~50%

Best quality with significant VRAM savings

Q5_K_M (5-bit)

Quality loss: Minimal (1-2%)

Size: -~69%

Good balance for most use cases

Q4_K_M (4-bit)

Quality loss: Small (2-5%)

Size: -~75%

Sweet spot: fits large models on consumer GPUs

Q2_K (2-bit)

Quality loss: Noticeable (5-15%)

Size: -~87%

When you absolutely must fit on tiny hardware

Step 4: Pick a Serving Framework

The framework you choose determines throughput, ease of use, and production readiness.

vLLM

Production API endpoints with high throughput requirements

High-throughput serving engine with PagedAttention. The industry standard for production deployments.

Pros

+ Highest throughput (PagedAttention)
+ OpenAI-compatible API
+ Tensor parallelism across GPUs
+ Continuous batching

Cons

- Requires NVIDIA GPUs
- Higher memory overhead at startup
- Linux only

Ollama

Local development, prototyping, personal use on Mac/Windows/Linux

Dead-simple local LLM runner. Download and run models with a single command.

Pros

+ One-command setup
+ Cross-platform (Mac, Windows, Linux)
+ Built-in model library
+ CPU + GPU support

Cons

- Lower throughput than vLLM
- Single-user focused
- Limited production features

TGI (Text Generation Inference)

Teams already in the Hugging Face ecosystem

Hugging Face's production-grade inference server with built-in quantization support.

Pros

+ Built-in quantization (GPTQ, AWQ, bitsandbytes)
+ Docker-ready
+ Flash attention support
+ Production-tested

Cons

- Slightly lower throughput than vLLM
- More complex configuration
- NVIDIA focused

llama.cpp

Running models on consumer hardware, edge devices, or CPU-only servers

Lightweight C++ inference engine. Runs models on CPU with optional GPU acceleration.

Pros

+ Runs on CPU (no GPU required)
+ Minimal dependencies
+ GGUF quantization support
+ Very low memory footprint

Cons

- Single-request at a time (no batching)
- Lower throughput
- CLI-focused (needs wrapper for API)

Explore Open-Source Models

Browse our complete index of 156 open-source models, check VRAM requirements, and compare performance against paid alternatives.

Open-Source Models |VRAM Calculator |Best Local LLMs for Coding |Cost Optimization

Choosing Guide|Model Families|LLM Parameters|AI Glossary

Frequently Asked Questions

Yes, with quantization. Using Ollama or llama.cpp, you can run models up to 7B parameters at full precision or 13B-30B at Q4 quantization on a laptop with 16GB RAM and a modern GPU. For larger models, you need a desktop GPU like the RTX 4090 (24GB VRAM).

If you already own a GPU, the cost is just electricity. Cloud GPU costs range from $1,500/month (single A100) to $25,000+/month (8x H100 cluster). The breakeven vs API access typically happens at 10-50M tokens per day, depending on the model.

Llama 4 Maverick and DeepSeek V3.1 are the closest. They match or exceed proprietary models on many benchmarks but still trail on the hardest reasoning tasks. For most production use cases, the gap is small enough that self-hosting makes economic sense.

GGUF (llama.cpp format) supports CPU + GPU mixed inference and is great for consumer hardware. GPTQ is GPU-only quantization optimized for throughput. For production on NVIDIA GPUs, use GPTQ or AWQ with vLLM. For local development or CPU inference, use GGUF with Ollama or llama.cpp.

Step 1: Pick Your Model

The top open-source model families and their best use cases.

Llama 4 (Meta)

Best for: General-purpose, coding, chat

Sizes: Scout (17B), Maverick (400B)

Details

DeepSeek V3.1

Best for: Coding, math, reasoning

Sizes: 685B MoE (37B active)

Details

Qwen 3 (Alibaba)

Best for: Multilingual, coding

Sizes: 7B, 32B, 72B, 235B MoE

Details

Mistral (Mistral AI)

Best for: Multilingual, function calling

Sizes: Small 3 (24B), Large (123B)

Details

Gemma 3 (Google)

Best for: Compact, efficient

Sizes: 4B, 12B, 27B

Details

Step 2: Hardware Requirements

VRAM is the bottleneck. Rule of thumb: a model needs roughly 2x its parameter count in GB of VRAM for FP16, or roughly 0.5x for Q4 quantization.

Setup	GPU	VRAM	Models	Monthly Cost
Consumer	RTX 4090	24 GB	Up to 13B (FP16) or 70B (Q4)	$0 (own hardware)
Prosumer	2x RTX 4090	48 GB	Up to 70B (FP16) or 235B (Q4)	$0 (own hardware)
Cloud Basic	1x A100 80GB	80 GB	Up to 70B (FP16)	$1,500-2,500
Cloud Pro	2x A100 80GB	160 GB	Up to 180B (FP16) or 405B (Q4)	$3,000-5,000
Enterprise	8x H100 80GB	640 GB	Any model, any precision	$20,000-30,000

Use our VRAM Calculator to check if your GPU can run a specific model.

Step 3: Choose Quantization Level

Quantization reduces model size by lowering numerical precision. The sweet spot for most users is Q4_K_M: fits large models on consumer GPUs with minimal quality loss.

FP16 (no quantization)

Quality loss: None

Size: -0%

When quality is critical and you have the VRAM

Q8_0 (8-bit)

Quality loss: Negligible (<1%)

Size: -~50%

Best quality with significant VRAM savings

Q5_K_M (5-bit)

Quality loss: Minimal (1-2%)

Size: -~69%

Good balance for most use cases

Q4_K_M (4-bit)

Quality loss: Small (2-5%)

Size: -~75%

Sweet spot: fits large models on consumer GPUs

Q2_K (2-bit)

Quality loss: Noticeable (5-15%)

Size: -~87%

When you absolutely must fit on tiny hardware

Step 4: Pick a Serving Framework

The framework you choose determines throughput, ease of use, and production readiness.

vLLM

Production API endpoints with high throughput requirements

High-throughput serving engine with PagedAttention. The industry standard for production deployments.

Pros

+ Highest throughput (PagedAttention)
+ OpenAI-compatible API
+ Tensor parallelism across GPUs
+ Continuous batching

Cons

- Requires NVIDIA GPUs
- Higher memory overhead at startup
- Linux only

Ollama

Local development, prototyping, personal use on Mac/Windows/Linux

Dead-simple local LLM runner. Download and run models with a single command.

Pros

+ One-command setup
+ Cross-platform (Mac, Windows, Linux)
+ Built-in model library
+ CPU + GPU support

Cons

- Lower throughput than vLLM
- Single-user focused
- Limited production features

TGI (Text Generation Inference)

Teams already in the Hugging Face ecosystem

Hugging Face's production-grade inference server with built-in quantization support.

Pros

+ Built-in quantization (GPTQ, AWQ, bitsandbytes)
+ Docker-ready
+ Flash attention support
+ Production-tested

Cons

- Slightly lower throughput than vLLM
- More complex configuration
- NVIDIA focused

llama.cpp

Running models on consumer hardware, edge devices, or CPU-only servers

Lightweight C++ inference engine. Runs models on CPU with optional GPU acceleration.

Pros

+ Runs on CPU (no GPU required)
+ Minimal dependencies
+ GGUF quantization support
+ Very low memory footprint

Cons

- Single-request at a time (no batching)
- Lower throughput
- CLI-focused (needs wrapper for API)