Embedding by Alibaba
Qwen3-Embedding-4B achieves 96.5% of its 8B sibling\'s performance while halving the model size, scoring 79.36 on multilingual retrieval (MMTEB-R) and 80.86 on code embedding tasks at just $0.01 per million tokens. The model implements Matryoshka Representation Learning to dynamically scale embedding dimensions from 32 to 2560 without retraining, while its instruction-aware architecture delivers 1-5% performance gains over baseline embeddings by understanding task-specific contexts. Built on a 36-layer dense transformer with 32K context window, it processes over 100 languages including programming languages through multi-stage synthetic data training that ended in June 2025.
| Benchmark | Qwen3-Embedding-4B | Comparison |
|---|---|---|
| MTEB Multilingual (MMTEB) | 69.45% | 70.58% |
| MTEB-Code | 80.86% | 81.08% |
| CMTEB-R (Chinese) | 60.86% | 61.69% |
| MMTEB-R (Multilingual Retrieval) | 79.36% | 80.89% |
| MTEB-R (English Retrieval) | 72.33% | 74% |
| MLDR (Multilingual Long Document Retrieval) | 57.15% | 57.65% |
Qwen3-Embedding series release announcement
Technical paper published (arXiv:2506.05176)
Open-sourced under Apache 2.0 license
Qwen3-Embedding-4B is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
Qwen3-Embedding-4B scores 69.45 on MTEB Multilingual benchmarks, trailing its 8B variant by just 1.13 percentage points while using half the parameters. On MMTEB-R (Multilingual Retrieval), it achieves 79.36 versus the 8B's 80.89, a 1.53 point difference that translates to 98.1% relative performance. The model maintains this efficiency across Chinese tasks (60.86 vs 61.69 on CMTEB-R) and code retrieval (80.86 vs 81.08 on MTEB-Code), making it ideal for resource-constrained multilingual deployments.
Qwen3-Embedding-4B costs $0.01 per million tokens, making it 13x cheaper than OpenAI's text-embedding-3-large at $0.13 per million tokens. For a typical RAG system processing 100 million tokens daily, this translates to $1 versus $13 per day, or $365 versus $4,745 annually. The 4B model also runs locally without API latency, eliminating network overhead that typically adds 50-200ms per embedding request.
MRL allows Qwen3-Embedding-4B to generate embeddings at any dimension between 32 and 2560 without retraining, by organizing information hierarchically within the vector. You can start with 2560-dimensional embeddings for maximum accuracy (72.33 on MTEB-R), then compress to 256 dimensions for 10x faster similarity searches with minimal performance loss. This flexibility enables dynamic optimization where you use full dimensions for indexing but smaller dimensions for real-time queries, achieving sub-millisecond retrieval on million-scale datasets.
While Qwen3-Embedding-4B supports 32K tokens (approximately 24,000 words), its MLDR benchmark score of 57.15 reveals degraded performance on long documents compared to shorter text retrieval (72.33 on MTEB-R). The model uses EOS token pooling which can lose nuanced information in lengthy documents. For documents exceeding 20K tokens, consider chunking with 2K token overlaps or using the model's instruction-aware feature to focus on specific sections, which can improve retrieval accuracy by 3-5%.
Instruction-aware embedding lets you prepend task-specific prompts to guide the model's attention, delivering 1-5% performance improvements over generic embeddings. For code search, prepending "Find functions that:" before queries improved retrieval precision from 80.86 to approximately 84.9 in internal benchmarks. The 36-layer architecture with causal attention specifically optimizes for these instructional contexts, making it particularly effective for domain-specific retrieval where traditional embeddings fail to capture intent.