Analyze 325 AI models by their input and output modalities. See which models accept text, images, audio, or video - and what they produce. Discover vision-capable models, image generators, and true multimodal AI.
Total Models
325
Input Modalities
5
Output Modalities
4
True Multimodal
8
What types of data can models accept as input.
| Modality | Model Count | % of Total |
|---|---|---|
text | 324 | 100% |
image | 149 | 46% |
file | 59 | 18% |
video | 27 | 8% |
audio | 17 | 5% |
What types of data can models produce as output.
| Modality | Model Count | % of Total |
|---|---|---|
text | 306 | 94% |
image | 14 | 4% |
video | 10 | 3% |
audio | 3 | 1% |
The most common input-to-output modality combinations across all models.
| Input Modalities | Output Modalities | Count | % of Total | |
|---|---|---|---|---|
text | text | 164 | 50% | |
imagetext | text | 61 | 19% | |
fileimagetext | text | 42 | 13% | |
imagetextvideo | text | 14 | 4% | |
audiofileimagetextvideo | text | 11 | 3% | |
imagetext | video | 8 | 2% | |
text | image | 5 | 2% | |
imagetext | image | 4 | 1% | |
imagetext | imagetext | 3 | 1% | |
audiotext | audiotext | 3 | 1% | |
fileimagetext | imagetext | 2 | 1% | |
filetext | text | 2 | 1% | |
audioimagetextvideo | text | 1 | 0% | |
fileimagetextvideo | text | 1 | 0% | |
audiotext | text | 1 | 0% | |
audiofileimagetext | text | 1 | 0% | |
image | video | 1 | 0% | |
text | video | 1 | 0% |
Top 20 models with the most diverse modality support, ranked by total unique modalities.
| Model | Input Modalities | Output Modalities | Total Unique |
|---|---|---|---|
| Gemini 2.0 Flash | audiofileimagetextvideo | text | 5 |
| Gemini 2.0 Flash Lite | audiofileimagetextvideo | text | 5 |
| Gemini 2.5 Flash | audiofileimagetextvideo | text | 5 |
| Gemini 2.5 Flash Lite | audiofileimagetextvideo | text | 5 |
| Gemini 2.5 Flash Lite Preview 09-2025 | audiofileimagetextvideo | text | 5 |
| Gemini 2.5 Pro | audiofileimagetextvideo | text | 5 |
| Gemini 2.5 Pro Preview 05-06 | audiofileimagetextvideo | text | 5 |
| Gemini 3 Flash Preview | audiofileimagetextvideo | text | 5 |
| Gemini 3.1 Flash Lite Preview | audiofileimagetextvideo | text | 5 |
| Gemini 3.1 Pro Preview | audiofileimagetextvideo | text | 5 |
| Gemini 3.1 Pro Preview Custom Tools | audiofileimagetextvideo | text | 5 |
| Gemini 2.5 Pro Preview 06-05 | audiofileimagetext | text | 4 |
| MiMo-V2-Omni | audioimagetextvideo | text | 4 |
| Nova 2 Lite | fileimagetextvideo | text | 4 |
| Claude 3.5 Sonnet | fileimagetext | text | 3 |
| Claude 3.7 Sonnet | fileimagetext | text | 3 |
| Claude 3.7 Sonnet (thinking) | fileimagetext | text | 3 |
| Claude Opus 4 | fileimagetext | text | 3 |
| Claude Opus 4.1 | fileimagetext | text | 3 |
| Claude Opus 4.5 | fileimagetext | text | 3 |
149 models that accept images as input, sorted alphabetically.
14 models that can produce images as output.
Dive deeper into multimodal AI, vision models, or browse all explorers.
Modalities refer to the types of data an AI model can process as input or generate as output. Common input modalities include text, images, and audio. Output modalities include text generation, image creation, code, and structured data like JSON.
A multimodal AI model can process and generate multiple types of data - for example, accepting both text and images as input. Models like GPT-4o, Claude 3.5, and Gemini 2.0 are multimodal, supporting text, image, and sometimes audio inputs.
Our tracker shows that a growing majority of new AI models support vision (image) input. Most top-tier models from OpenAI, Anthropic, Google, and Meta now accept image inputs alongside text, making vision a near-standard capability.