按输入和输出模态分析325个AI模型。查看哪些模型接受文本、图像、音频或视频输入以及它们产生什么输出。发现支持视觉的模型、图像生成器和真正的多模态AI。
模型总数
325
输入模态
5
输出模态
4
真正多模态
8
模型可以接受哪些类型的数据作为输入。
| 模态 | 模型数量 | 占比 |
|---|---|---|
text | 324 | 100% |
image | 149 | 46% |
file | 59 | 18% |
video | 27 | 8% |
audio | 17 | 5% |
模型可以产生哪些类型的数据作为输出。
| 模态 | 模型数量 | 占比 |
|---|---|---|
text | 306 | 94% |
image | 14 | 4% |
video | 10 | 3% |
audio | 3 | 1% |
所有模型中最常见的输入到输出模态组合。
| 输入模态 | 输出模态 | 数量 | 占比 | |
|---|---|---|---|---|
text | text | 164 | 50% | |
imagetext | text | 61 | 19% | |
fileimagetext | text | 42 | 13% | |
imagetextvideo | text | 14 | 4% | |
audiofileimagetextvideo | text | 11 | 3% | |
imagetext | video | 8 | 2% | |
text | image | 5 | 2% | |
imagetext | image | 4 | 1% | |
imagetext | imagetext | 3 | 1% | |
audiotext | audiotext | 3 | 1% | |
fileimagetext | imagetext | 2 | 1% | |
filetext | text | 2 | 1% | |
audioimagetextvideo | text | 1 | 0% | |
fileimagetextvideo | text | 1 | 0% | |
audiotext | text | 1 | 0% | |
audiofileimagetext | text | 1 | 0% | |
image | video | 1 | 0% | |
text | video | 1 | 0% |
按总独特模态数排名,模态支持最多样化的前20个模型。
| 模型 | 输入模态 | 输出模态 | 总计独特 |
|---|---|---|---|
| Gemini 2.0 Flash | audiofileimagetextvideo | text | 5 |
| Gemini 2.0 Flash Lite | audiofileimagetextvideo | text | 5 |
| Gemini 2.5 Flash | audiofileimagetextvideo | text | 5 |
| Gemini 2.5 Flash Lite | audiofileimagetextvideo | text | 5 |
| Gemini 2.5 Flash Lite Preview 09-2025 | audiofileimagetextvideo | text | 5 |
| Gemini 2.5 Pro | audiofileimagetextvideo | text | 5 |
| Gemini 2.5 Pro Preview 05-06 | audiofileimagetextvideo | text | 5 |
| Gemini 3 Flash Preview | audiofileimagetextvideo | text | 5 |
| Gemini 3.1 Flash Lite Preview | audiofileimagetextvideo | text | 5 |
| Gemini 3.1 Pro Preview | audiofileimagetextvideo | text | 5 |
| Gemini 3.1 Pro Preview Custom Tools | audiofileimagetextvideo | text | 5 |
| Gemini 2.5 Pro Preview 06-05 | audiofileimagetext | text | 4 |
| MiMo-V2-Omni | audioimagetextvideo | text | 4 |
| Nova 2 Lite | fileimagetextvideo | text | 4 |
| Claude 3.5 Sonnet | fileimagetext | text | 3 |
| Claude 3.7 Sonnet | fileimagetext | text | 3 |
| Claude 3.7 Sonnet (thinking) | fileimagetext | text | 3 |
| Claude Opus 4 | fileimagetext | text | 3 |
| Claude Opus 4.1 | fileimagetext | text | 3 |
| Claude Opus 4.5 | fileimagetext | text | 3 |
149个接受图像输入的模型,按字母顺序排列。
14个可以生成图像输出的模型。
Modalities refer to the types of data an AI model can process as input or generate as output. Common input modalities include text, images, and audio. Output modalities include text generation, image creation, code, and structured data like JSON.
A multimodal AI model can process and generate multiple types of data - for example, accepting both text and images as input. Models like GPT-4o, Claude 3.5, and Gemini 2.0 are multimodal, supporting text, image, and sometimes audio inputs.
Our tracker shows that a growing majority of new AI models support vision (image) input. Most top-tier models from OpenAI, Anthropic, Google, and Meta now accept image inputs alongside text, making vision a near-standard capability.