Vision Capabilities
Enable LLMs to see and understand images with vision-capable models.Overview
Vision-capable LLMs can analyze images and answer questions about visual content. This enables:- Image description - Generate detailed descriptions of images
- Visual Q&A - Answer questions about image content
- OCR/Text extraction - Read text from images
- Object detection - Identify objects and entities
- Scene understanding - Understand context and relationships
- Chart analysis - Interpret graphs and visualizations
Vision-Capable Models
| Provider | Model | Strengths | Max Image Size | Languages |
|---|---|---|---|---|
| OpenAI | gpt-4o | Fast, accurate, multi-image | 20 MB | 50+ |
| OpenAI | gpt-4-turbo | High quality analysis | 20 MB | 50+ |
| Anthropic | claude-3-5-sonnet-20241022 | Excellent reasoning, documents | 5 MB | 100+ |
| Anthropic | claude-3-opus-20240229 | Superior accuracy | 5 MB | 100+ |
| gemini-1.5-pro | Long context, large files | 20 MB | 100+ | |
| gemini-1.5-flash | Fast, cost-effective | 20 MB | 100+ | |
| Mistral | pixtral-12b | Efficient, European | 10 MB | 50+ |
Basic Image Analysis
Simple Image Description
Visual Question Answering
Advanced Vision Use Cases
OCR and Text Extraction
Extract text from images with high accuracy:Object and Entity Detection
Identify objects, brands, and entities:Chart and Graph Analysis
Interpret data visualizations:Screenshot Analysis
Debug UI issues or analyze interfaces:Logo and Brand Detection
Identify brands and logos:Multi-Image Analysis
Compare and analyze multiple images:Before/After Comparison
Multi-Image Context
Analyze related images together:Provider Comparison
OpenAI (GPT-4o, GPT-4-turbo)
Strengths:- Fast processing
- Excellent general-purpose vision
- Strong multi-image capabilities
- Reliable OCR
- Good detail detection
- Real-time applications
- Multi-image analysis
- General image understanding
- Screenshot analysis
Anthropic (Claude 3 Family)
Strengths:- Superior reasoning about images
- Excellent document analysis
- Strong at complex visual tasks
- Detailed, thoughtful responses
- Multi-language support
- Document processing
- Complex reasoning tasks
- Detailed analysis
- Academic/research content
Google (Gemini 1.5)
Strengths:- Extremely long context (up to 2GB)
- Fast processing (Flash variant)
- Strong multilingual capabilities
- Excellent for large documents
- Cost-effective (Flash)
- Large document processing
- Multi-page PDFs
- Video frame analysis
- High-volume applications
Mistral (Pixtral)
Strengths:- European data residency
- Efficient processing
- Good price/performance
- Privacy-focused
- European compliance needs
- Cost-sensitive applications
- Privacy requirements
Image Input Formats
HTTP(S) URLs
Simplest method for accessible images:Base64 Data URLs
For inline or private images:Uploaded File UUIDs
For reusable images:Best Practices
Prompting for Vision
Be specific about what you want:Image Quality Tips
Optimize resolution:- Use high-quality images (min 1024px on longest side)
- Avoid excessive compression
- Ensure text is legible
- Well-lit images work best
- Avoid glare and shadows
- Ensure good contrast
- Center subjects of interest
- Avoid clutter when possible
- Crop to relevant content
Temperature Settings
Adjust temperature based on task:Cost Optimization
Choose appropriate models:- Use
gemini-1.5-flashfor high-volume tasks - Reserve
claude-3-opusfor complex analysis - Use
gpt-4ofor balanced performance
- Resize images to minimum needed resolution
- Compress without losing critical details
- Use URLs instead of base64 when possible
Error Handling
Common Vision Errors
Unsupported image format:Handling Vision Errors
Supported Image Formats
| Format | Extension | OpenAI | Anthropic | Mistral | |
|---|---|---|---|---|---|
| JPEG | .jpg, .jpeg | ✓ | ✓ | ✓ | ✓ |
| PNG | .png | ✓ | ✓ | ✓ | ✓ |
| WebP | .webp | ✓ | ✓ | ✓ | ✓ |
| GIF | .gif | ✓ | ✓ | ✓ | - |
Next Steps
- Working with Media Files - Complete media guide
- File Attachments - Handle documents and PDFs
- Chat Completions - Core LLM features
- Streaming Responses - Handle SSE streams
- Upload Files - File management