Skip to main content

Vision Capabilities

Enable LLMs to see and understand images with vision-capable models.

Overview

Vision-capable LLMs can analyze images and answer questions about visual content. This enables:
  • Image description - Generate detailed descriptions of images
  • Visual Q&A - Answer questions about image content
  • OCR/Text extraction - Read text from images
  • Object detection - Identify objects and entities
  • Scene understanding - Understand context and relationships
  • Chart analysis - Interpret graphs and visualizations
Eden AI V3 provides vision capabilities through multiple providers, each with unique strengths.

Vision-Capable Models

ProviderModelStrengthsMax Image SizeLanguages
OpenAIgpt-4oFast, accurate, multi-image20 MB50+
OpenAIgpt-4-turboHigh quality analysis20 MB50+
Anthropicclaude-3-5-sonnet-20241022Excellent reasoning, documents5 MB100+
Anthropicclaude-3-opus-20240229Superior accuracy5 MB100+
Googlegemini-1.5-proLong context, large files20 MB100+
Googlegemini-1.5-flashFast, cost-effective20 MB100+
Mistralpixtral-12bEfficient, European10 MB50+

Basic Image Analysis

Simple Image Description

Visual Question Answering

Advanced Vision Use Cases

OCR and Text Extraction

Extract text from images with high accuracy:

Object and Entity Detection

Identify objects, brands, and entities:

Chart and Graph Analysis

Interpret data visualizations:

Screenshot Analysis

Debug UI issues or analyze interfaces:

Logo and Brand Detection

Identify brands and logos:

Multi-Image Analysis

Compare and analyze multiple images:

Before/After Comparison

Multi-Image Context

Analyze related images together:

Provider Comparison

OpenAI (GPT-4o, GPT-4-turbo)

Strengths:
  • Fast processing
  • Excellent general-purpose vision
  • Strong multi-image capabilities
  • Reliable OCR
  • Good detail detection
Best for:
  • Real-time applications
  • Multi-image analysis
  • General image understanding
  • Screenshot analysis
Example:
"model": "openai/gpt-4o"

Anthropic (Claude 3 Family)

Strengths:
  • Superior reasoning about images
  • Excellent document analysis
  • Strong at complex visual tasks
  • Detailed, thoughtful responses
  • Multi-language support
Best for:
  • Document processing
  • Complex reasoning tasks
  • Detailed analysis
  • Academic/research content
Example:
"model": "anthropic/claude-3-5-sonnet-20241022"

Google (Gemini 1.5)

Strengths:
  • Extremely long context (up to 2GB)
  • Fast processing (Flash variant)
  • Strong multilingual capabilities
  • Excellent for large documents
  • Cost-effective (Flash)
Best for:
  • Large document processing
  • Multi-page PDFs
  • Video frame analysis
  • High-volume applications
Example:
"model": "google/gemini-1.5-flash"

Mistral (Pixtral)

Strengths:
  • European data residency
  • Efficient processing
  • Good price/performance
  • Privacy-focused
Best for:
  • European compliance needs
  • Cost-sensitive applications
  • Privacy requirements
Example:
"model": "mistral/pixtral-12b"

Image Input Formats

HTTP(S) URLs

Simplest method for accessible images:
{
    "type": "image_url",
    "image_url": {
        "url": "https://example.com/image.jpg"
    }
}

Base64 Data URLs

For inline or private images:
import base64

with open("image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode('utf-8')

{
    "type": "image_url",
    "image_url": {
        "url": f"data:image/jpeg;base64,{image_data}"
    }
}

Uploaded File UUIDs

For reusable images:
# Upload first
upload_response = requests.post(
    "https://api.edenai.run/v3/upload",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    files={"file": open("image.jpg", "rb")}
)
file_id = upload_response.json()["file_id"]

# Use in vision request
{
    "type": "file",
    "file": {"file_id": file_id}
}

Best Practices

Prompting for Vision

Be specific about what you want:
# Vague
"What's in this image?"

# Specific
"List all furniture items visible in this room photo, including their approximate positions and colors."
Request structured output:
"Extract the following from this business card and format as JSON:
- name
- title
- company
- email
- phone"
Provide context:
"This is a medical X-ray of a chest. Identify any abnormalities or concerning features."

Image Quality Tips

Optimize resolution:
  • Use high-quality images (min 1024px on longest side)
  • Avoid excessive compression
  • Ensure text is legible
Proper lighting:
  • Well-lit images work best
  • Avoid glare and shadows
  • Ensure good contrast
Clear framing:
  • Center subjects of interest
  • Avoid clutter when possible
  • Crop to relevant content

Temperature Settings

Adjust temperature based on task:
# Factual tasks (OCR, counting, detection)
"temperature": 0.1

# General description
"temperature": 0.5

# Creative interpretation
"temperature": 0.8

Cost Optimization

Choose appropriate models:
  • Use gemini-1.5-flash for high-volume tasks
  • Reserve claude-3-opus for complex analysis
  • Use gpt-4o for balanced performance
Image size optimization:
  • Resize images to minimum needed resolution
  • Compress without losing critical details
  • Use URLs instead of base64 when possible

Error Handling

Common Vision Errors

Unsupported image format:
{
  "error": {
    "code": "unsupported_format",
    "message": "Image format .bmp is not supported"
  }
}
Image too large:
{
  "error": {
    "code": "image_too_large",
    "message": "Image size exceeds 20 MB limit for this provider"
  }
}
Invalid image data:
{
  "error": {
    "code": "invalid_image",
    "message": "Unable to process image data"
  }
}

Handling Vision Errors

Supported Image Formats

FormatExtensionOpenAIAnthropicGoogleMistral
JPEG.jpg, .jpeg
PNG.png
WebP.webp
GIF.gif-

Next Steps