Skip to main content

Optimize LLM Costs with Smart Routing

Build a cost-effective chatbot application using Eden AI’s smart routing to automatically select the best model for each query while minimizing expenses.

What You’ll Build

By the end of this tutorial, you’ll have:
  • Smart routing-powered chatbot - Automatically selects optimal models
  • Multi-tier routing strategy - Budget/balanced/premium tiers for different use cases
  • Cost tracking system - Monitor spending per conversation and query type
  • A/B testing framework - Compare smart routing vs. fixed models
  • Performance metrics - Track latency, quality, and cost trade-offs

Prerequisites

  • Python 3.8 or higher
  • Eden AI API key (Get one here)
  • Basic understanding of LLMs and REST APIs
  • Optional: Database for persistent storage (SQLite/PostgreSQL)

Problem Statement

You’re building a customer support chatbot with diverse query types:
  • Simple FAQs - “What are your hours?” (low complexity)
  • Technical support - “How do I configure SSL?” (medium complexity)
  • Complex troubleshooting - “Server crashes with error X” (high complexity)
Challenges:
  1. Fixed models are inefficient - Using GPT-4o for all queries wastes money on simple FAQs
  2. Manual model selection is hard - Predicting which model fits each query is complex
  3. Quality vs. cost trade-off - Balancing response quality with budget constraints
Solution: Smart routing automatically selects the right model for each query, optimizing costs without sacrificing quality.

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│               Customer Support Chatbot                   │
│                                                          │
│  ┌────────────┐   ┌──────────────┐   ┌──────────────┐  │
│  │  Query     │──▶│  Routing     │──▶│  Response    │  │
│  │  Analyzer  │   │  Engine      │   │  Generator   │  │
│  └────────────┘   └──────────────┘   └──────────────┘  │
│                           │                              │
│                           ▼                              │
│                  ┌─────────────────┐                     │
│                  │  Cost Tracker   │                     │
│                  └─────────────────┘                     │
└─────────────────────────────────────────────────────────┘


              ┌────────────────────────┐
              │  Eden AI Smart Router  │
              │  (@edenai routing)     │
              └────────────────────────┘

Step 1: Baseline Implementation (Fixed Model)

First, let’s build a simple chatbot using a single fixed model to establish a baseline:
import requests
import json
from datetime import datetime
from typing import List, Dict

class BaselineChatbot:
    """Simple chatbot with fixed model (baseline for comparison)."""

    def __init__(self, api_key: str, model: str = "openai/gpt-4o"):
        self.api_key = api_key
        self.model = model
        self.conversation_history = []
        self.cost_log = []

    def chat(self, message: str) -> Dict:
        """Send a message and get response."""

        self.conversation_history.append({
            "role": "user",
            "content": message
        })

        url = "https://api.edenai.run/v3/llm/chat/completions"
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }

        payload = {
            "model": self.model,
            "messages": self.conversation_history,
            "stream": True
        }

        start_time = datetime.now()
        response = requests.post(url, headers=headers, json=payload, stream=True)

        full_response = ""
        for line in response.iter_lines():
            if line:
                line_str = line.decode('utf-8')
                if line_str.startswith('data: ') and line_str != 'data: [DONE]':
                    data = json.loads(line_str[6:])
                    content = data.get('choices', [{}])[0].get('delta', {}).get('content', '')
                    full_response += content

        end_time = datetime.now()
        latency = (end_time - start_time).total_seconds()

        # Add to conversation history
        self.conversation_history.append({
            "role": "assistant",
            "content": full_response
        })

        # Log cost (approximate - you'd get this from response metadata)
        estimated_cost = self._estimate_cost(message, full_response)
        self.cost_log.append({
            "timestamp": start_time.isoformat(),
            "model": self.model,
            "input_tokens": len(message.split()) * 1.3,  # Rough estimate
            "output_tokens": len(full_response.split()) * 1.3,
            "cost": estimated_cost,
            "latency": latency
        })

        return {
            "response": full_response,
            "model": self.model,
            "latency": latency,
            "estimated_cost": estimated_cost
        }

    def _estimate_cost(self, input_text: str, output_text: str) -> float:
        """Estimate cost based on token counts (simplified)."""
        # Rough token estimation: ~1.3 tokens per word
        input_tokens = len(input_text.split()) * 1.3
        output_tokens = len(output_text.split()) * 1.3

        # GPT-4o pricing (example)
        COST_PER_1K_INPUT = 0.0025
        COST_PER_1K_OUTPUT = 0.01

        cost = (input_tokens / 1000 * COST_PER_1K_INPUT +
                output_tokens / 1000 * COST_PER_1K_OUTPUT)
        return cost

    def get_total_cost(self) -> float:
        """Get total conversation cost."""
        return sum(log["cost"] for log in self.cost_log)

    def get_stats(self) -> Dict:
        """Get conversation statistics."""
        total_queries = len(self.cost_log)
        if total_queries == 0:
            return {"error": "No queries yet"}

        return {
            "total_queries": total_queries,
            "total_cost": self.get_total_cost(),
            "avg_cost_per_query": self.get_total_cost() / total_queries,
            "avg_latency": sum(log["latency"] for log in self.cost_log) / total_queries,
            "model_used": self.model
        }


# Test baseline chatbot
if __name__ == "__main__":
    chatbot = BaselineChatbot(api_key="YOUR_API_KEY")

    # Test queries
    queries = [
        "What are your business hours?",
        "How do I reset my password?",
        "My server is returning 500 errors after upgrading to version 2.0. The logs show 'Connection timeout' but only for API endpoints, not static content.",
    ]

    print("=== Baseline Chatbot (Fixed Model: GPT-4o) ===\n")

    for query in queries:
        print(f"User: {query}")
        result = chatbot.chat(query)
        print(f"Assistant: {result['response'][:100]}...")
        print(f"Cost: ${result['estimated_cost']:.4f} | Latency: {result['latency']:.2f}s\n")

    # Print stats
    stats = chatbot.get_stats()
    print("\n=== Baseline Statistics ===")
    print(f"Total queries: {stats['total_queries']}")
    print(f"Total cost: ${stats['total_cost']:.4f}")
    print(f"Avg cost per query: ${stats['avg_cost_per_query']:.4f}")
Expected Output:
User: What are your business hours?
Assistant: Our business hours are Monday through Friday, 9 AM to 5 PM EST...
Cost: $0.0015 | Latency: 1.2s

Total cost: $0.0123
Avg cost per query: $0.0041
Problem: Simple queries like “What are your business hours?” cost the same as complex troubleshooting questions, wasting money.

Step 2: Add Smart Routing

Now let’s migrate to smart routing with default model selection:
import requests
import json
from datetime import datetime
from typing import List, Dict, Optional

class SmartRoutingChatbot:
    """Chatbot with smart routing for automatic model selection."""

    def __init__(self, api_key: str, router_candidates: Optional[List[str]] = None):
        self.api_key = api_key
        self.router_candidates = router_candidates
        self.conversation_history = []
        self.cost_log = []

    def chat(self, message: str) -> Dict:
        """Send a message with smart routing."""

        self.conversation_history.append({
            "role": "user",
            "content": message
        })

        url = "https://api.edenai.run/v3/llm/chat/completions"
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }

        payload = {
            "model": "@edenai",  # Smart routing trigger
            "messages": self.conversation_history,
            "stream": True
        }

        # Add custom candidates if provided
        if self.router_candidates:
            payload["router_candidates"] = self.router_candidates

        start_time = datetime.now()
        response = requests.post(url, headers=headers, json=payload, stream=True)

        full_response = ""
        selected_model = None
        first_chunk_time = None

        for line in response.iter_lines():
            if line:
                if not first_chunk_time:
                    first_chunk_time = datetime.now()

                line_str = line.decode('utf-8')
                if line_str.startswith('data: ') and line_str != 'data: [DONE]':
                    data = json.loads(line_str[6:])

                    # Track which model was selected
                    if not selected_model and 'model' in data:
                        selected_model = data['model']

                    content = data.get('choices', [{}])[0].get('delta', {}).get('content', '')
                    full_response += content

        end_time = datetime.now()
        latency = (end_time - start_time).total_seconds()
        routing_latency = (first_chunk_time - start_time).total_seconds() if first_chunk_time else 0

        # Add to conversation history
        self.conversation_history.append({
            "role": "assistant",
            "content": full_response
        })

        # Log cost
        estimated_cost = self._estimate_cost(message, full_response, selected_model)
        self.cost_log.append({
            "timestamp": start_time.isoformat(),
            "query": message[:100],
            "model": selected_model,
            "routing_method": "smart",
            "input_tokens": len(message.split()) * 1.3,
            "output_tokens": len(full_response.split()) * 1.3,
            "cost": estimated_cost,
            "latency": latency,
            "routing_latency": routing_latency
        })

        return {
            "response": full_response,
            "model": selected_model,
            "latency": latency,
            "routing_latency": routing_latency,
            "estimated_cost": estimated_cost
        }

    def _estimate_cost(self, input_text: str, output_text: str, model: str) -> float:
        """Estimate cost based on model pricing."""
        input_tokens = len(input_text.split()) * 1.3
        output_tokens = len(output_text.split()) * 1.3

        # Simplified pricing table (per 1K tokens)
        PRICING = {
            "openai/gpt-4o": {"input": 0.0025, "output": 0.01},
            "openai/gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
            "anthropic/claude-sonnet-4-5": {"input": 0.003, "output": 0.015},
            "anthropic/claude-haiku-4-5": {"input": 0.0008, "output": 0.004},
            "google/gemini-2.0-flash": {"input": 0.0001, "output": 0.0004},
        }

        pricing = PRICING.get(model, PRICING["openai/gpt-4o"])
        cost = (input_tokens / 1000 * pricing["input"] +
                output_tokens / 1000 * pricing["output"])
        return cost

    def get_stats(self) -> Dict:
        """Get conversation statistics."""
        if not self.cost_log:
            return {"error": "No queries yet"}

        from collections import defaultdict
        model_counts = defaultdict(int)
        model_costs = defaultdict(float)

        for log in self.cost_log:
            model_counts[log["model"]] += 1
            model_costs[log["model"]] += log["cost"]

        return {
            "total_queries": len(self.cost_log),
            "total_cost": sum(log["cost"] for log in self.cost_log),
            "avg_cost_per_query": sum(log["cost"] for log in self.cost_log) / len(self.cost_log),
            "avg_latency": sum(log["latency"] for log in self.cost_log) / len(self.cost_log),
            "avg_routing_latency": sum(log["routing_latency"] for log in self.cost_log) / len(self.cost_log),
            "model_distribution": dict(model_counts),
            "cost_by_model": dict(model_costs)
        }


# Test smart routing chatbot
if __name__ == "__main__":
    # Use custom candidates for cost optimization
    candidates = [
        "openai/gpt-4o",
        "openai/gpt-4o-mini",
        "anthropic/claude-sonnet-4-5",
        "google/gemini-2.0-flash"
    ]

    chatbot = SmartRoutingChatbot(
        api_key="YOUR_API_KEY",
        router_candidates=candidates
    )

    print("=== Smart Routing Chatbot ===\n")

    queries = [
        "What are your business hours?",
        "How do I reset my password?",
        "My server is returning 500 errors after upgrading to version 2.0. The logs show 'Connection timeout' but only for API endpoints.",
    ]

    for query in queries:
        print(f"User: {query}")
        result = chatbot.chat(query)
        print(f"Model: {result['model']}")
        print(f"Assistant: {result['response'][:100]}...")
        print(f"Cost: ${result['estimated_cost']:.4f} | Latency: {result['latency']:.2f}s")
        print(f"Routing overhead: {result['routing_latency']:.2f}s\n")

    # Print stats
    stats = chatbot.get_stats()
    print("\n=== Smart Routing Statistics ===")
    print(f"Total queries: {stats['total_queries']}")
    print(f"Total cost: ${stats['total_cost']:.4f}")
    print(f"Avg cost per query: ${stats['avg_cost_per_query']:.4f}")
    print(f"Model distribution: {stats['model_distribution']}")
    print(f"Cost by model: {stats['cost_by_model']}")
Expected Output:
User: What are your business hours?
Model: google/gemini-2.0-flash
Cost: $0.0002 | Latency: 1.4s

User: How do I reset my password?
Model: openai/gpt-4o-mini
Cost: $0.0008 | Latency: 1.5s

User: My server is returning 500 errors...
Model: openai/gpt-4o
Cost: $0.0035 | Latency: 1.8s

Total cost: $0.0045 (vs. $0.0123 baseline = 63% savings!)
Model distribution: {'google/gemini-2.0-flash': 1, 'openai/gpt-4o-mini': 1, 'openai/gpt-4o': 1}
Result: 60%+ cost savings by routing simple queries to cheaper models!

Step 3: Implement Multi-Tier Routing Strategy

Create different routing strategies for various use cases:
from typing import Dict, List, Optional
from enum import Enum

class CostTier(Enum):
    """Cost optimization tiers."""
    BUDGET = "budget"      # Minimize cost
    BALANCED = "balanced"  # Balance cost and quality
    PREMIUM = "premium"    # Maximize quality

class TieredRoutingChatbot(SmartRoutingChatbot):
    """Chatbot with cost-tier-based routing strategies."""

    # Define candidate pools for each tier
    TIER_CANDIDATES = {
        CostTier.BUDGET: [
            "openai/gpt-4o-mini",
            "google/gemini-2.0-flash",
            "anthropic/claude-haiku-4-5",
        ],
        CostTier.BALANCED: [
            "openai/gpt-4o",
            "anthropic/claude-sonnet-4-5",
            "google/gemini-2.0-flash",
        ],
        CostTier.PREMIUM: [
            "anthropic/claude-opus-4-5",
            "openai/gpt-4o",
            "google/gemini-2.5-pro",
        ]
    }

    def __init__(self, api_key: str, cost_tier: CostTier = CostTier.BALANCED):
        """Initialize with cost tier."""
        candidates = self.TIER_CANDIDATES[cost_tier]
        super().__init__(api_key, router_candidates=candidates)
        self.cost_tier = cost_tier

    @classmethod
    def classify_query_tier(cls, message: str) -> CostTier:
        """
        Classify query into cost tier based on complexity.

        Simple heuristic - you could use ML model for production.
        """
        message_lower = message.lower()
        word_count = len(message.split())

        # Budget tier: Simple, short queries
        if word_count < 10 and any(keyword in message_lower for keyword in [
            "hours", "price", "cost", "when", "where", "what is"
        ]):
            return CostTier.BUDGET

        # Premium tier: Complex, long queries
        if word_count > 30 or any(keyword in message_lower for keyword in [
            "analyze", "comprehensive", "detailed", "troubleshoot", "debug"
        ]):
            return CostTier.PREMIUM

        # Balanced tier: Everything else
        return CostTier.BALANCED

class AdaptiveChatbot:
    """Chatbot that adapts tier based on query complexity."""

    def __init__(self, api_key: str):
        self.api_key = api_key
        self.chatbots = {
            tier: TieredRoutingChatbot(api_key, tier)
            for tier in CostTier
        }
        self.query_log = []

    def chat(self, message: str, force_tier: Optional[CostTier] = None) -> Dict:
        """Chat with automatic tier selection."""

        # Classify query or use forced tier
        tier = force_tier or TieredRoutingChatbot.classify_query_tier(message)
        chatbot = self.chatbots[tier]

        # Get response
        result = chatbot.chat(message)
        result["cost_tier"] = tier.value

        # Log query
        self.query_log.append({
            "query": message[:100],
            "tier": tier.value,
            "model": result["model"],
            "cost": result["estimated_cost"]
        })

        return result

    def get_aggregated_stats(self) -> Dict:
        """Get stats across all tiers."""
        all_stats = {}
        for tier, chatbot in self.chatbots.items():
            stats = chatbot.get_stats()
            if "error" not in stats:
                all_stats[tier.value] = stats

        # Calculate totals
        total_cost = sum(
            stats["total_cost"]
            for stats in all_stats.values()
        )
        total_queries = sum(
            stats["total_queries"]
            for stats in all_stats.values()
        )

        return {
            "by_tier": all_stats,
            "total_cost": total_cost,
            "total_queries": total_queries,
            "avg_cost_per_query": total_cost / total_queries if total_queries > 0 else 0
        }


# Test adaptive chatbot
if __name__ == "__main__":
    chatbot = AdaptiveChatbot(api_key="YOUR_API_KEY")

    print("=== Adaptive Multi-Tier Chatbot ===\n")

    queries = [
        ("What are your business hours?", None),  # Auto: Budget
        ("How do I reset my password?", None),  # Auto: Balanced
        ("Provide a comprehensive analysis of why my distributed system is experiencing cascading failures", None),  # Auto: Premium
        ("What's 2+2?", CostTier.BUDGET),  # Forced: Budget
    ]

    for query, tier in queries:
        print(f"User: {query}")
        result = chatbot.chat(query, force_tier=tier)
        print(f"Tier: {result['cost_tier']} | Model: {result['model']}")
        print(f"Cost: ${result['estimated_cost']:.4f}\n")

    # Print aggregated stats
    stats = chatbot.get_aggregated_stats()
    print("\n=== Aggregated Statistics ===")
    print(f"Total cost: ${stats['total_cost']:.4f}")
    print(f"Total queries: {stats['total_queries']}")
    print(f"\nBy tier:")
    for tier, tier_stats in stats['by_tier'].items():
        print(f"  {tier}: {tier_stats['total_queries']} queries, ${tier_stats['total_cost']:.4f}")

Step 4: Build A/B Testing Framework

Compare smart routing vs. fixed models:
import random
from typing import Dict, List
from dataclasses import dataclass
from datetime import datetime

@dataclass
class ABTestResult:
    """Results from A/B test."""
    variant: str
    query: str
    model: str
    cost: float
    latency: float
    quality_score: float  # User rating or automated metric

class ABTester:
    """A/B test smart routing vs. fixed models."""

    def __init__(self, api_key: str):
        self.api_key = api_key

        # Variant A: Fixed model
        self.variant_a = BaselineChatbot(api_key, model="openai/gpt-4o")

        # Variant B: Smart routing
        self.variant_b = SmartRoutingChatbot(
            api_key,
            router_candidates=[
                "openai/gpt-4o",
                "openai/gpt-4o-mini",
                "google/gemini-2.0-flash"
            ]
        )

        self.results = []

    def run_test(
        self,
        queries: List[str],
        split: float = 0.5
    ) -> Dict:
        """Run A/B test on queries."""

        for query in queries:
            # Random assignment
            variant = "A" if random.random() < split else "B"
            chatbot = self.variant_a if variant == "A" else self.variant_b

            # Get response
            result = chatbot.chat(query)

            # Simulate quality score (in production, get from users or eval model)
            quality_score = random.uniform(0.7, 1.0)

            # Record result
            self.results.append(ABTestResult(
                variant=variant,
                query=query,
                model=result["model"],
                cost=result["estimated_cost"],
                latency=result["latency"],
                quality_score=quality_score
            ))

        return self.analyze_results()

    def analyze_results(self) -> Dict:
        """Analyze A/B test results."""
        if not self.results:
            return {"error": "No test results"}

        # Split by variant
        variant_a = [r for r in self.results if r.variant == "A"]
        variant_b = [r for r in self.results if r.variant == "B"]

        def analyze_variant(results: List[ABTestResult]) -> Dict:
            if not results:
                return {}
            return {
                "sample_size": len(results),
                "avg_cost": sum(r.cost for r in results) / len(results),
                "total_cost": sum(r.cost for r in results),
                "avg_latency": sum(r.latency for r in results) / len(results),
                "avg_quality": sum(r.quality_score for r in results) / len(results),
            }

        a_stats = analyze_variant(variant_a)
        b_stats = analyze_variant(variant_b)

        # Calculate improvements
        cost_savings = ((a_stats["avg_cost"] - b_stats["avg_cost"]) /
                       a_stats["avg_cost"] * 100) if a_stats else 0

        quality_change = ((b_stats["avg_quality"] - a_stats["avg_quality"]) /
                         a_stats["avg_quality"] * 100) if a_stats else 0

        return {
            "variant_a": a_stats,
            "variant_b": b_stats,
            "cost_savings_pct": cost_savings,
            "quality_change_pct": quality_change,
            "recommendation": "B (Smart Routing)" if cost_savings > 10 and quality_change > -5 else "A (Fixed Model)"
        }


# Run A/B test
if __name__ == "__main__":
    tester = ABTester(api_key="YOUR_API_KEY")

    queries = [
        "What are your hours?",
        "How do I reset my password?",
        "Explain the difference between OAuth and JWT",
        "My app crashes on startup",
        "What's the pricing?",
        "Help me debug this complex authentication flow",
    ]

    print("=== Running A/B Test ===\n")
    results = tester.run_test(queries)

    print(f"Variant A (Fixed: GPT-4o):")
    print(f"  Sample size: {results['variant_a']['sample_size']}")
    print(f"  Avg cost: ${results['variant_a']['avg_cost']:.4f}")
    print(f"  Avg quality: {results['variant_a']['avg_quality']:.2f}\n")

    print(f"Variant B (Smart Routing):")
    print(f"  Sample size: {results['variant_b']['sample_size']}")
    print(f"  Avg cost: ${results['variant_b']['avg_cost']:.4f}")
    print(f"  Avg quality: {results['variant_b']['avg_quality']:.2f}\n")

    print(f"Cost savings: {results['cost_savings_pct']:.1f}%")
    print(f"Quality change: {results['quality_change_pct']:.1f}%")
    print(f"\nRecommendation: {results['recommendation']}")
Expected Results:
Variant A (Fixed: GPT-4o):
  Avg cost: $0.0042
  Avg quality: 0.85

Variant B (Smart Routing):
  Avg cost: $0.0016
  Avg quality: 0.84

Cost savings: 61.9%
Quality change: -1.2%

Recommendation: B (Smart Routing)

Step 5: Production Deployment Best Practices

Monitoring and Alerting

import logging
from datetime import datetime
from typing import Dict, Optional
import json

class ProductionMonitor:
    """Production monitoring for smart routing."""

    def __init__(self, api_key: str, alert_threshold_usd: float = 100.0):
        self.api_key = api_key
        self.alert_threshold = alert_threshold_usd
        self.daily_spend = 0.0
        self.error_count = 0
        self.routing_failures = 0

        # Set up logging
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s',
            handlers=[
                logging.FileHandler('chatbot.log'),
                logging.StreamHandler()
            ]
        )
        self.logger = logging.getLogger(__name__)

    def log_request(
        self,
        query: str,
        model: str,
        cost: float,
        latency: float,
        error: Optional[str] = None
    ):
        """Log request details."""

        self.daily_spend += cost

        log_data = {
            "timestamp": datetime.now().isoformat(),
            "query_preview": query[:50],
            "model": model,
            "cost": cost,
            "latency": latency,
            "daily_spend": self.daily_spend,
            "error": error
        }

        if error:
            self.error_count += 1
            self.logger.error(f"Request failed: {json.dumps(log_data)}")
        else:
            self.logger.info(json.dumps(log_data))

        # Check thresholds
        self._check_alerts()

    def _check_alerts(self):
        """Check if alerts should be triggered."""

        # Cost alert
        if self.daily_spend > self.alert_threshold:
            self.logger.warning(
                f"Daily spend (${self.daily_spend:.2f}) exceeded threshold (${self.alert_threshold:.2f})"
            )
            # In production: send email/Slack notification

        # Error rate alert
        if self.error_count > 10:
            self.logger.critical(f"High error count: {self.error_count}")
            # In production: page on-call engineer

# Integrate with chatbot
class ProductionChatbot(SmartRoutingChatbot):
    """Production-ready chatbot with monitoring."""

    def __init__(self, api_key: str, candidates: List[str]):
        super().__init__(api_key, candidates)
        self.monitor = ProductionMonitor(api_key, alert_threshold_usd=100.0)

    def chat(self, message: str) -> Dict:
        """Chat with monitoring."""
        try:
            result = super().chat(message)

            # Log successful request
            self.monitor.log_request(
                query=message,
                model=result["model"],
                cost=result["estimated_cost"],
                latency=result["latency"]
            )

            return result

        except Exception as e:
            # Log error
            self.monitor.log_request(
                query=message,
                model="unknown",
                cost=0.0,
                latency=0.0,
                error=str(e)
            )
            raise

Key Takeaways

Cost Savings Summary

StrategyAvg Cost per QuerySavings vs. Baseline
Baseline (Fixed GPT-4o)$0.0041-
Smart Routing (Default)$0.001856%
Smart Routing (Budget Tier)$0.000880%
Smart Routing (Balanced)$0.001563%

Best Practices

Start simple - Begin with default smart routing, then optimize ✅ Monitor metrics - Track cost, latency, and quality ✅ Use tiered strategies - Different tiers for different use cases ✅ A/B test - Validate cost savings don’t hurt quality ✅ Set budgets - Alert before overspending ✅ Log routing decisions - Debug and optimize over time

When Smart Routing Shines

  • Diverse query types - Mix of simple and complex queries
  • Cost-sensitive applications - Budget constraints
  • High volume - Many requests per day
  • Unpredictable workloads - Query complexity varies

When to Use Fixed Models

  • Consistent requirements - All queries need same model
  • Latency-critical - Can’t afford 100-500ms routing overhead
  • Specific model features - Need particular model’s capabilities
  • Already optimized - You’ve manually tuned model selection

Next Steps

Additional Resources

Try it yourself and see 50%+ cost savings!