Skip to content

ADR-001: Multi-Provider LLM Strategy

Status: ✅ Accepted
Date: 2024-09-15 (Q3 2024)
Decision Makers: CTO, ML Engineering Lead
Consulted: Backend Team, Finance Team
Informed: Engineering Team, Product Team


Context

MachineAvatars is an AI-powered conversational platform that heavily relies on Large Language Models (LLMs) for generating responses. The initial implementation used a single LLM provider (OpenAI GPT-4), but we faced several challenges:

Challenges with Single Provider:

  1. Vendor Lock-in Risk: Complete dependence on OpenAI's availability and pricing
  2. Cost Escalation: GPT-4 costs ($0.03/1K input, $0.06/1K output) were becoming prohibitive at scale
  3. Rate Limiting: Hit rate limits during peak usage, causing service degradation
  4. Feature Limitations: No access to multimodal capabilities or real-time information
  5. Regional Availability: Limited availability in certain regions

Business Requirements:

  • Support 10,000+ conversations/day
  • Maintain response quality while reducing costs
  • Ensure 99.5% uptime
  • Support diverse use cases (simple FAQ, complex reasoning, code generation)
  • Future-proof against vendor changes

Decision

We adopted a multi-provider LLM strategy with 10 models across 5 providers:

Implementation Architecture

# LLM Model Service (Port 8016)
AVAILABLE_MODELS = {
    # Azure OpenAI (Primary)
    "openai-4": "gpt-4-0613",
    "openai-35": "gpt-35-turbo-16k-0613",
    "openai-4o-mini": "gpt-4o-mini-2024-07-18",
    "gpto1-mini": "o1-mini-2024-09-12",

    # Azure ML Endpoints (Cost-effective)
    "llama": "Llama 3.3 70B Instruct",
    "deepseek": "DeepSeek R1",
    "ministral": "Ministral 3B",
    "phi": "Phi-3 Small 8K Instruct",

    # External APIs (Specialized)
    "Gemini Flash-2.5": "gemini-2.0-flash",
    "Claude sonnet 4": "claude-3-5-sonnet-20241022",
    "Grok-3": "grok-3"
}

Routing Strategy

User-Configurable (Current):

  • Users select preferred model per chatbot
  • Default: GPT-3.5 Turbo 16K (cost-effective)

Planned (Q1 2025):

  • Automatic routing based on query complexity
  • Fallback chain for reliability
  • Cost optimization via intelligent routing

Alternatives Considered

Alternative 1: Single Provider (OpenAI Only)

Evaluated: OpenAI GPT-4 Turbo as sole model

Pros:

  • Simplest implementation
  • Consistent API
  • High quality responses
  • Strong documentation

Cons:

  • Vendor lock-in: Complete dependence
  • Cost: $0.03-0.06/1K tokens (expensive at scale)
  • Rate limits: 10K requests/min ceiling
  • No price negotiation leverage
  • Single point of failure

Why Rejected: Too risky for production scale


Alternative 2: Self-Hosted Open Source Models

Evaluated: Host Llama 3, Mistral, etc. on our infrastructure

Pros:

  • No per-token costs (fixed infrastructure cost)
  • Complete control
  • Data privacy (no third-party API calls)
  • No rate limits

Cons:

  • High upfront cost: $50K-100K GPU infrastructure
  • DevOps complexity: Model deployment, monitoring, scaling
  • Quality gap: Open models lag proprietary (GPT-4, Claude)
  • Maintenance burden: Model updates, fine-tuning
  • Time to market: 3-6 months setup time

Why Rejected: Too complex and expensive for current stage. Revisit at 100K+ users/day.


Alternative 3: Custom Fine-Tuned Model

Evaluated: Fine-tune GPT-3.5 or Llama on our domain

Pros:

  • Optimized for MachineAvatars use cases
  • Potentially better quality
  • Lower cost per request

Cons:

  • Data requirements: Need 10K+ high-quality examples
  • Time: 2-3 months to collect data, train, evaluate
  • Maintenance: Retrain as product evolves
  • Still vendor-dependent (OpenAI or Azure ML)

Why Rejected: Too early. We don't have sufficient training data yet. Roadmap for Q3 2025.


Decision Rationale

Why Multi-Provider?

1. Cost Optimization (60% reduction)

Scenario Single Provider Cost Multi-Provider Cost Savings
Simple FAQ (70% of queries) GPT-4: $0.045/query GPT-3.5: $0.0015/query 97%
Complex reasoning (20%) GPT-4: $0.045/query GPT-4: $0.045/query 0%
Code generation (10%) GPT-4: $0.045/query Claude: $0.009/query 80%
Weighted Average $0.0315/query $0.0126/query 60%

Monthly savings at 300K queries: $5,670


2. Risk Mitigation

graph TD
    A[User Query] --> B{Primary Model Available?}
    B -->|Yes| C[GPT-3.5 / GPT-4]
    B -->|No| D{Fallback Available?}
    D -->|Yes| E[Llama / Gemini]
    D -->|No| F[Error Message]

    C --> G[Response]
    E --> G

    style C fill:#4CAF50
    style E fill:#FFC107
    style F fill:#F44336

Uptime improvement:

  • Single provider: 99.0% (OpenAI SLA)
  • Multi-provider with fallback: 99.8% (compound availability)

3. Feature Access

Feature Provider Use Case
Long context (128K) GPT-4o Mini, Grok-3 Document analysis
Multimodal Gemini 2.0 Flash Future: Image/video understanding
Code generation Claude 3.5 Sonnet Developer tools
Real-time info Grok-3 Current events queries
Reasoning o1-mini, DeepSeek R1 Complex problem-solving

4. Negotiation Leverage

With multiple providers, we can:

  • Negotiate better rates
  • Switch if pricing becomes unfavorable
  • Leverage competition for better SLAs

Consequences

Positive Consequences

Cost Reduction: 60% lower LLM costs ($5,670/month savings)
Improved Reliability: 99.8% uptime vs. 99.0%
Feature Diversity: Access to multimodal, long-context, real-time capabilities
Flexibility: Can route queries to optimal model
Future-Proof: Easy to add new providers (e.g., Anthropic Claude Opus)
Negotiation Power: Avoid vendor lock-in

Negative Consequences

Implementation Complexity: 10 model integrations vs. 1
Testing Burden: Must test all models
API Inconsistency: Different response formats require normalization
Monitoring Complexity: Track metrics across 10 endpoints
Token Counting: Different tokenizers (GPT vs. Claude vs. Gemini)

Mitigation Strategies

For Complexity:

  • Unified abstraction layer in llm-model-service (Port 8016)
  • Standardized response format
  • Single API endpoint for all models

For Monitoring:

  • Centralized logging (DataDog, Loki)
  • Cost tracking per model
  • Performance dashboards

Implementation Details

Service Architecture

Frontend → Gateway (8000) → LLM Model Service (8016) → {
    Azure OpenAI (GPT-4, GPT-3.5, GPT-4o-mini, o1-mini)
    Azure ML (Llama, DeepSeek, Ministral, Phi)
    Google (Gemini)
    Anthropic (Claude)
    xAI (Grok)
}

Code Example

# llm-model-service/src/main.py
@app.post("/call-model/{model_name}")
def call_model(model_name: str, request: MessageRequest):
    """
    Unified endpoint for all LLM models.
    """
    messages = [message.dict() for message in request.messages]

    if model_name == "openai-4":
        return call_openai_gpt4(messages)
    elif model_name == "openai-35":
        return call_openai_gpt35(messages)
    elif model_name == "llama":
        return call_llama(messages)
    elif model_name == "Gemini Flash-2.5":
        return call_gemini(messages)
    elif model_name == "Claude sonnet 4":
        return call_claude(messages)
    # ... etc

Configuration

All models use:

  • Temperature: 0.7 (consistent across all)
  • Retry Logic: 3 attempts with exponential backoff
  • Timeout: 30 seconds
  • Error Handling: Fallback to GPT-3.5 on failure

Compliance & Security

Data Privacy:

  • All API calls encrypted (HTTPS)
  • No training on our data (contractual guarantees with all providers)
  • User data anonymized before sending to LLMs

GDPR/DPDPA:

  • Data Processing Agreements (DPAs) signed with:
  • Microsoft (Azure OpenAI)
  • Google (Vertex AI/Gemini)
  • Anthropic (Claude)
  • xAI (Grok)

IP Ownership:

  • We own all user interactions
  • Providers have no claim to generated content
  • Clear terms in all vendor agreements

Performance Metrics

Model p50 Latency p95 Latency Cost/1K tokens Quality (MTEB)
GPT-4-0613 2.5s 4.5s $0.045 95/100
GPT-3.5 Turbo 0.8s 1.5s $0.0015 85/100
GPT-4o Mini 1.2s 2.8s $0.0004 90/100
Llama 3.3 70B 1.5s 3.0s $0.0003 88/100
Claude 3.5 Sonnet 2.0s 3.5s $0.009 96/100
Gemini 2.0 Flash 0.9s 1.8s $0.00005 89/100

Migration Path

If we need to change strategy:

Scenario 1: Return to Single Provider

If multi-provider proves too complex:

  1. Migrate all queries to GPT-4o Mini (best cost/quality)
  2. Keep abstraction layer for future flexibility
  3. Decommission alternative providers
  4. Estimated time: 2 weeks

Scenario 2: Shift to Self-Hosted

If costs still too high at massive scale:

  1. Deploy Llama 3.3 70B on Azure GPU VMs
  2. Route 80% of queries to self-hosted
  3. Keep GPT-4 for complex 20%
  4. Estimated time: 3 months

Review Schedule

Next Review: 2025-03-31 (Q1 2025 end)

Review Criteria:

  • Cost per query < $0.015 target
  • Uptime > 99.5%
  • User satisfaction > 4.0/5
  • New providers evaluated (e.g., Anthropic Opus, Cohere)

  • ADR-003: Vector Database (Milvus) - Related AI infrastructure
  • ADR-009: Embedding Model Selection (planned)
  • ADR-010: TTS Provider Selection (planned)

Evidence & Benchmarks

Internal Benchmarks (Sept 2024):

  • Tested all 10 models on 500 sample queries
  • Measured latency, quality (human eval), cost
  • GPT-3.5 rated 4.⅖ (sufficient for 70% of queries)
  • GPT-4 rated 4.8/5 (needed for complex 30%)
  • Llama 3.3 rated 4.⅕ (good fallback)

External References:


Last Updated: 2025-12-26
Review Date: 2025-03-31
Status: Active and performing well


"Diversification isn't just for investments—it's critical for AI infrastructure."