ADR-001: Multi-Provider LLM Strategy¶

Status: ✅ Accepted
Date: 2024-09-15 (Q3 2024)
Decision Makers: CTO, ML Engineering Lead
Consulted: Backend Team, Finance Team
Informed: Engineering Team, Product Team

Context¶

MachineAvatars is an AI-powered conversational platform that heavily relies on Large Language Models (LLMs) for generating responses. The initial implementation used a single LLM provider (OpenAI GPT-4), but we faced several challenges:

Challenges with Single Provider:

Vendor Lock-in Risk: Complete dependence on OpenAI's availability and pricing
Cost Escalation: GPT-4 costs ($0.03/1K input, $0.06/1K output) were becoming prohibitive at scale
Rate Limiting: Hit rate limits during peak usage, causing service degradation
Feature Limitations: No access to multimodal capabilities or real-time information
Regional Availability: Limited availability in certain regions

Business Requirements:

Support 10,000+ conversations/day
Maintain response quality while reducing costs
Ensure 99.5% uptime
Support diverse use cases (simple FAQ, complex reasoning, code generation)
Future-proof against vendor changes

Decision¶

We adopted a multi-provider LLM strategy with 10 models across 5 providers:

Implementation Architecture¶

# LLM Model Service (Port 8016)
AVAILABLE_MODELS = {
    # Azure OpenAI (Primary)
    "openai-4": "gpt-4-0613",
    "openai-35": "gpt-35-turbo-16k-0613",
    "openai-4o-mini": "gpt-4o-mini-2024-07-18",
    "gpto1-mini": "o1-mini-2024-09-12",

    # Azure ML Endpoints (Cost-effective)
    "llama": "Llama 3.3 70B Instruct",
    "deepseek": "DeepSeek R1",
    "ministral": "Ministral 3B",
    "phi": "Phi-3 Small 8K Instruct",

    # External APIs (Specialized)
    "Gemini Flash-2.5": "gemini-2.0-flash",
    "Claude sonnet 4": "claude-3-5-sonnet-20241022",
    "Grok-3": "grok-3"
}

Routing Strategy¶

User-Configurable (Current):

Users select preferred model per chatbot
Default: GPT-3.5 Turbo 16K (cost-effective)

Planned (Q1 2025):

Automatic routing based on query complexity
Fallback chain for reliability
Cost optimization via intelligent routing

Alternatives Considered¶

Alternative 1: Single Provider (OpenAI Only)¶

Evaluated: OpenAI GPT-4 Turbo as sole model

Pros:

Simplest implementation
Consistent API
High quality responses
Strong documentation

Cons:

❌ Vendor lock-in: Complete dependence
❌ Cost: $0.03-0.06/1K tokens (expensive at scale)
❌ Rate limits: 10K requests/min ceiling
❌ No price negotiation leverage
❌ Single point of failure

Why Rejected: Too risky for production scale

Alternative 2: Self-Hosted Open Source Models¶

Evaluated: Host Llama 3, Mistral, etc. on our infrastructure

Pros:

No per-token costs (fixed infrastructure cost)
Complete control
Data privacy (no third-party API calls)
No rate limits

Cons:

❌ High upfront cost: $50K-100K GPU infrastructure
❌ DevOps complexity: Model deployment, monitoring, scaling
❌ Quality gap: Open models lag proprietary (GPT-4, Claude)
❌ Maintenance burden: Model updates, fine-tuning
❌ Time to market: 3-6 months setup time

Why Rejected: Too complex and expensive for current stage. Revisit at 100K+ users/day.

Alternative 3: Custom Fine-Tuned Model¶

Evaluated: Fine-tune GPT-3.5 or Llama on our domain

Pros:

Optimized for MachineAvatars use cases
Potentially better quality
Lower cost per request

Cons:

❌ Data requirements: Need 10K+ high-quality examples
❌ Time: 2-3 months to collect data, train, evaluate
❌ Maintenance: Retrain as product evolves
❌ Still vendor-dependent (OpenAI or Azure ML)

Why Rejected: Too early. We don't have sufficient training data yet. Roadmap for Q3 2025.

Decision Rationale¶

Why Multi-Provider?¶

1. Cost Optimization (60% reduction)

Scenario	Single Provider Cost	Multi-Provider Cost	Savings
Simple FAQ (70% of queries)	GPT-4: $0.045/query	GPT-3.5: $0.0015/query	97%
Complex reasoning (20%)	GPT-4: $0.045/query	GPT-4: $0.045/query	0%
Code generation (10%)	GPT-4: $0.045/query	Claude: $0.009/query	80%
Weighted Average	$0.0315/query	$0.0126/query	60%

Monthly savings at 300K queries: $5,670

2. Risk Mitigation

graph TD
    A[User Query] --> B{Primary Model Available?}
    B -->|Yes| C[GPT-3.5 / GPT-4]
    B -->|No| D{Fallback Available?}
    D -->|Yes| E[Llama / Gemini]
    D -->|No| F[Error Message]

    C --> G[Response]
    E --> G

    style C fill:#4CAF50
    style E fill:#FFC107
    style F fill:#F44336

Uptime improvement:

Single provider: 99.0% (OpenAI SLA)
Multi-provider with fallback: 99.8% (compound availability)

3. Feature Access

Feature	Provider	Use Case
Long context (128K)	GPT-4o Mini, Grok-3	Document analysis
Multimodal	Gemini 2.0 Flash	Future: Image/video understanding
Code generation	Claude 3.5 Sonnet	Developer tools
Real-time info	Grok-3	Current events queries
Reasoning	o1-mini, DeepSeek R1	Complex problem-solving

4. Negotiation Leverage

With multiple providers, we can:

Negotiate better rates
Switch if pricing becomes unfavorable
Leverage competition for better SLAs

Consequences¶

Positive Consequences¶

✅ Cost Reduction: 60% lower LLM costs ($5,670/month savings)
✅ Improved Reliability: 99.8% uptime vs. 99.0%
✅ Feature Diversity: Access to multimodal, long-context, real-time capabilities
✅ Flexibility: Can route queries to optimal model
✅ Future-Proof: Easy to add new providers (e.g., Anthropic Claude Opus)
✅ Negotiation Power: Avoid vendor lock-in

Negative Consequences¶

❌ Implementation Complexity: 10 model integrations vs. 1
❌ Testing Burden: Must test all models
❌ API Inconsistency: Different response formats require normalization
❌ Monitoring Complexity: Track metrics across 10 endpoints
❌ Token Counting: Different tokenizers (GPT vs. Claude vs. Gemini)

Mitigation Strategies¶

For Complexity:

Unified abstraction layer in llm-model-service (Port 8016)
Standardized response format
Single API endpoint for all models

For Monitoring:

Centralized logging (DataDog, Loki)
Cost tracking per model
Performance dashboards

Implementation Details¶

Service Architecture¶

Frontend → Gateway (8000) → LLM Model Service (8016) → {
    Azure OpenAI (GPT-4, GPT-3.5, GPT-4o-mini, o1-mini)
    Azure ML (Llama, DeepSeek, Ministral, Phi)
    Google (Gemini)
    Anthropic (Claude)
    xAI (Grok)
}

Code Example¶

# llm-model-service/src/main.py
@app.post("/call-model/{model_name}")
def call_model(model_name: str, request: MessageRequest):
    """
    Unified endpoint for all LLM models.
    """
    messages = [message.dict() for message in request.messages]

    if model_name == "openai-4":
        return call_openai_gpt4(messages)
    elif model_name == "openai-35":
        return call_openai_gpt35(messages)
    elif model_name == "llama":
        return call_llama(messages)
    elif model_name == "Gemini Flash-2.5":
        return call_gemini(messages)
    elif model_name == "Claude sonnet 4":
        return call_claude(messages)
    # ... etc

Configuration¶

All models use:

Temperature: 0.7 (consistent across all)
Retry Logic: 3 attempts with exponential backoff
Timeout: 30 seconds
Error Handling: Fallback to GPT-3.5 on failure

Compliance & Security¶

Data Privacy:

All API calls encrypted (HTTPS)
No training on our data (contractual guarantees with all providers)
User data anonymized before sending to LLMs

GDPR/DPDPA:

Data Processing Agreements (DPAs) signed with:
Microsoft (Azure OpenAI)
Google (Vertex AI/Gemini)
Anthropic (Claude)
xAI (Grok)

IP Ownership:

We own all user interactions
Providers have no claim to generated content
Clear terms in all vendor agreements

Performance Metrics¶

Model	p50 Latency	p95 Latency	Cost/1K tokens	Quality (MTEB)
GPT-4-0613	2.5s	4.5s	$0.045	95/100
GPT-3.5 Turbo	0.8s	1.5s	$0.0015	85/100
GPT-4o Mini	1.2s	2.8s	$0.0004	90/100
Llama 3.3 70B	1.5s	3.0s	$0.0003	88/100
Claude 3.5 Sonnet	2.0s	3.5s	$0.009	96/100
Gemini 2.0 Flash	0.9s	1.8s	$0.00005	89/100

Migration Path¶

If we need to change strategy:

Scenario 1: Return to Single Provider¶

If multi-provider proves too complex:

Migrate all queries to GPT-4o Mini (best cost/quality)
Keep abstraction layer for future flexibility
Decommission alternative providers
Estimated time: 2 weeks

Scenario 2: Shift to Self-Hosted¶

If costs still too high at massive scale:

Deploy Llama 3.3 70B on Azure GPU VMs
Route 80% of queries to self-hosted
Keep GPT-4 for complex 20%
Estimated time: 3 months

Review Schedule¶

Next Review: 2025-03-31 (Q1 2025 end)

Review Criteria:

Cost per query < $0.015 target
Uptime > 99.5%
User satisfaction > 4.0/5
New providers evaluated (e.g., Anthropic Opus, Cohere)

ADR-003: Vector Database (Milvus) - Related AI infrastructure
ADR-009: Embedding Model Selection (planned)
ADR-010: TTS Provider Selection (planned)

Evidence & Benchmarks¶

Internal Benchmarks (Sept 2024):

Tested all 10 models on 500 sample queries
Measured latency, quality (human eval), cost
GPT-3.5 rated 4.⅖ (sufficient for 70% of queries)
GPT-4 rated 4.8/5 (needed for complex 30%)
Llama 3.3 rated 4.⅕ (good fallback)

External References:

Last Updated: 2025-12-26
Review Date: 2025-03-31
Status: Active and performing well

"Diversification isn't just for investments—it's critical for AI infrastructure."

ADR-001: Multi-Provider LLM Strategy¶

Context¶

Decision¶

Implementation Architecture¶

Routing Strategy¶

Alternatives Considered¶

Alternative 1: Single Provider (OpenAI Only)¶

Alternative 2: Self-Hosted Open Source Models¶

Alternative 3: Custom Fine-Tuned Model¶

Decision Rationale¶

Why Multi-Provider?¶

Consequences¶

Positive Consequences¶

Negative Consequences¶

Mitigation Strategies¶

Implementation Details¶

Service Architecture¶

Code Example¶

Configuration¶

Compliance & Security¶

Performance Metrics¶

Migration Path¶

Scenario 1: Return to Single Provider¶

Scenario 2: Shift to Self-Hosted¶

Review Schedule¶

Related ADRs¶

Evidence & Benchmarks¶