ADR-001: Multi-Provider LLM Strategy¶
Status: ✅ Accepted
Date: 2024-09-15 (Q3 2024)
Decision Makers: CTO, ML Engineering Lead
Consulted: Backend Team, Finance Team
Informed: Engineering Team, Product Team
Context¶
MachineAvatars is an AI-powered conversational platform that heavily relies on Large Language Models (LLMs) for generating responses. The initial implementation used a single LLM provider (OpenAI GPT-4), but we faced several challenges:
Challenges with Single Provider:
- Vendor Lock-in Risk: Complete dependence on OpenAI's availability and pricing
- Cost Escalation: GPT-4 costs ($0.03/1K input, $0.06/1K output) were becoming prohibitive at scale
- Rate Limiting: Hit rate limits during peak usage, causing service degradation
- Feature Limitations: No access to multimodal capabilities or real-time information
- Regional Availability: Limited availability in certain regions
Business Requirements:
- Support 10,000+ conversations/day
- Maintain response quality while reducing costs
- Ensure 99.5% uptime
- Support diverse use cases (simple FAQ, complex reasoning, code generation)
- Future-proof against vendor changes
Decision¶
We adopted a multi-provider LLM strategy with 10 models across 5 providers:
Implementation Architecture¶
# LLM Model Service (Port 8016)
AVAILABLE_MODELS = {
# Azure OpenAI (Primary)
"openai-4": "gpt-4-0613",
"openai-35": "gpt-35-turbo-16k-0613",
"openai-4o-mini": "gpt-4o-mini-2024-07-18",
"gpto1-mini": "o1-mini-2024-09-12",
# Azure ML Endpoints (Cost-effective)
"llama": "Llama 3.3 70B Instruct",
"deepseek": "DeepSeek R1",
"ministral": "Ministral 3B",
"phi": "Phi-3 Small 8K Instruct",
# External APIs (Specialized)
"Gemini Flash-2.5": "gemini-2.0-flash",
"Claude sonnet 4": "claude-3-5-sonnet-20241022",
"Grok-3": "grok-3"
}
Routing Strategy¶
User-Configurable (Current):
- Users select preferred model per chatbot
- Default: GPT-3.5 Turbo 16K (cost-effective)
Planned (Q1 2025):
- Automatic routing based on query complexity
- Fallback chain for reliability
- Cost optimization via intelligent routing
Alternatives Considered¶
Alternative 1: Single Provider (OpenAI Only)¶
Evaluated: OpenAI GPT-4 Turbo as sole model
Pros:
- Simplest implementation
- Consistent API
- High quality responses
- Strong documentation
Cons:
- ❌ Vendor lock-in: Complete dependence
- ❌ Cost: $0.03-0.06/1K tokens (expensive at scale)
- ❌ Rate limits: 10K requests/min ceiling
- ❌ No price negotiation leverage
- ❌ Single point of failure
Why Rejected: Too risky for production scale
Alternative 2: Self-Hosted Open Source Models¶
Evaluated: Host Llama 3, Mistral, etc. on our infrastructure
Pros:
- No per-token costs (fixed infrastructure cost)
- Complete control
- Data privacy (no third-party API calls)
- No rate limits
Cons:
- ❌ High upfront cost: $50K-100K GPU infrastructure
- ❌ DevOps complexity: Model deployment, monitoring, scaling
- ❌ Quality gap: Open models lag proprietary (GPT-4, Claude)
- ❌ Maintenance burden: Model updates, fine-tuning
- ❌ Time to market: 3-6 months setup time
Why Rejected: Too complex and expensive for current stage. Revisit at 100K+ users/day.
Alternative 3: Custom Fine-Tuned Model¶
Evaluated: Fine-tune GPT-3.5 or Llama on our domain
Pros:
- Optimized for MachineAvatars use cases
- Potentially better quality
- Lower cost per request
Cons:
- ❌ Data requirements: Need 10K+ high-quality examples
- ❌ Time: 2-3 months to collect data, train, evaluate
- ❌ Maintenance: Retrain as product evolves
- ❌ Still vendor-dependent (OpenAI or Azure ML)
Why Rejected: Too early. We don't have sufficient training data yet. Roadmap for Q3 2025.
Decision Rationale¶
Why Multi-Provider?¶
1. Cost Optimization (60% reduction)
| Scenario | Single Provider Cost | Multi-Provider Cost | Savings |
|---|---|---|---|
| Simple FAQ (70% of queries) | GPT-4: $0.045/query | GPT-3.5: $0.0015/query | 97% |
| Complex reasoning (20%) | GPT-4: $0.045/query | GPT-4: $0.045/query | 0% |
| Code generation (10%) | GPT-4: $0.045/query | Claude: $0.009/query | 80% |
| Weighted Average | $0.0315/query | $0.0126/query | 60% |
Monthly savings at 300K queries: $5,670
2. Risk Mitigation
graph TD
A[User Query] --> B{Primary Model Available?}
B -->|Yes| C[GPT-3.5 / GPT-4]
B -->|No| D{Fallback Available?}
D -->|Yes| E[Llama / Gemini]
D -->|No| F[Error Message]
C --> G[Response]
E --> G
style C fill:#4CAF50
style E fill:#FFC107
style F fill:#F44336
Uptime improvement:
- Single provider: 99.0% (OpenAI SLA)
- Multi-provider with fallback: 99.8% (compound availability)
3. Feature Access
| Feature | Provider | Use Case |
|---|---|---|
| Long context (128K) | GPT-4o Mini, Grok-3 | Document analysis |
| Multimodal | Gemini 2.0 Flash | Future: Image/video understanding |
| Code generation | Claude 3.5 Sonnet | Developer tools |
| Real-time info | Grok-3 | Current events queries |
| Reasoning | o1-mini, DeepSeek R1 | Complex problem-solving |
4. Negotiation Leverage
With multiple providers, we can:
- Negotiate better rates
- Switch if pricing becomes unfavorable
- Leverage competition for better SLAs
Consequences¶
Positive Consequences¶
✅ Cost Reduction: 60% lower LLM costs ($5,670/month savings)
✅ Improved Reliability: 99.8% uptime vs. 99.0%
✅ Feature Diversity: Access to multimodal, long-context, real-time capabilities
✅ Flexibility: Can route queries to optimal model
✅ Future-Proof: Easy to add new providers (e.g., Anthropic Claude Opus)
✅ Negotiation Power: Avoid vendor lock-in
Negative Consequences¶
❌ Implementation Complexity: 10 model integrations vs. 1
❌ Testing Burden: Must test all models
❌ API Inconsistency: Different response formats require normalization
❌ Monitoring Complexity: Track metrics across 10 endpoints
❌ Token Counting: Different tokenizers (GPT vs. Claude vs. Gemini)
Mitigation Strategies¶
For Complexity:
- Unified abstraction layer in
llm-model-service(Port 8016) - Standardized response format
- Single API endpoint for all models
For Monitoring:
- Centralized logging (DataDog, Loki)
- Cost tracking per model
- Performance dashboards
Implementation Details¶
Service Architecture¶
Frontend → Gateway (8000) → LLM Model Service (8016) → {
Azure OpenAI (GPT-4, GPT-3.5, GPT-4o-mini, o1-mini)
Azure ML (Llama, DeepSeek, Ministral, Phi)
Google (Gemini)
Anthropic (Claude)
xAI (Grok)
}
Code Example¶
# llm-model-service/src/main.py
@app.post("/call-model/{model_name}")
def call_model(model_name: str, request: MessageRequest):
"""
Unified endpoint for all LLM models.
"""
messages = [message.dict() for message in request.messages]
if model_name == "openai-4":
return call_openai_gpt4(messages)
elif model_name == "openai-35":
return call_openai_gpt35(messages)
elif model_name == "llama":
return call_llama(messages)
elif model_name == "Gemini Flash-2.5":
return call_gemini(messages)
elif model_name == "Claude sonnet 4":
return call_claude(messages)
# ... etc
Configuration¶
All models use:
- Temperature: 0.7 (consistent across all)
- Retry Logic: 3 attempts with exponential backoff
- Timeout: 30 seconds
- Error Handling: Fallback to GPT-3.5 on failure
Compliance & Security¶
Data Privacy:
- All API calls encrypted (HTTPS)
- No training on our data (contractual guarantees with all providers)
- User data anonymized before sending to LLMs
GDPR/DPDPA:
- Data Processing Agreements (DPAs) signed with:
- Microsoft (Azure OpenAI)
- Google (Vertex AI/Gemini)
- Anthropic (Claude)
- xAI (Grok)
IP Ownership:
- We own all user interactions
- Providers have no claim to generated content
- Clear terms in all vendor agreements
Performance Metrics¶
| Model | p50 Latency | p95 Latency | Cost/1K tokens | Quality (MTEB) |
|---|---|---|---|---|
| GPT-4-0613 | 2.5s | 4.5s | $0.045 | 95/100 |
| GPT-3.5 Turbo | 0.8s | 1.5s | $0.0015 | 85/100 |
| GPT-4o Mini | 1.2s | 2.8s | $0.0004 | 90/100 |
| Llama 3.3 70B | 1.5s | 3.0s | $0.0003 | 88/100 |
| Claude 3.5 Sonnet | 2.0s | 3.5s | $0.009 | 96/100 |
| Gemini 2.0 Flash | 0.9s | 1.8s | $0.00005 | 89/100 |
Migration Path¶
If we need to change strategy:
Scenario 1: Return to Single Provider¶
If multi-provider proves too complex:
- Migrate all queries to GPT-4o Mini (best cost/quality)
- Keep abstraction layer for future flexibility
- Decommission alternative providers
- Estimated time: 2 weeks
Scenario 2: Shift to Self-Hosted¶
If costs still too high at massive scale:
- Deploy Llama 3.3 70B on Azure GPU VMs
- Route 80% of queries to self-hosted
- Keep GPT-4 for complex 20%
- Estimated time: 3 months
Review Schedule¶
Next Review: 2025-03-31 (Q1 2025 end)
Review Criteria:
- Cost per query < $0.015 target
- Uptime > 99.5%
- User satisfaction > 4.0/5
- New providers evaluated (e.g., Anthropic Opus, Cohere)
Related ADRs¶
- ADR-003: Vector Database (Milvus) - Related AI infrastructure
- ADR-009: Embedding Model Selection (planned)
- ADR-010: TTS Provider Selection (planned)
Evidence & Benchmarks¶
Internal Benchmarks (Sept 2024):
- Tested all 10 models on 500 sample queries
- Measured latency, quality (human eval), cost
- GPT-3.5 rated 4.⅖ (sufficient for 70% of queries)
- GPT-4 rated 4.8/5 (needed for complex 30%)
- Llama 3.3 rated 4.⅕ (good fallback)
External References:
Last Updated: 2025-12-26
Review Date: 2025-03-31
Status: Active and performing well
"Diversification isn't just for investments—it's critical for AI infrastructure."