Model Selection & Routing Strategy¶
Purpose: Complete guide to LLM model selection, routing logic, and parameter tuning
Audience: Backend Engineers, ML Engineers, DevOps
Owner: ML Engineering Lead Last Updated: 2025-12-26
Version: 1.0
Model Roster¶
Complete Model Inventory¶
MachineAvatars supports 10 LLM models across 5 providers:
Azure OpenAI Models¶
| Model ID | Deployment Name | Context | Temp | Use Case |
|---|---|---|---|---|
| openai-4 | gpt-4-0613 |
8K | 0.7 | Complex reasoning, high-quality responses |
| openai-35 | gpt-35-turbo-16k-0613 |
16K | 0.7 | Fast, cost-effective general queries |
| openai-4o-mini | gpt-4o-mini-2024-07-18 |
128K | 0.7 | Long context, balanced cost/performance |
| gpto1-mini | o1-mini-2024-09-12 |
128K | 0.7 | Reasoning-optimized tasks |
Azure ML Endpoint Models¶
| Model ID | Deployment | Context | Temp | Use Case |
|---|---|---|---|---|
| llama | Llama 3.3 70B Instruct | 8K | 0.7 | Open-source alternative, cost-saving |
| deepseek | DeepSeek R1 | 32K | N/A | Chain-of-thought reasoning |
| ministral | Ministral 3B | 8K | N/A | Lightweight, fast responses |
| phi | Phi-3 Small 8K Instruct | 8K | N/A | Edge deployment experiments |
External API Models¶
| Model ID | Provider | Model | Context | Use Case |
|---|---|---|---|---|
| Gemini Flash-2.5 | gemini-2.0-flash | 1M | Multimodal, long context | |
| Claude sonnet 4 | Anthropic | claude-3-5-sonnet-20241022 | 200K | Complex analysis, code generation |
| Grok-3 | xAI | grok-3 | 128K | Real-time information, web-connected |
Model Parameters¶
Temperature: 0.7 (Universal)¶
ALL models use temperature = 0.7
# Every model call
response = model.chat.completions.create(
model=deployment_name,
messages=messages,
temperature=0.7 # Consistent across all models
)
Why 0.7?
Temperature controls randomness (0.0 = deterministic, 1.0 = creative):
- 0.0: Completely deterministic (same input → same output)
- 0.3: Low creativity (factual Q&A, data extraction)
- 0.7: Balanced (conversational, helpful, not repetitive) ← Our choice
- 1.0: High creativity (storytelling, brainstorming)
Rationale for 0.7:
- ✅ Conversational variety (not robotic)
- ✅ Still grounded in context
- ✅ Good for chatbot use case
- ✅ Prevents hallucinations from excessive creativity
Alternative considered: 0.3 (more factual, but too robotic for user experience)
Model Selection Logic¶
Default Routing¶
Code Location: response-3d-chatbot-service/src/main.py
# Check chatbot-level model preference
chatbot_config = get_chatbot_config(user_id, project_id)
preferred_model = chatbot_config.get("preferred_model")
if preferred_model:
return call_model(preferred_model, messages)
else:
# Default: Use GPT-3.5 Turbo 16K
return call_openai_35(messages)
Default Model: GPT-3.5 Turbo 16K (fast, cost-effective)
Per-Chatbot Model Configuration¶
Users can set preferred model per chatbot:
# chatbot_selection collection
{
"user_id": "User-123",
"project_id": "Project-456",
"preferred_model": "openai-4", # User choice
"chatbot_purpose": "Custom-Agent",
...
}
Supported Values:
openai-4openai-35openai-4o-minillamadeepseekministralphiGemini Flash-2.5Claude sonnet 4Grok-3
Model Comparison¶
Performance Characteristics¶
| Model | Speed | Quality | Cost | Best For |
|---|---|---|---|---|
| GPT-4-0613 | 🐌 Slow (2.5s) | ⭐⭐⭐⭐⭐ | 💰💰💰 High | Complex reasoning, critical accuracy |
| GPT-3.5 Turbo 16K | ⚡ Fast (0.8s) | ⭐⭐⭐ | 💰 Low | General queries, high volume |
| GPT-4o Mini | 🚀 Medium (1.2s) | ⭐⭐⭐⭐ | 💰💰 Medium | Long context, balanced needs |
| o1-mini | 🐌 Slow (3.0s) | ⭐⭐⭐⭐⭐ | 💰💰💰💰 | Reasoning tasks, problem-solving |
| Llama 3.3 70B | 🚀 Medium (1.5s) | ⭐⭐⭐⭐ | 💰 Low | Cost-conscious, no vendor lock-in |
| DeepSeek R1 | 🐌 Slow (2.8s) | ⭐⭐⭐⭐ | 💰💰 | Chain-of-thought reasoning |
| Ministral 3B | ⚡⚡ Very Fast (0.5s) | ⭐⭐ | 💰 | Simple queries, speed critical |
| Phi-3 Small | ⚡⚡ Very Fast (0.6s) | ⭐⭐ | 💰 | Edge deployment, experiments |
| Gemini 2.0 Flash | ⚡ Fast (0.9s) | ⭐⭐⭐⭐ | 💰💰 | Multimodal, long context |
| Claude 3.5 Sonnet | 🚀 Medium (2.0s) | ⭐⭐⭐⭐⭐ | 💰💰💰 | Complex analysis, code |
| Grok-3 | 🚀 Medium (1.8s) | ⭐⭐⭐⭐ | 💰💰💰 | Real-time information |
Cost Comparison (per 1K tokens)¶
| Model | Input Cost | Output Cost | Total (avg) |
|---|---|---|---|
| GPT-4-0613 | $0.03 | $0.06 | $0.045 |
| GPT-3.5 Turbo 16K | $0.001 | $0.002 | $0.0015 ⭐ Cheapest OpenAI |
| GPT-4o Mini | $0.00015 | $0.0006 | $0.0004 ⭐ Cheapest overall |
| o1-mini | $0.003 | $0.012 | $0.0075 |
| Llama 3.3 70B | $0.0003 | $0.0003 | $0.0003 ⭐ Best value |
| Gemini 2.0 Flash | $0.00001875 | $0.000075 | $0.00005 ⭐ Ultra-cheap |
| Claude 3.5 Sonnet | $0.003 | $0.015 | $0.009 |
Cost Optimization Strategy:
- Default to GPT-3.5 or Gemini (cheap)
- Escalate to GPT-4 if query complexity detected
- Use Llama for cost-sensitive customers
Routing Strategies¶
Strategy 1: User-Specified (Current)¶
Implementation: User selects model in chatbot settings
Pros:
- ✅ User control
- ✅ Simple implementation
- ✅ Predictable costs
Cons:
- ❌ Users may not know which model to choose
- ❌ Over-paying if always using GPT-4
Code:
if chatbot_config.get("preferred_model"):
model = chatbot_config["preferred_model"]
else:
model = "openai-35" # Default
Strategy 2: Query Complexity-Based (Planned)¶
Implementation: Detect query complexity and route automatically
Heuristics:
def detect_complexity(query: str) -> str:
"""
Route based on query characteristics.
"""
query_lower = query.lower()
word_count = len(query.split())
# Simple FAQ → Fast, cheap model
if word_count < 10:
return "openai-35"
# Code-related → Claude
if any(keyword in query_lower for keyword in ["code", "function", "debug", "API", "syntax"]):
return "Claude sonnet 4"
# Math/reasoning → o1-mini or DeepSeek
if any(keyword in query_lower for keyword in ["calculate", "solve", "problem", "equation"]):
return "openai-gpto1mini"
# Long context (> 500 words) → GPT-4o Mini or Gemini
if word_count > 500:
return "openai-4o-mini"
# Complex reasoning → GPT-4
if any(keyword in query_lower for keyword in ["explain", "analyze", "compare", "why"]):
return "openai-4"
# Default → GPT-3.5
return "openai-35"
Pros:
- ✅ Automatic cost optimization
- ✅ Better user experience (right model for task)
- ✅ Easier for users (no configuration)
Cons:
- ❌ Heuristics may be wrong
- ❌ More complex implementation
Status: Roadmap Q1 2025
Strategy 3: Fallback Chain (Error Handling)¶
Implementation: If primary model fails, try alternatives
def call_with_fallback(messages, primary="openai-4"):
"""
Try primary model, fallback to GPT-3.5 if it fails.
"""
try:
return call_model(primary, messages)
except Exception as e:
logger.warning(f"{primary} failed: {e}, falling back to GPT-3.5")
return call_openai_35(messages)
Fallback Order:
- GPT-4 → GPT-3.5 (if rate limit or error)
- Claude → GPT-4 (if API down)
- Gemini → GPT-3.5 (if quota exceeded)
Status: Implemented in some services
Model-Specific Implementation¶
Azure OpenAI Models¶
from openai import AzureOpenAI
# GPT-4-0613
client_gpt4 = AzureOpenAI(
azure_endpoint="https://machineagentopenai.openai.azure.com/openai/deployments/gpt-4-0613/chat/completions?api-version=2024-02-15-preview",
api_key=AZURE_OPENAI_API_KEY,
api_version="2024-02-15-preview"
)
def call_openai_4(messages):
response = client_gpt4.chat.completions.create(
model="gpt-4-0613",
messages=messages,
temperature=0.7
)
return response.choices[0].message.content
Azure ML Endpoints¶
import requests
# Llama 3.3 70B
LLAMA_ENDPOINT = "https://Llama-3-3-70B-Instruct-ulmca.eastus.models.ai.azure.com/chat/completions"
LLAMA_API_KEY = "..."
def call_llama(messages):
payload = {
"messages": messages,
"temperature": 0.7,
"max_tokens": 50 # Limit output length
}
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {LLAMA_API_KEY}"
}
response = requests.post(LLAMA_ENDPOINT, json=payload, headers=headers)
response.raise_for_status()
return response.json()["choices"][0]["message"]["content"]
External APIs¶
# Gemini 2.0 Flash
GEMINI_ENDPOINT = "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=..."
def call_gemini(messages):
# Transform to Gemini format
gemini_contents = [{
"role": msg["role"],
"parts": [{"text": msg["content"]}]
} for msg in messages]
payload = {
"contents": gemini_contents,
"generationConfig": {
"temperature": 0.7,
"maxOutputTokens": 1000
}
}
response = requests.post(GEMINI_ENDPOINT, json=payload)
response.raise_for_status()
return response.json()["candidates"][0]["content"]["parts"][0]["text"]
Token Management¶
Max Token Limits¶
Different models have different output limits:
MAX_TOKENS = {
"openai-4": 4096,
"openai-35": 4096,
"openai-4o-mini": 16384,
"llama": 8192,
"deepseek": 4096,
"ministral": 2048, # Limited due to Azure ML constraints
"phi": 2048,
"Gemini Flash-2.5": 8192,
"Claude sonnet 4": 4096,
"Grok-3": 4096
}
Token Counting¶
Code Location: response-3d-chatbot-service/src/main.py
import tiktoken
# Initialize tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base") # For OpenAI models
def count_tokens(text: str) -> int:
"""Count tokens in text."""
return len(tokenizer.encode(text))
# Track usage
input_tokens = count_tokens(input_prompt)
output_tokens = count_tokens(output_response)
total_tokens = input_tokens + output_tokens
logger.info(f"Token usage: {input_tokens} in, {output_tokens} out, {total_tokens} total")
Chat History Management¶
Conversation truncation to fit context window:
def truncate_history(chat_history, max_tokens=6000):
"""
Keep most recent messages that fit in context window.
"""
truncated = []
total_tokens = 0
# Iterate from most recent
for message in reversed(chat_history):
msg_tokens = count_tokens(message['content'])
if total_tokens + msg_tokens > max_tokens:
break
truncated.insert(0, message)
total_tokens += msg_tokens
return truncated
Error Handling¶
Common Errors¶
| Error | Cause | Solution |
|---|---|---|
| Rate Limit | Too many requests | Exponential backoff, retry |
| Timeout | Model taking too long | Increase timeout, use faster model |
| Invalid API Key | Wrong credentials | Check environment variables |
| Model Not Found | Wrong deployment name | Verify deployment names |
| Content Filter | Unsafe content detected | Sanitize input, handle gracefully |
Retry Logic¶
import time
def call_with_retry(model_func, messages, max_retries=3):
"""
Retry with exponential backoff.
"""
for attempt in range(max_retries):
try:
return model_func(messages)
except Exception as e:
if attempt == max_retries - 1:
raise # Give up
wait_time = 2 ** attempt # 1s, 2s, 4s
logger.warning(f"Attempt {attempt + 1} failed: {e}, retrying in {wait_time}s")
time.sleep(wait_time)
Monitoring & Logging¶
Per-Request Logging¶
logger.info("Calling LLM", extra={
"model": model_name,
"user_id": user_id,
"project_id": project_id,
"input_tokens": input_tokens,
"message_count": len(messages)
})
# After response
logger.info("LLM response received", extra={
"model": model_name,
"output_tokens": output_tokens,
"total_tokens": total_tokens,
"latency_ms": latency_ms,
"cost_usd": calculated_cost
})
Cost Tracking¶
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
"""
Calculate cost in USD.
"""
COST_TABLE = {
"openai-4": {"input": 0.03, "output": 0.06},
"openai-35": {"input": 0.001, "output": 0.002},
"openai-4o-mini": {"input": 0.00015, "output": 0.0006},
# ... other models
}
rates = COST_TABLE.get(model, {"input": 0, "output": 0})
cost = (input_tokens / 1000 * rates["input"]) + (output_tokens / 1000 * rates["output"])
return cost
Future Enhancements¶
Q1 2025:
- Automatic complexity-based routing
- A/B testing framework (compare models on same queries)
- Fine-tuning GPT-3.5 on domain data
Q2 2025:
- Multi-model ensembling (combine responses from multiple models)
- Streaming responses (real-time output)
- Model performance dashboard
Related Documentation¶
Last Updated: 2025-12-26
Version: 1.0
Owner: ML Engineering Lead
"The right model for every conversation."