Model Selection & Routing Strategy¶

Purpose: Complete guide to LLM model selection, routing logic, and parameter tuning
Audience: Backend Engineers, ML Engineers, DevOps
Owner: ML Engineering Lead Last Updated: 2025-12-26
Version: 1.0

Model Roster¶

Complete Model Inventory¶

MachineAvatars supports 10 LLM models across 5 providers:

Azure OpenAI Models¶

Model ID	Deployment Name	Context	Temp	Use Case
openai-4	`gpt-4-0613`	8K	0.7	Complex reasoning, high-quality responses
openai-35	`gpt-35-turbo-16k-0613`	16K	0.7	Fast, cost-effective general queries
openai-4o-mini	`gpt-4o-mini-2024-07-18`	128K	0.7	Long context, balanced cost/performance
gpto1-mini	`o1-mini-2024-09-12`	128K	0.7	Reasoning-optimized tasks

Azure ML Endpoint Models¶

Model ID	Deployment	Context	Temp	Use Case
llama	Llama 3.3 70B Instruct	8K	0.7	Open-source alternative, cost-saving
deepseek	DeepSeek R1	32K	N/A	Chain-of-thought reasoning
ministral	Ministral 3B	8K	N/A	Lightweight, fast responses
phi	Phi-3 Small 8K Instruct	8K	N/A	Edge deployment experiments

External API Models¶

Model ID	Provider	Model	Context	Use Case
Gemini Flash-2.5	Google	gemini-2.0-flash	1M	Multimodal, long context
Claude sonnet 4	Anthropic	claude-3-5-sonnet-20241022	200K	Complex analysis, code generation
Grok-3	xAI	grok-3	128K	Real-time information, web-connected

Model Parameters¶

Temperature: 0.7 (Universal)¶

ALL models use temperature = 0.7

# Every model call
response = model.chat.completions.create(
    model=deployment_name,
    messages=messages,
    temperature=0.7  # Consistent across all models
)

Why 0.7?

Temperature controls randomness (0.0 = deterministic, 1.0 = creative):

0.0: Completely deterministic (same input → same output)
0.3: Low creativity (factual Q&A, data extraction)
0.7: Balanced (conversational, helpful, not repetitive) ← Our choice
1.0: High creativity (storytelling, brainstorming)

Rationale for 0.7:

✅ Conversational variety (not robotic)
✅ Still grounded in context
✅ Good for chatbot use case
✅ Prevents hallucinations from excessive creativity

Alternative considered: 0.3 (more factual, but too robotic for user experience)

Model Selection Logic¶

Default Routing¶

Code Location: response-3d-chatbot-service/src/main.py

# Check chatbot-level model preference
chatbot_config = get_chatbot_config(user_id, project_id)
preferred_model = chatbot_config.get("preferred_model")

if preferred_model:
    return call_model(preferred_model, messages)
else:
    # Default: Use GPT-3.5 Turbo 16K
    return call_openai_35(messages)

Default Model: GPT-3.5 Turbo 16K (fast, cost-effective)

Per-Chatbot Model Configuration¶

Users can set preferred model per chatbot:

# chatbot_selection collection
{
    "user_id": "User-123",
    "project_id": "Project-456",
    "preferred_model": "openai-4",  # User choice
    "chatbot_purpose": "Custom-Agent",
    ...
}

Supported Values:

openai-4
openai-35
openai-4o-mini
llama
deepseek
ministral
phi
Gemini Flash-2.5
Claude sonnet 4
Grok-3

Model Comparison¶

Performance Characteristics¶

Model	Speed	Quality	Cost	Best For
GPT-4-0613	🐌 Slow (2.5s)	⭐⭐⭐⭐⭐	💰💰💰 High	Complex reasoning, critical accuracy
GPT-3.5 Turbo 16K	⚡ Fast (0.8s)	⭐⭐⭐	💰 Low	General queries, high volume
GPT-4o Mini	🚀 Medium (1.2s)	⭐⭐⭐⭐	💰💰 Medium	Long context, balanced needs
o1-mini	🐌 Slow (3.0s)	⭐⭐⭐⭐⭐	💰💰💰💰	Reasoning tasks, problem-solving
Llama 3.3 70B	🚀 Medium (1.5s)	⭐⭐⭐⭐	💰 Low	Cost-conscious, no vendor lock-in
DeepSeek R1	🐌 Slow (2.8s)	⭐⭐⭐⭐	💰💰	Chain-of-thought reasoning
Ministral 3B	⚡⚡ Very Fast (0.5s)	⭐⭐	💰	Simple queries, speed critical
Phi-3 Small	⚡⚡ Very Fast (0.6s)	⭐⭐	💰	Edge deployment, experiments
Gemini 2.0 Flash	⚡ Fast (0.9s)	⭐⭐⭐⭐	💰💰	Multimodal, long context
Claude 3.5 Sonnet	🚀 Medium (2.0s)	⭐⭐⭐⭐⭐	💰💰💰	Complex analysis, code
Grok-3	🚀 Medium (1.8s)	⭐⭐⭐⭐	💰💰💰	Real-time information

Cost Comparison (per 1K tokens)¶

Model	Input Cost	Output Cost	Total (avg)
GPT-4-0613	$0.03	$0.06	$0.045
GPT-3.5 Turbo 16K	$0.001	$0.002	$0.0015 ⭐ Cheapest OpenAI
GPT-4o Mini	$0.00015	$0.0006	$0.0004 ⭐ Cheapest overall
o1-mini	$0.003	$0.012	$0.0075
Llama 3.3 70B	$0.0003	$0.0003	$0.0003 ⭐ Best value
Gemini 2.0 Flash	$0.00001875	$0.000075	$0.00005 ⭐ Ultra-cheap
Claude 3.5 Sonnet	$0.003	$0.015	$0.009

Cost Optimization Strategy:

Default to GPT-3.5 or Gemini (cheap)
Escalate to GPT-4 if query complexity detected
Use Llama for cost-sensitive customers

Routing Strategies¶

Strategy 1: User-Specified (Current)¶

Implementation: User selects model in chatbot settings

Pros:

✅ User control
✅ Simple implementation
✅ Predictable costs

Cons:

❌ Users may not know which model to choose
❌ Over-paying if always using GPT-4

Code:

if chatbot_config.get("preferred_model"):
    model = chatbot_config["preferred_model"]
else:
    model = "openai-35"  # Default

Strategy 2: Query Complexity-Based (Planned)¶

Implementation: Detect query complexity and route automatically

Heuristics:

def detect_complexity(query: str) -> str:
    """
    Route based on query characteristics.
    """
    query_lower = query.lower()
    word_count = len(query.split())

    # Simple FAQ → Fast, cheap model
    if word_count < 10:
        return "openai-35"

    # Code-related → Claude
    if any(keyword in query_lower for keyword in ["code", "function", "debug", "API", "syntax"]):
        return "Claude sonnet 4"

    # Math/reasoning → o1-mini or DeepSeek
    if any(keyword in query_lower for keyword in ["calculate", "solve", "problem", "equation"]):
        return "openai-gpto1mini"

    # Long context (> 500 words) → GPT-4o Mini or Gemini
    if word_count > 500:
        return "openai-4o-mini"

    # Complex reasoning → GPT-4
    if any(keyword in query_lower for keyword in ["explain", "analyze", "compare", "why"]):
        return "openai-4"

    # Default → GPT-3.5
    return "openai-35"

Pros:

✅ Automatic cost optimization
✅ Better user experience (right model for task)
✅ Easier for users (no configuration)

Cons:

❌ Heuristics may be wrong
❌ More complex implementation

Status: Roadmap Q1 2025

Strategy 3: Fallback Chain (Error Handling)¶

Implementation: If primary model fails, try alternatives

def call_with_fallback(messages, primary="openai-4"):
    """
    Try primary model, fallback to GPT-3.5 if it fails.
    """
    try:
        return call_model(primary, messages)
    except Exception as e:
        logger.warning(f"{primary} failed: {e}, falling back to GPT-3.5")
        return call_openai_35(messages)

Fallback Order:

GPT-4 → GPT-3.5 (if rate limit or error)
Claude → GPT-4 (if API down)
Gemini → GPT-3.5 (if quota exceeded)

Status: Implemented in some services

Model-Specific Implementation¶

Azure OpenAI Models¶

from openai import AzureOpenAI

# GPT-4-0613
client_gpt4 = AzureOpenAI(
    azure_endpoint="https://machineagentopenai.openai.azure.com/openai/deployments/gpt-4-0613/chat/completions?api-version=2024-02-15-preview",
    api_key=AZURE_OPENAI_API_KEY,
    api_version="2024-02-15-preview"
)

def call_openai_4(messages):
    response = client_gpt4.chat.completions.create(
        model="gpt-4-0613",
        messages=messages,
        temperature=0.7
    )
    return response.choices[0].message.content

Azure ML Endpoints¶

import requests

# Llama 3.3 70B
LLAMA_ENDPOINT = "https://Llama-3-3-70B-Instruct-ulmca.eastus.models.ai.azure.com/chat/completions"
LLAMA_API_KEY = "..."

def call_llama(messages):
    payload = {
        "messages": messages,
        "temperature": 0.7,
        "max_tokens": 50  # Limit output length
    }
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {LLAMA_API_KEY}"
    }
    response = requests.post(LLAMA_ENDPOINT, json=payload, headers=headers)
    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]

External APIs¶

# Gemini 2.0 Flash
GEMINI_ENDPOINT = "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=..."

def call_gemini(messages):
    # Transform to Gemini format
    gemini_contents = [{
        "role": msg["role"],
        "parts": [{"text": msg["content"]}]
    } for msg in messages]

    payload = {
        "contents": gemini_contents,
        "generationConfig": {
            "temperature": 0.7,
            "maxOutputTokens": 1000
        }
    }

    response = requests.post(GEMINI_ENDPOINT, json=payload)
    response.raise_for_status()
    return response.json()["candidates"][0]["content"]["parts"][0]["text"]

Token Management¶

Max Token Limits¶

Different models have different output limits:

MAX_TOKENS = {
    "openai-4": 4096,
    "openai-35": 4096,
    "openai-4o-mini": 16384,
    "llama": 8192,
    "deepseek": 4096,
    "ministral": 2048,  # Limited due to Azure ML constraints
    "phi": 2048,
    "Gemini Flash-2.5": 8192,
    "Claude sonnet 4": 4096,
    "Grok-3": 4096
}

Token Counting¶

Code Location: response-3d-chatbot-service/src/main.py

import tiktoken

# Initialize tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")  # For OpenAI models

def count_tokens(text: str) -> int:
    """Count tokens in text."""
    return len(tokenizer.encode(text))

# Track usage
input_tokens = count_tokens(input_prompt)
output_tokens = count_tokens(output_response)
total_tokens = input_tokens + output_tokens

logger.info(f"Token usage: {input_tokens} in, {output_tokens} out, {total_tokens} total")

Chat History Management¶

Conversation truncation to fit context window:

def truncate_history(chat_history, max_tokens=6000):
    """
    Keep most recent messages that fit in context window.
    """
    truncated = []
    total_tokens = 0

    # Iterate from most recent
    for message in reversed(chat_history):
        msg_tokens = count_tokens(message['content'])
        if total_tokens + msg_tokens > max_tokens:
            break
        truncated.insert(0, message)
        total_tokens += msg_tokens

    return truncated

Error Handling¶

Common Errors¶

Error	Cause	Solution
Rate Limit	Too many requests	Exponential backoff, retry
Timeout	Model taking too long	Increase timeout, use faster model
Invalid API Key	Wrong credentials	Check environment variables
Model Not Found	Wrong deployment name	Verify deployment names
Content Filter	Unsafe content detected	Sanitize input, handle gracefully

Retry Logic¶

import time

def call_with_retry(model_func, messages, max_retries=3):
    """
    Retry with exponential backoff.
    """
    for attempt in range(max_retries):
        try:
            return model_func(messages)
        except Exception as e:
            if attempt == max_retries - 1:
                raise  # Give up

            wait_time = 2 ** attempt  # 1s, 2s, 4s
            logger.warning(f"Attempt {attempt + 1} failed: {e}, retrying in {wait_time}s")
            time.sleep(wait_time)

Monitoring & Logging¶

Per-Request Logging¶

logger.info("Calling LLM", extra={
    "model": model_name,
    "user_id": user_id,
    "project_id": project_id,
    "input_tokens": input_tokens,
    "message_count": len(messages)
})

# After response
logger.info("LLM response received", extra={
    "model": model_name,
    "output_tokens": output_tokens,
    "total_tokens": total_tokens,
    "latency_ms": latency_ms,
    "cost_usd": calculated_cost
})

Cost Tracking¶

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    """
    Calculate cost in USD.
    """
    COST_TABLE = {
        "openai-4": {"input": 0.03, "output": 0.06},
        "openai-35": {"input": 0.001, "output": 0.002},
        "openai-4o-mini": {"input": 0.00015, "output": 0.0006},
        # ... other models
    }

    rates = COST_TABLE.get(model, {"input": 0, "output": 0})
    cost = (input_tokens / 1000 * rates["input"]) + (output_tokens / 1000 * rates["output"])
    return cost

Future Enhancements¶

Q1 2025:

Automatic complexity-based routing
A/B testing framework (compare models on same queries)
Fine-tuning GPT-3.5 on domain data

Q2 2025:

Multi-model ensembling (combine responses from multiple models)
Streaming responses (real-time output)
Model performance dashboard

Last Updated: 2025-12-26
Version: 1.0
Owner: ML Engineering Lead

"The right model for every conversation."

Model Selection & Routing Strategy¶

Model Roster¶

Complete Model Inventory¶

Azure OpenAI Models¶

Azure ML Endpoint Models¶

External API Models¶

Model Parameters¶

Temperature: 0.7 (Universal)¶

Model Selection Logic¶

Default Routing¶

Per-Chatbot Model Configuration¶

Model Comparison¶

Performance Characteristics¶

Cost Comparison (per 1K tokens)¶

Routing Strategies¶

Strategy 1: User-Specified (Current)¶

Strategy 2: Query Complexity-Based (Planned)¶

Strategy 3: Fallback Chain (Error Handling)¶

Model-Specific Implementation¶

Azure OpenAI Models¶

Azure ML Endpoints¶

External APIs¶

Token Management¶

Max Token Limits¶

Token Counting¶

Chat History Management¶

Error Handling¶

Common Errors¶

Retry Logic¶

Monitoring & Logging¶

Per-Request Logging¶

Cost Tracking¶

Future Enhancements¶

Related Documentation¶