Skip to content

Model Selection & Routing Strategy

Purpose: Complete guide to LLM model selection, routing logic, and parameter tuning
Audience: Backend Engineers, ML Engineers, DevOps
Owner: ML Engineering Lead Last Updated: 2025-12-26
Version: 1.0


Model Roster

Complete Model Inventory

MachineAvatars supports 10 LLM models across 5 providers:

Azure OpenAI Models

Model ID Deployment Name Context Temp Use Case
openai-4 gpt-4-0613 8K 0.7 Complex reasoning, high-quality responses
openai-35 gpt-35-turbo-16k-0613 16K 0.7 Fast, cost-effective general queries
openai-4o-mini gpt-4o-mini-2024-07-18 128K 0.7 Long context, balanced cost/performance
gpto1-mini o1-mini-2024-09-12 128K 0.7 Reasoning-optimized tasks

Azure ML Endpoint Models

Model ID Deployment Context Temp Use Case
llama Llama 3.3 70B Instruct 8K 0.7 Open-source alternative, cost-saving
deepseek DeepSeek R1 32K N/A Chain-of-thought reasoning
ministral Ministral 3B 8K N/A Lightweight, fast responses
phi Phi-3 Small 8K Instruct 8K N/A Edge deployment experiments

External API Models

Model ID Provider Model Context Use Case
Gemini Flash-2.5 Google gemini-2.0-flash 1M Multimodal, long context
Claude sonnet 4 Anthropic claude-3-5-sonnet-20241022 200K Complex analysis, code generation
Grok-3 xAI grok-3 128K Real-time information, web-connected

Model Parameters

Temperature: 0.7 (Universal)

ALL models use temperature = 0.7

# Every model call
response = model.chat.completions.create(
    model=deployment_name,
    messages=messages,
    temperature=0.7  # Consistent across all models
)

Why 0.7?

Temperature controls randomness (0.0 = deterministic, 1.0 = creative):

  • 0.0: Completely deterministic (same input → same output)
  • 0.3: Low creativity (factual Q&A, data extraction)
  • 0.7: Balanced (conversational, helpful, not repetitive) ← Our choice
  • 1.0: High creativity (storytelling, brainstorming)

Rationale for 0.7:

  • ✅ Conversational variety (not robotic)
  • ✅ Still grounded in context
  • ✅ Good for chatbot use case
  • ✅ Prevents hallucinations from excessive creativity

Alternative considered: 0.3 (more factual, but too robotic for user experience)


Model Selection Logic

Default Routing

Code Location: response-3d-chatbot-service/src/main.py

# Check chatbot-level model preference
chatbot_config = get_chatbot_config(user_id, project_id)
preferred_model = chatbot_config.get("preferred_model")

if preferred_model:
    return call_model(preferred_model, messages)
else:
    # Default: Use GPT-3.5 Turbo 16K
    return call_openai_35(messages)

Default Model: GPT-3.5 Turbo 16K (fast, cost-effective)

Per-Chatbot Model Configuration

Users can set preferred model per chatbot:

# chatbot_selection collection
{
    "user_id": "User-123",
    "project_id": "Project-456",
    "preferred_model": "openai-4",  # User choice
    "chatbot_purpose": "Custom-Agent",
    ...
}

Supported Values:

  • openai-4
  • openai-35
  • openai-4o-mini
  • llama
  • deepseek
  • ministral
  • phi
  • Gemini Flash-2.5
  • Claude sonnet 4
  • Grok-3

Model Comparison

Performance Characteristics

Model Speed Quality Cost Best For
GPT-4-0613 🐌 Slow (2.5s) ⭐⭐⭐⭐⭐ 💰💰💰 High Complex reasoning, critical accuracy
GPT-3.5 Turbo 16K ⚡ Fast (0.8s) ⭐⭐⭐ 💰 Low General queries, high volume
GPT-4o Mini 🚀 Medium (1.2s) ⭐⭐⭐⭐ 💰💰 Medium Long context, balanced needs
o1-mini 🐌 Slow (3.0s) ⭐⭐⭐⭐⭐ 💰💰💰💰 Reasoning tasks, problem-solving
Llama 3.3 70B 🚀 Medium (1.5s) ⭐⭐⭐⭐ 💰 Low Cost-conscious, no vendor lock-in
DeepSeek R1 🐌 Slow (2.8s) ⭐⭐⭐⭐ 💰💰 Chain-of-thought reasoning
Ministral 3B ⚡⚡ Very Fast (0.5s) ⭐⭐ 💰 Simple queries, speed critical
Phi-3 Small ⚡⚡ Very Fast (0.6s) ⭐⭐ 💰 Edge deployment, experiments
Gemini 2.0 Flash ⚡ Fast (0.9s) ⭐⭐⭐⭐ 💰💰 Multimodal, long context
Claude 3.5 Sonnet 🚀 Medium (2.0s) ⭐⭐⭐⭐⭐ 💰💰💰 Complex analysis, code
Grok-3 🚀 Medium (1.8s) ⭐⭐⭐⭐ 💰💰💰 Real-time information

Cost Comparison (per 1K tokens)

Model Input Cost Output Cost Total (avg)
GPT-4-0613 $0.03 $0.06 $0.045
GPT-3.5 Turbo 16K $0.001 $0.002 $0.0015 ⭐ Cheapest OpenAI
GPT-4o Mini $0.00015 $0.0006 $0.0004 ⭐ Cheapest overall
o1-mini $0.003 $0.012 $0.0075
Llama 3.3 70B $0.0003 $0.0003 $0.0003 ⭐ Best value
Gemini 2.0 Flash $0.00001875 $0.000075 $0.00005 ⭐ Ultra-cheap
Claude 3.5 Sonnet $0.003 $0.015 $0.009

Cost Optimization Strategy:

  1. Default to GPT-3.5 or Gemini (cheap)
  2. Escalate to GPT-4 if query complexity detected
  3. Use Llama for cost-sensitive customers

Routing Strategies

Strategy 1: User-Specified (Current)

Implementation: User selects model in chatbot settings

Pros:

  • ✅ User control
  • ✅ Simple implementation
  • ✅ Predictable costs

Cons:

  • ❌ Users may not know which model to choose
  • ❌ Over-paying if always using GPT-4

Code:

if chatbot_config.get("preferred_model"):
    model = chatbot_config["preferred_model"]
else:
    model = "openai-35"  # Default

Strategy 2: Query Complexity-Based (Planned)

Implementation: Detect query complexity and route automatically

Heuristics:

def detect_complexity(query: str) -> str:
    """
    Route based on query characteristics.
    """
    query_lower = query.lower()
    word_count = len(query.split())

    # Simple FAQ → Fast, cheap model
    if word_count < 10:
        return "openai-35"

    # Code-related → Claude
    if any(keyword in query_lower for keyword in ["code", "function", "debug", "API", "syntax"]):
        return "Claude sonnet 4"

    # Math/reasoning → o1-mini or DeepSeek
    if any(keyword in query_lower for keyword in ["calculate", "solve", "problem", "equation"]):
        return "openai-gpto1mini"

    # Long context (> 500 words) → GPT-4o Mini or Gemini
    if word_count > 500:
        return "openai-4o-mini"

    # Complex reasoning → GPT-4
    if any(keyword in query_lower for keyword in ["explain", "analyze", "compare", "why"]):
        return "openai-4"

    # Default → GPT-3.5
    return "openai-35"

Pros:

  • ✅ Automatic cost optimization
  • ✅ Better user experience (right model for task)
  • ✅ Easier for users (no configuration)

Cons:

  • ❌ Heuristics may be wrong
  • ❌ More complex implementation

Status: Roadmap Q1 2025


Strategy 3: Fallback Chain (Error Handling)

Implementation: If primary model fails, try alternatives

def call_with_fallback(messages, primary="openai-4"):
    """
    Try primary model, fallback to GPT-3.5 if it fails.
    """
    try:
        return call_model(primary, messages)
    except Exception as e:
        logger.warning(f"{primary} failed: {e}, falling back to GPT-3.5")
        return call_openai_35(messages)

Fallback Order:

  1. GPT-4 → GPT-3.5 (if rate limit or error)
  2. Claude → GPT-4 (if API down)
  3. Gemini → GPT-3.5 (if quota exceeded)

Status: Implemented in some services


Model-Specific Implementation

Azure OpenAI Models

from openai import AzureOpenAI

# GPT-4-0613
client_gpt4 = AzureOpenAI(
    azure_endpoint="https://machineagentopenai.openai.azure.com/openai/deployments/gpt-4-0613/chat/completions?api-version=2024-02-15-preview",
    api_key=AZURE_OPENAI_API_KEY,
    api_version="2024-02-15-preview"
)

def call_openai_4(messages):
    response = client_gpt4.chat.completions.create(
        model="gpt-4-0613",
        messages=messages,
        temperature=0.7
    )
    return response.choices[0].message.content

Azure ML Endpoints

import requests

# Llama 3.3 70B
LLAMA_ENDPOINT = "https://Llama-3-3-70B-Instruct-ulmca.eastus.models.ai.azure.com/chat/completions"
LLAMA_API_KEY = "..."

def call_llama(messages):
    payload = {
        "messages": messages,
        "temperature": 0.7,
        "max_tokens": 50  # Limit output length
    }
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {LLAMA_API_KEY}"
    }
    response = requests.post(LLAMA_ENDPOINT, json=payload, headers=headers)
    response.raise_for_status()
    return response.json()["choices"][0]["message"]["content"]

External APIs

# Gemini 2.0 Flash
GEMINI_ENDPOINT = "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=..."

def call_gemini(messages):
    # Transform to Gemini format
    gemini_contents = [{
        "role": msg["role"],
        "parts": [{"text": msg["content"]}]
    } for msg in messages]

    payload = {
        "contents": gemini_contents,
        "generationConfig": {
            "temperature": 0.7,
            "maxOutputTokens": 1000
        }
    }

    response = requests.post(GEMINI_ENDPOINT, json=payload)
    response.raise_for_status()
    return response.json()["candidates"][0]["content"]["parts"][0]["text"]

Token Management

Max Token Limits

Different models have different output limits:

MAX_TOKENS = {
    "openai-4": 4096,
    "openai-35": 4096,
    "openai-4o-mini": 16384,
    "llama": 8192,
    "deepseek": 4096,
    "ministral": 2048,  # Limited due to Azure ML constraints
    "phi": 2048,
    "Gemini Flash-2.5": 8192,
    "Claude sonnet 4": 4096,
    "Grok-3": 4096
}

Token Counting

Code Location: response-3d-chatbot-service/src/main.py

import tiktoken

# Initialize tokenizer
tokenizer = tiktoken.get_encoding("cl100k_base")  # For OpenAI models

def count_tokens(text: str) -> int:
    """Count tokens in text."""
    return len(tokenizer.encode(text))

# Track usage
input_tokens = count_tokens(input_prompt)
output_tokens = count_tokens(output_response)
total_tokens = input_tokens + output_tokens

logger.info(f"Token usage: {input_tokens} in, {output_tokens} out, {total_tokens} total")

Chat History Management

Conversation truncation to fit context window:

def truncate_history(chat_history, max_tokens=6000):
    """
    Keep most recent messages that fit in context window.
    """
    truncated = []
    total_tokens = 0

    # Iterate from most recent
    for message in reversed(chat_history):
        msg_tokens = count_tokens(message['content'])
        if total_tokens + msg_tokens > max_tokens:
            break
        truncated.insert(0, message)
        total_tokens += msg_tokens

    return truncated

Error Handling

Common Errors

Error Cause Solution
Rate Limit Too many requests Exponential backoff, retry
Timeout Model taking too long Increase timeout, use faster model
Invalid API Key Wrong credentials Check environment variables
Model Not Found Wrong deployment name Verify deployment names
Content Filter Unsafe content detected Sanitize input, handle gracefully

Retry Logic

import time

def call_with_retry(model_func, messages, max_retries=3):
    """
    Retry with exponential backoff.
    """
    for attempt in range(max_retries):
        try:
            return model_func(messages)
        except Exception as e:
            if attempt == max_retries - 1:
                raise  # Give up

            wait_time = 2 ** attempt  # 1s, 2s, 4s
            logger.warning(f"Attempt {attempt + 1} failed: {e}, retrying in {wait_time}s")
            time.sleep(wait_time)

Monitoring & Logging

Per-Request Logging

logger.info("Calling LLM", extra={
    "model": model_name,
    "user_id": user_id,
    "project_id": project_id,
    "input_tokens": input_tokens,
    "message_count": len(messages)
})

# After response
logger.info("LLM response received", extra={
    "model": model_name,
    "output_tokens": output_tokens,
    "total_tokens": total_tokens,
    "latency_ms": latency_ms,
    "cost_usd": calculated_cost
})

Cost Tracking

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    """
    Calculate cost in USD.
    """
    COST_TABLE = {
        "openai-4": {"input": 0.03, "output": 0.06},
        "openai-35": {"input": 0.001, "output": 0.002},
        "openai-4o-mini": {"input": 0.00015, "output": 0.0006},
        # ... other models
    }

    rates = COST_TABLE.get(model, {"input": 0, "output": 0})
    cost = (input_tokens / 1000 * rates["input"]) + (output_tokens / 1000 * rates["output"])
    return cost

Future Enhancements

Q1 2025:

  • Automatic complexity-based routing
  • A/B testing framework (compare models on same queries)
  • Fine-tuning GPT-3.5 on domain data

Q2 2025:

  • Multi-model ensembling (combine responses from multiple models)
  • Streaming responses (real-time output)
  • Model performance dashboard


Last Updated: 2025-12-26
Version: 1.0
Owner: ML Engineering Lead


"The right model for every conversation."