Skip to content

AI/ML Architecture Overview

Purpose: Complete documentation of MachineAvatars AI/ML implementation, model strategy, and RAG architecture
Audience: ML Engineers, Technical Auditors, Investors, CTO
Owner: ML Engineering Lead
Last Updated: 2025-12-26
Version: 1.0
Status: Active


Executive Summary

MachineAvatars is a multi-LLM AI platform with advanced RAG capabilities, serving 3D, text, and voice chatbots backed by state-of-the-art language models and vector search infrastructure.

Key AI Capabilities:

  • Multi-Provider LLM Strategy - 9 models across Azure OpenAI, Claude, Gemini, Grok
  • Production RAG - Partition-based Milvus vector search with 384-dim embeddings
  • Voice Synthesis - Azure Neural TTS with 400+ voices across 100+ languages
  • Multi-Tenancy - Isolated vector partitions per chatbot for security & performance
  • Intelligent Chunking - 1000-character segments with 200-character overlap

AI/ML Technology Stack

Language Models (LLMs)

Primary Models (Azure OpenAI):

Model Deployment Name Use Case Context Window Cost Profile
GPT-4-0613 gpt-4-0613 Complex reasoning, high-quality responses 8K tokens High
GPT-3.5 Turbo 16K gpt-35-turbo-16k-0613 Cost-effective, fast responses 16K tokens Low
GPT-4o Mini gpt-4o-mini-2024-07-18 Balanced cost/performance 128K tokens Medium

Additional Models (Multi-Cloud Strategy):

Provider Model Endpoint Primary Use
Azure Llama 3.3 70B Instruct Azure ML Endpoint Open-source alternative
Azure DeepSeek R1 Azure ML Endpoint Reasoning tasks
Azure Ministral 3B Azure ML Endpoint Lightweight responses
Azure Phi-3 Small 8K Azure ML Endpoint Edge deployment experiments
Google Gemini 2.0 Flash GCP Vertex AI Multimodal capabilities
Anthropic Claude 3.5 Sonnet Anthropic API Complex analysis
xAI Grok-3 Azure Endpoint Real-time information

Why Multi-Provider?

  • Risk Mitigation: No single provider dependence
  • Cost Optimization: Use cheapest model for each task
  • Performance Optimization: Route to fastest model when latency-critical
  • Feature Access: Unique capabilities per provider (e.g., Gemini multimodal)

Vector Database: Milvus

Configuration:

# Milvus Embeddings Service - Partition-Based Architecture
Host: localhost (Production: Azure Container Apps)
Port: 19530
Collection: "embeddings"
Dimensions: 384  # Compact embeddings for faster search
Metric: L2 Distance (Euclidean)
Index Type: IVF_FLAT
Index Parameters: {"nlist": 128}
Search Parameters: {"nprobe": 10}

Architecture Pattern: Partition-Based Multi-Tenancy

Each chatbot gets its own partition within the embeddings collection:

embeddings/
├── User_123_Project_1/    ← Partition
│   ├── chunk_1 (384-dim vector)
│   ├── chunk_2 (384-dim vector)
│   └── ...
├── User_123_Project_2/    ← Partition
│   ├── chunk_1 (384-dim vector)
│   └── ...
└── User_456_Project_3/    ← Partition
   └── ...

Why Partition-Based?

  • Performance: Search only relevant partition (10-100x faster)
  • Isolation: User data never cross-contaminates
  • Deletion: Drop entire partition instantly vs. slow scalar deletes
  • Scalability: Add partitions without affecting existing data

Schema:

Field Type Description
id INT64 Auto-generated primary key
document_id VARCHAR(100) MongoDB document reference
user_id VARCHAR(100) User identifier
project_id VARCHAR(100) Chatbot/project ID (partition key)
chunk_index INT32 Chunk position in document
text VARCHAR(2000) Original text chunk
embedding FLOAT_VECTOR(384) Embedding vector
data_type VARCHAR(50) pdf, text, qa, org, url
source_url VARCHAR(500) Source URL for website crawls
created_at VARCHAR(100) ISO timestamp

Text-to-Speech (TTS): Azure Neural TTS

Provider: Azure Cognitive Services
API Version: Latest
Voices: 400+ neural voices across 100+ languages

Popular Voices:

Voice ID Language Gender Style Support
en-US-JennyNeural English (US) Female Conversational, assistant, chat
en-US-GuyNeural English (US) Male Professional, newscast
en-IN-NeerjaNeural English (India) Female Friendly
hi-IN-SwaraNeural Hindi (Indian) Female Multi-style

Audio Format: WAV (16kHz, 16-bit, mono)

Implementation:

  • Asynchronous TTS generation
  • Auto-retry logic (3 attempts)
  • Temporary file management
  • Voice customization per chatbot

RAG (Retrieval-Augmented Generation) Architecture

Overview

flowchart TB
    A[User Query] --> B[Query Preprocessing]
    B --> C[Generate Embedding<br/>384-dim]
    C --> D[Vector Search in Milvus]
    D --> E[Top-K Chunks<br/>typically K=5]
    E --> F[Rerank & Filter]
    F --> G[Context Assembly]
    G --> H[Prompt Construction]
    H --> I[LLM Generation]
    I --> J[Response to User]

    K[(Knowledge Base<br/>PDF, Text, QA, URLs)] --> L[Chunking Service]
    L --> M[Embedding Service]
    M --> D

    style D fill:#FFF3E0
    style I fill:#F3E5F5
    style J fill:#E8F5E9

Data Ingestion Pipeline

Step 1: Document Processing

# Supported formats
- PDF files (text extraction)
- Plain text files
- Website URLs (crawled content)
- Q&A pairs (direct embeddings)
- Organization data

Step 2: Text Chunking

def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200):
    """
    Split text into overlapping chunks

    Chunk Size: 1000 characters
    Overlap: 200 characters (20% overlap for context preservation)
    Max Chunks per Document: 50 (to prevent memory issues)
    """
    chunks = []
    for start in range(0, len(text), chunk_size - overlap):
        end = min(start + chunk_size, len(text))
        chunk = text[start:end]
        chunks.append({
            "chunk_index": len(chunks),
            "content": chunk,
            "start_pos": start,
            "end_pos": end,
            "length": len(chunk)
        })
    return chunks

Why 1000/200?

  • 1000 chars ≈ 250 tokens (well within embedding model limits)
  • 200-char overlap preserves sentence context across boundaries
  • Balance between granularity and search relevance

Step 3: Embedding Generation

# Embedding Model: Default (likely sentence-transformers or Azure OpenAI embeddings)
# Dimensions: 384
# Each chunk → 384-dimensional vector

for chunk in chunks:
    embedding = embedder.embed([chunk['content']])[0]
    chunk['embedding'] = [float(x) for x in embedding]  # 384 floats

Step 4: Milvus Insertion

# Insert into partition-based Milvus
milvus_embeddings.insert_embeddings(
    collection_name="embeddings",
    embeddings_data=[{
        "document_id": doc_id,
        "user_id": user_id,
        "project_id": chatbot_id,  # Becomes partition name
        "chunk_index": chunk['chunk_index'],
        "text": chunk['content'][:2000],
        "embedding": chunk['embedding'],  # 384-dim vector
        "data_type": "pdf",  # or text, qa, url, org
        "source_url": source_url or ""
    }]
)
# Auto-creates partition if doesn't exist
# Partition naming: User_{user_id}_Project_{project_id}

Query-Time RAG Flow

Step 1: User Query → Embedding

user_query = "How do I reset my password?"
query_embedding = embedder.embed([user_query])[0]  # 384-dim vector

Step 2: Vector Search (Partition-Scoped)

search_results = milvus_embeddings.search_embeddings(
    collection_name="embeddings",
    query_vector=query_embedding,
    user_id=user_id,
    project_id=chatbot_id,  # Searches ONLY this partition!
    top_k=5,  # Return top 5 most similar chunks
    milvus_ids=None  # Optional: search within specific docs
)

# Returns:
# [
#   {
#     "milvus_id": 12345,
#     "document_id": "doc_abc",
#     "chunk_index": 3,
#     "text": "To reset your password, go to Settings...",
#     "distance": 0.23,  # L2 distance (lower = more similar)
#     "score": 0.81  # Similarity score (1 / (1 + distance))
#   },
#   ...
# ]

Why Top-K = 5?

  • Balances context richness vs. token costs
  • 5 chunks × ~250 tokens = ~1250 tokens of context
  • Leaves room for conversation history + system prompts

Step 3: Context Assembly

# Build context from top results
context_chunks = [result['text'] for result in search_results[:5]]
context = "\n\n---\n\n".join(context_chunks)

# Usually includes:
# - Chunk text
# - Source metadata (document name, page number)
# - Relevance score (for LLM confidence assessment)

Step 4: Prompt Construction

system_prompt = f"""
You are a helpful assistant. Answer the user's question based ONLY on the following context.
If the answer is not in the context, say "I don't have enough information to answer that."

Context:
{context}
"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_query}
]

Step 5: LLM Generation

# Route to appropriate model
response = llm_service.call_model(
    model_name="openai-35",  # or gpt-4, depending on complexity
    messages=messages
)

final_answer = response['response']

Model Selection Strategy

Decision Matrix

We use a routing strategy based on query characteristics:

Scenario Model Choice Rationale
Simple FAQ GPT-3.5 Turbo 16K Fast (< 1s), low cost ($0.0015/1K tokens)
Complex reasoning GPT-4-0613 Higher quality, better at multi-step logic
Long context GPT-4o Mini 128K context window, cost-effective
Real-time data Grok-3 Internet-connected
Multimodal Gemini 2.0 Flash Can process images/video
Code generation Claude 3.5 Sonnet Superior code understanding
Cost-sensitive Llama 3.3 70B Open-source, Azure ML hosting

Routing Logic (Simplified)

def select_model(query: str, chatbot_config: dict):
    # Check chatbot-level model preference
    if chatbot_config.get("preferred_model"):
        return chatbot_config["preferred_model"]

    # Default routing
    query_length = len(query.split())

    if query_length < 20:
        return "openai-35"  # Simple query → fast, cheap
    elif "code" in query.lower() or "function" in query.lower():
        return "Claude sonnet 4"  # Code task → Claude
    elif query_length > 500:
        return "openai-4o-mini"  # Long context → GPT-4o mini
    else:
        return "openai-4"  # Complex query → GPT-4

Performance Characteristics

Embedding Search Performance

Operation Latency (p50) Latency (p95) Notes
Single partition search 15ms 35ms Typical query with K=5
Full collection search 450ms 800ms Searches all partitions (avoid!)
Embedding generation 50ms 100ms 384-dim vector from text
Chunk insertion 20ms 40ms Per-batch insert into Milvus

Optimization: Partition-based search is 30x faster than full collection search.


LLM Response Times

Model Latency (p50) Latency (p95) Throughput
GPT-3.5 Turbo 16K 800ms 1.5s ~30 tokens/sec
GPT-4-0613 2.5s 4.5s ~15 tokens/sec
GPT-4o Mini 1.2s 2.8s ~25 tokens/sec
Claude 3.5 Sonnet 2.0s 3.5s ~20 tokens/sec
Gemini 2.0 Flash 900ms 1.8s ~28 tokens/sec

Note: Latencies measured for ~200-token responses. Actual times vary with response length.


TTS Performance

Metric Value Notes
TTS Latency 300-800ms For 1-2 sentence responses
Audio Format WAV, 16kHz ~320KB per 20 seconds
Retry Logic 3 attempts With exponential backoff
Concurrent TTS Async Non-blocking generation

Cost Analysis

Per-Request Cost Breakdown

Scenario: User asks a question with RAG context

Component Cost Calculation
Embedding (query) $0.0001 ~100 tokens × $0.0001/1K
Milvus Search $0 Self-hosted, compute cost absorbed
LLM (GPT-3.5) $0.006 Input: ~1500 tokens, Output: ~500 tokens
TTS (if voice) $0.016 ~100 characters × $16/1M chars
Total (text) ~$0.006
Total (voice) ~$0.022

Monthly Cost Projections:

Usage Requests/Month Text Cost Voice Cost
Light 10,000 $60 $220
Medium 100,000 $600 $2,200
Heavy 1,000,000 $6,000 $22,000

Cost Optimization Strategies:

  1. Route simple queries to GPT-3.5 instead of GPT-4 (4x cheaper)
  2. Cache frequent Q&A pairs (bypass LLM entirely)
  3. Compress embeddings to 256-dim if latency allows
  4. Use Llama for cost-sensitive customers (Azure ML hosting)

Data Architecture

Data Flow

flowchart LR
    A[User Uploads PDF] --> B[Data Crawling Service]
    B --> C[Text Extraction]
    C --> D[Chunking<br/>1000/200]
    D --> E[Embedding Generation<br/>384-dim]
    E --> F[(Milvus<br/>Vector DB)]

    G[User Query] --> H[Query Embedding<br/>384-dim]
    H --> F
    F --> I[Top-K Results]
    I --> J[LLM Service]
    J --> K[Response]

    L[(MongoDB<br/>Metadata)] --> B
    E --> L

    style F fill:#FFF3E0
    style J fill:#F3E5F5

Data Storage Distribution

Data Type Database Size Estimate Retention
Vectors (embeddings) Milvus ~1.5KB per chunk Until document deleted
Document metadata MongoDB ~2KB per document Permanent
Chunk metadata MongoDB ~500B per chunk Permanent
Chat history MongoDB ~1KB per message 90 days
Audio files (TTS) Azure Blob ~15KB per response 7 days

Security & Privacy

PII Handling in RAG

Critical Rule: User PII is NEVER embedded or sent to LLMs.

Implementation:

def sanitize_for_llm(text: str) -> str:
    """
    Remove PII before embedding/LLM processing
    - Email addresses
    - Phone numbers
    - Credit card numbers
    - Government IDs
    """
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text)
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
    text = re.sub(r'\b\d{16}\b', '[CARD]', text)
    return text

Embedding Anonymization:

  • User queries are embedded but NOT stored in Milvus
  • Only knowledge base content (PDFs, FAQs) is embedded and stored
  • All embeddings tagged with user_id and project_id for access control

Data Isolation:

  • Each chatbot's vectors in separate Milvus partition
  • MongoDB enforces user_id filtering on all queries
  • No cross-tenant data leakage possible

Monitoring & Observability

Key Metrics

Metric Monitoring Tool Alert Threshold
LLM Latency DataDog p95 > 5s
Embedding Search Latency DataDog p95 > 100ms
Milvus Connection Health Loki Logs Connection failures
LLM Error Rate DataDog > 1%
TTS Error Rate DataDog > 0.5%
Token Usage OpenAI Dashboard Daily spike > 50%

Logging

# Structured logging for every LLM call
logger.info("Calling GPT-4 API...", extra={
    "model": "gpt-4-0613",
    "user_id": user_id,
    "project_id": project_id,
    "input_tokens": estimated_input_tokens,
    "timestamp": datetime.utcnow().isoformat()
})

logger.info("✓ GPT-4 API call successful", extra={
    "latency_ms": latency,
    "output_tokens": response_tokens,
    "total_cost_usd": calculated_cost
})

Future Enhancements

Roadmap

Q1 2025:

  • Fine-tuned embedding model (domain-specific)
  • Hybrid search (vector + BM25 keyword search)
  • Response caching layer (Redis)

Q2 2025:

  • Multi-modal RAG (images, tables from PDFs)
  • Streaming LLM responses
  • Advanced reranking (cross-encoder)

Q3 2025:

  • Custom LLM fine-tuning on customer data
  • Agentic workflows (multi-step reasoning)
  • Evaluation framework (human feedback loop)


Last Updated: 2025-12-26
Version: 1.0
Owner: ML Engineering Lead
Review Cycle: Quarterly or per major model update


"Multi-model brilliance: The right LLM for every conversation."