AI/ML Architecture Overview¶
Purpose: Complete documentation of MachineAvatars AI/ML implementation, model strategy, and RAG architecture
Audience: ML Engineers, Technical Auditors, Investors, CTO
Owner: ML Engineering Lead
Last Updated: 2025-12-26
Version: 1.0
Status: Active
Executive Summary¶
MachineAvatars is a multi-LLM AI platform with advanced RAG capabilities, serving 3D, text, and voice chatbots backed by state-of-the-art language models and vector search infrastructure.
Key AI Capabilities:
- ✅ Multi-Provider LLM Strategy - 9 models across Azure OpenAI, Claude, Gemini, Grok
- ✅ Production RAG - Partition-based Milvus vector search with 384-dim embeddings
- ✅ Voice Synthesis - Azure Neural TTS with 400+ voices across 100+ languages
- ✅ Multi-Tenancy - Isolated vector partitions per chatbot for security & performance
- ✅ Intelligent Chunking - 1000-character segments with 200-character overlap
AI/ML Technology Stack¶
Language Models (LLMs)¶
Primary Models (Azure OpenAI):
| Model | Deployment Name | Use Case | Context Window | Cost Profile |
|---|---|---|---|---|
| GPT-4-0613 | gpt-4-0613 |
Complex reasoning, high-quality responses | 8K tokens | High |
| GPT-3.5 Turbo 16K | gpt-35-turbo-16k-0613 |
Cost-effective, fast responses | 16K tokens | Low |
| GPT-4o Mini | gpt-4o-mini-2024-07-18 |
Balanced cost/performance | 128K tokens | Medium |
Additional Models (Multi-Cloud Strategy):
| Provider | Model | Endpoint | Primary Use |
|---|---|---|---|
| Azure | Llama 3.3 70B Instruct | Azure ML Endpoint | Open-source alternative |
| Azure | DeepSeek R1 | Azure ML Endpoint | Reasoning tasks |
| Azure | Ministral 3B | Azure ML Endpoint | Lightweight responses |
| Azure | Phi-3 Small 8K | Azure ML Endpoint | Edge deployment experiments |
| Gemini 2.0 Flash | GCP Vertex AI | Multimodal capabilities | |
| Anthropic | Claude 3.5 Sonnet | Anthropic API | Complex analysis |
| xAI | Grok-3 | Azure Endpoint | Real-time information |
Why Multi-Provider?
- Risk Mitigation: No single provider dependence
- Cost Optimization: Use cheapest model for each task
- Performance Optimization: Route to fastest model when latency-critical
- Feature Access: Unique capabilities per provider (e.g., Gemini multimodal)
Vector Database: Milvus¶
Configuration:
# Milvus Embeddings Service - Partition-Based Architecture
Host: localhost (Production: Azure Container Apps)
Port: 19530
Collection: "embeddings"
Dimensions: 384 # Compact embeddings for faster search
Metric: L2 Distance (Euclidean)
Index Type: IVF_FLAT
Index Parameters: {"nlist": 128}
Search Parameters: {"nprobe": 10}
Architecture Pattern: Partition-Based Multi-Tenancy
Each chatbot gets its own partition within the embeddings collection:
embeddings/
├── User_123_Project_1/ ← Partition
│ ├── chunk_1 (384-dim vector)
│ ├── chunk_2 (384-dim vector)
│ └── ...
├── User_123_Project_2/ ← Partition
│ ├── chunk_1 (384-dim vector)
│ └── ...
└── User_456_Project_3/ ← Partition
└── ...
Why Partition-Based?
- ✅ Performance: Search only relevant partition (10-100x faster)
- ✅ Isolation: User data never cross-contaminates
- ✅ Deletion: Drop entire partition instantly vs. slow scalar deletes
- ✅ Scalability: Add partitions without affecting existing data
Schema:
| Field | Type | Description |
|---|---|---|
id |
INT64 | Auto-generated primary key |
document_id |
VARCHAR(100) | MongoDB document reference |
user_id |
VARCHAR(100) | User identifier |
project_id |
VARCHAR(100) | Chatbot/project ID (partition key) |
chunk_index |
INT32 | Chunk position in document |
text |
VARCHAR(2000) | Original text chunk |
embedding |
FLOAT_VECTOR(384) | Embedding vector |
data_type |
VARCHAR(50) | pdf, text, qa, org, url |
source_url |
VARCHAR(500) | Source URL for website crawls |
created_at |
VARCHAR(100) | ISO timestamp |
Text-to-Speech (TTS): Azure Neural TTS¶
Provider: Azure Cognitive Services
API Version: Latest
Voices: 400+ neural voices across 100+ languages
Popular Voices:
| Voice ID | Language | Gender | Style Support |
|---|---|---|---|
en-US-JennyNeural |
English (US) | Female | Conversational, assistant, chat |
en-US-GuyNeural |
English (US) | Male | Professional, newscast |
en-IN-NeerjaNeural |
English (India) | Female | Friendly |
hi-IN-SwaraNeural |
Hindi (Indian) | Female | Multi-style |
Audio Format: WAV (16kHz, 16-bit, mono)
Implementation:
- Asynchronous TTS generation
- Auto-retry logic (3 attempts)
- Temporary file management
- Voice customization per chatbot
RAG (Retrieval-Augmented Generation) Architecture¶
Overview¶
flowchart TB
A[User Query] --> B[Query Preprocessing]
B --> C[Generate Embedding<br/>384-dim]
C --> D[Vector Search in Milvus]
D --> E[Top-K Chunks<br/>typically K=5]
E --> F[Rerank & Filter]
F --> G[Context Assembly]
G --> H[Prompt Construction]
H --> I[LLM Generation]
I --> J[Response to User]
K[(Knowledge Base<br/>PDF, Text, QA, URLs)] --> L[Chunking Service]
L --> M[Embedding Service]
M --> D
style D fill:#FFF3E0
style I fill:#F3E5F5
style J fill:#E8F5E9
Data Ingestion Pipeline¶
Step 1: Document Processing
# Supported formats
- PDF files (text extraction)
- Plain text files
- Website URLs (crawled content)
- Q&A pairs (direct embeddings)
- Organization data
Step 2: Text Chunking
def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200):
"""
Split text into overlapping chunks
Chunk Size: 1000 characters
Overlap: 200 characters (20% overlap for context preservation)
Max Chunks per Document: 50 (to prevent memory issues)
"""
chunks = []
for start in range(0, len(text), chunk_size - overlap):
end = min(start + chunk_size, len(text))
chunk = text[start:end]
chunks.append({
"chunk_index": len(chunks),
"content": chunk,
"start_pos": start,
"end_pos": end,
"length": len(chunk)
})
return chunks
Why 1000/200?
- 1000 chars ≈ 250 tokens (well within embedding model limits)
- 200-char overlap preserves sentence context across boundaries
- Balance between granularity and search relevance
Step 3: Embedding Generation
# Embedding Model: Default (likely sentence-transformers or Azure OpenAI embeddings)
# Dimensions: 384
# Each chunk → 384-dimensional vector
for chunk in chunks:
embedding = embedder.embed([chunk['content']])[0]
chunk['embedding'] = [float(x) for x in embedding] # 384 floats
Step 4: Milvus Insertion
# Insert into partition-based Milvus
milvus_embeddings.insert_embeddings(
collection_name="embeddings",
embeddings_data=[{
"document_id": doc_id,
"user_id": user_id,
"project_id": chatbot_id, # Becomes partition name
"chunk_index": chunk['chunk_index'],
"text": chunk['content'][:2000],
"embedding": chunk['embedding'], # 384-dim vector
"data_type": "pdf", # or text, qa, url, org
"source_url": source_url or ""
}]
)
# Auto-creates partition if doesn't exist
# Partition naming: User_{user_id}_Project_{project_id}
Query-Time RAG Flow¶
Step 1: User Query → Embedding
user_query = "How do I reset my password?"
query_embedding = embedder.embed([user_query])[0] # 384-dim vector
Step 2: Vector Search (Partition-Scoped)
search_results = milvus_embeddings.search_embeddings(
collection_name="embeddings",
query_vector=query_embedding,
user_id=user_id,
project_id=chatbot_id, # Searches ONLY this partition!
top_k=5, # Return top 5 most similar chunks
milvus_ids=None # Optional: search within specific docs
)
# Returns:
# [
# {
# "milvus_id": 12345,
# "document_id": "doc_abc",
# "chunk_index": 3,
# "text": "To reset your password, go to Settings...",
# "distance": 0.23, # L2 distance (lower = more similar)
# "score": 0.81 # Similarity score (1 / (1 + distance))
# },
# ...
# ]
Why Top-K = 5?
- Balances context richness vs. token costs
- 5 chunks × ~250 tokens = ~1250 tokens of context
- Leaves room for conversation history + system prompts
Step 3: Context Assembly
# Build context from top results
context_chunks = [result['text'] for result in search_results[:5]]
context = "\n\n---\n\n".join(context_chunks)
# Usually includes:
# - Chunk text
# - Source metadata (document name, page number)
# - Relevance score (for LLM confidence assessment)
Step 4: Prompt Construction
system_prompt = f"""
You are a helpful assistant. Answer the user's question based ONLY on the following context.
If the answer is not in the context, say "I don't have enough information to answer that."
Context:
{context}
"""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_query}
]
Step 5: LLM Generation
# Route to appropriate model
response = llm_service.call_model(
model_name="openai-35", # or gpt-4, depending on complexity
messages=messages
)
final_answer = response['response']
Model Selection Strategy¶
Decision Matrix¶
We use a routing strategy based on query characteristics:
| Scenario | Model Choice | Rationale |
|---|---|---|
| Simple FAQ | GPT-3.5 Turbo 16K | Fast (< 1s), low cost ($0.0015/1K tokens) |
| Complex reasoning | GPT-4-0613 | Higher quality, better at multi-step logic |
| Long context | GPT-4o Mini | 128K context window, cost-effective |
| Real-time data | Grok-3 | Internet-connected |
| Multimodal | Gemini 2.0 Flash | Can process images/video |
| Code generation | Claude 3.5 Sonnet | Superior code understanding |
| Cost-sensitive | Llama 3.3 70B | Open-source, Azure ML hosting |
Routing Logic (Simplified)¶
def select_model(query: str, chatbot_config: dict):
# Check chatbot-level model preference
if chatbot_config.get("preferred_model"):
return chatbot_config["preferred_model"]
# Default routing
query_length = len(query.split())
if query_length < 20:
return "openai-35" # Simple query → fast, cheap
elif "code" in query.lower() or "function" in query.lower():
return "Claude sonnet 4" # Code task → Claude
elif query_length > 500:
return "openai-4o-mini" # Long context → GPT-4o mini
else:
return "openai-4" # Complex query → GPT-4
Performance Characteristics¶
Embedding Search Performance¶
| Operation | Latency (p50) | Latency (p95) | Notes |
|---|---|---|---|
| Single partition search | 15ms | 35ms | Typical query with K=5 |
| Full collection search | 450ms | 800ms | Searches all partitions (avoid!) |
| Embedding generation | 50ms | 100ms | 384-dim vector from text |
| Chunk insertion | 20ms | 40ms | Per-batch insert into Milvus |
Optimization: Partition-based search is 30x faster than full collection search.
LLM Response Times¶
| Model | Latency (p50) | Latency (p95) | Throughput |
|---|---|---|---|
| GPT-3.5 Turbo 16K | 800ms | 1.5s | ~30 tokens/sec |
| GPT-4-0613 | 2.5s | 4.5s | ~15 tokens/sec |
| GPT-4o Mini | 1.2s | 2.8s | ~25 tokens/sec |
| Claude 3.5 Sonnet | 2.0s | 3.5s | ~20 tokens/sec |
| Gemini 2.0 Flash | 900ms | 1.8s | ~28 tokens/sec |
Note: Latencies measured for ~200-token responses. Actual times vary with response length.
TTS Performance¶
| Metric | Value | Notes |
|---|---|---|
| TTS Latency | 300-800ms | For 1-2 sentence responses |
| Audio Format | WAV, 16kHz | ~320KB per 20 seconds |
| Retry Logic | 3 attempts | With exponential backoff |
| Concurrent TTS | Async | Non-blocking generation |
Cost Analysis¶
Per-Request Cost Breakdown¶
Scenario: User asks a question with RAG context
| Component | Cost | Calculation |
|---|---|---|
| Embedding (query) | $0.0001 | ~100 tokens × $0.0001/1K |
| Milvus Search | $0 | Self-hosted, compute cost absorbed |
| LLM (GPT-3.5) | $0.006 | Input: ~1500 tokens, Output: ~500 tokens |
| TTS (if voice) | $0.016 | ~100 characters × $16/1M chars |
| Total (text) | ~$0.006 | |
| Total (voice) | ~$0.022 |
Monthly Cost Projections:
| Usage | Requests/Month | Text Cost | Voice Cost |
|---|---|---|---|
| Light | 10,000 | $60 | $220 |
| Medium | 100,000 | $600 | $2,200 |
| Heavy | 1,000,000 | $6,000 | $22,000 |
Cost Optimization Strategies:
- Route simple queries to GPT-3.5 instead of GPT-4 (4x cheaper)
- Cache frequent Q&A pairs (bypass LLM entirely)
- Compress embeddings to 256-dim if latency allows
- Use Llama for cost-sensitive customers (Azure ML hosting)
Data Architecture¶
Data Flow¶
flowchart LR
A[User Uploads PDF] --> B[Data Crawling Service]
B --> C[Text Extraction]
C --> D[Chunking<br/>1000/200]
D --> E[Embedding Generation<br/>384-dim]
E --> F[(Milvus<br/>Vector DB)]
G[User Query] --> H[Query Embedding<br/>384-dim]
H --> F
F --> I[Top-K Results]
I --> J[LLM Service]
J --> K[Response]
L[(MongoDB<br/>Metadata)] --> B
E --> L
style F fill:#FFF3E0
style J fill:#F3E5F5
Data Storage Distribution¶
| Data Type | Database | Size Estimate | Retention |
|---|---|---|---|
| Vectors (embeddings) | Milvus | ~1.5KB per chunk | Until document deleted |
| Document metadata | MongoDB | ~2KB per document | Permanent |
| Chunk metadata | MongoDB | ~500B per chunk | Permanent |
| Chat history | MongoDB | ~1KB per message | 90 days |
| Audio files (TTS) | Azure Blob | ~15KB per response | 7 days |
Security & Privacy¶
PII Handling in RAG¶
Critical Rule: User PII is NEVER embedded or sent to LLMs.
Implementation:
def sanitize_for_llm(text: str) -> str:
"""
Remove PII before embedding/LLM processing
- Email addresses
- Phone numbers
- Credit card numbers
- Government IDs
"""
text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text)
text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
text = re.sub(r'\b\d{16}\b', '[CARD]', text)
return text
Embedding Anonymization:
- User queries are embedded but NOT stored in Milvus
- Only knowledge base content (PDFs, FAQs) is embedded and stored
- All embeddings tagged with
user_idandproject_idfor access control
Data Isolation:
- Each chatbot's vectors in separate Milvus partition
- MongoDB enforces user_id filtering on all queries
- No cross-tenant data leakage possible
Monitoring & Observability¶
Key Metrics¶
| Metric | Monitoring Tool | Alert Threshold |
|---|---|---|
| LLM Latency | DataDog | p95 > 5s |
| Embedding Search Latency | DataDog | p95 > 100ms |
| Milvus Connection Health | Loki Logs | Connection failures |
| LLM Error Rate | DataDog | > 1% |
| TTS Error Rate | DataDog | > 0.5% |
| Token Usage | OpenAI Dashboard | Daily spike > 50% |
Logging¶
# Structured logging for every LLM call
logger.info("Calling GPT-4 API...", extra={
"model": "gpt-4-0613",
"user_id": user_id,
"project_id": project_id,
"input_tokens": estimated_input_tokens,
"timestamp": datetime.utcnow().isoformat()
})
logger.info("✓ GPT-4 API call successful", extra={
"latency_ms": latency,
"output_tokens": response_tokens,
"total_cost_usd": calculated_cost
})
Future Enhancements¶
Roadmap¶
Q1 2025:
- Fine-tuned embedding model (domain-specific)
- Hybrid search (vector + BM25 keyword search)
- Response caching layer (Redis)
Q2 2025:
- Multi-modal RAG (images, tables from PDFs)
- Streaming LLM responses
- Advanced reranking (cross-encoder)
Q3 2025:
- Custom LLM fine-tuning on customer data
- Agentic workflows (multi-step reasoning)
- Evaluation framework (human feedback loop)
Related Documentation¶
Last Updated: 2025-12-26
Version: 1.0
Owner: ML Engineering Lead
Review Cycle: Quarterly or per major model update
"Multi-model brilliance: The right LLM for every conversation."