Monitoring & Observability¶
Section: 3-product-architecture
Document: Complete Observability Stack
Audience: DevOps, SRE, Platform Engineers
Last Updated: 2025-12-30
🎯 Overview¶
Comprehensive monitoring and observability strategy for MachineAvatars platform, covering logging, metrics, tracing, and alerting.
Observability Pillars:
- 📝 Logging: Centralized logs (Loki)
- 📊 Metrics: System and application metrics
- 🔍 Tracing: Distributed request tracing (planned)
- 🚨 Alerting: Proactive incident detection
📝 Logging¶
Loki Stack¶
Architecture:
Components:
- Loki: Log aggregation database
- Promtail: Log shipper on each container
- Grafana: Visualization and querying
Configuration:
loki:
ingestion_rate_mb: 50
retention_period: 90d
chunk_target_size: 1536000
promtail:
positions_file: /tmp/positions.yaml
scrape_configs:
- job_name: containers
static_configs:
- targets:
- localhost
labels:
job: containerlogs
__path__: /var/log/containers/*log
Log Levels¶
Standard levels across all services:
| Level | Usage | Example |
|---|---|---|
| DEBUG | Development only | Variable values, detailed flow |
| INFO | Normal operations | "User logged in", "Chatbot created" |
| WARN | Potential issues | "High latency detected", "Retry attempt ⅔" |
| ERROR | Errors requiring attention | "Failed to connect to MongoDB", "API call failed" |
| FATAL | System crashes | "Out of memory", "Cannot start service" |
Structured Logging¶
JSON format:
{
"timestamp": "2025-12-30T10:30:15.123Z",
"level": "INFO",
"service": "response-3d",
"user_id": "User-123456",
"project_id": "User-123456_Project_1",
"trace_id": "abc123xyz",
"message": "Generated chatbot response",
"duration_ms": 1234,
"model": "gpt-4-turbo",
"tokens": 450
}
Benefits:
- Easy to query (LogQL)
- Automatic field extraction
- JSON parsing in Grafana
Log Retention¶
| Environment | Retention | Storage |
|---|---|---|
| Production | 90 days | Azure Blob (Cool tier) |
| Staging | 30 days | Local storage |
| Development | 7 days | Local storage |
Compliance logs: 1 year (GDPR, audit requirements)
📊 Metrics¶
System Metrics (Infrastructure)¶
Collected by Azure Monitor:
Compute:
- CPU utilization (%)
- Memory usage (%)
- Disk I/O (IOPS, throughput)
- Network I/O (bytes in/out)
Per Service:
- Active replicas
- Restart count
- Container health status
Databases:
-
Cosmos DB:
-
RU/s consumption
- Request latency (p50, p95, p99)
- Throttled requests
-
Storage size
-
Milvus:
- Query latency
- Insert throughput
- Memory usage
- Collection count
Application Metrics (RED Method)¶
Rate, Errors, Duration for all endpoints:
Rate (Requests):
Errors:
Duration:
Business Metrics¶
User Engagement:
- Daily active users (DAU)
- Monthly active users (MAU)
- New signups per day
- Chatbots created per day
Chatbot Usage:
- Total conversations per day
- Conversations per chatbot
- Average response time
- User satisfaction (thumbs up/down)
Revenue:
- New subscriptions (Free → Pro)
- Monthly recurring revenue (MRR)
- Churn rate
AI/ML:
- LLM API calls per model
- Total tokens consumed
- Cost per conversation
- RAG hit rate (%)
Metrics Storage¶
Prometheus (Planned Q1 2025):
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "container-apps"
static_configs:
- targets:
- "response-3d:8011"
- "auth-service:8001"
# ... all services
Current: Azure Monitor Metrics
Retention:
- 1 minute resolution: 30 days
- 5 minute resolution: 90 days
- 1 hour resolution: 2 years
🔍 Distributed Tracing (Planned)¶
OpenTelemetry Integration¶
Architecture:
Service A → Service B → Service C
↓ ↓ ↓
Trace Context Propagation
↓ ↓ ↓
OpenTelemetry Collector
↓
Jaeger/Tempo
↓
Grafana
Trace Example:
Trace ID: abc123xyz
├─ Span: Gateway (100ms)
│ └─ Span: Auth Service (20ms)
├─ Span: Response 3D Service (1200ms)
│ ├─ Span: Milvus Search (35ms)
│ ├─ Span: LLM API Call (900ms)
│ └─ Span: Azure TTS (250ms)
└─ Span: Save to MongoDB (15ms)
Benefits:
- End-to-end request visibility
- Performance bottleneck identification
- Error correlation across services
🚨 Alerting¶
Alert Categories¶
Infrastructure Alerts:
| Alert | Threshold | Severity | Channel |
|---|---|---|---|
| High CPU | >80% for 5min | Warning | |
| Critical CPU | >95% for 2min | Critical | Email + Slack |
| High Memory | >90% for 5min | Warning | |
| Out of Memory | >98% | Critical | Email + Slack + PagerDuty |
| Disk Full | >85% | Warning | |
| Container Crash | Restart >3 in 10min | Critical | Slack + PagerDuty |
Application Alerts:
| Alert | Threshold | Severity | Channel |
|---|---|---|---|
| High Error Rate | >5% errors | Warning | Email + Slack |
| Critical Errors | >20% errors | Critical | Slack + PagerDuty |
| High Latency | p95 >3s | Warning | |
| Critical Latency | p95 >10s | Critical | Slack |
| No Requests | 0 req for 5min | Warning | |
| Database Down | Connection failed | Critical | Slack + PagerDuty |
Business Alerts:
| Alert | Threshold | Severity | Channel |
|---|---|---|---|
| No Signups | 0 signups for 1 hour | Warning | |
| Payment Failures | >10% failed payments | Critical | Email + Slack |
| High Churn | >5% monthly churn | Warning | Email (monthly) |
Alert Routing¶
Channels:
- Email: All alerts
- Slack (#alerts): Warning and Critical
- PagerDuty: Critical only (24/7 on-call)
On-Call Schedule:
Week Schedule (rotating):
- Primary: DevOps Engineer
- Secondary: Backend Lead
- Escalation: CTO
Escalation Path:
1. Primary (immediate)
2. Secondary (after 15 min)
3. Escalation (after 30 min)
📊 Dashboards¶
1. Infrastructure Dashboard¶
Metrics:
- Container Apps: CPU, Memory, Replicas
- Databases: RU/s, Latency, Storage
- Network: Bandwidth, Errors
- Cost: Daily spend by service
Grafana Panels:
┌────────────────────────────────────────┐
│ CPU Utilization (All Services) │
│ [Line Chart - Time Series] │
└────────────────────────────────────────┘
┌────────────────────────────────────────┐
│ Memory Usage (All Services) │
│ [Line Chart - Stack] │
└────────────────────────────────────────┘
┌──────────────┬─────────────────────────┐
│ Active │ Cosmos DB RU/s │
│ Replicas │ Consumption │
│ [Gauge] │ [Line Chart] │
└──────────────┴─────────────────────────┘
2. Application Dashboard¶
Metrics:
- Request rate (req/sec)
- Error rate (%)
- Response latency (p50, p95, p99)
- LLM API calls
- Token usage
Key Panels:
┌────────────────────────────────────────┐
│ Request Rate (per service) │
│ [Line Chart - Multi-series] │
└────────────────────────────────────────┘
┌────────────────────────────────────────┐
│ Error Rate % │
│ [Line Chart - Red threshold at 5%] │
└────────────────────────────────────────┘
┌──────────────┬─────────────────────────┐
│ p95 Latency │ LLM Token Usage │
│ [Line Chart] │ [Bar Chart - by model] │
└──────────────┴─────────────────────────┘
3. Business Dashboard¶
Metrics:
- Daily/Monthly active users
- New signups
- Conversations per day
- Revenue (MRR)
- Top chatbots by usage
Executive View:
┌──────────┬──────────┬──────────┬──────────┐
│ DAU │ MAU │ New │ MRR │
│ 1,234 │ 5,678 │ Signups │ $12,345 │
│ [+5%] │ [+12%] │ 45 │ [+8%] │
└──────────┴──────────┴──────────┴──────────┘
┌────────────────────────────────────────┐
│ Conversations per Day │
│ [Line Chart - 30 day trend] │
└────────────────────────────────────────┘
✅ Health Checks¶
Container Health Probes¶
Liveness Probe:
livenessProbe:
httpGet:
path: /health
port: 8011
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
Response:
{
"status": "healthy",
"timestamp": "2025-12-30T10:30:00Z",
"service": "response-3d",
"version": "v1.2.3"
}
Readiness Probe:
readinessProbe:
httpGet:
path: /ready
port: 8011
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
Response:
{
"ready": true,
"dependencies": {
"mongodb": "connected",
"milvus": "connected",
"blob_storage": "connected"
}
}
Dependency Health Checks¶
MongoDB:
def check_mongodb():
try:
client.admin.command('ping')
return "healthy"
except Exception as e:
return f"unhealthy: {str(e)}"
Milvus:
def check_milvus():
try:
milvus.list_collections()
return "healthy"
except Exception as e:
return f"unhealthy: {str(e)}"
🔧 Incident Management¶
Incident Response Process¶
1. Detection (Auto or Manual)
- Alert fired
- User report
- Monitoring dashboard anomaly
2. Triage (< 5 minutes)
- Assess severity (P0-P3)
- Identify affected services
- Assign on-call engineer
3. Investigation (< 15 minutes for P0)
- Check logs, metrics, traces
- Identify root cause
- Determine fix approach
4. Mitigation (< 30 minutes for P0)
- Apply hotfix
- Rollback deployment
- Scale resources
- Enable circuit breaker
5. Resolution
- Verify fix in production
- Monitor for 1 hour
- Update status page
6. Post-Mortem (within 48 hours)
- Document timeline
- Root cause analysis
- Action items to prevent recurrence
- Blameless culture
Incident Severity Levels¶
| Level | Definition | Response Time | Escalation |
|---|---|---|---|
| P0 - Critical | Complete outage, data loss | Immediate | CTO + entire team |
| P1 - High | Major feature down | <15 min | On-call + backend lead |
| P2 - Medium | Degraded performance | <1 hour | On-call engineer |
| P3 - Low | Minor issue, workaround exists | <24 hours | Next business day |
📈 Performance Monitoring¶
Service Level Indicators (SLIs)¶
API Availability:
API Latency:
Error Rate:
Service Level Objectives (SLOs)¶
| Service | Availability SLO | Latency SLO (p95) | Error Rate SLO |
|---|---|---|---|
| Gateway | 99.95% | <100ms | <0.5% |
| Auth | 99.9% | <200ms | <1% |
| Response 3D | 99.5% | <3s | <2% |
| Response Text | 99.7% | <1s | <1.5% |
| LLM Service | 99.0% | <2s | <5% (external dependency) |
Performance Budgets¶
Response Time Budgets:
- Gateway routing: 50ms
- Database query: 100ms
- Milvus search: 50ms
- LLM API call: 2000ms
- TTS generation: 500ms
- Total budget: 3000ms
If budget exceeded:
- Investigate bottleneck
- Optimize queries
- Add caching
- Scale resources
🔗 Related Documentation¶
- Infrastructure¶
"Observability is about asking questions you didn't know you had." 📊🔍