Monitoring & Observability¶

Section: 3-product-architecture
Document: Complete Observability Stack
Audience: DevOps, SRE, Platform Engineers
Last Updated: 2025-12-30

🎯 Overview¶

Comprehensive monitoring and observability strategy for MachineAvatars platform, covering logging, metrics, tracing, and alerting.

Observability Pillars:

📝 Logging: Centralized logs (Loki)
📊 Metrics: System and application metrics
🔍 Tracing: Distributed request tracing (planned)
🚨 Alerting: Proactive incident detection

📝 Logging¶

Loki Stack¶

Architecture:

Container Apps → Promtail → Loki → Grafana

Components:

Loki: Log aggregation database
Promtail: Log shipper on each container
Grafana: Visualization and querying

Configuration:

loki:
  ingestion_rate_mb: 50
  retention_period: 90d
  chunk_target_size: 1536000

promtail:
  positions_file: /tmp/positions.yaml
  scrape_configs:
    - job_name: containers
      static_configs:
        - targets:
            - localhost
          labels:
            job: containerlogs
            __path__: /var/log/containers/*log

Log Levels¶

Standard levels across all services:

Level	Usage	Example
DEBUG	Development only	Variable values, detailed flow
INFO	Normal operations	"User logged in", "Chatbot created"
WARN	Potential issues	"High latency detected", "Retry attempt ⅔"
ERROR	Errors requiring attention	"Failed to connect to MongoDB", "API call failed"
FATAL	System crashes	"Out of memory", "Cannot start service"

Structured Logging¶

JSON format:

{
  "timestamp": "2025-12-30T10:30:15.123Z",
  "level": "INFO",
  "service": "response-3d",
  "user_id": "User-123456",
  "project_id": "User-123456_Project_1",
  "trace_id": "abc123xyz",
  "message": "Generated chatbot response",
  "duration_ms": 1234,
  "model": "gpt-4-turbo",
  "tokens": 450
}

Benefits:

Easy to query (LogQL)
Automatic field extraction
JSON parsing in Grafana

Log Retention¶

Environment	Retention	Storage
Production	90 days	Azure Blob (Cool tier)
Staging	30 days	Local storage
Development	7 days	Local storage

Compliance logs: 1 year (GDPR, audit requirements)

📊 Metrics¶

System Metrics (Infrastructure)¶

Collected by Azure Monitor:

Compute:

CPU utilization (%)
Memory usage (%)
Disk I/O (IOPS, throughput)
Network I/O (bytes in/out)

Per Service:

Active replicas
Restart count
Container health status

Databases:

Cosmos DB:
RU/s consumption
Request latency (p50, p95, p99)
Throttled requests
Storage size
Milvus:
Query latency
Insert throughput
Memory usage
Collection count

Application Metrics (RED Method)¶

Rate, Errors, Duration for all endpoints:

Rate (Requests):

http_requests_total{service="response-3d", endpoint="/get-response-3d", method="POST"}

Errors:

http_requests_failed_total{service="response-3d", status_code="500"}

Duration:

http_request_duration_seconds{service="response-3d", quantile="0.95"}

Business Metrics¶

User Engagement:

Daily active users (DAU)
Monthly active users (MAU)
New signups per day
Chatbots created per day

Chatbot Usage:

Total conversations per day
Conversations per chatbot
Average response time
User satisfaction (thumbs up/down)

Revenue:

New subscriptions (Free → Pro)
Monthly recurring revenue (MRR)
Churn rate

AI/ML:

LLM API calls per model
Total tokens consumed
Cost per conversation
RAG hit rate (%)

Metrics Storage¶

Prometheus (Planned Q1 2025):

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "container-apps"
    static_configs:
      - targets:
          - "response-3d:8011"
          - "auth-service:8001"
          # ... all services

Current: Azure Monitor Metrics

Retention:

1 minute resolution: 30 days
5 minute resolution: 90 days
1 hour resolution: 2 years

🔍 Distributed Tracing (Planned)¶

OpenTelemetry Integration¶

Architecture:

Service A → Service B → Service C
     ↓           ↓           ↓
  Trace Context Propagation
     ↓           ↓           ↓
    OpenTelemetry Collector
              ↓
         Jaeger/Tempo
              ↓
           Grafana

Trace Example:

Trace ID: abc123xyz
├─ Span: Gateway (100ms)
│  └─ Span: Auth Service (20ms)
├─ Span: Response 3D Service (1200ms)
│  ├─ Span: Milvus Search (35ms)
│  ├─ Span: LLM API Call (900ms)
│  └─ Span: Azure TTS (250ms)
└─ Span: Save to MongoDB (15ms)

Benefits:

End-to-end request visibility
Performance bottleneck identification
Error correlation across services

🚨 Alerting¶

Alert Categories¶

Infrastructure Alerts:

Alert	Threshold	Severity	Channel
High CPU	>80% for 5min	Warning	Email
Critical CPU	>95% for 2min	Critical	Email + Slack
High Memory	>90% for 5min	Warning	Email
Out of Memory	>98%	Critical	Email + Slack + PagerDuty
Disk Full	>85%	Warning	Email
Container Crash	Restart >3 in 10min	Critical	Slack + PagerDuty

Application Alerts:

Alert	Threshold	Severity	Channel
High Error Rate	>5% errors	Warning	Email + Slack
Critical Errors	>20% errors	Critical	Slack + PagerDuty
High Latency	p95 >3s	Warning	Email
Critical Latency	p95 >10s	Critical	Slack
No Requests	0 req for 5min	Warning	Email
Database Down	Connection failed	Critical	Slack + PagerDuty

Business Alerts:

Alert	Threshold	Severity	Channel
No Signups	0 signups for 1 hour	Warning	Email
Payment Failures	>10% failed payments	Critical	Email + Slack
High Churn	>5% monthly churn	Warning	Email (monthly)

Alert Routing¶

Channels:

Email: All alerts
Slack (#alerts): Warning and Critical
PagerDuty: Critical only (24/7 on-call)

On-Call Schedule:

Week Schedule (rotating):
- Primary: DevOps Engineer
- Secondary: Backend Lead
- Escalation: CTO

Escalation Path:
1. Primary (immediate)
2. Secondary (after 15 min)
3. Escalation (after 30 min)

📊 Dashboards¶

1. Infrastructure Dashboard¶

Metrics:

Container Apps: CPU, Memory, Replicas
Databases: RU/s, Latency, Storage
Network: Bandwidth, Errors
Cost: Daily spend by service

Grafana Panels:

┌────────────────────────────────────────┐
│ CPU Utilization (All Services)         │
│ [Line Chart - Time Series]             │
└────────────────────────────────────────┘

┌────────────────────────────────────────┐
│ Memory Usage (All Services)             │
│ [Line Chart - Stack]                    │
└────────────────────────────────────────┘

┌──────────────┬─────────────────────────┐
│ Active       │ Cosmos DB RU/s          │
│ Replicas     │ Consumption             │
│ [Gauge]      │ [Line Chart]            │
└──────────────┴─────────────────────────┘

2. Application Dashboard¶

Metrics:

Request rate (req/sec)
Error rate (%)
Response latency (p50, p95, p99)
LLM API calls
Token usage

Key Panels:

┌────────────────────────────────────────┐
│ Request Rate (per service)              │
│ [Line Chart - Multi-series]             │
└────────────────────────────────────────┘

┌────────────────────────────────────────┐
│ Error Rate %                            │
│ [Line Chart - Red threshold at 5%]     │
└────────────────────────────────────────┘

┌──────────────┬─────────────────────────┐
│ p95 Latency  │ LLM Token Usage         │
│ [Line Chart] │ [Bar Chart - by model]  │
└──────────────┴─────────────────────────┘

3. Business Dashboard¶

Metrics:

Daily/Monthly active users
New signups
Conversations per day
Revenue (MRR)
Top chatbots by usage

Executive View:

┌──────────┬──────────┬──────────┬──────────┐
│ DAU      │ MAU      │ New      │ MRR      │
│ 1,234    │ 5,678    │ Signups  │ $12,345  │
│ [+5%]    │ [+12%]   │ 45       │ [+8%]    │
└──────────┴──────────┴──────────┴──────────┘

┌────────────────────────────────────────┐
│ Conversations per Day                   │
│ [Line Chart - 30 day trend]             │
└────────────────────────────────────────┘

✅ Health Checks¶

Container Health Probes¶

Liveness Probe:

livenessProbe:
  httpGet:
    path: /health
    port: 8011
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Response:

{
  "status": "healthy",
  "timestamp": "2025-12-30T10:30:00Z",
  "service": "response-3d",
  "version": "v1.2.3"
}

Readiness Probe:

readinessProbe:
  httpGet:
    path: /ready
    port: 8011
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

Response:

{
  "ready": true,
  "dependencies": {
    "mongodb": "connected",
    "milvus": "connected",
    "blob_storage": "connected"
  }
}

Dependency Health Checks¶

MongoDB:

def check_mongodb():
    try:
        client.admin.command('ping')
        return "healthy"
    except Exception as e:
        return f"unhealthy: {str(e)}"

Milvus:

def check_milvus():
    try:
        milvus.list_collections()
        return "healthy"
    except Exception as e:
        return f"unhealthy: {str(e)}"

🔧 Incident Management¶

Incident Response Process¶

1. Detection (Auto or Manual)

Alert fired
User report
Monitoring dashboard anomaly

2. Triage (< 5 minutes)

Assess severity (P0-P3)
Identify affected services
Assign on-call engineer

3. Investigation (< 15 minutes for P0)

Check logs, metrics, traces
Identify root cause
Determine fix approach

4. Mitigation (< 30 minutes for P0)

Apply hotfix
Rollback deployment
Scale resources
Enable circuit breaker

5. Resolution

Verify fix in production
Monitor for 1 hour
Update status page

6. Post-Mortem (within 48 hours)

Document timeline
Root cause analysis
Action items to prevent recurrence
Blameless culture

Incident Severity Levels¶

Level	Definition	Response Time	Escalation
P0 - Critical	Complete outage, data loss	Immediate	CTO + entire team
P1 - High	Major feature down	<15 min	On-call + backend lead
P2 - Medium	Degraded performance	<1 hour	On-call engineer
P3 - Low	Minor issue, workaround exists	<24 hours	Next business day

📈 Performance Monitoring¶

Service Level Indicators (SLIs)¶

API Availability:

SLI = (Successful Requests / Total Requests) × 100
Target: 99.9% (3 nines)

API Latency:

SLI = % of requests < 3 seconds (p95)
Target: 95% of requests

Error Rate:

SLI = (Failed Requests / Total Requests) × 100
Target: < 1%

Service Level Objectives (SLOs)¶

Service	Availability SLO	Latency SLO (p95)	Error Rate SLO
Gateway	99.95%	<100ms	<0.5%
Auth	99.9%	<200ms	<1%
Response 3D	99.5%	<3s	<2%
Response Text	99.7%	<1s	<1.5%
LLM Service	99.0%	<2s	<5% (external dependency)

Performance Budgets¶

Response Time Budgets:

Gateway routing: 50ms
Database query: 100ms
Milvus search: 50ms
LLM API call: 2000ms
TTS generation: 500ms
Total budget: 3000ms

If budget exceeded:

Investigate bottleneck
Optimize queries
Add caching
Scale resources

- Infrastructure ¶

"Observability is about asking questions you didn't know you had." 📊🔍

Monitoring & Observability¶

🎯 Overview¶

📝 Logging¶

Loki Stack¶

Log Levels¶

Structured Logging¶

Log Retention¶

📊 Metrics¶

System Metrics (Infrastructure)¶

Application Metrics (RED Method)¶

Business Metrics¶

Metrics Storage¶

🔍 Distributed Tracing (Planned)¶

OpenTelemetry Integration¶

🚨 Alerting¶

Alert Categories¶

Alert Routing¶

📊 Dashboards¶

1. Infrastructure Dashboard¶

2. Application Dashboard¶

3. Business Dashboard¶

✅ Health Checks¶

Container Health Probes¶

Dependency Health Checks¶

🔧 Incident Management¶

Incident Response Process¶

Incident Severity Levels¶

📈 Performance Monitoring¶

Service Level Indicators (SLIs)¶

Service Level Objectives (SLOs)¶

Performance Budgets¶

🔗 Related Documentation¶

- Infrastructure¶

- Infrastructure ¶