Skip to content

Monitoring & Observability

Section: 3-product-architecture
Document: Complete Observability Stack
Audience: DevOps, SRE, Platform Engineers
Last Updated: 2025-12-30


🎯 Overview

Comprehensive monitoring and observability strategy for MachineAvatars platform, covering logging, metrics, tracing, and alerting.

Observability Pillars:

  • 📝 Logging: Centralized logs (Loki)
  • 📊 Metrics: System and application metrics
  • 🔍 Tracing: Distributed request tracing (planned)
  • 🚨 Alerting: Proactive incident detection

📝 Logging

Loki Stack

Architecture:

Container Apps → Promtail → Loki → Grafana

Components:

  • Loki: Log aggregation database
  • Promtail: Log shipper on each container
  • Grafana: Visualization and querying

Configuration:

loki:
  ingestion_rate_mb: 50
  retention_period: 90d
  chunk_target_size: 1536000

promtail:
  positions_file: /tmp/positions.yaml
  scrape_configs:
    - job_name: containers
      static_configs:
        - targets:
            - localhost
          labels:
            job: containerlogs
            __path__: /var/log/containers/*log

Log Levels

Standard levels across all services:

Level Usage Example
DEBUG Development only Variable values, detailed flow
INFO Normal operations "User logged in", "Chatbot created"
WARN Potential issues "High latency detected", "Retry attempt ⅔"
ERROR Errors requiring attention "Failed to connect to MongoDB", "API call failed"
FATAL System crashes "Out of memory", "Cannot start service"

Structured Logging

JSON format:

{
  "timestamp": "2025-12-30T10:30:15.123Z",
  "level": "INFO",
  "service": "response-3d",
  "user_id": "User-123456",
  "project_id": "User-123456_Project_1",
  "trace_id": "abc123xyz",
  "message": "Generated chatbot response",
  "duration_ms": 1234,
  "model": "gpt-4-turbo",
  "tokens": 450
}

Benefits:

  • Easy to query (LogQL)
  • Automatic field extraction
  • JSON parsing in Grafana

Log Retention

Environment Retention Storage
Production 90 days Azure Blob (Cool tier)
Staging 30 days Local storage
Development 7 days Local storage

Compliance logs: 1 year (GDPR, audit requirements)


📊 Metrics

System Metrics (Infrastructure)

Collected by Azure Monitor:

Compute:

  • CPU utilization (%)
  • Memory usage (%)
  • Disk I/O (IOPS, throughput)
  • Network I/O (bytes in/out)

Per Service:

  • Active replicas
  • Restart count
  • Container health status

Databases:

  • Cosmos DB:

  • RU/s consumption

  • Request latency (p50, p95, p99)
  • Throttled requests
  • Storage size

  • Milvus:

  • Query latency
  • Insert throughput
  • Memory usage
  • Collection count

Application Metrics (RED Method)

Rate, Errors, Duration for all endpoints:

Rate (Requests):

http_requests_total{service="response-3d", endpoint="/get-response-3d", method="POST"}

Errors:

http_requests_failed_total{service="response-3d", status_code="500"}

Duration:

http_request_duration_seconds{service="response-3d", quantile="0.95"}

Business Metrics

User Engagement:

  • Daily active users (DAU)
  • Monthly active users (MAU)
  • New signups per day
  • Chatbots created per day

Chatbot Usage:

  • Total conversations per day
  • Conversations per chatbot
  • Average response time
  • User satisfaction (thumbs up/down)

Revenue:

  • New subscriptions (Free → Pro)
  • Monthly recurring revenue (MRR)
  • Churn rate

AI/ML:

  • LLM API calls per model
  • Total tokens consumed
  • Cost per conversation
  • RAG hit rate (%)

Metrics Storage

Prometheus (Planned Q1 2025):

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "container-apps"
    static_configs:
      - targets:
          - "response-3d:8011"
          - "auth-service:8001"
          # ... all services

Current: Azure Monitor Metrics

Retention:

  • 1 minute resolution: 30 days
  • 5 minute resolution: 90 days
  • 1 hour resolution: 2 years

🔍 Distributed Tracing (Planned)

OpenTelemetry Integration

Architecture:

Service A → Service B → Service C
     ↓           ↓           ↓
  Trace Context Propagation
     ↓           ↓           ↓
    OpenTelemetry Collector
         Jaeger/Tempo
           Grafana

Trace Example:

Trace ID: abc123xyz
├─ Span: Gateway (100ms)
│  └─ Span: Auth Service (20ms)
├─ Span: Response 3D Service (1200ms)
│  ├─ Span: Milvus Search (35ms)
│  ├─ Span: LLM API Call (900ms)
│  └─ Span: Azure TTS (250ms)
└─ Span: Save to MongoDB (15ms)

Benefits:

  • End-to-end request visibility
  • Performance bottleneck identification
  • Error correlation across services

🚨 Alerting

Alert Categories

Infrastructure Alerts:

Alert Threshold Severity Channel
High CPU >80% for 5min Warning Email
Critical CPU >95% for 2min Critical Email + Slack
High Memory >90% for 5min Warning Email
Out of Memory >98% Critical Email + Slack + PagerDuty
Disk Full >85% Warning Email
Container Crash Restart >3 in 10min Critical Slack + PagerDuty

Application Alerts:

Alert Threshold Severity Channel
High Error Rate >5% errors Warning Email + Slack
Critical Errors >20% errors Critical Slack + PagerDuty
High Latency p95 >3s Warning Email
Critical Latency p95 >10s Critical Slack
No Requests 0 req for 5min Warning Email
Database Down Connection failed Critical Slack + PagerDuty

Business Alerts:

Alert Threshold Severity Channel
No Signups 0 signups for 1 hour Warning Email
Payment Failures >10% failed payments Critical Email + Slack
High Churn >5% monthly churn Warning Email (monthly)

Alert Routing

Channels:

  • Email: All alerts
  • Slack (#alerts): Warning and Critical
  • PagerDuty: Critical only (24/7 on-call)

On-Call Schedule:

Week Schedule (rotating):
- Primary: DevOps Engineer
- Secondary: Backend Lead
- Escalation: CTO

Escalation Path:
1. Primary (immediate)
2. Secondary (after 15 min)
3. Escalation (after 30 min)

📊 Dashboards

1. Infrastructure Dashboard

Metrics:

  • Container Apps: CPU, Memory, Replicas
  • Databases: RU/s, Latency, Storage
  • Network: Bandwidth, Errors
  • Cost: Daily spend by service

Grafana Panels:

┌────────────────────────────────────────┐
│ CPU Utilization (All Services)         │
│ [Line Chart - Time Series]             │
└────────────────────────────────────────┘

┌────────────────────────────────────────┐
│ Memory Usage (All Services)             │
│ [Line Chart - Stack]                    │
└────────────────────────────────────────┘

┌──────────────┬─────────────────────────┐
│ Active       │ Cosmos DB RU/s          │
│ Replicas     │ Consumption             │
│ [Gauge]      │ [Line Chart]            │
└──────────────┴─────────────────────────┘

2. Application Dashboard

Metrics:

  • Request rate (req/sec)
  • Error rate (%)
  • Response latency (p50, p95, p99)
  • LLM API calls
  • Token usage

Key Panels:

┌────────────────────────────────────────┐
│ Request Rate (per service)              │
│ [Line Chart - Multi-series]             │
└────────────────────────────────────────┘

┌────────────────────────────────────────┐
│ Error Rate %                            │
│ [Line Chart - Red threshold at 5%]     │
└────────────────────────────────────────┘

┌──────────────┬─────────────────────────┐
│ p95 Latency  │ LLM Token Usage         │
│ [Line Chart] │ [Bar Chart - by model]  │
└──────────────┴─────────────────────────┘

3. Business Dashboard

Metrics:

  • Daily/Monthly active users
  • New signups
  • Conversations per day
  • Revenue (MRR)
  • Top chatbots by usage

Executive View:

┌──────────┬──────────┬──────────┬──────────┐
│ DAU      │ MAU      │ New      │ MRR      │
│ 1,234    │ 5,678    │ Signups  │ $12,345  │
│ [+5%]    │ [+12%]   │ 45       │ [+8%]    │
└──────────┴──────────┴──────────┴──────────┘

┌────────────────────────────────────────┐
│ Conversations per Day                   │
│ [Line Chart - 30 day trend]             │
└────────────────────────────────────────┘

✅ Health Checks

Container Health Probes

Liveness Probe:

livenessProbe:
  httpGet:
    path: /health
    port: 8011
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Response:

{
  "status": "healthy",
  "timestamp": "2025-12-30T10:30:00Z",
  "service": "response-3d",
  "version": "v1.2.3"
}

Readiness Probe:

readinessProbe:
  httpGet:
    path: /ready
    port: 8011
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

Response:

{
  "ready": true,
  "dependencies": {
    "mongodb": "connected",
    "milvus": "connected",
    "blob_storage": "connected"
  }
}

Dependency Health Checks

MongoDB:

def check_mongodb():
    try:
        client.admin.command('ping')
        return "healthy"
    except Exception as e:
        return f"unhealthy: {str(e)}"

Milvus:

def check_milvus():
    try:
        milvus.list_collections()
        return "healthy"
    except Exception as e:
        return f"unhealthy: {str(e)}"

🔧 Incident Management

Incident Response Process

1. Detection (Auto or Manual)

  • Alert fired
  • User report
  • Monitoring dashboard anomaly

2. Triage (< 5 minutes)

  • Assess severity (P0-P3)
  • Identify affected services
  • Assign on-call engineer

3. Investigation (< 15 minutes for P0)

  • Check logs, metrics, traces
  • Identify root cause
  • Determine fix approach

4. Mitigation (< 30 minutes for P0)

  • Apply hotfix
  • Rollback deployment
  • Scale resources
  • Enable circuit breaker

5. Resolution

  • Verify fix in production
  • Monitor for 1 hour
  • Update status page

6. Post-Mortem (within 48 hours)

  • Document timeline
  • Root cause analysis
  • Action items to prevent recurrence
  • Blameless culture

Incident Severity Levels

Level Definition Response Time Escalation
P0 - Critical Complete outage, data loss Immediate CTO + entire team
P1 - High Major feature down <15 min On-call + backend lead
P2 - Medium Degraded performance <1 hour On-call engineer
P3 - Low Minor issue, workaround exists <24 hours Next business day

📈 Performance Monitoring

Service Level Indicators (SLIs)

API Availability:

SLI = (Successful Requests / Total Requests) × 100
Target: 99.9% (3 nines)

API Latency:

SLI = % of requests < 3 seconds (p95)
Target: 95% of requests

Error Rate:

SLI = (Failed Requests / Total Requests) × 100
Target: < 1%

Service Level Objectives (SLOs)

Service Availability SLO Latency SLO (p95) Error Rate SLO
Gateway 99.95% <100ms <0.5%
Auth 99.9% <200ms <1%
Response 3D 99.5% <3s <2%
Response Text 99.7% <1s <1.5%
LLM Service 99.0% <2s <5% (external dependency)

Performance Budgets

Response Time Budgets:

  • Gateway routing: 50ms
  • Database query: 100ms
  • Milvus search: 50ms
  • LLM API call: 2000ms
  • TTS generation: 500ms
  • Total budget: 3000ms

If budget exceeded:

  • Investigate bottleneck
  • Optimize queries
  • Add caching
  • Scale resources

- Infrastructure


"Observability is about asking questions you didn't know you had." 📊🔍