Deployment & Operations¶

Section: 8-deployment-operations
Document: Complete DevOps Guide
Audience: DevOps Engineers, SREs, Developers
Last Updated: 2025-12-30

🎯 Overview¶

Complete deployment and operations guide for MachineAvatars platform covering 23 microservices, CI/CD pipelines, deployment strategies, and operational procedures.

Deployment Philosophy:

🤖 Automation First - Minimize manual interventions
🔒 Safety - Zero-downtime deployments, easy rollbacks
👀 Observability - Monitor everything, alert proactively
📖 Documentation - Runbooks for all procedures

🚀 Quick Links¶

Emergency Procedures¶

Service Down → Operational Runbooks
Database Issues → Database Operations
Disaster Recovery → Operational Runbooks

Common Operations¶

Scale Service → Operational Runbooks
Rotate Secrets → Operational Runbooks
Database Migration → Database Operations

🌍 Environments¶

Development (Local)¶

Purpose: Local development and testing
Services: Docker Compose (all 23 services)
Databases: MongoDB + Milvus (local containers)
Secrets: .env file (not committed)

QA/Staging¶

Purpose: Pre-production testing
Infrastructure: Azure (dedicated resource group)
Services: Azure Container Apps (1-2 replicas)
Databases: Dedicated Cosmos DB + shared Milvus

Production¶

Purpose: Live customer traffic
Infrastructure: Azure (multi-region)
Services: Azure Container Apps (2-10 replicas with auto-scaling)
Databases: Multi-region Cosmos DB + dedicated Milvus
High Availability: Yes (99.9% SLA)

🔄 Deployment Pipeline¶

Build → Test → Deploy Flow¶

graph LR
    A[Code Push] --> B[GitHub Actions]
    B --> C{Tests Pass?}
    C -->|No| D[Block Deployment]
    C -->|Yes| E[Build Docker Image]
    E --> F[Security Scan]
    F --> G{Vulnerabilities?}
    G -->|Critical| D
    G -->|None/Low| H[Push to ACR]
    H --> I{Target Environment}
    I -->|Staging| J[Auto-Deploy Staging]
    I -->|Production| K[Manual Approval]
    K --> L[Deploy Production]
    L --> M[Health Checks]
    M --> N{Healthy?}
    N -->|No| O[Auto-Rollback]
    N -->|Yes| P[Deployment Complete]

    style C fill:#FFF3E0
    style G fill:#FFCDD2
    style K fill:#C5E1A5
    style N fill:#FFF3E0

Pipeline Components:

Automated Testing - Unit, integration, security scans
Docker Build - Multi-stage builds for optimization
Registry Push - Azure Container Registry
Deployment - Azure Container Apps with health checks
Monitoring - Real-time alerts during deployment

📊 Key Metrics (DORA)¶

Deployment Frequency¶

Current: 3-5 deploys/week
Target: Daily deployments
Elite: Multiple deploys/day

Lead Time for Changes¶

Current: 2-4 hours (code commit → production)
Target: <1 hour
Elite: <1 hour

Mean Time to Recovery (MTTR)¶

Current: 30 minutes
Target: <15 minutes
Elite: <1 hour

Change Failure Rate¶

Current: 5% (1 in 20 deployments needs rollback)
Target: <5%
Elite: 0-15%

📁 Documentation Structure¶

Part 1: Setup & Configuration¶

Database Operations - Backup, restore, migrations
Operational Runbooks - Emergency procedures, scaling, DR

🛠️ Technology Stack¶

CI/CD¶

GitHub Actions - CI/CD orchestration
Docker - Containerization
Azure Container Registry - Image storage
Trivy/Snyk - Security scanning

Deployment¶

Azure Container Apps - Container hosting
Azure CLI - Deployment automation
Helm (planned) - Kubernetes deployments

Monitoring¶

Loki - Log aggregation
Grafana - Dashboards
Azure Monitor - Infrastructure metrics
PagerDuty - On-call alerting

🚨 On-Call & Incident Response¶

On-Call Schedule¶

Primary: DevOps Engineer (rotating weekly)
Secondary: Backend Lead
Escalation: CTO

Incident Severity¶

Severity	Response Time	Escalation	Examples
P0	Immediate	All hands	Complete outage, data loss
P1	<15 min	Primary + Secondary	Major feature down, DB issues
P2	<1 hour	Primary only	Degraded performance
P3	Next business day	Backlog	Minor bugs, improvements

Incident Checklist¶

Acknowledge - Respond to alert within 5 minutes
Assess - Determine severity and impact
Communicate - Update status page, notify stakeholders
Mitigate - Apply fix or rollback
Resolve - Verify fix, monitor for 1 hour
Post-Mortem - Document incident, create action items

📈 Deployment Best Practices¶

Pre-Deployment¶

✅ All tests passing (unit, integration, E2E)
✅ Security scans passed (no critical vulnerabilities)
✅ Database migrations reviewed and tested
✅ Rollback plan documented
✅ On-call engineer notified
✅ Deployment window scheduled (avoid peak hours)

During Deployment¶

👀 Monitor error rates in real-time
👀 Watch response latency (p95, p99)
👀 Check health endpoints
👀 Monitor resource usage (CPU, memory)

Post-Deployment¶

✅ Run smoke tests
✅ Verify key user flows
✅ Monitor for 1 hour minimum
✅ Check alert channels (ensure no incidents)
✅ Mark deployment complete in tracking system

Architecture:

Security:

Data:

📞 Support & Escalation¶

DevOps Team:

Slack: #devops-support
Email: devops@machineavatars.com
On-Call: PagerDuty (24/7)

Escalation Path:

On-call DevOps Engineer
Backend Engineering Lead
CTO

Progress: Section 8 - ⅛ files complete (12.5%)

"Deploy early, deploy often, deploy safely." 🚀✅