Deployment & OperationsΒΆ
Section: 8-deployment-operations
Document: Complete DevOps Guide
Audience: DevOps Engineers, SREs, Developers
Last Updated: 2025-12-30
π― OverviewΒΆ
Complete deployment and operations guide for MachineAvatars platform covering 23 microservices, CI/CD pipelines, deployment strategies, and operational procedures.
Deployment Philosophy:
- π€ Automation First - Minimize manual interventions
- π Safety - Zero-downtime deployments, easy rollbacks
- π Observability - Monitor everything, alert proactively
- π Documentation - Runbooks for all procedures
π Quick LinksΒΆ
Emergency ProceduresΒΆ
Common OperationsΒΆ
- Scale Service β Operational Runbooks
- Rotate Secrets β Operational Runbooks
- Database Migration β Database Operations
π EnvironmentsΒΆ
Development (Local)ΒΆ
- Purpose: Local development and testing
- Services: Docker Compose (all 23 services)
- Databases: MongoDB + Milvus (local containers)
- Secrets: .env file (not committed)
QA/StagingΒΆ
- Purpose: Pre-production testing
- Infrastructure: Azure (dedicated resource group)
- Services: Azure Container Apps (1-2 replicas)
- Databases: Dedicated Cosmos DB + shared Milvus
ProductionΒΆ
- Purpose: Live customer traffic
- Infrastructure: Azure (multi-region)
- Services: Azure Container Apps (2-10 replicas with auto-scaling)
- Databases: Multi-region Cosmos DB + dedicated Milvus
- High Availability: Yes (99.9% SLA)
π Deployment PipelineΒΆ
Build β Test β Deploy FlowΒΆ
graph LR
A[Code Push] --> B[GitHub Actions]
B --> C{Tests Pass?}
C -->|No| D[Block Deployment]
C -->|Yes| E[Build Docker Image]
E --> F[Security Scan]
F --> G{Vulnerabilities?}
G -->|Critical| D
G -->|None/Low| H[Push to ACR]
H --> I{Target Environment}
I -->|Staging| J[Auto-Deploy Staging]
I -->|Production| K[Manual Approval]
K --> L[Deploy Production]
L --> M[Health Checks]
M --> N{Healthy?}
N -->|No| O[Auto-Rollback]
N -->|Yes| P[Deployment Complete]
style C fill:#FFF3E0
style G fill:#FFCDD2
style K fill:#C5E1A5
style N fill:#FFF3E0
Pipeline Components:
- Automated Testing - Unit, integration, security scans
- Docker Build - Multi-stage builds for optimization
- Registry Push - Azure Container Registry
- Deployment - Azure Container Apps with health checks
- Monitoring - Real-time alerts during deployment
π Key Metrics (DORA)ΒΆ
Deployment FrequencyΒΆ
- Current: 3-5 deploys/week
- Target: Daily deployments
- Elite: Multiple deploys/day
Lead Time for ChangesΒΆ
- Current: 2-4 hours (code commit β production)
- Target: <1 hour
- Elite: <1 hour
Mean Time to Recovery (MTTR)ΒΆ
- Current: 30 minutes
- Target: <15 minutes
- Elite: <1 hour
Change Failure RateΒΆ
- Current: 5% (1 in 20 deployments needs rollback)
- Target: <5%
- Elite: 0-15%
π Documentation StructureΒΆ
Part 1: Setup & ConfigurationΒΆ
- Database Operations - Backup, restore, migrations
- Operational Runbooks - Emergency procedures, scaling, DR
π οΈ Technology StackΒΆ
CI/CDΒΆ
- GitHub Actions - CI/CD orchestration
- Docker - Containerization
- Azure Container Registry - Image storage
- Trivy/Snyk - Security scanning
DeploymentΒΆ
- Azure Container Apps - Container hosting
- Azure CLI - Deployment automation
- Helm (planned) - Kubernetes deployments
MonitoringΒΆ
- Loki - Log aggregation
- Grafana - Dashboards
- Azure Monitor - Infrastructure metrics
- PagerDuty - On-call alerting
π¨ On-Call & Incident ResponseΒΆ
On-Call ScheduleΒΆ
- Primary: DevOps Engineer (rotating weekly)
- Secondary: Backend Lead
- Escalation: CTO
Incident SeverityΒΆ
| Severity | Response Time | Escalation | Examples |
|---|---|---|---|
| P0 | Immediate | All hands | Complete outage, data loss |
| P1 | <15 min | Primary + Secondary | Major feature down, DB issues |
| P2 | <1 hour | Primary only | Degraded performance |
| P3 | Next business day | Backlog | Minor bugs, improvements |
Incident ChecklistΒΆ
- Acknowledge - Respond to alert within 5 minutes
- Assess - Determine severity and impact
- Communicate - Update status page, notify stakeholders
- Mitigate - Apply fix or rollback
- Resolve - Verify fix, monitor for 1 hour
- Post-Mortem - Document incident, create action items
π Deployment Best PracticesΒΆ
Pre-DeploymentΒΆ
- β All tests passing (unit, integration, E2E)
- β Security scans passed (no critical vulnerabilities)
- β Database migrations reviewed and tested
- β Rollback plan documented
- β On-call engineer notified
- β Deployment window scheduled (avoid peak hours)
During DeploymentΒΆ
- π Monitor error rates in real-time
- π Watch response latency (p95, p99)
- π Check health endpoints
- π Monitor resource usage (CPU, memory)
Post-DeploymentΒΆ
- β Run smoke tests
- β Verify key user flows
- β Monitor for 1 hour minimum
- β Check alert channels (ensure no incidents)
- β Mark deployment complete in tracking system
π Related DocumentationΒΆ
Architecture:
Security:
Data:
π Support & EscalationΒΆ
DevOps Team:
- Slack: #devops-support
- Email: devops@machineavatars.com
- On-Call: PagerDuty (24/7)
Escalation Path:
- On-call DevOps Engineer
- Backend Engineering Lead
- CTO
Progress: Section 8 - β files complete (12.5%)
"Deploy early, deploy often, deploy safely." πβ