Operational Runbooks¶
Section: 8-deployment-operations
Document: Step-by-Step Operational Procedures
Audience: DevOps Engineers, SREs, On-Call Engineers
Last Updated: 2025-12-30
🎯 Purpose¶
Complete operational runbooks for common tasks, emergency procedures, and disaster recovery for all 23 microservices.
Quick Access:
🚨 Emergency Procedures¶
Complete Service Outage¶
Symptoms: All services down, no user access
Immediate Actions (< 5 minutes):
# 1. Check Azure status
az account list-locations --query "[].{Name:name, Status:metadata.status}"
# 2. Check Container Apps environment
az containerapp env list --resource-group machineagents-prod-rg
# 3. Check if it's a planned outage
# Contact Azure support if regional issue
# 4. If not Azure issue, check recent deployments
az containerapp revision list --name gateway-service-prod \
--resource-group machineagents-prod-rg
# 5. Rollback last deployment if recent change
az containerapp revision set-mode --name gateway-service-prod \
--resource-group machineagents-prod-rg \
--mode single \
--revision gateway-service-prod--<previous-revision>
Follow-up (< 30 minutes):
- Update status page
- Notify all stakeholders
- Create incident post-mortem
- Implement prevention measures
Single Service Down¶
Example: response-3d-service not responding
Step 1: Diagnose (< 2 minutes)
# Check service health
az containerapp show --name response-3d-prod \
--resource-group machineagents-prod-rg \
--query "{Status:properties.runningStatus, Replicas:properties.template.scale}"
# Check logs (last 30 minutes)
az containerapp logs show --name response-3d-prod \
--resource-group machineagents-prod-rg \
--tail 100
# Alternative: Loki query
curl -G -s "http://loki:3100/loki/api/v1/query_range" \
--data-urlencode 'query={service="response-3d"}' \
--data-urlencode 'start=$(date -u -d '30 minutes ago' +%s)' | jq
Step 2: Quick Restart (< 1 minute)
# Restart all replicas
az containerapp revision restart --name response-3d-prod \
--resource-group machineagents-prod-rg \
--revision response-3d-prod--latest
# Monitor restart
watch -n 2 'az containerapp replica list --name response-3d-prod \
--resource-group machineagents-prod-rg | grep -c "Running"'
Step 3: If Restart Fails - Rollback (< 3 minutes)
# List available revisions
az containerapp revision list --name response-3d-prod \
--resource-group machineagents-prod-rg \
--query "[].{Name:name, Created:properties.createdTime, Active:properties.active}"
# Activate previous revision
az containerapp revision activate --name response-3d-prod \
--resource-group machineagents-prod-rg \
--revision response-3d-prod--<previous-revision-id>
# Set traffic to 100% on previous revision
az containerapp ingress traffic set --name response-3d-prod \
--resource-group machineagents-prod-rg \
--revision-weight response-3d-prod--<previous>=100
Step 4: Monitor Recovery (15 minutes)
# Watch error rate
# Grafana dashboard: Application Metrics
# Expected: Error rate < 1% within 5 minutes
Database Connection Issues¶
Symptoms: Services logging MongoDB connection errors
Step 1: Check Cosmos DB Status
# Check Cosmos DB account
az cosmosdb show --name machineagents-cosmosdb-prod \
--resource-group machineagents-data-rg \
--query "{Status:provisioningState, ReadLocations:readLocations, WriteLocations:writeLocations}"
# Check if throttled
az monitor metrics list --resource /subscriptions/{sub-id}/resourceGroups/machineagents-data-rg/providers/Microsoft.DocumentDB/databaseAccounts/machineagents-cosmosdb-prod \
--metric "TotalRequestUnits" \
--start-time $(date -u -d '30 minutes ago' +%Y-%m-%dT%H:%M:%SZ)
Step 2: Check Connection String
# Verify connection string in Key Vault
az keyvault secret show --vault-name machineagents-keyvault-prod \
--name mongo-connection-string \
--query "value" -o tsv
# Test connection from local machine
mongosh "mongodb://..." --eval "db.adminCommand('ping')"
Step 3: Check NSG Rules
# Verify databases subnet NSG allows MongoDB port
az network nsg rule list --nsg-name databases-nsg \
--resource-group machineagents-network-rg \
--query "[?destinationPortRange=='27017'].{Name:name, Access:access, Priority:priority}"
Step 4: Increase RU/s if Throttled
# Check current RU/s
az cosmosdb mongodb collection throughput show \
--account-name machineagents-cosmosdb-prod \
--database-name Machine_agent_prod \
--name chatbot_history \
--resource-group machineagents-data-rg
# Increase if needed
az cosmosdb mongodb collection throughput update \
--account-name machineagents-cosmosdb-prod \
--database-name Machine_agent_prod \
--name chatbot_history \
--resource-group machineagents-data-rg \
--throughput 6000
⚙️ Service Operations¶
Scale Service Up (Traffic Spike)¶
When: Anticipated high traffic (product launch, marketing campaign)
Example: Scale response-3d-service
# Current configuration
az containerapp show --name response-3d-prod \
--resource-group machineagents-prod-rg \
--query "properties.template.scale"
# Scale up
az containerapp update --name response-3d-prod \
--resource-group machineagents-prod-rg \
--min-replicas 5 \
--max-replicas 20 \
--cpu 1.5 \
--memory 3Gi
# Verify scaling
watch -n 5 'az containerapp replica list --name response-3d-prod \
--resource-group machineagents-prod-rg --query "length(@)"'
Expected: Replicas increase to min (5) within 2 minutes
Scale Service Down (Cost Optimization)¶
When: Low traffic period (nights, weekends)
# Scale down (non-peak hours)
az containerapp update --name response-3d-prod \
--resource-group machineagents-prod-rg \
--min-replicas 2 \
--max-replicas 10
# Verify no performance degradation
# Monitor: Response latency should stay < 3s p95
Update Environment Variable¶
Example: Update log level for debugging
# View current env vars
az containerapp show --name response-3d-prod \
--resource-group machineagents-prod-rg \
--query "properties.template.containers[0].env"
# Update env var (creates new revision)
az containerapp update --name response-3d-prod \
--resource-group machineagents-prod-rg \
--set-env-vars "LOG_LEVEL=DEBUG"
# Restart to apply
az containerapp revision restart --name response-3d-prod \
--resource-group machineagents-prod-rg
# IMPORTANT: Revert after debugging
az containerapp update --name response-3d-prod \
--resource-group machineagents-prod-rg \
--set-env-vars "LOG_LEVEL=INFO"
💾 Database Operations Runbooks¶
MongoDB Backup (Manual)¶
When: Before major migration or schema change
# Export entire database
mongoexport --uri="$MONGO_URI" \
--db=Machine_agent_prod \
--collection=users_multichatbot_v2 \
--out=users_backup_$(date +%Y%m%d).json
# Backup all collections
for collection in users_multichatbot_v2 chatbot_selections chatbot_history files files_secondary system_prompts_user projectid_creation organisation_data trash_collection_name; do
mongoexport --uri="$MONGO_URI" \
--db=Machine_agent_prod \
--collection=$collection \
--out=${collection}_backup_$(date +%Y%m%d).json
done
# Upload to Azure Blob Storage
az storage blob upload-batch \
--destination backups/manual/$(date +%Y%m%d) \
--source . \
--pattern "*_backup_*.json" \
--account-name qablobmachineagents
MongoDB Point-in-Time Restore¶
When: Data corruption, accidental deletion
# 1. Identify restore point (UTC)
# Example: Restore to 2025-12-30 10:00:00 UTC
# 2. Initiate restore via Azure Portal
# Cosmos DB → Backups → Point-in-time Restore
# Or via CLI:
az cosmosdb mongodb database restore \
--account-name machineagents-cosmosdb-prod \
--resource-group machineagents-data-rg \
--database-name Machine_agent_prod \
--restore-timestamp "2025-12-30T10:00:00Z" \
--target-database-name Machine_agent_prod_restored
# 3. Verify restored data
mongosh "$MONGO_URI" --eval "use Machine_agent_prod_restored; db.users_multichatbot_v2.countDocuments()"
# 4. Switch applications to restored database
# Update connection string in Key Vault
az keyvault secret set --vault-name machineagents-keyvault-prod \
--name mongo-connection-string \
--value "mongodb://...Machine_agent_prod_restored..."
# 5. Restart all services to pick up new connection string
for service in gateway auth user create-chatbot selection data-crawling response-3d response-text response-voice; do
az containerapp revision restart --name ${service}-service-prod \
--resource-group machineagents-prod-rg
done
Milvus Backup¶
When: Weekly automated backup
# 1. Stop writes (maintenance window)
# Set chatbots to read-only mode via feature flag
# 2. Create Milvus snapshot
docker exec milvus-standalone /bin/bash -c "
cd /var/lib/milvus &&
tar -czf /tmp/milvus_backup_$(date +%Y%m%d).tar.gz db/ wal/ &&
echo 'Backup complete'
"
# 3. Copy snapshot to host
docker cp milvus-standalone:/tmp/milvus_backup_$(date +%Y%m%d).tar.gz ./
# 4. Upload to Azure Blob
az storage blob upload \
--container-name milvus-backups \
--file milvus_backup_$(date +%Y%m%d).tar.gz \
--name $(date +%Y%m%d)/milvus_backup.tar.gz \
--account-name qablobmachineagents
# 5. Resume writes
# Disable read-only mode
🔐 Security Operations¶
Rotate OpenAI API Key¶
When: Every 90 days or on compromise
# 1. Generate new key in OpenAI dashboard
# https://platform.openai.com/api-keys
# 2. Update Key Vault
az keyvault secret set --vault-name machineagents-keyvault-prod \
--name openai-api-key \
--value "sk-proj-NEW_KEY_HERE"
# 3. Restart services that use OpenAI
for service in response-3d response-text response-voice llm-model-service; do
echo "Restarting ${service}..."
az containerapp revision restart --name ${service}-service-prod \
--resource-group machineagents-prod-rg
done
# 4. Verify services are using new key
# Check logs for successful OpenAI API calls
# 5. Wait 24 hours, then deactivate old key in OpenAI dashboard
Rotate JWT Secret¶
When: Every 180 days or on suspected compromise
⚠️ WARNING: This will invalidate ALL user sessions!
# 1. Generate new JWT secret
NEW_SECRET=$(openssl rand -base64 64)
# 2. Update Key Vault
az keyvault secret set --vault-name machineagents-keyvault-prod \
--name jwt-secret \
--value "$NEW_SECURET"
# 3. Restart auth service
az containerapp revision restart --name auth-service-prod \
--resource-group machineagents-prod-rg
# 4. Notify users (email)
# Subject: "Please log in again"
# Body: "For security, all sessions have been invalidated..."
# 5. Monitor login rate (expect spike)
# Grafana dashboard: User Authentication Metrics
Rotate MongoDB Connection String¶
When: Credential compromise
# 1. Generate new connection string in Azure Portal
# Cosmos DB → Keys → Regenerate Primary Key
# 2. Update Key Vault with new connection string
az keyvault secret set --vault-name machineagents-keyvault-prod \
--name mongo-connection-string \
--value "mongodb://NEW_CONNECTION_STRING"
# 3. Restart ALL services that use MongoDB
for service in gateway auth user create-chatbot selection data-crawling \
3d-state text-state voice-state system-prompts chatbot-maintenance \
response-3d response-text response-voice chat-history analytics \
payment notification feature admin; do
echo "Restarting ${service}..."
az containerapp revision restart --name ${service}-service-prod \
--resource-group machineagents-prod-rg &
done
# Wait for all restarts
wait
# 4. Verify all services connected successfully
# Check logs for "MongoDB connected" messages
🌍 Disaster Recovery¶
Complete Region Failure (Primary: East US)¶
Scenario: Azure East US region completely unavailable
RTO: 4 hours
RPO: 1 hour
Phase 1: Assessment (0-15 minutes)¶
# 1. Confirm region failure (not just service issue)
az account list-locations --query "[?name=='eastus'].metadata.status"
# 2. Check secondary region status
az account list-locations --query "[?name=='southeastasia'].metadata.status"
# 3. Verify Cosmos DB auto-failover
az cosmosdb show --name machineagents-cosmosdb-prod \
--resource-group machineagents-data-rg \
--query "{ReadLocations:readLocations, WriteLocations:writeLocations}"
# If automatic failover hasn't occurred:
az cosmosdb failover-priority-change --name machineagents-cosmosdb-prod \
--resource-group machineagents-data-rg \
--failover-policies southeastasia=0 eastus=1
Phase 2: Deploy to Secondary Region (15-120 minutes)¶
# 1. Deploy all 23 services to Southeast Asia
# Use existing Docker images from ACR (geo-replicated)
for service in gateway auth user create-chatbot selection data-crawling \
3d-state text-state voice-state system-prompts chatbot-maintenance \
response-3d response-text response-voice chat-history analytics \
llm-model-service embedding-service tts-service payment \
notification feature admin; do
echo "Deploying ${service} to Southeast Asia..."
az containerapp create \
--name ${service}-service-dr \
--resource-group machineagents-dr-rg \
--environment machineagents-dr-env \
--image machineagentsacr.azurecr.io/${service}:latest \
--target-port 80$(echo ${service} | cut -d'-' -f1 | wc -c) \
--ingress external \
--min-replicas 2 \
--max-replicas 10 \
--secrets \
mongo-uri="$MONGO_URI_DR" \
openai-key="$OPENAI_KEY" &
done
wait
Phase 3: Update DNS (120-150 minutes)¶
# 1. Update Azure Front Door / Traffic Manager
az network traffic-manager endpoint update \
--name eastus-endpoint \
--profile-name machineagents-tm \
--resource-group machineagents-network-rg \
--type azureEndpoints \
--endpoint-status Disabled
az network traffic-manager endpoint update \
--name southeastasia-endpoint \
--profile-name machineagents-tm \
--resource-group machineagents-network-rg \
--type azureEndpoints \
--endpoint-status Enabled \
--priority 1
# 2. Update frontend API endpoint (if not using Traffic Manager)
# Update environment variable in Vercel/Azure Static Web Apps
Phase 4: Verification (150-240 minutes)¶
# 1. Health check all services
for service in gateway auth user create-chatbot selection; do
curl https://${service}-service-dr.southeastasia.azurecontainerapps.io/health
done
# 2. Run smoke tests
# - User signup
# - User login
# - Create chatbot
# - Send message to chatbot
# - All should succeed
# 3. Monitor error rates (target: < 5%)
# Grafana dashboard: DR Environment Metrics
# 4. Update status page
# "Services restored in secondary region. Primary region recovery in progress."
Phase 5: Return to Primary (When Available)¶
# Once East US is back online:
# 1. Sync data from Southeast Asia to East US
# (Cosmos DB handles automatically with multi-region writes)
# 2. Redeploy services to East US
# 3. Switch DNS back to East US
# 4. Shut down DR environment
# 5. Post-mortem + lessons learned
Data Loss / Corruption¶
Scenario: Accidental deletion or data corruption
For MongoDB (Last 30 days)¶
# Use Point-in-Time Restore (see Database Operations above)
# Cosmos DB provides continuous backup for 30 days
For Milvus (Last 7 days)¶
# 1. Download backup from Azure Blob
az storage blob download \
--container-name milvus-backups \
--name $(date -d "7 days ago" +%Y%m%d)/milvus_backup.tar.gz \
--file milvus_restore.tar.gz \
--account-name qablobmachineagents
# 2. Stop Milvus
docker stop milvus-standalone
# 3. Clear current data
docker exec milvus-standalone rm -rf /var/lib/milvus/db /var/lib/milvus/wal
# 4. Restore backup
docker cp milvus_restore.tar.gz milvus-standalone:/tmp/
docker exec milvus-standalone tar -xzf /tmp/milvus_restore.tar.gz -C /var/lib/milvus
# 5. Restart Milvus
docker start milvus-standalone
# 6. Verify collections
curl http://localhost:9091/api/v1/collections
📊 Monitoring & Health Checks¶
Daily Health Check Routine¶
#!/bin/bash
# daily-health-check.sh
echo "=== Daily Health Check $(date) ==="
# 1. Check all services are running
echo "1. Service Status:"
for service in gateway auth user create-chatbot selection data-crawling \
3d-state text-state voice-state system-prompts chatbot-maintenance \
response-3d response-text response-voice chat-history analytics \
llm-model-service embedding-service tts-service payment \
notification feature admin; do
status=$(az containerapp show --name ${service}-service-prod \
--resource-group machineagents-prod-rg \
--query "properties.runningStatus" -o tsv)
echo " ${service}: ${status}"
done
# 2. Check replica count
echo "2. Replica Counts:"
for service in response-3d response-text response-voice; do
count=$(az containerapp replica list --name ${service}-service-prod \
--resource-group machineagents-prod-rg --query "length(@)" -o tsv)
echo " ${service}: ${count} replicas"
done
# 3. Check database connections
echo "3. Database Status:"
mongosh "$MONGO_URI" --quiet --eval "db.adminCommand('ping')" && echo " MongoDB: OK"
curl -s http://milvus:9091/healthz | grep -q "OK" && echo " Milvus: OK"
# 4. Check error rates (last 24h)
echo "4. Error Rates (last 24h):"
# Query from monitoring system
# Expected: < 1%
# 5. Check disk usage
echo "5. Disk Usage:"
df -h | grep -E '(Filesystem|milvus|mongo)'
# 6. Check SSL certificate expiry
echo "6. SSL Certificates:"
echo | openssl s_client -servername machineavatars.com -connect machineavatars.com:443 2>/dev/null | \
openssl x509 -noout -dates | grep notAfter
echo "=== Health Check Complete ==="
📝 Post-Incident Checklist¶
After ANY incident (P0-P3):
- Incident timeline documented
- Root cause identified
- Immediate fix applied
- Monitoring added to prevent recurrence
- Runbook updated (if applicable)
- Post-mortem meeting scheduled (P0/P1 only)
- Action items created in project tracker
- Stakeholders notified
- Blameless culture maintained
🔗 Related Documentation¶
"Practice makes perfect - run DR drills quarterly." 🚀✅
Section 8: COMPLETE! (8/8 files) 🎉