Data Operations - Governance, Backup, Recovery & Migration¶
Section: 4-data-architecture-governance
Document: Combined Operations Guide
Coverage: Data governance, backup/recovery procedures, migration strategies
Audience: DBAs, DevOps, data engineers
🎯 Overview¶
This document consolidates operational procedures for data management including governance policies, backup/disaster recovery, and migration strategies.
Part 1: Data Governance¶
📋 Data Retention Policies¶
Conversation History Retention¶
| Plan | Retention Period | Auto-Delete | Export Available |
|---|---|---|---|
| Free | 7 days | ✅ Yes | ✅ Yes |
| Pro | 30 days | ✅ Yes | ✅ Yes |
| Business | 90 days | ✅ Yes | ✅ Yes |
| Premium | Unlimited | ❌ No | ✅ Yes |
| Enterprise | Unlimited | ❌ No | ✅ Yes |
Implementation:
// Scheduled job (daily 2 AM)
db.chatbot_history.deleteMany({
timestamp: {
$lt: new Date(Date.now() - retention_days * 24 * 60 * 60 * 1000),
},
});
Audit Logs Retention¶
| Plan | Retention | Searchable |
|---|---|---|
| Free | 7 days | ❌ |
| Pro | 30 days | ✅ |
| Business | 90 days | ✅ |
| Premium/Enterprise | Unlimited | ✅ |
Trash Retention¶
All Plans: 7 days for soft-deleted chatbots
Auto-Purge (planned):
// Daily cleanup
db.trash_collection_name.deleteMany({
deleted_at: { $lt: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000) },
});
🔒 Data Privacy & Security¶
PII Classification¶
Highly Sensitive (Requires Encryption + Masking):
- Passwords (⚠️ currently plain text!)
- API keys, secrets
- Payment information (handled by Razorpay)
Sensitive (Requires Encryption):
- Email addresses
- User names
- IP addresses
- Conversation transcripts
Internal (Encrypted at Rest):
- Chatbot configurations
- System prompts
- Analytics data
Public (No special handling):
- Public URLs
- Product descriptions
GDPR/DPDPA Compliance¶
Right to Access:
- Data export API (
/v1/gdpr/data-export) - JSON format, includes all collections
- Available in all plans
Right to Erasure:
- Account deletion API (
/v1/gdpr/delete-account) - Cascading delete across all collections
- Permanent deletion (no recovery)
Data Minimization:
- Only collect necessary data
- No tracking cookies without consent
- Analytics opt-out available
Part 2: Backup & Recovery¶
💾 Backup Strategy¶
MongoDB (Cosmos DB)¶
Automated Backups:
- Frequency: Daily (automatic)
- Time: 2:00 AM UTC
- Retention: 35 days (configurable up to 365 days)
- Type: Full database backup
- Encryption: AES-256
- Location: Azure geo-redundant storage
Point-in-Time Restore (PITR):
- Restore to any point within retention window
- Granularity: 1-minute intervals
- RTO: 1-4 hours
Verification:
# Azure CLI - List backups
az cosmosdb mongodb database backup list \
--account-name machineavatars-mongodb \
--resource-group machineavatars-rg
Milvus Vector Database¶
Nightly Snapshots:
- Frequency: Daily (nightly)
- Time: 3:00 AM UTC
- Retention: 7 days
- Type: Azure Blob storage snapshot
- Size: ~6GB compressed
Backup Script:
#!/bin/bash
# Milvus backup script
DATE=$(date +%Y-%m-%d)
BACKUP_NAME="milvus-backup-${DATE}"
# Create Azure Blob snapshot
az storage blob snapshot \
--account-name qablobmachineagents \
--container-name milvus-data \
--name collections/ \
--snapshot-name ${BACKUP_NAME}
# Cleanup old snapshots (>7 days)
az storage blob list \
--account-name qablobmachineagents \
--container-name milvus-data \
--query "[?properties.createdOn<'$7_DAYS_AGO'].name" \
| xargs -I {} az storage blob delete --name {}
🔄 Disaster Recovery¶
RTO & RPO Targets¶
| Component | RTO (Recovery Time) | RPO (Recovery Point) | DR Strategy |
|---|---|---|---|
| MongoDB | 4 hours | 1 hour | PITR from backup |
| Milvus | 2 hours | 24 hours | Restore from snapshot |
| Application | 1 hour | N/A | Blue-green deployment |
| Overall System | 4 hours | 1 hour | Coordinated recovery |
RTO: Maximum time to restore service
RPO: Maximum acceptable data loss
DR Scenarios & Procedures¶
Scenario 1: MongoDB Data Corruption¶
Detection: Application errors, data inconsistencies
Impact: High - chatbots cannot function
Recovery Steps:
- Identify corruption time (from logs/monitoring)
- Initiate PITR:
- Verify restored data (spot checks, row counts)
- Switch traffic to restored database (update connection string)
- Monitor for 24 hours
- Delete corrupted database after verification
Timeline: 2-4 hours
Scenario 2: Milvus Vector Loss¶
Detection: Empty search results, collection not found
Impact: Medium - chatbots respond without context
Recovery Steps:
- Stop Milvus container
- Restore from snapshot:
- Restart Milvus container
- Rebuild indexes:
- Verify search functionality (test queries)
Timeline: 1-2 hours
Scenario 3: Complete Azure Region Failure¶
Detection: All services down, Azure status page
Impact: Critical - complete outage
Current: ⚠️ No DR site configured
Planned: Q2 2025 - Secondary region (East US)
Recovery Steps (Manual):
- Provision new resources in secondary region
- Restore MongoDB from geo-redundant backup (~2 hours)
- Deploy application to new region (~30 min)
- Restore Milvus from backup (~1 hour)
- Update DNS to point to new region (~30 min)
- Test end-to-end functionality
Timeline: 4-6 hours (first incident)
Backup Testing¶
Monthly DR Drills:
- Restore test database from backup
- Verify data integrity
- Measure RTO/RPO
- Document lessons learned
Test Checklist:
- MongoDB PITR restore successful
- Milvus snapshot restore successful
- Connection strings updated
- Application connects to restored DBs
- Search functionality works
- No data loss detected
- RTO/RPO targets met
Part 3: Data Migration¶
🔄 Schema Migration Procedures¶
MongoDB Schema Migrations¶
Migration Script Template:
# migrations/001_add_password_hash_field.py
from pymongo import MongoClientimport bcrypt
from datetime import datetime
def migrate_up(db):
"""Add password_hash field, hash existing passwords."""
users = db.users_multichatbot_v2
# Add password_hash field to all users
for user in users.find({"password_hash": {"$exists": False}}):
plain_password = user.get("password")
if plain_password:
# Hash password with bcrypt
password_hash = bcrypt.hashpw(
plain_password.encode('utf-8'),
bcrypt.gensalt(rounds=12)
)
users.update_one(
{"_id": user["_id"]},
{
"$set": {
"password_hash": password_hash.decode('utf-8'),
"migrated_at": datetime.utcnow()
}
}
)
print(f"Migrated user: {user['email']}")
print("Migration completed successfully!")
def migrate_down(db):
"""Rollback: Remove password_hash field."""
db.users_multichatbot_v2.update_many(
{},
{"$unset": {"password_hash": "", "migrated_at": ""}}
)
Execution:
Zero-Downtime Migration Strategy¶
Pattern: Dual-write, gradual cutover
Example: Migrating to password_hash field
Phase 1: Dual-Write (Week 1)
# Update both old and new fields
def signup(email, password, name):
password_hash = bcrypt.hashpw(password.encode(), bcrypt.gensalt())
users_collection.insert_one({
"email": email,
"password": password, # OLD field (for backward compat)
"password_hash": password_hash.decode(), # NEW field
"name": name
})
Phase 2: Migrate Existing (Week 2)
# Run migration script to hash all existing passwords
python migrations/001_add_password_hash_field.py
Phase 3: Read from New Field (Week 3)
# Update login to use password_hash
def login(email, password):
user = users_collection.find_one({"email": email})
# Check new field first, fallback to old
if user.get("password_hash"):
if bcrypt.checkpw(password.encode(), user["password_hash"].encode()):
return True
elif user.get("password") == password: # Fallback
return True
return False
Phase 4: Stop Writing Old Field (Week 4)
# Only write password_hash
def signup(email, password, name):
password_hash = bcrypt.hashpw(password.encode(), bcrypt.gensalt())
users_collection.insert_one({
"email": email,
"password_hash": password_hash.decode(), # Only new field
"name": name
})
Phase 5: Remove Old Field (Week 5)
Index Migration¶
Adding New Index:
# Create index in background (non-blocking)
db.chatbot_history.create_index(
[("user_id", 1), ("timestamp", -1)],
background=True, # Don't block operations
name="user_timestamp_idx"
)
Monitoring Progress:
Data Migration Between Environments¶
Dev → QA → Prod Pipeline:
#!/bin/bash
# migrate_data.sh
SOURCE_ENV="dev"
TARGET_ENV="qa"
# 1. Export from source
mongodump \
--uri="$SOURCE_MONGO_URI" \
--db="Machine_agent_${SOURCE_ENV}" \
--out="./backup_${SOURCE_ENV}"
# 2. Sanitize data (mask PII for QA)
python scripts/sanitize_backup.py \
--input="./backup_${SOURCE_ENV}" \
--output="./backup_${SOURCE_ENV}_sanitized"
# 3. Import to target
mongorestore \
--uri="$TARGET_MONGO_URI" \
--db="Machine_agent_${TARGET_ENV}" \
--dir="./backup_${SOURCE_ENV}_sanitized/Machine_agent_${SOURCE_ENV}"
# 4. Verify counts
python scripts/verify_migration.py \
--source="$SOURCE_MONGO_URI" \
--target="$TARGET_MONGO_URI"
📚 Related Documentation¶
Data Architecture:
- Database Schema - Collection schemas
- Vector Store - Milvus backups
- Index - Architecture overview
Security:
- Encryption - Backup encryption
- Compliance - GDPR retention
Progress: Section 4 - 4-6/8 combined (75%)
"Backups are insurance. DR is execution." 💾✅