Data Operations - Governance, Backup, Recovery & Migration¶

Section: 4-data-architecture-governance
Document: Combined Operations Guide
Coverage: Data governance, backup/recovery procedures, migration strategies
Audience: DBAs, DevOps, data engineers

🎯 Overview¶

This document consolidates operational procedures for data management including governance policies, backup/disaster recovery, and migration strategies.

Part 1: Data Governance¶

📋 Data Retention Policies¶

Conversation History Retention¶

Plan	Retention Period	Auto-Delete	Export Available
Free	7 days	✅ Yes	✅ Yes
Pro	30 days	✅ Yes	✅ Yes
Business	90 days	✅ Yes	✅ Yes
Premium	Unlimited	❌ No	✅ Yes
Enterprise	Unlimited	❌ No	✅ Yes

Implementation:

// Scheduled job (daily 2 AM)
db.chatbot_history.deleteMany({
  timestamp: {
    $lt: new Date(Date.now() - retention_days * 24 * 60 * 60 * 1000),
  },
});

Audit Logs Retention¶

Plan	Retention	Searchable
Free	7 days	❌
Pro	30 days	✅
Business	90 days	✅
Premium/Enterprise	Unlimited	✅

Trash Retention¶

All Plans: 7 days for soft-deleted chatbots

Auto-Purge (planned):

// Daily cleanup
db.trash_collection_name.deleteMany({
  deleted_at: { $lt: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000) },
});

🔒 Data Privacy & Security¶

PII Classification¶

Highly Sensitive (Requires Encryption + Masking):

Passwords (⚠️ currently plain text!)
API keys, secrets
Payment information (handled by Razorpay)

Sensitive (Requires Encryption):

Email addresses
User names
IP addresses
Conversation transcripts

Internal (Encrypted at Rest):

Chatbot configurations
System prompts
Analytics data

Public (No special handling):

Public URLs
Product descriptions

GDPR/DPDPA Compliance¶

Right to Access:

Data export API (/v1/gdpr/data-export)
JSON format, includes all collections
Available in all plans

Right to Erasure:

Account deletion API (/v1/gdpr/delete-account)
Cascading delete across all collections
Permanent deletion (no recovery)

Data Minimization:

Only collect necessary data
No tracking cookies without consent
Analytics opt-out available

Part 2: Backup & Recovery¶

💾 Backup Strategy¶

MongoDB (Cosmos DB)¶

Automated Backups:

Frequency: Daily (automatic)
Time: 2:00 AM UTC
Retention: 35 days (configurable up to 365 days)
Type: Full database backup
Encryption: AES-256
Location: Azure geo-redundant storage

Point-in-Time Restore (PITR):

Restore to any point within retention window
Granularity: 1-minute intervals
RTO: 1-4 hours

Verification:

# Azure CLI - List backups
az cosmosdb mongodb database backup list \
  --account-name machineavatars-mongodb \
  --resource-group machineavatars-rg

Milvus Vector Database¶

Nightly Snapshots:

Frequency: Daily (nightly)
Time: 3:00 AM UTC
Retention: 7 days
Type: Azure Blob storage snapshot
Size: ~6GB compressed

Backup Script:

#!/bin/bash
# Milvus backup script

DATE=$(date +%Y-%m-%d)
BACKUP_NAME="milvus-backup-${DATE}"

# Create Azure Blob snapshot
az storage blob snapshot \
  --account-name qablobmachineagents \
  --container-name milvus-data \
  --name collections/ \
  --snapshot-name ${BACKUP_NAME}

# Cleanup old snapshots (>7 days)
az storage blob list \
  --account-name qablobmachineagents \
  --container-name milvus-data \
  --query "[?properties.createdOn<'$7_DAYS_AGO'].name" \
  | xargs -I {} az storage blob delete --name {}

🔄 Disaster Recovery¶

RTO & RPO Targets¶

Component	RTO (Recovery Time)	RPO (Recovery Point)	DR Strategy
MongoDB	4 hours	1 hour	PITR from backup
Milvus	2 hours	24 hours	Restore from snapshot
Application	1 hour	N/A	Blue-green deployment
Overall System	4 hours	1 hour	Coordinated recovery

RTO: Maximum time to restore service
RPO: Maximum acceptable data loss

DR Scenarios & Procedures¶

Scenario 1: MongoDB Data Corruption¶

Detection: Application errors, data inconsistencies
Impact: High - chatbots cannot function

Recovery Steps:

Identify corruption time (from logs/monitoring)

Initiate PITR:

az cosmosdb mongodb database restore \
  --account-name machineavatars-mongodb \
  --resource-group machineavatars-rg \
  --restore-timestamp "2025-01-15T10:30:00Z" \
  --target-database-name Machine_agent_dev_restored

Verify restored data (spot checks, row counts)
Switch traffic to restored database (update connection string)
Monitor for 24 hours
Delete corrupted database after verification

Timeline: 2-4 hours

Scenario 2: Milvus Vector Loss¶

Detection: Empty search results, collection not found
Impact: Medium - chatbots respond without context

Recovery Steps:

Stop Milvus container

Restore from snapshot:

az storage blob copy start \
  --source-snapshot milvus-backup-2025-01-14 \
  --destination-container milvus-data

Restart Milvus container

Rebuild indexes:

for collection in collections:
    collection.create_index("embedding", index_params)

Verify search functionality (test queries)

Timeline: 1-2 hours

Scenario 3: Complete Azure Region Failure¶

Detection: All services down, Azure status page
Impact: Critical - complete outage

Current: ⚠️ No DR site configured
Planned: Q2 2025 - Secondary region (East US)

Recovery Steps (Manual):

Provision new resources in secondary region
Restore MongoDB from geo-redundant backup (~2 hours)
Deploy application to new region (~30 min)
Restore Milvus from backup (~1 hour)
Update DNS to point to new region (~30 min)
Test end-to-end functionality

Timeline: 4-6 hours (first incident)

Backup Testing¶

Monthly DR Drills:

Restore test database from backup
Verify data integrity
Measure RTO/RPO
Document lessons learned

Test Checklist:

Part 3: Data Migration¶

🔄 Schema Migration Procedures¶

MongoDB Schema Migrations¶

Migration Script Template:

# migrations/001_add_password_hash_field.py
from pymongo import MongoClientimport bcrypt
from datetime import datetime

def migrate_up(db):
    """Add password_hash field, hash existing passwords."""
    users = db.users_multichatbot_v2

    # Add password_hash field to all users
    for user in users.find({"password_hash": {"$exists": False}}):
        plain_password = user.get("password")

        if plain_password:
            # Hash password with bcrypt
            password_hash = bcrypt.hashpw(
                plain_password.encode('utf-8'),
                bcrypt.gensalt(rounds=12)
            )

            users.update_one(
                {"_id": user["_id"]},
                {
                    "$set": {
                        "password_hash": password_hash.decode('utf-8'),
                        "migrated_at": datetime.utcnow()
                    }
                }
            )

            print(f"Migrated user: {user['email']}")

    print("Migration completed successfully!")

def migrate_down(db):
    """Rollback: Remove password_hash field."""
    db.users_multichatbot_v2.update_many(
        {},
        {"$unset": {"password_hash": "", "migrated_at": ""}}
    )

Execution:

python run_migration.py migrations/001_add_password_hash_field.py

Zero-Downtime Migration Strategy¶

Pattern: Dual-write, gradual cutover

Example: Migrating to password_hash field

Phase 1: Dual-Write (Week 1)

# Update both old and new fields
def signup(email, password, name):
    password_hash = bcrypt.hashpw(password.encode(), bcrypt.gensalt())

    users_collection.insert_one({
        "email": email,
        "password": password,  # OLD field (for backward compat)
        "password_hash": password_hash.decode(),  # NEW field
        "name": name
    })

Phase 2: Migrate Existing (Week 2)

# Run migration script to hash all existing passwords
python migrations/001_add_password_hash_field.py

Phase 3: Read from New Field (Week 3)

# Update login to use password_hash
def login(email, password):
    user = users_collection.find_one({"email": email})

    # Check new field first, fallback to old
    if user.get("password_hash"):
        if bcrypt.checkpw(password.encode(), user["password_hash"].encode()):
            return True
    elif user.get("password") == password:  # Fallback
        return True

    return False

Phase 4: Stop Writing Old Field (Week 4)

# Only write password_hash
def signup(email, password, name):
    password_hash = bcrypt.hashpw(password.encode(), bcrypt.gensalt())

    users_collection.insert_one({
        "email": email,
        "password_hash": password_hash.decode(),  # Only new field
        "name": name
    })

Phase 5: Remove Old Field (Week 5)

# Drop old password field
db.users_multichatbot_v2.update_many(
    {},
    {"$unset": {"password": ""}}
)

Index Migration¶

Adding New Index:

# Create index in background (non-blocking)
db.chatbot_history.create_index(
    [("user_id", 1), ("timestamp", -1)],
    background=True,  # Don't block operations
    name="user_timestamp_idx"
)

Monitoring Progress:

db.currentOp({
  "command.createIndexes": { $exists: true },
});

Data Migration Between Environments¶

Dev → QA → Prod Pipeline:

#!/bin/bash
# migrate_data.sh

SOURCE_ENV="dev"
TARGET_ENV="qa"

# 1. Export from source
mongodump \
  --uri="$SOURCE_MONGO_URI" \
  --db="Machine_agent_${SOURCE_ENV}" \
  --out="./backup_${SOURCE_ENV}"

# 2. Sanitize data (mask PII for QA)
python scripts/sanitize_backup.py \
  --input="./backup_${SOURCE_ENV}" \
  --output="./backup_${SOURCE_ENV}_sanitized"

# 3. Import to target
mongorestore \
  --uri="$TARGET_MONGO_URI" \
  --db="Machine_agent_${TARGET_ENV}" \
  --dir="./backup_${SOURCE_ENV}_sanitized/Machine_agent_${SOURCE_ENV}"

# 4. Verify counts
python scripts/verify_migration.py \
  --source="$SOURCE_MONGO_URI" \
  --target="$TARGET_MONGO_URI"

Data Architecture:

Database Schema - Collection schemas
Vector Store - Milvus backups
Index - Architecture overview

Security:

Encryption - Backup encryption
Compliance - GDPR retention

Progress: Section 4 - 4-6/8 combined (75%)

"Backups are insurance. DR is execution." 💾✅