Skip to content

Data Operations - Governance, Backup, Recovery & Migration

Section: 4-data-architecture-governance
Document: Combined Operations Guide
Coverage: Data governance, backup/recovery procedures, migration strategies
Audience: DBAs, DevOps, data engineers


🎯 Overview

This document consolidates operational procedures for data management including governance policies, backup/disaster recovery, and migration strategies.


Part 1: Data Governance

📋 Data Retention Policies

Conversation History Retention

Plan Retention Period Auto-Delete Export Available
Free 7 days ✅ Yes ✅ Yes
Pro 30 days ✅ Yes ✅ Yes
Business 90 days ✅ Yes ✅ Yes
Premium Unlimited ❌ No ✅ Yes
Enterprise Unlimited ❌ No ✅ Yes

Implementation:

// Scheduled job (daily 2 AM)
db.chatbot_history.deleteMany({
  timestamp: {
    $lt: new Date(Date.now() - retention_days * 24 * 60 * 60 * 1000),
  },
});

Audit Logs Retention

Plan Retention Searchable
Free 7 days
Pro 30 days
Business 90 days
Premium/Enterprise Unlimited

Trash Retention

All Plans: 7 days for soft-deleted chatbots

Auto-Purge (planned):

// Daily cleanup
db.trash_collection_name.deleteMany({
  deleted_at: { $lt: new Date(Date.now() - 7 * 24 * 60 * 60 * 1000) },
});

🔒 Data Privacy & Security

PII Classification

Highly Sensitive (Requires Encryption + Masking):

  • Passwords (⚠️ currently plain text!)
  • API keys, secrets
  • Payment information (handled by Razorpay)

Sensitive (Requires Encryption):

  • Email addresses
  • User names
  • IP addresses
  • Conversation transcripts

Internal (Encrypted at Rest):

  • Chatbot configurations
  • System prompts
  • Analytics data

Public (No special handling):

  • Public URLs
  • Product descriptions

GDPR/DPDPA Compliance

Right to Access:

  • Data export API (/v1/gdpr/data-export)
  • JSON format, includes all collections
  • Available in all plans

Right to Erasure:

  • Account deletion API (/v1/gdpr/delete-account)
  • Cascading delete across all collections
  • Permanent deletion (no recovery)

Data Minimization:

  • Only collect necessary data
  • No tracking cookies without consent
  • Analytics opt-out available

Part 2: Backup & Recovery

💾 Backup Strategy

MongoDB (Cosmos DB)

Automated Backups:

  • Frequency: Daily (automatic)
  • Time: 2:00 AM UTC
  • Retention: 35 days (configurable up to 365 days)
  • Type: Full database backup
  • Encryption: AES-256
  • Location: Azure geo-redundant storage

Point-in-Time Restore (PITR):

  • Restore to any point within retention window
  • Granularity: 1-minute intervals
  • RTO: 1-4 hours

Verification:

# Azure CLI - List backups
az cosmosdb mongodb database backup list \
  --account-name machineavatars-mongodb \
  --resource-group machineavatars-rg

Milvus Vector Database

Nightly Snapshots:

  • Frequency: Daily (nightly)
  • Time: 3:00 AM UTC
  • Retention: 7 days
  • Type: Azure Blob storage snapshot
  • Size: ~6GB compressed

Backup Script:

#!/bin/bash
# Milvus backup script

DATE=$(date +%Y-%m-%d)
BACKUP_NAME="milvus-backup-${DATE}"

# Create Azure Blob snapshot
az storage blob snapshot \
  --account-name qablobmachineagents \
  --container-name milvus-data \
  --name collections/ \
  --snapshot-name ${BACKUP_NAME}

# Cleanup old snapshots (>7 days)
az storage blob list \
  --account-name qablobmachineagents \
  --container-name milvus-data \
  --query "[?properties.createdOn<'$7_DAYS_AGO'].name" \
  | xargs -I {} az storage blob delete --name {}

🔄 Disaster Recovery

RTO & RPO Targets

Component RTO (Recovery Time) RPO (Recovery Point) DR Strategy
MongoDB 4 hours 1 hour PITR from backup
Milvus 2 hours 24 hours Restore from snapshot
Application 1 hour N/A Blue-green deployment
Overall System 4 hours 1 hour Coordinated recovery

RTO: Maximum time to restore service
RPO: Maximum acceptable data loss


DR Scenarios & Procedures

Scenario 1: MongoDB Data Corruption

Detection: Application errors, data inconsistencies
Impact: High - chatbots cannot function

Recovery Steps:

  1. Identify corruption time (from logs/monitoring)
  2. Initiate PITR:
    az cosmosdb mongodb database restore \
      --account-name machineavatars-mongodb \
      --resource-group machineavatars-rg \
      --restore-timestamp "2025-01-15T10:30:00Z" \
      --target-database-name Machine_agent_dev_restored
    
  3. Verify restored data (spot checks, row counts)
  4. Switch traffic to restored database (update connection string)
  5. Monitor for 24 hours
  6. Delete corrupted database after verification

Timeline: 2-4 hours


Scenario 2: Milvus Vector Loss

Detection: Empty search results, collection not found
Impact: Medium - chatbots respond without context

Recovery Steps:

  1. Stop Milvus container
  2. Restore from snapshot:
    az storage blob copy start \
      --source-snapshot milvus-backup-2025-01-14 \
      --destination-container milvus-data
    
  3. Restart Milvus container
  4. Rebuild indexes:
    for collection in collections:
        collection.create_index("embedding", index_params)
    
  5. Verify search functionality (test queries)

Timeline: 1-2 hours


Scenario 3: Complete Azure Region Failure

Detection: All services down, Azure status page
Impact: Critical - complete outage

Current: ⚠️ No DR site configured
Planned: Q2 2025 - Secondary region (East US)

Recovery Steps (Manual):

  1. Provision new resources in secondary region
  2. Restore MongoDB from geo-redundant backup (~2 hours)
  3. Deploy application to new region (~30 min)
  4. Restore Milvus from backup (~1 hour)
  5. Update DNS to point to new region (~30 min)
  6. Test end-to-end functionality

Timeline: 4-6 hours (first incident)


Backup Testing

Monthly DR Drills:

  • Restore test database from backup
  • Verify data integrity
  • Measure RTO/RPO
  • Document lessons learned

Test Checklist:

  • MongoDB PITR restore successful
  • Milvus snapshot restore successful
  • Connection strings updated
  • Application connects to restored DBs
  • Search functionality works
  • No data loss detected
  • RTO/RPO targets met

Part 3: Data Migration

🔄 Schema Migration Procedures

MongoDB Schema Migrations

Migration Script Template:

# migrations/001_add_password_hash_field.py
from pymongo import MongoClientimport bcrypt
from datetime import datetime

def migrate_up(db):
    """Add password_hash field, hash existing passwords."""
    users = db.users_multichatbot_v2

    # Add password_hash field to all users
    for user in users.find({"password_hash": {"$exists": False}}):
        plain_password = user.get("password")

        if plain_password:
            # Hash password with bcrypt
            password_hash = bcrypt.hashpw(
                plain_password.encode('utf-8'),
                bcrypt.gensalt(rounds=12)
            )

            users.update_one(
                {"_id": user["_id"]},
                {
                    "$set": {
                        "password_hash": password_hash.decode('utf-8'),
                        "migrated_at": datetime.utcnow()
                    }
                }
            )

            print(f"Migrated user: {user['email']}")

    print("Migration completed successfully!")

def migrate_down(db):
    """Rollback: Remove password_hash field."""
    db.users_multichatbot_v2.update_many(
        {},
        {"$unset": {"password_hash": "", "migrated_at": ""}}
    )

Execution:

python run_migration.py migrations/001_add_password_hash_field.py

Zero-Downtime Migration Strategy

Pattern: Dual-write, gradual cutover

Example: Migrating to password_hash field

Phase 1: Dual-Write (Week 1)

# Update both old and new fields
def signup(email, password, name):
    password_hash = bcrypt.hashpw(password.encode(), bcrypt.gensalt())

    users_collection.insert_one({
        "email": email,
        "password": password,  # OLD field (for backward compat)
        "password_hash": password_hash.decode(),  # NEW field
        "name": name
    })

Phase 2: Migrate Existing (Week 2)

# Run migration script to hash all existing passwords
python migrations/001_add_password_hash_field.py

Phase 3: Read from New Field (Week 3)

# Update login to use password_hash
def login(email, password):
    user = users_collection.find_one({"email": email})

    # Check new field first, fallback to old
    if user.get("password_hash"):
        if bcrypt.checkpw(password.encode(), user["password_hash"].encode()):
            return True
    elif user.get("password") == password:  # Fallback
        return True

    return False

Phase 4: Stop Writing Old Field (Week 4)

# Only write password_hash
def signup(email, password, name):
    password_hash = bcrypt.hashpw(password.encode(), bcrypt.gensalt())

    users_collection.insert_one({
        "email": email,
        "password_hash": password_hash.decode(),  # Only new field
        "name": name
    })

Phase 5: Remove Old Field (Week 5)

# Drop old password field
db.users_multichatbot_v2.update_many(
    {},
    {"$unset": {"password": ""}}
)

Index Migration

Adding New Index:

# Create index in background (non-blocking)
db.chatbot_history.create_index(
    [("user_id", 1), ("timestamp", -1)],
    background=True,  # Don't block operations
    name="user_timestamp_idx"
)

Monitoring Progress:

db.currentOp({
  "command.createIndexes": { $exists: true },
});

Data Migration Between Environments

Dev → QA → Prod Pipeline:

#!/bin/bash
# migrate_data.sh

SOURCE_ENV="dev"
TARGET_ENV="qa"

# 1. Export from source
mongodump \
  --uri="$SOURCE_MONGO_URI" \
  --db="Machine_agent_${SOURCE_ENV}" \
  --out="./backup_${SOURCE_ENV}"

# 2. Sanitize data (mask PII for QA)
python scripts/sanitize_backup.py \
  --input="./backup_${SOURCE_ENV}" \
  --output="./backup_${SOURCE_ENV}_sanitized"

# 3. Import to target
mongorestore \
  --uri="$TARGET_MONGO_URI" \
  --db="Machine_agent_${TARGET_ENV}" \
  --dir="./backup_${SOURCE_ENV}_sanitized/Machine_agent_${SOURCE_ENV}"

# 4. Verify counts
python scripts/verify_migration.py \
  --source="$SOURCE_MONGO_URI" \
  --target="$TARGET_MONGO_URI"

Data Architecture:

Security:


Progress: Section 4 - 4-6/8 combined (75%)

"Backups are insurance. DR is execution." 💾✅