Skip to content

ADR-002: Azure Cosmos DB for MongoDB API

Status: ✅ Accepted
Date: 2024-06-20 (Q2 2024)
Decision Makers: CTO, Backend Lead, DevOps Lead
Consulted: Finance Team, Solution Architect
Informed: Engineering Team


Context

MachineAvatars requires a scalable, globally distributed database to store:

  • User accounts and profiles
  • Chatbot configurations
  • Conversation history (chat messages)
  • Document metadata (PDFs, files uploaded by users)
  • System prompts and guardrails
  • Payment and subscription data

Requirements:

  1. Scale: Support 100K+ users, millions of documents
  2. Performance: < 100ms query latency for 95% of requests
  3. Global Distribution: Low latency for users in India, US, Europe
  4. Flexible Schema: JSON documents with evolving structure
  5. High Availability: 99.95%+ uptime SLA
  6. Azure Integration: Leverage existing Azure infrastructure
  7. Developer Experience: Familiar MongoDB syntax

Constraints:

  • Budget: < $2,000/month for database at current scale
  • Team expertise: Strong MongoDB experience (3+ years)
  • Timeline: Need production-ready in 4 weeks

Decision

We selected Azure Cosmos DB with MongoDB API (v4.2 compatibility).

Configuration

# Production Configuration
Account Type: Azure Cosmos DB for MongoDB (vCore)
Region: Primary - East US, Secondary - Southeast Asia
Consistency: Session (default)
Throughput Model: Autoscale (400-4000 RU/s per collection)
Backup: Continuous (Point-in-Time Restore enabled)

Collections Structure

machineagents-db/
├── users                    # User accounts (100K docs)
├── chatbot_selection        # Chatbot configs (500K docs)
├── chatbot_history          # Conversation history (5M+ docs, TTL 90 days)
├── files                    # Document metadata with chunks (1M docs)
├── files_secondary          # Document metadata without chunks (for queries)
├── system_prompts_default   # Default AI prompts
├── system_prompts_user      # User-customized prompts
├── guardrails               # Safety guardrails
├── projectid_creation       # Project metadata
├── selection_history        # User selections
├── organisation_data        # Organization info
└── generate_greetings       # Chatbot greetings

Alternatives Considered

Alternative 1: MongoDB Atlas

Evaluated: MongoDB Atlas M10 cluster (dedicated)

Pros:

  • ✅ Native MongoDB (100% compatibility)
  • ✅ Excellent tooling (Atlas UI, Charts, Realm)
  • ✅ Strong community and documentation
  • ✅ Team familiarity (zero learning curve)
  • ✅ Built-in search (Atlas Search)

Cons:

  • Cost: $0.17/hour = $123/month (M10) → $600/month (M30 for scale)
  • Azure integration: Requires VPC peering (complexity)
  • Vendor lock-in: Different cloud provider
  • Data transfer costs: Egress charges Azure → MongoDB Atlas
  • Compliance: Data leaves Azure environment

Cost Comparison (at scale):

  • MongoDB Atlas M30: ~$600/month
  • Cosmos DB (4000 RU/s): ~$400/month
  • Savings: $200/month

Why Rejected: Higher cost, worse Azure integration, data residency concerns


Alternative 2: Azure SQL Database

Evaluated: Azure SQL Database (S3 tier)

Pros:

  • ✅ ACID transactions
  • ✅ Strong typing (schema enforcement)
  • ✅ Excellent BI/reporting tools
  • ✅ Azure native integration
  • ✅ Lower cost (~$75/month for S3)

Cons:

  • Rigid schema: Poor fit for evolving chatbot configs
  • JSON handling: Clunky compared to native document store
  • Developer experience: Team has no SQL expertise
  • Migration risk: Existing prototype uses MongoDB
  • Embedding storage: Not optimized for JSONB arrays (embeddings)

Why Rejected: Schema rigidity and poor developer experience. Would require full rewrite of existing code.


Alternative 3: Self-Hosted MongoDB on Azure VMs

Evaluated: MongoDB Community Edition on Azure VMs (D4s_v3)

Pros:

  • ✅ Full control
  • ✅ No vendor lock-in (can migrate anywhere)
  • ✅ Lowest cost (~$150/month for VM + storage)
  • ✅ 100% MongoDB compatibility

Cons:

  • DevOps burden: Manage backups, scaling, monitoring
  • High availability: Must configure replica sets manually
  • Global distribution: Complex multi-region setup
  • Security: SSL/TLS, firewall rules, patching
  • Time to production: 6-8 weeks vs. 2 weeks

Estimated effort:

  • Initial setup: 80 hours
  • Ongoing maintenance: 20 hours/month
  • Opportunity cost: >$5,000 in engineering time

Why Rejected: Too much operational overhead for current team size (5 engineers). Revisit at 500K+ users.


Alternative 4: DynamoDB (AWS)

Evaluated: Migrate to AWS DynamoDB (since we use some AWS services)

Pros:

  • ✅ Serverless (pay-per-request)
  • ✅ Excellent scalability
  • ✅ Low latency (single-digit ms)
  • ✅ AWS native

Cons:

  • Multi-cloud complexity: Split between Azure (AI) and AWS (DB)
  • Learning curve: Different paradigm (key-value + document)
  • Query limitations: No complex aggregations
  • Data transfer costs: Azure → AWS egress expensive
  • Migration effort: Complete rewrite required

Why Rejected: Multi-cloud complexity, expensive data transfer, learning curve


Decision Rationale

Why Cosmos DB for MongoDB API?

1. Azure Ecosystem Alignment

graph LR
    A[Azure OpenAI] --> B[Cosmos DB]
    C[Azure Blob Storage] --> B
    D[Azure Container Apps] --> B
    E[Azure Functions] --> B

    style B fill:#4CAF50
  • All services in one cloud = simplified networking
  • No cross-cloud data transfer costs
  • Unified billing and monitoring
  • Azure AD integration for access control

2. MongoDB Compatibility = Zero Rewrite

Existing MongoDB code works as-is:

# No changes needed!
from pymongo import MongoClient

client = MongoClient(COSMOS_CONNECTION_STRING)
db = client["machineagents-db"]

# Same queries
users = db.users.find({"email": "user@example.com"})
db.chatbot_history.insert_one(chat_entry)

Migration: 2 weeks (connection string change + testing)
vs. SQL migration: 12 weeks (schema design + rewrite + testing)


3. Global Distribution (Built-in)

Primary: East US (India → 250ms)
Secondary: Southeast Asia (India → 50ms)
Automatic replication (no config needed)

MongoDB Atlas equivalent:

  • Requires M40+ cluster ($1,000+/month)
  • Manual replica set configuration
  • Higher complexity

4. Autoscaling (Cost Optimization)

# Autoscale 400-4000 RU/s
Idle (night): 400 RU/s → $32/month
Peak (day): 4000 RU/s → $320/month
Average: ~1500 RU/s → $120/month

vs. Fixed provisioning (MongoDB Atlas):

  • Must provision for peak always
  • Pay $600/month even at night

5. Enterprise Features (Included)

  • Continuous backup: Point-in-time restore (30 days)
  • Encryption: At rest + in transit (automatic)
  • Compliance: SOC 2, ISO 27001, HIPAA, GDPR
  • Private endpoints: Azure VNet integration
  • Analytics: Azure Synapse Link

MongoDB Atlas equivalent: Requires M10+ ($123+/month)


Consequences

Positive Consequences

Fast Time to Market: 2 weeks to production (vs. 12 weeks for SQL)
Cost Efficiency: $120/month average (vs. $600 for Atlas M30)
Zero Learning Curve: Team already knows MongoDB
Global Performance: < 50ms latency in Asia, < 30ms in US
High Availability: 99.99% SLA (4 nines)
Automatic Backups: Continuous, no configuration
Azure Integration: Seamless with OpenAI, Blob Storage

Negative Consequences

MongoDB Compatibility: Not 100% (v4.2, missing some features)
Vendor Lock-in: Azure-specific (harder to migrate to AWS)
Cost at Massive Scale: RU-based pricing expensive beyond 100K RU/s
Limited Tooling: No Atlas Search, Atlas Charts
Aggregation Limits: Some complex pipelines may hit limits

Mitigation Strategies

For Compatibility:

  • Tested all queries in compatibility mode
  • Avoided unsupported features (e.g., $lookup depth > 3)
  • Documented workarounds in codebase

For Vendor Lock-in:

  • Use standard MongoDB drivers (PyMongo)
  • Abstract database access behind service layer
  • Keep migration path to self-hosted MongoDB open

For Cost at Scale:

  • Monitor RU consumption weekly
  • Optimize queries (add indexes)
  • If costs > $2K/month, migrate to self-hosted MongoDB

Performance Benchmarks

Operation Latency (p50) Latency (p95) RU Cost
User login (find) 8ms 15ms 1 RU
Insert chat message 12ms 25ms 5 RU
Query chat history (10 msgs) 20ms 40ms 3 RU
Chatbot config lookup 10ms 18ms 1 RU
Aggregate (complex) 150ms 300ms 50 RU

Average: 25ms p50, 50ms p95 ✅ Meets < 100ms requirement


Implementation Details

Connection Configuration

# .env
COSMOS_CONNECTION_STRING=mongodb://machineagents:<key>@machineagents.mongo.cosmos.azure.com:10255/?ssl=true&replicaSet=globaldb&retrywrites=false&maxIdleTimeMS=120000

# shared/database/db_manager.py
from pymongo import MongoClient

client = MongoClient(
    COSMOS_CONNECTION_STRING,
    connectTimeoutMS=5000,
    socketTimeoutMS=10000,
    serverSelectionTimeoutMS=5000
)

db = client["machineagents-db"]

Indexing Strategy

# Critical indexes
db.users.create_index([("email", 1)], unique=True)
db.chatbot_selection.create_index([("user_id", 1), ("project_id", 1)])
db.chatbot_history.create_index([("session_id", 1), ("timestamp", -1)])

# TTL index (auto-delete old chats after 90 days)
db.chatbot_history.create_index([("datetime", 1)], expireAfterSeconds=7776000)

Monitoring

# DataDog integration
- Track RU consumption
- Alert if > 3500 RU/s (approaching limit)
- Monitor query latency
- Track connection pool utilization

Compliance & Security

Data Encryption:

  • At rest: AES-256 (automatic)
  • In transit: TLS 1.2+ (enforced)

Access Control:

  • Azure AD integration (production)
  • Connection string rotation every 90 days
  • IP whitelisting (only Azure services)

GDPR/DPDPA:

  • Data residency: India (Southeast Asia region)
  • Right to delete: Automated via API
  • Data export: MongoDB export tools

Backup & Recovery:

  • Continuous backup (30-day window)
  • Point-in-time restore tested monthly
  • RTO: < 2 hours, RPO: < 5 minutes

Migration Path

Scenario 1: Migrate to Self-Hosted MongoDB

If costs become prohibitive (> $2K/month):

Process:

  1. Provision Azure VMs (D8s_v3 × 3 for replica set)
  2. Configure MongoDB 6.0 replica set
  3. Use mongodump + mongorestore for migration
  4. Cutover with 5-minute downtime window
  5. Decommission Cosmos DB

Estimated Cost:

  • VMs: $400/month
  • Storage: $100/month
  • Total: $500/month (vs. $2K Cosmos)

Estimated Time: 4 weeks


Scenario 2: Migrate to PostgreSQL

If document model becomes limiting:

Process:

  1. Design relational schema (users, chatbots, messages)
  2. Write migration scripts
  3. Dual-write period (both DBs)
  4. Validate data consistency
  5. Cutover to PostgreSQL

Estimated Time: 12 weeks (significant effort)


Review Schedule

Next Review: 2025-06-30 (1 year after implementation)

Review Criteria:

  • Monthly cost < $500 average
  • Query latency p95 < 100ms
  • Uptime > 99.9%
  • No compatibility issues encountered
  • Team satisfaction > 4.0/5

Triggers for Re-evaluation:

  • Monthly cost > $1,000
  • Frequent compatibility issues
  • New Azure database service launched

  • ADR-003: Vector Database (Milvus) - Separate DB for embeddings
  • ADR-004: Microservices Architecture - DB access patterns
  • ADR-006: Azure Cloud Provider (planned)

Evidence & Data

Internal Testing (June 2024):

  • Load test: 1,000 concurrent users
  • Result: p95 latency 45ms, no errors
  • Cost: 2,500 avg RU/s = $200/month
  • ✅ Meets all requirements

Production Metrics (6 months):

  • Uptime: 99.98% (exceeds SLA)
  • Avg Latency: 28ms p50, 62ms p95
  • Avg Cost: $145/month
  • Incidents: 0 (zero database outages)

External References:


Last Updated: 2025-12-26
Review Date: 2025-06-30
Status: Active and performing excellently


"The best database is the one your team already knows."