ADR-002: Azure Cosmos DB for MongoDB API¶
Status: ✅ Accepted
Date: 2024-06-20 (Q2 2024)
Decision Makers: CTO, Backend Lead, DevOps Lead
Consulted: Finance Team, Solution Architect
Informed: Engineering Team
Context¶
MachineAvatars requires a scalable, globally distributed database to store:
- User accounts and profiles
- Chatbot configurations
- Conversation history (chat messages)
- Document metadata (PDFs, files uploaded by users)
- System prompts and guardrails
- Payment and subscription data
Requirements:
- Scale: Support 100K+ users, millions of documents
- Performance: < 100ms query latency for 95% of requests
- Global Distribution: Low latency for users in India, US, Europe
- Flexible Schema: JSON documents with evolving structure
- High Availability: 99.95%+ uptime SLA
- Azure Integration: Leverage existing Azure infrastructure
- Developer Experience: Familiar MongoDB syntax
Constraints:
- Budget: < $2,000/month for database at current scale
- Team expertise: Strong MongoDB experience (3+ years)
- Timeline: Need production-ready in 4 weeks
Decision¶
We selected Azure Cosmos DB with MongoDB API (v4.2 compatibility).
Configuration¶
# Production Configuration
Account Type: Azure Cosmos DB for MongoDB (vCore)
Region: Primary - East US, Secondary - Southeast Asia
Consistency: Session (default)
Throughput Model: Autoscale (400-4000 RU/s per collection)
Backup: Continuous (Point-in-Time Restore enabled)
Collections Structure¶
machineagents-db/
├── users # User accounts (100K docs)
├── chatbot_selection # Chatbot configs (500K docs)
├── chatbot_history # Conversation history (5M+ docs, TTL 90 days)
├── files # Document metadata with chunks (1M docs)
├── files_secondary # Document metadata without chunks (for queries)
├── system_prompts_default # Default AI prompts
├── system_prompts_user # User-customized prompts
├── guardrails # Safety guardrails
├── projectid_creation # Project metadata
├── selection_history # User selections
├── organisation_data # Organization info
└── generate_greetings # Chatbot greetings
Alternatives Considered¶
Alternative 1: MongoDB Atlas¶
Evaluated: MongoDB Atlas M10 cluster (dedicated)
Pros:
- ✅ Native MongoDB (100% compatibility)
- ✅ Excellent tooling (Atlas UI, Charts, Realm)
- ✅ Strong community and documentation
- ✅ Team familiarity (zero learning curve)
- ✅ Built-in search (Atlas Search)
Cons:
- ❌ Cost: $0.17/hour = $123/month (M10) → $600/month (M30 for scale)
- ❌ Azure integration: Requires VPC peering (complexity)
- ❌ Vendor lock-in: Different cloud provider
- ❌ Data transfer costs: Egress charges Azure → MongoDB Atlas
- ❌ Compliance: Data leaves Azure environment
Cost Comparison (at scale):
- MongoDB Atlas M30: ~$600/month
- Cosmos DB (4000 RU/s): ~$400/month
- Savings: $200/month
Why Rejected: Higher cost, worse Azure integration, data residency concerns
Alternative 2: Azure SQL Database¶
Evaluated: Azure SQL Database (S3 tier)
Pros:
- ✅ ACID transactions
- ✅ Strong typing (schema enforcement)
- ✅ Excellent BI/reporting tools
- ✅ Azure native integration
- ✅ Lower cost (~$75/month for S3)
Cons:
- ❌ Rigid schema: Poor fit for evolving chatbot configs
- ❌ JSON handling: Clunky compared to native document store
- ❌ Developer experience: Team has no SQL expertise
- ❌ Migration risk: Existing prototype uses MongoDB
- ❌ Embedding storage: Not optimized for JSONB arrays (embeddings)
Why Rejected: Schema rigidity and poor developer experience. Would require full rewrite of existing code.
Alternative 3: Self-Hosted MongoDB on Azure VMs¶
Evaluated: MongoDB Community Edition on Azure VMs (D4s_v3)
Pros:
- ✅ Full control
- ✅ No vendor lock-in (can migrate anywhere)
- ✅ Lowest cost (~$150/month for VM + storage)
- ✅ 100% MongoDB compatibility
Cons:
- ❌ DevOps burden: Manage backups, scaling, monitoring
- ❌ High availability: Must configure replica sets manually
- ❌ Global distribution: Complex multi-region setup
- ❌ Security: SSL/TLS, firewall rules, patching
- ❌ Time to production: 6-8 weeks vs. 2 weeks
Estimated effort:
- Initial setup: 80 hours
- Ongoing maintenance: 20 hours/month
- Opportunity cost: >$5,000 in engineering time
Why Rejected: Too much operational overhead for current team size (5 engineers). Revisit at 500K+ users.
Alternative 4: DynamoDB (AWS)¶
Evaluated: Migrate to AWS DynamoDB (since we use some AWS services)
Pros:
- ✅ Serverless (pay-per-request)
- ✅ Excellent scalability
- ✅ Low latency (single-digit ms)
- ✅ AWS native
Cons:
- ❌ Multi-cloud complexity: Split between Azure (AI) and AWS (DB)
- ❌ Learning curve: Different paradigm (key-value + document)
- ❌ Query limitations: No complex aggregations
- ❌ Data transfer costs: Azure → AWS egress expensive
- ❌ Migration effort: Complete rewrite required
Why Rejected: Multi-cloud complexity, expensive data transfer, learning curve
Decision Rationale¶
Why Cosmos DB for MongoDB API?¶
1. Azure Ecosystem Alignment
graph LR
A[Azure OpenAI] --> B[Cosmos DB]
C[Azure Blob Storage] --> B
D[Azure Container Apps] --> B
E[Azure Functions] --> B
style B fill:#4CAF50
- All services in one cloud = simplified networking
- No cross-cloud data transfer costs
- Unified billing and monitoring
- Azure AD integration for access control
2. MongoDB Compatibility = Zero Rewrite
Existing MongoDB code works as-is:
# No changes needed!
from pymongo import MongoClient
client = MongoClient(COSMOS_CONNECTION_STRING)
db = client["machineagents-db"]
# Same queries
users = db.users.find({"email": "user@example.com"})
db.chatbot_history.insert_one(chat_entry)
Migration: 2 weeks (connection string change + testing)
vs. SQL migration: 12 weeks (schema design + rewrite + testing)
3. Global Distribution (Built-in)
Primary: East US (India → 250ms)
Secondary: Southeast Asia (India → 50ms)
Automatic replication (no config needed)
MongoDB Atlas equivalent:
- Requires M40+ cluster ($1,000+/month)
- Manual replica set configuration
- Higher complexity
4. Autoscaling (Cost Optimization)
# Autoscale 400-4000 RU/s
Idle (night): 400 RU/s → $32/month
Peak (day): 4000 RU/s → $320/month
Average: ~1500 RU/s → $120/month
vs. Fixed provisioning (MongoDB Atlas):
- Must provision for peak always
- Pay $600/month even at night
5. Enterprise Features (Included)
- ✅ Continuous backup: Point-in-time restore (30 days)
- ✅ Encryption: At rest + in transit (automatic)
- ✅ Compliance: SOC 2, ISO 27001, HIPAA, GDPR
- ✅ Private endpoints: Azure VNet integration
- ✅ Analytics: Azure Synapse Link
MongoDB Atlas equivalent: Requires M10+ ($123+/month)
Consequences¶
Positive Consequences¶
✅ Fast Time to Market: 2 weeks to production (vs. 12 weeks for SQL)
✅ Cost Efficiency: $120/month average (vs. $600 for Atlas M30)
✅ Zero Learning Curve: Team already knows MongoDB
✅ Global Performance: < 50ms latency in Asia, < 30ms in US
✅ High Availability: 99.99% SLA (4 nines)
✅ Automatic Backups: Continuous, no configuration
✅ Azure Integration: Seamless with OpenAI, Blob Storage
Negative Consequences¶
❌ MongoDB Compatibility: Not 100% (v4.2, missing some features)
❌ Vendor Lock-in: Azure-specific (harder to migrate to AWS)
❌ Cost at Massive Scale: RU-based pricing expensive beyond 100K RU/s
❌ Limited Tooling: No Atlas Search, Atlas Charts
❌ Aggregation Limits: Some complex pipelines may hit limits
Mitigation Strategies¶
For Compatibility:
- Tested all queries in compatibility mode
- Avoided unsupported features (e.g.,
$lookupdepth > 3) - Documented workarounds in codebase
For Vendor Lock-in:
- Use standard MongoDB drivers (PyMongo)
- Abstract database access behind service layer
- Keep migration path to self-hosted MongoDB open
For Cost at Scale:
- Monitor RU consumption weekly
- Optimize queries (add indexes)
- If costs > $2K/month, migrate to self-hosted MongoDB
Performance Benchmarks¶
| Operation | Latency (p50) | Latency (p95) | RU Cost |
|---|---|---|---|
| User login (find) | 8ms | 15ms | 1 RU |
| Insert chat message | 12ms | 25ms | 5 RU |
| Query chat history (10 msgs) | 20ms | 40ms | 3 RU |
| Chatbot config lookup | 10ms | 18ms | 1 RU |
| Aggregate (complex) | 150ms | 300ms | 50 RU |
Average: 25ms p50, 50ms p95 ✅ Meets < 100ms requirement
Implementation Details¶
Connection Configuration¶
# .env
COSMOS_CONNECTION_STRING=mongodb://machineagents:<key>@machineagents.mongo.cosmos.azure.com:10255/?ssl=true&replicaSet=globaldb&retrywrites=false&maxIdleTimeMS=120000
# shared/database/db_manager.py
from pymongo import MongoClient
client = MongoClient(
COSMOS_CONNECTION_STRING,
connectTimeoutMS=5000,
socketTimeoutMS=10000,
serverSelectionTimeoutMS=5000
)
db = client["machineagents-db"]
Indexing Strategy¶
# Critical indexes
db.users.create_index([("email", 1)], unique=True)
db.chatbot_selection.create_index([("user_id", 1), ("project_id", 1)])
db.chatbot_history.create_index([("session_id", 1), ("timestamp", -1)])
# TTL index (auto-delete old chats after 90 days)
db.chatbot_history.create_index([("datetime", 1)], expireAfterSeconds=7776000)
Monitoring¶
# DataDog integration
- Track RU consumption
- Alert if > 3500 RU/s (approaching limit)
- Monitor query latency
- Track connection pool utilization
Compliance & Security¶
Data Encryption:
- At rest: AES-256 (automatic)
- In transit: TLS 1.2+ (enforced)
Access Control:
- Azure AD integration (production)
- Connection string rotation every 90 days
- IP whitelisting (only Azure services)
GDPR/DPDPA:
- Data residency: India (Southeast Asia region)
- Right to delete: Automated via API
- Data export: MongoDB export tools
Backup & Recovery:
- Continuous backup (30-day window)
- Point-in-time restore tested monthly
- RTO: < 2 hours, RPO: < 5 minutes
Migration Path¶
Scenario 1: Migrate to Self-Hosted MongoDB¶
If costs become prohibitive (> $2K/month):
Process:
- Provision Azure VMs (D8s_v3 × 3 for replica set)
- Configure MongoDB 6.0 replica set
- Use
mongodump+mongorestorefor migration - Cutover with 5-minute downtime window
- Decommission Cosmos DB
Estimated Cost:
- VMs: $400/month
- Storage: $100/month
- Total: $500/month (vs. $2K Cosmos)
Estimated Time: 4 weeks
Scenario 2: Migrate to PostgreSQL¶
If document model becomes limiting:
Process:
- Design relational schema (users, chatbots, messages)
- Write migration scripts
- Dual-write period (both DBs)
- Validate data consistency
- Cutover to PostgreSQL
Estimated Time: 12 weeks (significant effort)
Review Schedule¶
Next Review: 2025-06-30 (1 year after implementation)
Review Criteria:
- Monthly cost < $500 average
- Query latency p95 < 100ms
- Uptime > 99.9%
- No compatibility issues encountered
- Team satisfaction > 4.0/5
Triggers for Re-evaluation:
- Monthly cost > $1,000
- Frequent compatibility issues
- New Azure database service launched
Related ADRs¶
- ADR-003: Vector Database (Milvus) - Separate DB for embeddings
- ADR-004: Microservices Architecture - DB access patterns
- ADR-006: Azure Cloud Provider (planned)
Evidence & Data¶
Internal Testing (June 2024):
- Load test: 1,000 concurrent users
- Result: p95 latency 45ms, no errors
- Cost: 2,500 avg RU/s = $200/month
- ✅ Meets all requirements
Production Metrics (6 months):
- Uptime: 99.98% (exceeds SLA)
- Avg Latency: 28ms p50, 62ms p95
- Avg Cost: $145/month
- Incidents: 0 (zero database outages)
External References:
Last Updated: 2025-12-26
Review Date: 2025-06-30
Status: Active and performing excellently
"The best database is the one your team already knows."