Data Architecture & GovernanceΒΆ
Section: 4-data-architecture-governance
Status: Comprehensive Data Architecture Documentation
Last Updated: 2025-12-30
Audience: Backend developers, database administrators, data engineers
π― OverviewΒΆ
MachineAvatars uses a hybrid data architecture combining MongoDB (Cosmos DB) for operational data and Milvus for vector embeddings, enabling both traditional CRUD operations and AI-powered semantic search through Retrieval-Augmented Generation (RAG).
Data Storage Technologies:
- MongoDB (Cosmos DB) - Primary database for all operational data
- Milvus - Vector database for semantic search
- GridFS - Large file storage (MongoDB)
- Azure Blob Storage - Milvus persistence layer
ποΈ Data Architecture DiagramΒΆ
graph TB
subgraph "Application Layer"
FE[Frontend<br/>Next.js]
GW[Gateway<br/>Port 8000/9000]
end
subgraph "Backend Services (23)"
AUTH[Auth Service]
USER[User Service]
CREATE[Create Chatbot]
MAINTAIN[Chatbot Maintenance]
RESP[Response Services<br/>3D/Text/Voice]
CRAWL[Data Crawling]
HISTORY[Chat History]
end
subgraph "Data Storage"
MONGO[(MongoDB<br/>Cosmos DB<br/>9 Collections)]
MILVUS[(Milvus<br/>Vector DB<br/>Embeddings)]
GRIDFS[(GridFS<br/>File Storage)]
end
subgraph "External Services"
OPENAI[Azure OpenAI<br/>Embeddings]
end
FE --> GW
GW --> AUTH
GW --> USER
GW --> CREATE
GW --> MAINTAIN
GW --> RESP
AUTH --> MONGO
USER --> MONGO
CREATE --> MONGO
MAINTAIN --> MONGO
HISTORY --> MONGO
CRAWL --> OPENAI
OPENAI --> MILVUS
RESP --> MILVUS
RESP --> MONGO
CREATE --> GRIDFS
RESP --> GRIDFS
style MONGO fill:#E3F2FD
style MILVUS fill:#FFF3E0
style GRIDFS fill:#C8E6C9
πΎ Database SummaryΒΆ
MongoDB (Cosmos DB)ΒΆ
Provider: Azure Cosmos DB for MongoDB API
Region: Central India
Tier: Provisioned throughput
Encryption: AES-256 (automatic)
Collections:
| Collection | Purpose | Documents | Avg Size | Key Indexes |
|---|---|---|---|---|
| users_multichatbot_v2 | User accounts | ~10K | 2KB | email, user_id |
| chatbot_selections | Chatbot configs | ~5K | 5KB | project_id, user_id |
| chatbot_history | Conversations | ~500K | 1KB | project_id, session_id |
| files | Uploaded files metadata | ~2K | 500B | user_id, project_id |
| files_secondary | Additional files | ~1K | 500B | user_id |
| system_prompts_user | Custom prompts | ~3K | 2KB | user_id, project_id |
| projectid_creation | Project metadata | ~5K | 1KB | user_id, project_id |
| organisation_data | Enterprise orgs | ~50 | 3KB | organization_id |
| trash_collection_name | Soft deletes | ~500 | 5KB | user_id |
Total Storage: ~2GB (operational data)
Milvus Vector DatabaseΒΆ
Deployment: Azure Container Instance
Storage Backend: Azure Blob Storage
Vector Dimensions: 1536 (OpenAI text-embedding-ada-002)
Collections:
| Collection | Vectors | Index Type | Distance Metric | Purpose |
|---|---|---|---|---|
| chatbotvectors | Variable | IVF_FLAT | Cosine | RAG context retrieval |
Total Vectors: ~1M across all projects
Storage: ~6GB (embeddings + metadata)
π Data Flow OverviewΒΆ
5 Major Data FlowsΒΆ
1. User Registration Flow
2. Chatbot Creation Flow
3. Data Ingestion Flow
User Upload β Backend β Processing β Azure OpenAI β Embeddings β Milvus
Website URL β Crawling β Chunking β Embeddings β Milvus
Q&A Pairs β Direct β Milvus
4. Chatbot Response Flow (RAG)
User Question β Embedding β Milvus Search β Top-K Context β LLM β Response
β
MongoDB (chatbot_history)
5. Analytics Flow
Conversations (MongoDB) β Aggregation Pipeline β Dashboard
β Real-time WebSocket β Live Updates
ποΈ Database Design PrinciplesΒΆ
1. Denormalization
- User information duplicated in multiple collections for performance
- No foreign key constraints (MongoDB philosophy)
- Application-level referential integrity
2. Document-Oriented
- Nested structures for related data
- Arrays for one-to-many relationships
- Rich documents vs. many tables
3. Flexible Schema
- Dynamic fields for customization
- Easy evolution without migrations
- JSON-like structure
4. Scalability
- Horizontal scaling via partitioning
- Index optimization for query performance
- Sharding strategy (planned for >1M users)
π Key Design DecisionsΒΆ
Why MongoDB (Cosmos DB)?ΒΆ
Advantages:
- β Flexible schema for rapid iteration
- β Azure-managed (encryption, backups, scaling)
- β Global distribution capability
- β Compatibility with MongoDB drivers
- β Strong consistency options
Trade-offs:
- β Higher cost than MongoDB Atlas
- β Limited to MongoDB 4.0 API (as of 2025)
- β οΈ Need to manage indexes carefully
Why Milvus for Vectors?ΒΆ
Advantages:
- β Purpose-built for vector similarity search
- β High performance (<40ms search latency)
- β Horizontal scalability
- β Multiple index types (IVF, HNSW)
- β Open-source (no vendor lock-in)
Alternatives Considered:
- Pinecone: β Cost prohibitive at scale
- Weaviate: β οΈ Less mature than Milvus
- PostgreSQL pgvector: β Not optimized for scale
- Qdrant: β οΈ Newer, less proven
Why GridFS for Files?ΒΆ
Advantages:
- β Integrated with MongoDB
- β Automatic chunking (16MB chunks)
- β Simplifies backup/restore
- β Metadata stored with files
Alternatives:
- Azure Blob Storage: Considered, but adds complexity
- File links only: Security risk (external URLs)
π Data Growth ProjectionsΒΆ
Current (Q4 2024):
- Users: ~10,000
- Chatbots: ~5,000
- Conversations: ~500,000
- Vectors: ~1M
Projected (Q4 2025):
- Users: ~100,000 (10x growth)
- Chatbots: ~50,000 (10x)
- Conversations: ~10M (20x)
- Vectors: ~20M (20x)
Scalability Plan:
- MongoDB sharding at 1M users
- Milvus clustering at 10M vectors
- Archive old conversations (>1 year) to cold storage
π Data SecurityΒΆ
Encryption:
- At Rest: AES-256 (Azure-managed keys)
- In Transit: TLS 1.3 (all connections)
- Backups: AES-256 encrypted
Access Control:
- Authentication: Azure managed identity
- Authorization: RBAC (database level)
- Audit: All database operations logged
Data Privacy:
- PII Handling: Encrypted, GDPR-compliant deletion
- Conversation Data: User-owned, exportable
- Embeddings: No raw text stored in vectors
Details: See Security Architecture
π Related Files in This SectionΒΆ
- Database Schema - Complete MongoDB schema (9 collections)
-
Vector Store - Milvus architecture and RAG pipeline
-
**** - Retention, quality, privacy policies
- **** - Backup strategy and DR procedures
- **** - Migration scripts and procedures
- Data Dictionary - Complete field definitions
π Related DocumentationΒΆ
Backend Services:
- Response 3D Service - Uses all collections
- User Service - User data management
- Chatbot Maintenance - CRUD operations
Features:
- Data Training - RAG pipeline, embedding strategy
- Analytics & Reporting - Data aggregation
Security:
- Encryption - Database encryption details
- Backup policies - GDPR retention
Progress: Section 4 - β files complete (12.5%)
"Data is the new oil. Architecture is the refinery." πποΈ