Data Architecture & Governance¶

Section: 4-data-architecture-governance
Status: Comprehensive Data Architecture Documentation
Last Updated: 2025-12-30
Audience: Backend developers, database administrators, data engineers

🎯 Overview¶

MachineAvatars uses a hybrid data architecture combining MongoDB (Cosmos DB) for operational data and Milvus for vector embeddings, enabling both traditional CRUD operations and AI-powered semantic search through Retrieval-Augmented Generation (RAG).

Data Storage Technologies:

MongoDB (Cosmos DB) - Primary database for all operational data
Milvus - Vector database for semantic search
GridFS - Large file storage (MongoDB)
Azure Blob Storage - Milvus persistence layer

🏗️ Data Architecture Diagram¶

graph TB
    subgraph "Application Layer"
        FE[Frontend<br/>Next.js]
        GW[Gateway<br/>Port 8000/9000]
    end

    subgraph "Backend Services (23)"
        AUTH[Auth Service]
        USER[User Service]
        CREATE[Create Chatbot]
        MAINTAIN[Chatbot Maintenance]
        RESP[Response Services<br/>3D/Text/Voice]
        CRAWL[Data Crawling]
        HISTORY[Chat History]
    end

    subgraph "Data Storage"
        MONGO[(MongoDB<br/>Cosmos DB<br/>9 Collections)]
        MILVUS[(Milvus<br/>Vector DB<br/>Embeddings)]
        GRIDFS[(GridFS<br/>File Storage)]
    end

    subgraph "External Services"
        OPENAI[Azure OpenAI<br/>Embeddings]
    end

    FE --> GW
    GW --> AUTH
    GW --> USER
    GW --> CREATE
    GW --> MAINTAIN
    GW --> RESP

    AUTH --> MONGO
    USER --> MONGO
    CREATE --> MONGO
    MAINTAIN --> MONGO
    HISTORY --> MONGO

    CRAWL --> OPENAI
    OPENAI --> MILVUS

    RESP --> MILVUS
    RESP --> MONGO

    CREATE --> GRIDFS
    RESP --> GRIDFS

    style MONGO fill:#E3F2FD
    style MILVUS fill:#FFF3E0
    style GRIDFS fill:#C8E6C9

💾 Database Summary¶

MongoDB (Cosmos DB)¶

Provider: Azure Cosmos DB for MongoDB API
Region: Central India
Tier: Provisioned throughput
Encryption: AES-256 (automatic)

Collections:

Collection	Purpose	Documents	Avg Size	Key Indexes
users_multichatbot_v2	User accounts	~10K	2KB	email, user_id
chatbot_selections	Chatbot configs	~5K	5KB	project_id, user_id
chatbot_history	Conversations	~500K	1KB	project_id, session_id
files	Uploaded files metadata	~2K	500B	user_id, project_id
files_secondary	Additional files	~1K	500B	user_id
system_prompts_user	Custom prompts	~3K	2KB	user_id, project_id
projectid_creation	Project metadata	~5K	1KB	user_id, project_id
organisation_data	Enterprise orgs	~50	3KB	organization_id
trash_collection_name	Soft deletes	~500	5KB	user_id

Total Storage: ~2GB (operational data)

Milvus Vector Database¶

Deployment: Azure Container Instance
Storage Backend: Azure Blob Storage
Vector Dimensions: 1536 (OpenAI text-embedding-ada-002)

Collections:

Collection	Vectors	Index Type	Distance Metric	Purpose
chatbotvectors	Variable	IVF_FLAT	Cosine	RAG context retrieval

Total Vectors: ~1M across all projects
Storage: ~6GB (embeddings + metadata)

📊 Data Flow Overview¶

5 Major Data Flows¶

1. User Registration Flow

Frontend → Gateway → User Service → MongoDB (users) → Azure Email → OTP Verification

2. Chatbot Creation Flow

Frontend → Gateway → Create Chatbot → MongoDB (chatbot_selections, projectid_creation)

3. Data Ingestion Flow

User Upload → Backend → Processing → Azure OpenAI → Embeddings → Milvus
Website URL → Crawling → Chunking → Embeddings → Milvus
Q&A Pairs → Direct → Milvus

4. Chatbot Response Flow (RAG)

User Question → Embedding → Milvus Search → Top-K Context → LLM → Response
                                                                    ↓
                                                         MongoDB (chatbot_history)

5. Analytics Flow

Conversations (MongoDB) → Aggregation Pipeline → Dashboard
                       → Real-time WebSocket → Live Updates

🗂️ Database Design Principles¶

1. Denormalization

User information duplicated in multiple collections for performance
No foreign key constraints (MongoDB philosophy)
Application-level referential integrity

2. Document-Oriented

Nested structures for related data
Arrays for one-to-many relationships
Rich documents vs. many tables

3. Flexible Schema

Dynamic fields for customization
Easy evolution without migrations
JSON-like structure

4. Scalability

Horizontal scaling via partitioning
Index optimization for query performance
Sharding strategy (planned for >1M users)

🔑 Key Design Decisions¶

Why MongoDB (Cosmos DB)?¶

Advantages:

✅ Flexible schema for rapid iteration
✅ Azure-managed (encryption, backups, scaling)
✅ Global distribution capability
✅ Compatibility with MongoDB drivers
✅ Strong consistency options

Trade-offs:

❌ Higher cost than MongoDB Atlas
❌ Limited to MongoDB 4.0 API (as of 2025)
⚠️ Need to manage indexes carefully

Why Milvus for Vectors?¶

Advantages:

✅ Purpose-built for vector similarity search
✅ High performance (<40ms search latency)
✅ Horizontal scalability
✅ Multiple index types (IVF, HNSW)
✅ Open-source (no vendor lock-in)

Alternatives Considered:

Pinecone: ❌ Cost prohibitive at scale
Weaviate: ⚠️ Less mature than Milvus
PostgreSQL pgvector: ❌ Not optimized for scale
Qdrant: ⚠️ Newer, less proven

Why GridFS for Files?¶

Advantages:

✅ Integrated with MongoDB
✅ Automatic chunking (16MB chunks)
✅ Simplifies backup/restore
✅ Metadata stored with files

Alternatives:

Azure Blob Storage: Considered, but adds complexity
File links only: Security risk (external URLs)

📈 Data Growth Projections¶

Current (Q4 2024):

Users: ~10,000
Chatbots: ~5,000
Conversations: ~500,000
Vectors: ~1M

Projected (Q4 2025):

Users: ~100,000 (10x growth)
Chatbots: ~50,000 (10x)
Conversations: ~10M (20x)
Vectors: ~20M (20x)

Scalability Plan:

MongoDB sharding at 1M users
Milvus clustering at 10M vectors
Archive old conversations (>1 year) to cold storage

🔒 Data Security¶

Encryption:

At Rest: AES-256 (Azure-managed keys)
In Transit: TLS 1.3 (all connections)
Backups: AES-256 encrypted

Access Control:

Authentication: Azure managed identity
Authorization: RBAC (database level)
Audit: All database operations logged

Data Privacy:

PII Handling: Encrypted, GDPR-compliant deletion
Conversation Data: User-owned, exportable
Embeddings: No raw text stored in vectors

Details: See Security Architecture

Database Schema - Complete MongoDB schema (9 collections)
Vector Store - Milvus architecture and RAG pipeline
**** - Retention, quality, privacy policies
**** - Backup strategy and DR procedures
**** - Migration scripts and procedures
Data Dictionary - Complete field definitions

Backend Services:

Response 3D Service - Uses all collections
User Service - User data management
Chatbot Maintenance - CRUD operations

Features:

Data Training - RAG pipeline, embedding strategy
Analytics & Reporting - Data aggregation

Security:

Encryption - Database encryption details
Backup policies - GDPR retention

Progress: Section 4 - ⅛ files complete (12.5%)

"Data is the new oil. Architecture is the refinery." 📊🏗️

Data Architecture & Governance¶

🎯 Overview¶

🏗️ Data Architecture Diagram¶

💾 Database Summary¶

MongoDB (Cosmos DB)¶

Milvus Vector Database¶

📊 Data Flow Overview¶

5 Major Data Flows¶

🗂️ Database Design Principles¶

🔑 Key Design Decisions¶

Why MongoDB (Cosmos DB)?¶

Why Milvus for Vectors?¶

Why GridFS for Files?¶

📈 Data Growth Projections¶

🔒 Data Security¶

📁 Related Files in This Section¶

🔗 Related Documentation¶