Skip to content

Data Architecture & GovernanceΒΆ

Section: 4-data-architecture-governance
Status: Comprehensive Data Architecture Documentation
Last Updated: 2025-12-30
Audience: Backend developers, database administrators, data engineers


🎯 Overview¢

MachineAvatars uses a hybrid data architecture combining MongoDB (Cosmos DB) for operational data and Milvus for vector embeddings, enabling both traditional CRUD operations and AI-powered semantic search through Retrieval-Augmented Generation (RAG).

Data Storage Technologies:

  • MongoDB (Cosmos DB) - Primary database for all operational data
  • Milvus - Vector database for semantic search
  • GridFS - Large file storage (MongoDB)
  • Azure Blob Storage - Milvus persistence layer

πŸ—οΈ Data Architecture DiagramΒΆ

graph TB
    subgraph "Application Layer"
        FE[Frontend<br/>Next.js]
        GW[Gateway<br/>Port 8000/9000]
    end

    subgraph "Backend Services (23)"
        AUTH[Auth Service]
        USER[User Service]
        CREATE[Create Chatbot]
        MAINTAIN[Chatbot Maintenance]
        RESP[Response Services<br/>3D/Text/Voice]
        CRAWL[Data Crawling]
        HISTORY[Chat History]
    end

    subgraph "Data Storage"
        MONGO[(MongoDB<br/>Cosmos DB<br/>9 Collections)]
        MILVUS[(Milvus<br/>Vector DB<br/>Embeddings)]
        GRIDFS[(GridFS<br/>File Storage)]
    end

    subgraph "External Services"
        OPENAI[Azure OpenAI<br/>Embeddings]
    end

    FE --> GW
    GW --> AUTH
    GW --> USER
    GW --> CREATE
    GW --> MAINTAIN
    GW --> RESP

    AUTH --> MONGO
    USER --> MONGO
    CREATE --> MONGO
    MAINTAIN --> MONGO
    HISTORY --> MONGO

    CRAWL --> OPENAI
    OPENAI --> MILVUS

    RESP --> MILVUS
    RESP --> MONGO

    CREATE --> GRIDFS
    RESP --> GRIDFS

    style MONGO fill:#E3F2FD
    style MILVUS fill:#FFF3E0
    style GRIDFS fill:#C8E6C9

πŸ’Ύ Database SummaryΒΆ

MongoDB (Cosmos DB)ΒΆ

Provider: Azure Cosmos DB for MongoDB API
Region: Central India
Tier: Provisioned throughput
Encryption: AES-256 (automatic)

Collections:

Collection Purpose Documents Avg Size Key Indexes
users_multichatbot_v2 User accounts ~10K 2KB email, user_id
chatbot_selections Chatbot configs ~5K 5KB project_id, user_id
chatbot_history Conversations ~500K 1KB project_id, session_id
files Uploaded files metadata ~2K 500B user_id, project_id
files_secondary Additional files ~1K 500B user_id
system_prompts_user Custom prompts ~3K 2KB user_id, project_id
projectid_creation Project metadata ~5K 1KB user_id, project_id
organisation_data Enterprise orgs ~50 3KB organization_id
trash_collection_name Soft deletes ~500 5KB user_id

Total Storage: ~2GB (operational data)


Milvus Vector DatabaseΒΆ

Deployment: Azure Container Instance
Storage Backend: Azure Blob Storage
Vector Dimensions: 1536 (OpenAI text-embedding-ada-002)

Collections:

Collection Vectors Index Type Distance Metric Purpose
chatbotvectors Variable IVF_FLAT Cosine RAG context retrieval

Total Vectors: ~1M across all projects
Storage: ~6GB (embeddings + metadata)


πŸ“Š Data Flow OverviewΒΆ

5 Major Data FlowsΒΆ

1. User Registration Flow

Frontend β†’ Gateway β†’ User Service β†’ MongoDB (users) β†’ Azure Email β†’ OTP Verification

2. Chatbot Creation Flow

Frontend β†’ Gateway β†’ Create Chatbot β†’ MongoDB (chatbot_selections, projectid_creation)

3. Data Ingestion Flow

User Upload β†’ Backend β†’ Processing β†’ Azure OpenAI β†’ Embeddings β†’ Milvus
Website URL β†’ Crawling β†’ Chunking β†’ Embeddings β†’ Milvus
Q&A Pairs β†’ Direct β†’ Milvus

4. Chatbot Response Flow (RAG)

User Question β†’ Embedding β†’ Milvus Search β†’ Top-K Context β†’ LLM β†’ Response
                                                                    ↓
                                                         MongoDB (chatbot_history)

5. Analytics Flow

Conversations (MongoDB) β†’ Aggregation Pipeline β†’ Dashboard
                       β†’ Real-time WebSocket β†’ Live Updates

πŸ—‚οΈ Database Design PrinciplesΒΆ

1. Denormalization

  • User information duplicated in multiple collections for performance
  • No foreign key constraints (MongoDB philosophy)
  • Application-level referential integrity

2. Document-Oriented

  • Nested structures for related data
  • Arrays for one-to-many relationships
  • Rich documents vs. many tables

3. Flexible Schema

  • Dynamic fields for customization
  • Easy evolution without migrations
  • JSON-like structure

4. Scalability

  • Horizontal scaling via partitioning
  • Index optimization for query performance
  • Sharding strategy (planned for >1M users)

πŸ”‘ Key Design DecisionsΒΆ

Why MongoDB (Cosmos DB)?ΒΆ

Advantages:

  • βœ… Flexible schema for rapid iteration
  • βœ… Azure-managed (encryption, backups, scaling)
  • βœ… Global distribution capability
  • βœ… Compatibility with MongoDB drivers
  • βœ… Strong consistency options

Trade-offs:

  • ❌ Higher cost than MongoDB Atlas
  • ❌ Limited to MongoDB 4.0 API (as of 2025)
  • ⚠️ Need to manage indexes carefully

Why Milvus for Vectors?ΒΆ

Advantages:

  • βœ… Purpose-built for vector similarity search
  • βœ… High performance (<40ms search latency)
  • βœ… Horizontal scalability
  • βœ… Multiple index types (IVF, HNSW)
  • βœ… Open-source (no vendor lock-in)

Alternatives Considered:

  • Pinecone: ❌ Cost prohibitive at scale
  • Weaviate: ⚠️ Less mature than Milvus
  • PostgreSQL pgvector: ❌ Not optimized for scale
  • Qdrant: ⚠️ Newer, less proven

Why GridFS for Files?ΒΆ

Advantages:

  • βœ… Integrated with MongoDB
  • βœ… Automatic chunking (16MB chunks)
  • βœ… Simplifies backup/restore
  • βœ… Metadata stored with files

Alternatives:

  • Azure Blob Storage: Considered, but adds complexity
  • File links only: Security risk (external URLs)

πŸ“ˆ Data Growth ProjectionsΒΆ

Current (Q4 2024):

  • Users: ~10,000
  • Chatbots: ~5,000
  • Conversations: ~500,000
  • Vectors: ~1M

Projected (Q4 2025):

  • Users: ~100,000 (10x growth)
  • Chatbots: ~50,000 (10x)
  • Conversations: ~10M (20x)
  • Vectors: ~20M (20x)

Scalability Plan:

  • MongoDB sharding at 1M users
  • Milvus clustering at 10M vectors
  • Archive old conversations (>1 year) to cold storage

πŸ”’ Data SecurityΒΆ

Encryption:

  • At Rest: AES-256 (Azure-managed keys)
  • In Transit: TLS 1.3 (all connections)
  • Backups: AES-256 encrypted

Access Control:

  • Authentication: Azure managed identity
  • Authorization: RBAC (database level)
  • Audit: All database operations logged

Data Privacy:

  • PII Handling: Encrypted, GDPR-compliant deletion
  • Conversation Data: User-owned, exportable
  • Embeddings: No raw text stored in vectors

Details: See Security Architecture


  1. Database Schema - Complete MongoDB schema (9 collections)
  2. Vector Store - Milvus architecture and RAG pipeline

  3. **** - Retention, quality, privacy policies

  4. **** - Backup strategy and DR procedures
  5. **** - Migration scripts and procedures
  6. Data Dictionary - Complete field definitions

Backend Services:

Features:

Security:


Progress: Section 4 - β…› files complete (12.5%)

"Data is the new oil. Architecture is the refinery." πŸ“ŠπŸ—οΈ