ADR-003: Milvus for Vector Embeddings¶

Status: ✅ Accepted
Date: 2024-08-10 (Q3 2024)
Decision Makers: CTO, ML Engineering Lead, Backend Lead
Consulted: Data Engineering Team, Solution Architect
Informed: Engineering Team

Context¶

MachineAvatars implements Retrieval-Augmented Generation (RAG) to provide chatbots with contextual knowledge from user-uploaded documents (PDFs, text files, website content). This requires a vector database to:

Store document embeddings: Convert text chunks to 384-dimensional vectors
Perform similarity search: Find semantically similar chunks for user queries
Support multi-tenancy: Isolate data per chatbot/user
Scale: Handle millions of embedding vectors
Low latency: < 100ms search time for 95% of queries

Current Scale:

500K documents across all users
~25M embedding vectors (50 chunks/document avg)
10K searches/day

Projected Scale (6 months):

2M documents
100M embedding vectors
100K searches/day

Requirements:

Vector dimensions: 384 (BAAI/bge-small-en-v1.5)
Similarity metric: L2 distance or cosine
Multi-tenancy: Data isolation per chatbot
Query latency: < 50ms p95
Budget: < $500/month at current scale

Decision¶

We selected Milvus (self-hosted on Azure) with partition-based multi-tenancy.

Architecture¶

graph TB
    A[User Query] --> B[Embedding Service]
    B --> C[384-dim Vector]
    C --> D[Milvus Search]

    subgraph "Milvus Cluster"
        D --> E[Collection: embeddings]
        E --> F[Partition: User_123_Project_1]
        E --> G[Partition: User_123_Project_2]
        E --> H[Partition: User_456_Project_3]
    end

    D --> I[Top-K Results]
    I --> J[LLM Context]

    style D fill:#FFF3E0
    style J fill:#E8F5E9

Implementation Details¶

# Milvus Configuration
Collection: "embeddings"
Dimensions: 384
Index Type: IVF_FLAT
Metric: L2
Parameters: {"nlist": 128, "nprobe": 10}

# Schema
Fields:
- id: INT64 (primary key, auto)
- document_id: VARCHAR(100)
- user_id: VARCHAR(100)
- project_id: VARCHAR(100)  # Partition key
- chunk_index: INT32
- text: VARCHAR(2000)
- embedding: FLOAT_VECTOR(384)
- data_type: VARCHAR(50)    # pdf, text, qa, url
- source_url: VARCHAR(500)
- created_at: VARCHAR(100)

Alternatives Considered¶

Alternative 1: Pinecone (Managed Vector DB)¶

Evaluated: Pinecone Starter plan ($70/month) → Standard ($300/month)

Pros:

✅ Fully managed (no DevOps)
✅ Excellent developer experience
✅ Built-in metadata filtering
✅ Good documentation
✅ Automatic scaling

Cons:

❌ Cost: $70-300/month → $2,000+/month at 100M vectors
❌ Vendor lock-in: Proprietary API
❌ Data residency: US-only (free tier), Europe costs more
❌ Limited control: Can't optimize indexes ourselves
❌ Pricing uncertainty: Usage-based can spike unexpectedly

Cost Projection: | Scale | Pinecone Cost | Milvus Cost | Savings | |-------|--------------|-------------|---------| | 25M vectors | $300/month | $120/month | $180 | | 100M vectors | $2,000/month | $280/month | $1,720 |

Why Rejected: Too expensive at scale, vendor lock-in

Alternative 2: PostgreSQL with pgvector¶

Evaluated: Postgres 15 with pgvector extension on Azure

Pros:

✅ Minimal cost ($150/month for VM)
✅ Single database (vectors + metadata together)
✅ Familiar SQL syntax
✅ ACID transactions
✅ Strong community

Cons:

❌ Performance: Slow at > 10M vectors (linear scan)
❌ Indexing limitations: HNSW index memory-intensive
❌ Not purpose-built: General DB doing vector search
❌ Scaling challenges: Vertical scaling only

Benchmark (10M vectors, 384 dims):

pgvector (HNSW): 200-500ms p95 ❌
Milvus (IVF_FLAT): 15-35ms p95 ✅

Why Rejected: 10x slower than specialized vector DB

Alternative 3: Weaviate (Open Source Vector DB)¶

Evaluated: Weaviate v1.20 self-hosted

Pros:

✅ Open source (Apache 2.0)
✅ GraphQL API (modern)
✅ Built-in vectorization modules
✅ Good documentation
✅ Active development

Cons:

❌ Complexity: More features = steeper learning curve
❌ Resource usage: Memory-heavy (2x Milvus for same data)
❌ Overkill: We don't need graph features
❌ Community size: Smaller than Milvus

Resource Comparison (25M vectors):

Weaviate: 32GB RAM, 4 CPU cores
Milvus: 16GB RAM, 2 CPU cores

Why Rejected: Too complex for our simple use case

Alternative 4: Qdrant (Rust-Based Vector DB)¶

Evaluated: Qdrant v1.5 self-hosted

Pros:

✅ Written in Rust (fast, memory-safe)
✅ Simple API (REST + gRPC)
✅ Good filtering capabilities
✅ Low memory footprint

Cons:

❌ Maturity: Newer than Milvus (less battle-tested)
❌ Community: Smaller ecosystem
❌ Documentation: Good but less comprehensive
❌ Unknown scaling: Unproven at 100M+ vectors

Why Rejected: Too new, prefer proven technology (Milvus)

Alternative 5: FAISS (Facebook AI Similarity Search)¶

Evaluated: FAISS library integrated into Python service

Pros:

✅ Facebook-developed (mature)
✅ Extremely fast (C++ optimized)
✅ No external database (in-process)
✅ Free (library, not service)

Cons:

❌ No persistence: In-memory only (need custom persistence layer)
❌ No multi-tenancy: Must implement partitioning ourselves
❌ No scalability: Single-machine limit
❌ Ops complexity: Build our own distributed system

Engineering effort: 400+ hours to build production-ready

Why Rejected: Too much custom development vs. using proven DB

Decision Rationale¶

Why Milvus?¶

1. Performance (Fastest for Our Scale)

Operation	Milvus	Pinecone	pgvector	Weaviate
Insert 1K vectors	100ms	150ms	200ms	180ms
Search (exact)	15ms	25ms	500ms	30ms
Search (ANN, 25M)	18ms	22ms	300ms	25ms

Milvus wins on search speed (critical for user experience)

2. Cost (70% cheaper than managed option)

# Self-Hosted Milvus on Azure (current)
Azure VM D4s_v3: $140/month
Storage (SSD): $50/month
Bandwidth: $30/month
Total: $220/month

# vs. Pinecone Standard
25M vectors: $300/month
100M vectors: $2,000/month

Savings at 100M: $1,780/month ($21,360/year)

3. Partition-Based Multi-Tenancy (Built-in)

# Each chatbot = separate partition
collection.create_partition("User_123_Project_456")

# Search ONLY in user's partition (fast!)
search_results = collection.search(
    data=[query_vector],
    partition_names=["User_123_Project_456"],  # Scoped search
    limit=5
)

Benefits:

✅ 10-100x faster (search smaller subset)
✅ Data isolation (security)
✅ Easy deletion (drop partition)

vs. Metadata filtering (Pinecone, Weaviate):

# Slower (scans all vectors, filters after)
search_results = index.query(
    vector=query_vector,
    filter={"project_id": "Project_456"},  # Post-filter
    top_k=5
)

4. Open Source (No Vendor Lock-in)

Apache 2.0 license
Can migrate to any cloud or on-premises
Full control over upgrades
Active community (10K+ GitHub stars)

5. Mature & Battle-Tested

Used by: Shopify, Compass, Tokopedia
3+ years in production
Scales to billions of vectors
Strong CNCF ecosystem

Consequences¶

Positive Consequences¶

✅ Low Latency: 15ms p50, 35ms p95 (meets < 50ms requirement)
✅ Cost-Effective: $220/month (vs. $2K for Pinecone at scale)
✅ High Performance: Handles 100M vectors easily
✅ Multi-Tenancy: Partition-based isolation (fast + secure)
✅ Scalability: Horizontal scaling (add nodes)
✅ No Vendor Lock-in: Open source, portable
✅ Control: Tune indexes, optimize queries

Negative Consequences¶

❌ Operational Overhead: Must manage infrastructure
❌ DevOps Complexity: Monitoring, backups, upgrades
❌ Learning Curve: Team needs to learn Milvus
❌ Initial Setup Time: 2 weeks vs. instant for Pinecone

Mitigation Strategies¶

For Operational Overhead:

Automated backups (daily)
Monitoring via DataDog
Runbooks for common issues

For Updates:

Quarterly upgrade schedule
Test in staging first
Rollback plan ready

Implementation Details¶

Deployment Architecture¶

Azure Container Apps (Milvus):
├── Milvus Standalone (< 1M vectors)
└── Milvus Cluster (production)
    ├── Query Node ×2 (handle searches)
    ├── Data Node ×2 (handle inserts)
    ├── Index Node ×1 (build indexes)
    ├── Coordinator ×1 (orchestration)
    ├── etcd (metadata storage)
    └── MinIO (object storage for vectors)

Code Integration¶

# shared/database/milvus_embeddings_service.py
from pymilvus import connections, Collection

class MilvusEmbeddingsService:
    def __init__(self, host="localhost", port="19530"):
        connections.connect(host=host, port=port)

    def insert_embeddings(self, collection_name, embeddings_data):
        """
        Insert embeddings into partition.
        Auto-creates partition if doesn't exist.
        """
        collection = Collection(collection_name)
        project_id = embeddings_data[0]['project_id']
        partition_name = self._sanitize_partition_name(project_id)

        partition = self._get_or_create_partition(collection, partition_name)
        insert_result = partition.insert(embeddings_data)
        partition.flush()

        return list(insert_result.primary_keys)

    def search_embeddings(self, collection_name, query_vector,
                         user_id, project_id, top_k=5):
        """
        Search within user's partition only.
        """
        collection = Collection(collection_name)
        partition_name = self._sanitize(project_id)

        search_results = collection.search(
            data=[query_vector],
            anns_field="embedding",
            param={"metric_type": "L2", "params": {"nprobe": 10}},
            limit=top_k,
            partition_names=[partition_name]  # Partition-scoped!
        )

        return self._format_results(search_results)

Performance Benchmarks¶

Metric	Current (25M vectors)	Projected (100M)
Insert throughput	5,000 vectors/sec	4,000 vectors/sec
Search latency (p50)	15ms	22ms
Search latency (p95)	35ms	50ms
Memory usage	16GB	40GB
Disk usage	50GB	160GB

All within acceptable limits ✅

Compliance & Security¶

Data Encryption:

At rest: Azure Disk Encryption (AES-256)
In transit: TLS 1.2+

Access Control:

Private endpoint (Azure VNet only)
No public internet access
Service-to-service auth via Azure AD

Data Isolation:

Each chatbot = separate partition
No cross-tenant data leakage possible

Backup:

Daily snapshots to Azure Blob Storage
30-day retention
Tested restore procedure

Migration Path¶

Scenario 1: Migrate to Pinecone¶

If operational overhead too high:

Export Milvus data via bulk export
Transform to Pinecone format
Bulk upload to Pinecone
Update code to use Pinecone SDK
Estimated time: 3 weeks

Scenario 2: Migrate to Managed Milvus (Zilliz Cloud)¶

If want managed service but keep Milvus:

Zilliz Cloud (official managed Milvus)
Export from self-hosted → import to Zilliz
Update connection strings
Estimated time: 1 week

Review Schedule¶

Next Review: 2025-02-28 (6 months after implementation)

Review Criteria:

Triggers for Re-evaluation:

Monthly cost > $500
Latency p95 > 100ms
Frequent operational issues
Team requests managed service

ADR-001: LLM Selection - RAG use case
ADR-002: Cosmos DB - Metadata storage (separate from vectors)
ADR-009: Embedding Model Selection (BAAI/bge-small-en-v1.5) (planned)

Evidence & Testing¶

Load Test Results (Aug 2024):

1,000 concurrent searches
25M vector collection
Result: p95 = 38ms, p99 = 65ms
No errors, stable performance

Production Metrics (4 months):

Avg Search Latency: 18ms p50, 32ms p95
Uptime: 99.92% (2 planned maintenance windows)
Cost: $235/month average
Incidents: 1 (resolved in 45 minutes)

External References:

Last Updated: 2025-12-26
Review Date: 2025-02-28
Status: Active and performing excellently

"Speed matters: the fastest vector DB wins."

ADR-003: Milvus for Vector Embeddings¶

Context¶

Decision¶

Architecture¶

Implementation Details¶

Alternatives Considered¶

Alternative 1: Pinecone (Managed Vector DB)¶

Alternative 2: PostgreSQL with pgvector¶

Alternative 3: Weaviate (Open Source Vector DB)¶

Alternative 4: Qdrant (Rust-Based Vector DB)¶

Alternative 5: FAISS (Facebook AI Similarity Search)¶

Decision Rationale¶

Why Milvus?¶

Consequences¶

Positive Consequences¶

Negative Consequences¶

Mitigation Strategies¶

Implementation Details¶

Deployment Architecture¶

Code Integration¶

Performance Benchmarks¶

Compliance & Security¶

Migration Path¶

Scenario 1: Migrate to Pinecone¶

Scenario 2: Migrate to Managed Milvus (Zilliz Cloud)¶

Review Schedule¶

Related ADRs¶

Evidence & Testing¶