Skip to content

ADR-003: Milvus for Vector Embeddings

Status: ✅ Accepted
Date: 2024-08-10 (Q3 2024)
Decision Makers: CTO, ML Engineering Lead, Backend Lead
Consulted: Data Engineering Team, Solution Architect
Informed: Engineering Team


Context

MachineAvatars implements Retrieval-Augmented Generation (RAG) to provide chatbots with contextual knowledge from user-uploaded documents (PDFs, text files, website content). This requires a vector database to:

  1. Store document embeddings: Convert text chunks to 384-dimensional vectors
  2. Perform similarity search: Find semantically similar chunks for user queries
  3. Support multi-tenancy: Isolate data per chatbot/user
  4. Scale: Handle millions of embedding vectors
  5. Low latency: < 100ms search time for 95% of queries

Current Scale:

  • 500K documents across all users
  • ~25M embedding vectors (50 chunks/document avg)
  • 10K searches/day

Projected Scale (6 months):

  • 2M documents
  • 100M embedding vectors
  • 100K searches/day

Requirements:

  • Vector dimensions: 384 (BAAI/bge-small-en-v1.5)
  • Similarity metric: L2 distance or cosine
  • Multi-tenancy: Data isolation per chatbot
  • Query latency: < 50ms p95
  • Budget: < $500/month at current scale

Decision

We selected Milvus (self-hosted on Azure) with partition-based multi-tenancy.

Architecture

graph TB
    A[User Query] --> B[Embedding Service]
    B --> C[384-dim Vector]
    C --> D[Milvus Search]

    subgraph "Milvus Cluster"
        D --> E[Collection: embeddings]
        E --> F[Partition: User_123_Project_1]
        E --> G[Partition: User_123_Project_2]
        E --> H[Partition: User_456_Project_3]
    end

    D --> I[Top-K Results]
    I --> J[LLM Context]

    style D fill:#FFF3E0
    style J fill:#E8F5E9

Implementation Details

# Milvus Configuration
Collection: "embeddings"
Dimensions: 384
Index Type: IVF_FLAT
Metric: L2
Parameters: {"nlist": 128, "nprobe": 10}

# Schema
Fields:
- id: INT64 (primary key, auto)
- document_id: VARCHAR(100)
- user_id: VARCHAR(100)
- project_id: VARCHAR(100)  # Partition key
- chunk_index: INT32
- text: VARCHAR(2000)
- embedding: FLOAT_VECTOR(384)
- data_type: VARCHAR(50)    # pdf, text, qa, url
- source_url: VARCHAR(500)
- created_at: VARCHAR(100)

Alternatives Considered

Alternative 1: Pinecone (Managed Vector DB)

Evaluated: Pinecone Starter plan ($70/month) → Standard ($300/month)

Pros:

  • ✅ Fully managed (no DevOps)
  • ✅ Excellent developer experience
  • ✅ Built-in metadata filtering
  • ✅ Good documentation
  • ✅ Automatic scaling

Cons:

  • Cost: $70-300/month → $2,000+/month at 100M vectors
  • Vendor lock-in: Proprietary API
  • Data residency: US-only (free tier), Europe costs more
  • Limited control: Can't optimize indexes ourselves
  • Pricing uncertainty: Usage-based can spike unexpectedly

Cost Projection: | Scale | Pinecone Cost | Milvus Cost | Savings | |-------|--------------|-------------|---------| | 25M vectors | $300/month | $120/month | $180 | | 100M vectors | $2,000/month | $280/month | $1,720 |

Why Rejected: Too expensive at scale, vendor lock-in


Alternative 2: PostgreSQL with pgvector

Evaluated: Postgres 15 with pgvector extension on Azure

Pros:

  • ✅ Minimal cost ($150/month for VM)
  • ✅ Single database (vectors + metadata together)
  • ✅ Familiar SQL syntax
  • ✅ ACID transactions
  • ✅ Strong community

Cons:

  • Performance: Slow at > 10M vectors (linear scan)
  • Indexing limitations: HNSW index memory-intensive
  • Not purpose-built: General DB doing vector search
  • Scaling challenges: Vertical scaling only

Benchmark (10M vectors, 384 dims):

  • pgvector (HNSW): 200-500ms p95 ❌
  • Milvus (IVF_FLAT): 15-35ms p95 ✅

Why Rejected: 10x slower than specialized vector DB


Alternative 3: Weaviate (Open Source Vector DB)

Evaluated: Weaviate v1.20 self-hosted

Pros:

  • ✅ Open source (Apache 2.0)
  • ✅ GraphQL API (modern)
  • ✅ Built-in vectorization modules
  • ✅ Good documentation
  • ✅ Active development

Cons:

  • Complexity: More features = steeper learning curve
  • Resource usage: Memory-heavy (2x Milvus for same data)
  • Overkill: We don't need graph features
  • Community size: Smaller than Milvus

Resource Comparison (25M vectors):

  • Weaviate: 32GB RAM, 4 CPU cores
  • Milvus: 16GB RAM, 2 CPU cores

Why Rejected: Too complex for our simple use case


Alternative 4: Qdrant (Rust-Based Vector DB)

Evaluated: Qdrant v1.5 self-hosted

Pros:

  • ✅ Written in Rust (fast, memory-safe)
  • ✅ Simple API (REST + gRPC)
  • ✅ Good filtering capabilities
  • ✅ Low memory footprint

Cons:

  • Maturity: Newer than Milvus (less battle-tested)
  • Community: Smaller ecosystem
  • Documentation: Good but less comprehensive
  • Unknown scaling: Unproven at 100M+ vectors

Why Rejected: Too new, prefer proven technology (Milvus)


Evaluated: FAISS library integrated into Python service

Pros:

  • ✅ Facebook-developed (mature)
  • ✅ Extremely fast (C++ optimized)
  • ✅ No external database (in-process)
  • ✅ Free (library, not service)

Cons:

  • No persistence: In-memory only (need custom persistence layer)
  • No multi-tenancy: Must implement partitioning ourselves
  • No scalability: Single-machine limit
  • Ops complexity: Build our own distributed system

Engineering effort: 400+ hours to build production-ready

Why Rejected: Too much custom development vs. using proven DB


Decision Rationale

Why Milvus?

1. Performance (Fastest for Our Scale)

Operation Milvus Pinecone pgvector Weaviate
Insert 1K vectors 100ms 150ms 200ms 180ms
Search (exact) 15ms 25ms 500ms 30ms
Search (ANN, 25M) 18ms 22ms 300ms 25ms

Milvus wins on search speed (critical for user experience)


2. Cost (70% cheaper than managed option)

# Self-Hosted Milvus on Azure (current)
Azure VM D4s_v3: $140/month
Storage (SSD): $50/month
Bandwidth: $30/month
Total: $220/month

# vs. Pinecone Standard
25M vectors: $300/month
100M vectors: $2,000/month

Savings at 100M: $1,780/month ($21,360/year)

3. Partition-Based Multi-Tenancy (Built-in)

# Each chatbot = separate partition
collection.create_partition("User_123_Project_456")

# Search ONLY in user's partition (fast!)
search_results = collection.search(
    data=[query_vector],
    partition_names=["User_123_Project_456"],  # Scoped search
    limit=5
)

Benefits:

  • ✅ 10-100x faster (search smaller subset)
  • ✅ Data isolation (security)
  • ✅ Easy deletion (drop partition)

vs. Metadata filtering (Pinecone, Weaviate):

# Slower (scans all vectors, filters after)
search_results = index.query(
    vector=query_vector,
    filter={"project_id": "Project_456"},  # Post-filter
    top_k=5
)

4. Open Source (No Vendor Lock-in)

  • Apache 2.0 license
  • Can migrate to any cloud or on-premises
  • Full control over upgrades
  • Active community (10K+ GitHub stars)

5. Mature & Battle-Tested

  • Used by: Shopify, Compass, Tokopedia
  • 3+ years in production
  • Scales to billions of vectors
  • Strong CNCF ecosystem

Consequences

Positive Consequences

Low Latency: 15ms p50, 35ms p95 (meets < 50ms requirement)
Cost-Effective: $220/month (vs. $2K for Pinecone at scale)
High Performance: Handles 100M vectors easily
Multi-Tenancy: Partition-based isolation (fast + secure)
Scalability: Horizontal scaling (add nodes)
No Vendor Lock-in: Open source, portable
Control: Tune indexes, optimize queries

Negative Consequences

Operational Overhead: Must manage infrastructure
DevOps Complexity: Monitoring, backups, upgrades
Learning Curve: Team needs to learn Milvus
Initial Setup Time: 2 weeks vs. instant for Pinecone

Mitigation Strategies

For Operational Overhead:

  • Automated backups (daily)
  • Monitoring via DataDog
  • Runbooks for common issues

For Updates:

  • Quarterly upgrade schedule
  • Test in staging first
  • Rollback plan ready

Implementation Details

Deployment Architecture

Azure Container Apps (Milvus):
├── Milvus Standalone (< 1M vectors)
└── Milvus Cluster (production)
    ├── Query Node ×2 (handle searches)
    ├── Data Node ×2 (handle inserts)
    ├── Index Node ×1 (build indexes)
    ├── Coordinator ×1 (orchestration)
    ├── etcd (metadata storage)
    └── MinIO (object storage for vectors)

Code Integration

# shared/database/milvus_embeddings_service.py
from pymilvus import connections, Collection

class MilvusEmbeddingsService:
    def __init__(self, host="localhost", port="19530"):
        connections.connect(host=host, port=port)

    def insert_embeddings(self, collection_name, embeddings_data):
        """
        Insert embeddings into partition.
        Auto-creates partition if doesn't exist.
        """
        collection = Collection(collection_name)
        project_id = embeddings_data[0]['project_id']
        partition_name = self._sanitize_partition_name(project_id)

        partition = self._get_or_create_partition(collection, partition_name)
        insert_result = partition.insert(embeddings_data)
        partition.flush()

        return list(insert_result.primary_keys)

    def search_embeddings(self, collection_name, query_vector,
                         user_id, project_id, top_k=5):
        """
        Search within user's partition only.
        """
        collection = Collection(collection_name)
        partition_name = self._sanitize(project_id)

        search_results = collection.search(
            data=[query_vector],
            anns_field="embedding",
            param={"metric_type": "L2", "params": {"nprobe": 10}},
            limit=top_k,
            partition_names=[partition_name]  # Partition-scoped!
        )

        return self._format_results(search_results)

Performance Benchmarks

Metric Current (25M vectors) Projected (100M)
Insert throughput 5,000 vectors/sec 4,000 vectors/sec
Search latency (p50) 15ms 22ms
Search latency (p95) 35ms 50ms
Memory usage 16GB 40GB
Disk usage 50GB 160GB

All within acceptable limits


Compliance & Security

Data Encryption:

  • At rest: Azure Disk Encryption (AES-256)
  • In transit: TLS 1.2+

Access Control:

  • Private endpoint (Azure VNet only)
  • No public internet access
  • Service-to-service auth via Azure AD

Data Isolation:

  • Each chatbot = separate partition
  • No cross-tenant data leakage possible

Backup:

  • Daily snapshots to Azure Blob Storage
  • 30-day retention
  • Tested restore procedure

Migration Path

Scenario 1: Migrate to Pinecone

If operational overhead too high:

  1. Export Milvus data via bulk export
  2. Transform to Pinecone format
  3. Bulk upload to Pinecone
  4. Update code to use Pinecone SDK
  5. Estimated time: 3 weeks

Scenario 2: Migrate to Managed Milvus (Zilliz Cloud)

If want managed service but keep Milvus:

  1. Zilliz Cloud (official managed Milvus)
  2. Export from self-hosted → import to Zilliz
  3. Update connection strings
  4. Estimated time: 1 week

Review Schedule

Next Review: 2025-02-28 (6 months after implementation)

Review Criteria:

  • Search latency p95 < 50ms
  • Monthly cost < $300
  • Uptime > 99.5%
  • No scalability issues at 100M vectors
  • Ops burden < 10 hours/month

Triggers for Re-evaluation:

  • Monthly cost > $500
  • Latency p95 > 100ms
  • Frequent operational issues
  • Team requests managed service

  • ADR-001: LLM Selection - RAG use case
  • ADR-002: Cosmos DB - Metadata storage (separate from vectors)
  • ADR-009: Embedding Model Selection (BAAI/bge-small-en-v1.5) (planned)

Evidence & Testing

Load Test Results (Aug 2024):

  • 1,000 concurrent searches
  • 25M vector collection
  • Result: p95 = 38ms, p99 = 65ms
  • No errors, stable performance

Production Metrics (4 months):

  • Avg Search Latency: 18ms p50, 32ms p95
  • Uptime: 99.92% (2 planned maintenance windows)
  • Cost: $235/month average
  • Incidents: 1 (resolved in 45 minutes)

External References:


Last Updated: 2025-12-26
Review Date: 2025-02-28
Status: Active and performing excellently


"Speed matters: the fastest vector DB wins."