ADR-003: Milvus for Vector Embeddings¶
Status: ✅ Accepted
Date: 2024-08-10 (Q3 2024)
Decision Makers: CTO, ML Engineering Lead, Backend Lead
Consulted: Data Engineering Team, Solution Architect
Informed: Engineering Team
Context¶
MachineAvatars implements Retrieval-Augmented Generation (RAG) to provide chatbots with contextual knowledge from user-uploaded documents (PDFs, text files, website content). This requires a vector database to:
- Store document embeddings: Convert text chunks to 384-dimensional vectors
- Perform similarity search: Find semantically similar chunks for user queries
- Support multi-tenancy: Isolate data per chatbot/user
- Scale: Handle millions of embedding vectors
- Low latency: < 100ms search time for 95% of queries
Current Scale:
- 500K documents across all users
- ~25M embedding vectors (50 chunks/document avg)
- 10K searches/day
Projected Scale (6 months):
- 2M documents
- 100M embedding vectors
- 100K searches/day
Requirements:
- Vector dimensions: 384 (BAAI/bge-small-en-v1.5)
- Similarity metric: L2 distance or cosine
- Multi-tenancy: Data isolation per chatbot
- Query latency: < 50ms p95
- Budget: < $500/month at current scale
Decision¶
We selected Milvus (self-hosted on Azure) with partition-based multi-tenancy.
Architecture¶
graph TB
A[User Query] --> B[Embedding Service]
B --> C[384-dim Vector]
C --> D[Milvus Search]
subgraph "Milvus Cluster"
D --> E[Collection: embeddings]
E --> F[Partition: User_123_Project_1]
E --> G[Partition: User_123_Project_2]
E --> H[Partition: User_456_Project_3]
end
D --> I[Top-K Results]
I --> J[LLM Context]
style D fill:#FFF3E0
style J fill:#E8F5E9
Implementation Details¶
# Milvus Configuration
Collection: "embeddings"
Dimensions: 384
Index Type: IVF_FLAT
Metric: L2
Parameters: {"nlist": 128, "nprobe": 10}
# Schema
Fields:
- id: INT64 (primary key, auto)
- document_id: VARCHAR(100)
- user_id: VARCHAR(100)
- project_id: VARCHAR(100) # Partition key
- chunk_index: INT32
- text: VARCHAR(2000)
- embedding: FLOAT_VECTOR(384)
- data_type: VARCHAR(50) # pdf, text, qa, url
- source_url: VARCHAR(500)
- created_at: VARCHAR(100)
Alternatives Considered¶
Alternative 1: Pinecone (Managed Vector DB)¶
Evaluated: Pinecone Starter plan ($70/month) → Standard ($300/month)
Pros:
- ✅ Fully managed (no DevOps)
- ✅ Excellent developer experience
- ✅ Built-in metadata filtering
- ✅ Good documentation
- ✅ Automatic scaling
Cons:
- ❌ Cost: $70-300/month → $2,000+/month at 100M vectors
- ❌ Vendor lock-in: Proprietary API
- ❌ Data residency: US-only (free tier), Europe costs more
- ❌ Limited control: Can't optimize indexes ourselves
- ❌ Pricing uncertainty: Usage-based can spike unexpectedly
Cost Projection: | Scale | Pinecone Cost | Milvus Cost | Savings | |-------|--------------|-------------|---------| | 25M vectors | $300/month | $120/month | $180 | | 100M vectors | $2,000/month | $280/month | $1,720 |
Why Rejected: Too expensive at scale, vendor lock-in
Alternative 2: PostgreSQL with pgvector¶
Evaluated: Postgres 15 with pgvector extension on Azure
Pros:
- ✅ Minimal cost ($150/month for VM)
- ✅ Single database (vectors + metadata together)
- ✅ Familiar SQL syntax
- ✅ ACID transactions
- ✅ Strong community
Cons:
- ❌ Performance: Slow at > 10M vectors (linear scan)
- ❌ Indexing limitations: HNSW index memory-intensive
- ❌ Not purpose-built: General DB doing vector search
- ❌ Scaling challenges: Vertical scaling only
Benchmark (10M vectors, 384 dims):
- pgvector (HNSW): 200-500ms p95 ❌
- Milvus (IVF_FLAT): 15-35ms p95 ✅
Why Rejected: 10x slower than specialized vector DB
Alternative 3: Weaviate (Open Source Vector DB)¶
Evaluated: Weaviate v1.20 self-hosted
Pros:
- ✅ Open source (Apache 2.0)
- ✅ GraphQL API (modern)
- ✅ Built-in vectorization modules
- ✅ Good documentation
- ✅ Active development
Cons:
- ❌ Complexity: More features = steeper learning curve
- ❌ Resource usage: Memory-heavy (2x Milvus for same data)
- ❌ Overkill: We don't need graph features
- ❌ Community size: Smaller than Milvus
Resource Comparison (25M vectors):
- Weaviate: 32GB RAM, 4 CPU cores
- Milvus: 16GB RAM, 2 CPU cores
Why Rejected: Too complex for our simple use case
Alternative 4: Qdrant (Rust-Based Vector DB)¶
Evaluated: Qdrant v1.5 self-hosted
Pros:
- ✅ Written in Rust (fast, memory-safe)
- ✅ Simple API (REST + gRPC)
- ✅ Good filtering capabilities
- ✅ Low memory footprint
Cons:
- ❌ Maturity: Newer than Milvus (less battle-tested)
- ❌ Community: Smaller ecosystem
- ❌ Documentation: Good but less comprehensive
- ❌ Unknown scaling: Unproven at 100M+ vectors
Why Rejected: Too new, prefer proven technology (Milvus)
Alternative 5: FAISS (Facebook AI Similarity Search)¶
Evaluated: FAISS library integrated into Python service
Pros:
- ✅ Facebook-developed (mature)
- ✅ Extremely fast (C++ optimized)
- ✅ No external database (in-process)
- ✅ Free (library, not service)
Cons:
- ❌ No persistence: In-memory only (need custom persistence layer)
- ❌ No multi-tenancy: Must implement partitioning ourselves
- ❌ No scalability: Single-machine limit
- ❌ Ops complexity: Build our own distributed system
Engineering effort: 400+ hours to build production-ready
Why Rejected: Too much custom development vs. using proven DB
Decision Rationale¶
Why Milvus?¶
1. Performance (Fastest for Our Scale)
| Operation | Milvus | Pinecone | pgvector | Weaviate |
|---|---|---|---|---|
| Insert 1K vectors | 100ms | 150ms | 200ms | 180ms |
| Search (exact) | 15ms | 25ms | 500ms | 30ms |
| Search (ANN, 25M) | 18ms | 22ms | 300ms | 25ms |
Milvus wins on search speed (critical for user experience)
2. Cost (70% cheaper than managed option)
# Self-Hosted Milvus on Azure (current)
Azure VM D4s_v3: $140/month
Storage (SSD): $50/month
Bandwidth: $30/month
Total: $220/month
# vs. Pinecone Standard
25M vectors: $300/month
100M vectors: $2,000/month
Savings at 100M: $1,780/month ($21,360/year)
3. Partition-Based Multi-Tenancy (Built-in)
# Each chatbot = separate partition
collection.create_partition("User_123_Project_456")
# Search ONLY in user's partition (fast!)
search_results = collection.search(
data=[query_vector],
partition_names=["User_123_Project_456"], # Scoped search
limit=5
)
Benefits:
- ✅ 10-100x faster (search smaller subset)
- ✅ Data isolation (security)
- ✅ Easy deletion (drop partition)
vs. Metadata filtering (Pinecone, Weaviate):
# Slower (scans all vectors, filters after)
search_results = index.query(
vector=query_vector,
filter={"project_id": "Project_456"}, # Post-filter
top_k=5
)
4. Open Source (No Vendor Lock-in)
- Apache 2.0 license
- Can migrate to any cloud or on-premises
- Full control over upgrades
- Active community (10K+ GitHub stars)
5. Mature & Battle-Tested
- Used by: Shopify, Compass, Tokopedia
- 3+ years in production
- Scales to billions of vectors
- Strong CNCF ecosystem
Consequences¶
Positive Consequences¶
✅ Low Latency: 15ms p50, 35ms p95 (meets < 50ms requirement)
✅ Cost-Effective: $220/month (vs. $2K for Pinecone at scale)
✅ High Performance: Handles 100M vectors easily
✅ Multi-Tenancy: Partition-based isolation (fast + secure)
✅ Scalability: Horizontal scaling (add nodes)
✅ No Vendor Lock-in: Open source, portable
✅ Control: Tune indexes, optimize queries
Negative Consequences¶
❌ Operational Overhead: Must manage infrastructure
❌ DevOps Complexity: Monitoring, backups, upgrades
❌ Learning Curve: Team needs to learn Milvus
❌ Initial Setup Time: 2 weeks vs. instant for Pinecone
Mitigation Strategies¶
For Operational Overhead:
- Automated backups (daily)
- Monitoring via DataDog
- Runbooks for common issues
For Updates:
- Quarterly upgrade schedule
- Test in staging first
- Rollback plan ready
Implementation Details¶
Deployment Architecture¶
Azure Container Apps (Milvus):
├── Milvus Standalone (< 1M vectors)
└── Milvus Cluster (production)
├── Query Node ×2 (handle searches)
├── Data Node ×2 (handle inserts)
├── Index Node ×1 (build indexes)
├── Coordinator ×1 (orchestration)
├── etcd (metadata storage)
└── MinIO (object storage for vectors)
Code Integration¶
# shared/database/milvus_embeddings_service.py
from pymilvus import connections, Collection
class MilvusEmbeddingsService:
def __init__(self, host="localhost", port="19530"):
connections.connect(host=host, port=port)
def insert_embeddings(self, collection_name, embeddings_data):
"""
Insert embeddings into partition.
Auto-creates partition if doesn't exist.
"""
collection = Collection(collection_name)
project_id = embeddings_data[0]['project_id']
partition_name = self._sanitize_partition_name(project_id)
partition = self._get_or_create_partition(collection, partition_name)
insert_result = partition.insert(embeddings_data)
partition.flush()
return list(insert_result.primary_keys)
def search_embeddings(self, collection_name, query_vector,
user_id, project_id, top_k=5):
"""
Search within user's partition only.
"""
collection = Collection(collection_name)
partition_name = self._sanitize(project_id)
search_results = collection.search(
data=[query_vector],
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 10}},
limit=top_k,
partition_names=[partition_name] # Partition-scoped!
)
return self._format_results(search_results)
Performance Benchmarks¶
| Metric | Current (25M vectors) | Projected (100M) |
|---|---|---|
| Insert throughput | 5,000 vectors/sec | 4,000 vectors/sec |
| Search latency (p50) | 15ms | 22ms |
| Search latency (p95) | 35ms | 50ms |
| Memory usage | 16GB | 40GB |
| Disk usage | 50GB | 160GB |
All within acceptable limits ✅
Compliance & Security¶
Data Encryption:
- At rest: Azure Disk Encryption (AES-256)
- In transit: TLS 1.2+
Access Control:
- Private endpoint (Azure VNet only)
- No public internet access
- Service-to-service auth via Azure AD
Data Isolation:
- Each chatbot = separate partition
- No cross-tenant data leakage possible
Backup:
- Daily snapshots to Azure Blob Storage
- 30-day retention
- Tested restore procedure
Migration Path¶
Scenario 1: Migrate to Pinecone¶
If operational overhead too high:
- Export Milvus data via bulk export
- Transform to Pinecone format
- Bulk upload to Pinecone
- Update code to use Pinecone SDK
- Estimated time: 3 weeks
Scenario 2: Migrate to Managed Milvus (Zilliz Cloud)¶
If want managed service but keep Milvus:
- Zilliz Cloud (official managed Milvus)
- Export from self-hosted → import to Zilliz
- Update connection strings
- Estimated time: 1 week
Review Schedule¶
Next Review: 2025-02-28 (6 months after implementation)
Review Criteria:
- Search latency p95 < 50ms
- Monthly cost < $300
- Uptime > 99.5%
- No scalability issues at 100M vectors
- Ops burden < 10 hours/month
Triggers for Re-evaluation:
- Monthly cost > $500
- Latency p95 > 100ms
- Frequent operational issues
- Team requests managed service
Related ADRs¶
- ADR-001: LLM Selection - RAG use case
- ADR-002: Cosmos DB - Metadata storage (separate from vectors)
- ADR-009: Embedding Model Selection (BAAI/bge-small-en-v1.5) (planned)
Evidence & Testing¶
Load Test Results (Aug 2024):
- 1,000 concurrent searches
- 25M vector collection
- Result: p95 = 38ms, p99 = 65ms
- No errors, stable performance
Production Metrics (4 months):
- Avg Search Latency: 18ms p50, 32ms p95
- Uptime: 99.92% (2 planned maintenance windows)
- Cost: $235/month average
- Incidents: 1 (resolved in 45 minutes)
External References:
Last Updated: 2025-12-26
Review Date: 2025-02-28
Status: Active and performing excellently
"Speed matters: the fastest vector DB wins."