Vector Store - Milvus DatabaseΒΆ
Section: 4-data-architecture-governance
Document: Milvus Vector Database Architecture
Technology: Milvus 2.x
Deployment: Azure Container Instance
Purpose: Semantic search and RAG (Retrieval-Augmented Generation)
π― OverviewΒΆ
Milvus is the vector database powering MachineAvatars' Retrieval-Augmented Generation (RAG) system, enabling semantic search across user-uploaded documents, crawled websites, and Q&A pairs to provide contextual chatbot responses.
Key Statistics:
- Total Vectors: ~1M (across all projects)
- Vector Dimensions: 1536 (OpenAI) or 384 (bge-small)
- Search Latency: <40ms (p95)
- Index Type: IVF_FLAT
- Distance Metric: Cosine similarity
ποΈ Milvus ArchitectureΒΆ
graph TB
subgraph "Data Ingestion"
UPLOAD[User Upload<br/>PDF/DOCX/TXT]
CRAWL[Website Crawling]
QNA[Q&A Pairs]
end
subgraph "Processing Pipeline"
EXTRACT[Text Extraction]
CHUNK[Text Chunking]
EMBED[Embedding Generation<br/>OpenAI / bge-small]
end
subgraph "Milvus Vector DB"
COLL[Collections<br/>Per Project]
INDEX[IVF_FLAT Index<br/>nlist=1024]
SEARCH[Similarity Search<br/>Cosine Distance]
end
subgraph "RAG Query"
QUERY[User Question]
QEMBED[Question Embedding]
RETRIEVE[Top-K Retrieval<br/>k=3-5]
LLM[LLM Context<br/>+ Response]
end
subgraph "Storage"
BLOB[Azure Blob Storage<br/>Persistence Layer]
end
UPLOAD --> EXTRACT
CRAWL --> EXTRACT
QNA --> CHUNK
EXTRACT --> CHUNK
CHUNK --> EMBED
EMBED --> COLL
COLL --> INDEX
QUERY --> QEMBED
QEMBED --> SEARCH
INDEX --> SEARCH
SEARCH --> RETRIEVE
RETRIEVE --> LLM
COLL -.-> BLOB
style EMBED fill:#FFF3E0
style INDEX fill:#E3F2FD
style SEARCH fill:#C8E6C9
style LLM fill:#FFE082
π Collection StructureΒΆ
Dynamic CollectionsΒΆ
Naming Pattern: chatbot_vectors_{project_id}
Example Collections:
chatbot_vectors_User-123456_Project_1chatbot_vectors_User-789456_Project_Support
Why Per-Project Collections?
- β Isolation (user A can't search user B's data)
- β Deletion (drop entire collection on chatbot delete)
- β Performance (smaller collections = faster search)
- β Many collections (management overhead)
Vector SchemaΒΆ
Fields:
from pymilvus import CollectionSchema, FieldSchema, DataType
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536), # or 384
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="chunk_id", dtype=DataType.VARCHAR, max_length=256),
FieldSchema(name="url", dtype=DataType.VARCHAR, max_length=2048),
FieldSchema(name="project_id", dtype=DataType.VARCHAR, max_length=256),
FieldSchema(name="user_id", dtype=DataType.VARCHAR, max_length=256),
FieldSchema(name="file_type", dtype=DataType.VARCHAR, max_length=50), # "pdf", "url", "qna"
FieldSchema(name="timestamp", dtype=DataType.INT64) # Unix timestamp
]
schema = CollectionSchema(fields=fields, description="Chatbot knowledge base vectors")
Field Details:
| Field | Type | Dimension | Description |
|---|---|---|---|
| id | INT64 | - | Auto-generated primary key |
| embedding | FLOAT_VECTOR | 1536 or 384 | Vector embedding |
| text | VARCHAR | max 65535 | Original text chunk |
| chunk_id | VARCHAR | max 256 | Unique chunk identifier |
| url | VARCHAR | max 2048 | Source URL (if crawled) |
| project_id | VARCHAR | max 256 | Chatbot project ID |
| user_id | VARCHAR | max 256 | Owner user ID |
| file_type | VARCHAR | max 50 | Source type |
| timestamp | INT64 | - | Creation time |
Sample Vector DocumentΒΆ
{
"id": 442893741234567,
"embedding": [0.123, -0.456, 0.789, ..., 0.234], // 1536 dimensions
"text": "Our business hours are Monday to Friday, 9 AM to 5 PM Eastern Time. We are closed on weekends and public holidays.",
"chunk_id": "User-123456_Project_1_chunk_0",
"url": "https://example.com/contact",
"project_id": "User-123456_Project_1",
"user_id": "User-123456",
"file_type": "url",
"timestamp": 1735214400
}
π Indexing StrategyΒΆ
IVF_FLAT IndexΒΆ
Configuration:
from pymilvus import Collection
index_params = {
"metric_type": "COSINE", # Cosine similarity
"index_type": "IVF_FLAT", # Inverted File with Flat (exact) search
"params": {"nlist": 1024} # Number of cluster units
}
collection.create_index(field_name="embedding", index_params=index_params)
Parameters:
- metric_type: COSINE (cosine similarity, range: -1 to 1, higher = more similar)
- index_type: IVF_FLAT (good balance of speed and accuracy)
- nlist: 1024 (number of cluster centroids)
Search Parameters:
search_params = {
"metric_type": "COSINE",
"params": {"nprobe": 16} # Number of clusters to search
}
Performance:
- Build Time: ~10 seconds for 10K vectors
- Search Latency: 20-40ms for top-5 results
- Accuracy: 99%+ recall (near-exact)
Index Types ComparisonΒΆ
| Index Type | Speed | Accuracy | Memory | Use Case |
|---|---|---|---|---|
| IVF_FLAT | Fast | High (99%+) | Medium | Current - Production |
| IVF_SQ8 | Faster | High (98%+) | Low | Memory-constrained |
| IVF_PQ | Very Fast | Medium (95%+) | Very Low | Large-scale (>10M vectors) |
| HNSW | Very Fast | Very High (99.9%+) | High | Real-time, low-latency |
| FLAT | Slow | Perfect (100%) | High | Small datasets (<10K) |
Why IVF_FLAT?
- β Good balance for 1M vectors
- β High accuracy needed for RAG
- β Acceptable latency (<40ms)
- β³ May migrate to HNSW at 5M+ vectors
π Similarity SearchΒΆ
Search FlowΒΆ
1. Generate Question Embedding
# Using OpenAI
from openai import OpenAI
client = OpenAI(api_key="...")
response = client.embeddings.create(
model="text-embedding-ada-002",
input="What are your business hours?"
)
question_embedding = response.data[0].embedding # 1536 dimensions
2. Search Milvus
from pymilvus import Collection
collection = Collection(f"chatbot_vectors_{project_id}")
collection.load() # Load to memory
results = collection.search(
data=[question_embedding],
anns_field="embedding",
param={"metric_type": "COSINE", "params": {"nprobe": 16}},
limit=5, # Top-5 results
output_fields=["text", "url", "file_type"]
)
3. Result Format
for hits in results:
for hit in hits:
print(f"Score: {hit.score:.4f}") # 0.95 (very similar)
print(f"Text: {hit.entity.get('text')}")
print(f"URL: {hit.entity.get('url')}")
print(f"Type: {hit.entity.get('file_type')}")
Sample Output:
Score: 0.9512
Text: Our business hours are Monday to Friday, 9 AM to 5 PM Eastern Time...
URL: https://example.com/contact
Type: url
Score: 0.8734
Text: For urgent matters outside business hours, please call our emergency line...
URL: https://example.com/support
Type: pdf
Score: 0.8156
Text: Q: When are you open? A: Mon-Fri 9-5 EST
URL: None
Type: qna
π€ RAG Pipeline IntegrationΒΆ
Complete RAG FlowΒΆ
sequenceDiagram
participant User
participant Frontend
participant Response3D as Response 3D Service
participant OpenAI as Azure OpenAI
participant Milvus
participant LLM as LLM Service
User->>Frontend: "What are your hours?"
Frontend->>Response3D: POST /get-response-3d
Note over Response3D: Step 1: Embed Question
Response3D->>OpenAI: Embed("What are your hours?")
OpenAI-->>Response3D: [0.123, -0.456, ...]
Note over Response3D: Step 2: Search Milvus
Response3D->>Milvus: search(embedding, top_k=5)
Milvus-->>Response3D: Top-5 similar chunks
Note over Response3D: Step 3: Build Context
Response3D->>Response3D: context = join(chunks)
Note over Response3D: Step 4: Generate Response
Response3D->>LLM: generate(context + question)
LLM-->>Response3D: "We're open Mon-Fri 9-5 EST"
Note over Response3D: Step 5: TTS + Lipsync
Response3D->>Response3D: Generate audio + lipsync
Response3D-->>Frontend: {text, audio, lipsync}
Frontend-->>User: Avatar speaks response
Context ConstructionΒΆ
Prompt Template:
def build_rag_prompt(question: str, context_chunks: list, system_prompt: str):
context = "\n\n".join([f"[Source {i+1}]: {chunk}" for i, chunk in enumerate(context_chunks)])
prompt = f"""
{system_prompt}
Context Information:
{context}
User Question: {question}
Instructions: Answer the question based on the provided context. If the context doesn't contain enough information, say so politely.
Answer:"""
return prompt
Example:
You are Emma, a friendly customer support assistant.
Context Information:
[Source 1]: Our business hours are Monday to Friday, 9 AM to 5 PM Eastern Time.
[Source 2]: For urgent matters outside business hours, call our emergency line at 1-800-URGENT.
[Source 3]: Q: When are you open? A: Mon-Fri 9-5 EST
User Question: What are your business hours?
Instructions: Answer the question based on the provided context...
Answer:
π Performance OptimizationΒΆ
Search OptimizationΒΆ
1. Collection Loading
# Load entire collection to memory for faster search
collection.load()
# Partial loading (if memory constrained)
collection.load(partition_names=["partition_2025"])
2. Batch Search
# Search multiple questions at once
question_embeddings = [emb1, emb2, emb3] # Batch of 3
results = collectionction.search(
data=question_embeddings,
limit=5
)
# Faster than 3 individual searches
3. nprobe Tuning
# Lower nprobe = faster, less accurate
search_params = {"nprobe": 8} # Fast, 97% accuracy
# Higher nprobe = slower, more accurate
search_params = {"nprobe": 32} # Slow, 99.5% accuracy
# Sweet spot for our use case
search_params = {"nprobe": 16} # Balanced, 99% accuracy
Embedding Model ComparisonΒΆ
| Model | Dimensions | Speed | Accuracy | Cost | Use Case |
|---|---|---|---|---|---|
| text-embedding-ada-002 | 1536 | Medium | High | $0.0001/1K | Current - Production |
| text-embedding-3-small | 1536 | Fast | Medium | $0.00002/1K | Cost-sensitive |
| text-embedding-3-large | 3072 | Slow | Very High | $0.00013/1K | High accuracy needed |
| bge-small-en-v1.5 | 384 | Very Fast | Medium | FREE | Alternative - Self-hosted |
| bge-large-en-v1.5 | 1024 | Medium | High | FREE | Self-hosted, high accuracy |
Current Usage:
- OpenAI ada-002: Response services (query time)
- bge-small: Data ingestion (document embedding)
Why Different Models?
- OpenAI for query: Better semantic understanding
- bge-small for docs: Cost savings (millions of embeddings)
- β οΈ Issue: Embedding space mismatch! (1536 vs 384 dimensions)
Solution: Should use same model for query + documents
ποΈ Vector DeletionΒΆ
Hard Delete on Chatbot DeletionΒΆ
from database.milvus_embeddings_service import get_milvus_embeddings_service
milvus_service = get_milvus_embeddings_service()
# Delete all vectors for a project
deleted_count = milvus_service.delete_embeddings_by_user_project(
collection_name=f"chatbot_vectors_{project_id}",
user_id=user_id,
project_id=project_id
)
# Or drop entire collection
milvus_service.drop_collection(f"chatbot_vectors_{project_id}")
Note: Vectors are NOT moved to trash - hard deleted immediately!
πΎ Persistence & BackupΒΆ
Azure Blob Storage BackendΒΆ
Configuration:
import os
from milvus import connections
connections.connect(
alias="default",
host=os.getenv("MILVUS_HOST"),
port=os.getenv("MILVUS_PORT", "19530")
)
Storage Path:
Persistence:
- Vectors persisted to Azure Blob automatically
- Metadata stored in Milvus
- Checkpoints every 5 minutes
Backup StrategyΒΆ
Nightly Snapshots:
# Azure CLI - Create snapshot
az disk create \
--resource-group milvus-rg \
--name milvus-snapshot-2025-01-15 \
--source milvus-data-disk \
--sku Standard_LRS
# Retention: 7 days
Recovery:
# Restore from snapshot
az disk create \
--resource-group milvus-rg \
--name milvus-data-disk \
--source milvus-snapshot-2025-01-10
# Restart Milvus container
docker restart milvus-standalone
π Monitoring & MetricsΒΆ
Key MetricsΒΆ
Search Performance:
- Search latency (p50, p95, p99)
- Queries per second (QPS)
- Index building time
- Memory usage
Data Metrics:
- Total vectors
- Vectors per collection
- Collection count
- Storage size
Health:
- Milvus uptime
- Connection pool status
- Error rate
Milvus Metrics APIΒΆ
from pymilvus import utility
# Get collection stats
stats = collection.get_stats()
print(f"Total entities: {stats['row_count']}")
# Get index building progress
progress = utility.index_building_progress(collection_name)
print(f"Index progress: {progress['indexed_rows']}/{progress['total_rows']}")
# Get memory usage
memory_usage = utility.get_query_segment_info(collection_name)
π Related DocumentationΒΆ
Services:
- Response 3D Service - RAG implementation
- Data Crawling Service - Embedding generation
Data Architecture:
-
Database Schema - MongoDB files collection
-
Index - Overall architecture
Features:
- Data Training - RAG strategy
Progress: Section 4 - β files complete (37.5%)
"Vectors make semantics searchable." πβ