Vector Store - Milvus Database¶

Section: 4-data-architecture-governance
Document: Milvus Vector Database Architecture
Technology: Milvus 2.x
Deployment: Azure Container Instance
Purpose: Semantic search and RAG (Retrieval-Augmented Generation)

🎯 Overview¶

Milvus is the vector database powering MachineAvatars' Retrieval-Augmented Generation (RAG) system, enabling semantic search across user-uploaded documents, crawled websites, and Q&A pairs to provide contextual chatbot responses.

Key Statistics:

Total Vectors: ~1M (across all projects)
Vector Dimensions: 1536 (OpenAI) or 384 (bge-small)
Search Latency: <40ms (p95)
Index Type: IVF_FLAT
Distance Metric: Cosine similarity

🏗️ Milvus Architecture¶

graph TB
    subgraph "Data Ingestion"
        UPLOAD[User Upload<br/>PDF/DOCX/TXT]
        CRAWL[Website Crawling]
        QNA[Q&A Pairs]
    end

    subgraph "Processing Pipeline"
        EXTRACT[Text Extraction]
        CHUNK[Text Chunking]
        EMBED[Embedding Generation<br/>OpenAI / bge-small]
    end

    subgraph "Milvus Vector DB"
        COLL[Collections<br/>Per Project]
        INDEX[IVF_FLAT Index<br/>nlist=1024]
        SEARCH[Similarity Search<br/>Cosine Distance]
    end

    subgraph "RAG Query"
        QUERY[User Question]
        QEMBED[Question Embedding]
        RETRIEVE[Top-K Retrieval<br/>k=3-5]
        LLM[LLM Context<br/>+ Response]
    end

    subgraph "Storage"
        BLOB[Azure Blob Storage<br/>Persistence Layer]
    end

    UPLOAD --> EXTRACT
    CRAWL --> EXTRACT
    QNA --> CHUNK
    EXTRACT --> CHUNK
    CHUNK --> EMBED
    EMBED --> COLL
    COLL --> INDEX

    QUERY --> QEMBED
    QEMBED --> SEARCH
    INDEX --> SEARCH
    SEARCH --> RETRIEVE
    RETRIEVE --> LLM

    COLL -.-> BLOB

    style EMBED fill:#FFF3E0
    style INDEX fill:#E3F2FD
    style SEARCH fill:#C8E6C9
    style LLM fill:#FFE082

📊 Collection Structure¶

Dynamic Collections¶

Naming Pattern: chatbot_vectors_{project_id}

Example Collections:

chatbot_vectors_User-123456_Project_1
chatbot_vectors_User-789456_Project_Support

Why Per-Project Collections?

✅ Isolation (user A can't search user B's data)
✅ Deletion (drop entire collection on chatbot delete)
✅ Performance (smaller collections = faster search)
❌ Many collections (management overhead)

Vector Schema¶

Fields:

from pymilvus import CollectionSchema, FieldSchema, DataType

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),  # or 384
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
    FieldSchema(name="chunk_id", dtype=DataType.VARCHAR, max_length=256),
    FieldSchema(name="url", dtype=DataType.VARCHAR, max_length=2048),
    FieldSchema(name="project_id", dtype=DataType.VARCHAR, max_length=256),
    FieldSchema(name="user_id", dtype=DataType.VARCHAR, max_length=256),
    FieldSchema(name="file_type", dtype=DataType.VARCHAR, max_length=50),  # "pdf", "url", "qna"
    FieldSchema(name="timestamp", dtype=DataType.INT64)  # Unix timestamp
]

schema = CollectionSchema(fields=fields, description="Chatbot knowledge base vectors")

Field Details:

Field	Type	Dimension	Description
id	INT64	-	Auto-generated primary key
embedding	FLOAT_VECTOR	1536 or 384	Vector embedding
text	VARCHAR	max 65535	Original text chunk
chunk_id	VARCHAR	max 256	Unique chunk identifier
url	VARCHAR	max 2048	Source URL (if crawled)
project_id	VARCHAR	max 256	Chatbot project ID
user_id	VARCHAR	max 256	Owner user ID
file_type	VARCHAR	max 50	Source type
timestamp	INT64	-	Creation time

Sample Vector Document¶

{
    "id": 442893741234567,
    "embedding": [0.123, -0.456, 0.789, ..., 0.234],  // 1536 dimensions
    "text": "Our business hours are Monday to Friday, 9 AM to 5 PM Eastern Time. We are closed on weekends and public holidays.",
    "chunk_id": "User-123456_Project_1_chunk_0",
    "url": "https://example.com/contact",
    "project_id": "User-123456_Project_1",
    "user_id": "User-123456",
    "file_type": "url",
    "timestamp": 1735214400
}

🔍 Indexing Strategy¶

IVF_FLAT Index¶

Configuration:

from pymilvus import Collection

index_params = {
    "metric_type": "COSINE",  # Cosine similarity
    "index_type": "IVF_FLAT",  # Inverted File with Flat (exact) search
    "params": {"nlist": 1024}  # Number of cluster units
}

collection.create_index(field_name="embedding", index_params=index_params)

Parameters:

metric_type: COSINE (cosine similarity, range: -1 to 1, higher = more similar)
index_type: IVF_FLAT (good balance of speed and accuracy)
nlist: 1024 (number of cluster centroids)

Search Parameters:

search_params = {
    "metric_type": "COSINE",
    "params": {"nprobe": 16}  # Number of clusters to search
}

Performance:

Build Time: ~10 seconds for 10K vectors
Search Latency: 20-40ms for top-5 results
Accuracy: 99%+ recall (near-exact)

Index Types Comparison¶

Index Type	Speed	Accuracy	Memory	Use Case
IVF_FLAT	Fast	High (99%+)	Medium	Current - Production
IVF_SQ8	Faster	High (98%+)	Low	Memory-constrained
IVF_PQ	Very Fast	Medium (95%+)	Very Low	Large-scale (>10M vectors)
HNSW	Very Fast	Very High (99.9%+)	High	Real-time, low-latency
FLAT	Slow	Perfect (100%)	High	Small datasets (<10K)

Why IVF_FLAT?

✅ Good balance for 1M vectors
✅ High accuracy needed for RAG
✅ Acceptable latency (<40ms)
⏳ May migrate to HNSW at 5M+ vectors

🔎 Similarity Search¶

Search Flow¶

1. Generate Question Embedding

# Using OpenAI
from openai import OpenAI
client = OpenAI(api_key="...")

response = client.embeddings.create(
    model="text-embedding-ada-002",
    input="What are your business hours?"
)
question_embedding = response.data[0].embedding  # 1536 dimensions

2. Search Milvus

from pymilvus import Collection

collection = Collection(f"chatbot_vectors_{project_id}")
collection.load()  # Load to memory

results = collection.search(
    data=[question_embedding],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"nprobe": 16}},
    limit=5,  # Top-5 results
    output_fields=["text", "url", "file_type"]
)

3. Result Format

for hits in results:
    for hit in hits:
        print(f"Score: {hit.score:.4f}")  # 0.95 (very similar)
        print(f"Text: {hit.entity.get('text')}")
        print(f"URL: {hit.entity.get('url')}")
        print(f"Type: {hit.entity.get('file_type')}")

Sample Output:

Score: 0.9512
Text: Our business hours are Monday to Friday, 9 AM to 5 PM Eastern Time...
URL: https://example.com/contact
Type: url

Score: 0.8734
Text: For urgent matters outside business hours, please call our emergency line...
URL: https://example.com/support
Type: pdf

Score: 0.8156
Text: Q: When are you open? A: Mon-Fri 9-5 EST
URL: None
Type: qna

🤖 RAG Pipeline Integration¶

Complete RAG Flow¶

sequenceDiagram
    participant User
    participant Frontend
    participant Response3D as Response 3D Service
    participant OpenAI as Azure OpenAI
    participant Milvus
    participant LLM as LLM Service

    User->>Frontend: "What are your hours?"
    Frontend->>Response3D: POST /get-response-3d

    Note over Response3D: Step 1: Embed Question
    Response3D->>OpenAI: Embed("What are your hours?")
    OpenAI-->>Response3D: [0.123, -0.456, ...]

    Note over Response3D: Step 2: Search Milvus
    Response3D->>Milvus: search(embedding, top_k=5)
    Milvus-->>Response3D: Top-5 similar chunks

    Note over Response3D: Step 3: Build Context
    Response3D->>Response3D: context = join(chunks)

    Note over Response3D: Step 4: Generate Response
    Response3D->>LLM: generate(context + question)
    LLM-->>Response3D: "We're open Mon-Fri 9-5 EST"

    Note over Response3D: Step 5: TTS + Lipsync
    Response3D->>Response3D: Generate audio + lipsync

    Response3D-->>Frontend: {text, audio, lipsync}
    Frontend-->>User: Avatar speaks response

Context Construction¶

Prompt Template:

def build_rag_prompt(question: str, context_chunks: list, system_prompt: str):
    context = "\n\n".join([f"[Source {i+1}]: {chunk}" for i, chunk in enumerate(context_chunks)])

    prompt = f"""
{system_prompt}

Context Information:
{context}

User Question: {question}

Instructions: Answer the question based on the provided context. If the context doesn't contain enough information, say so politely.

Answer:"""

    return prompt

Example:

You are Emma, a friendly customer support assistant.

Context Information:
[Source 1]: Our business hours are Monday to Friday, 9 AM to 5 PM Eastern Time.
[Source 2]: For urgent matters outside business hours, call our emergency line at 1-800-URGENT.
[Source 3]: Q: When are you open? A: Mon-Fri 9-5 EST

User Question: What are your business hours?

Instructions: Answer the question based on the provided context...

Answer:

📈 Performance Optimization¶

Search Optimization¶

1. Collection Loading

# Load entire collection to memory for faster search
collection.load()

# Partial loading (if memory constrained)
collection.load(partition_names=["partition_2025"])

2. Batch Search

# Search multiple questions at once
question_embeddings = [emb1, emb2, emb3]  # Batch of 3
results = collectionction.search(
    data=question_embeddings,
    limit=5
)
# Faster than 3 individual searches

3. nprobe Tuning

# Lower nprobe = faster, less accurate
search_params = {"nprobe": 8}  # Fast, 97% accuracy

# Higher nprobe = slower, more accurate
search_params = {"nprobe": 32}  # Slow, 99.5% accuracy

# Sweet spot for our use case
search_params = {"nprobe": 16}  # Balanced, 99% accuracy

Embedding Model Comparison¶

Model	Dimensions	Speed	Accuracy	Cost	Use Case
text-embedding-ada-002	1536	Medium	High	$0.0001/1K	Current - Production
text-embedding-3-small	1536	Fast	Medium	$0.00002/1K	Cost-sensitive
text-embedding-3-large	3072	Slow	Very High	$0.00013/1K	High accuracy needed
bge-small-en-v1.5	384	Very Fast	Medium	FREE	Alternative - Self-hosted
bge-large-en-v1.5	1024	Medium	High	FREE	Self-hosted, high accuracy

Current Usage:

OpenAI ada-002: Response services (query time)
bge-small: Data ingestion (document embedding)

Why Different Models?

OpenAI for query: Better semantic understanding
bge-small for docs: Cost savings (millions of embeddings)
⚠️ Issue: Embedding space mismatch! (1536 vs 384 dimensions)

Solution: Should use same model for query + documents

🗑️ Vector Deletion¶

Hard Delete on Chatbot Deletion¶

from database.milvus_embeddings_service import get_milvus_embeddings_service

milvus_service = get_milvus_embeddings_service()

# Delete all vectors for a project
deleted_count = milvus_service.delete_embeddings_by_user_project(
    collection_name=f"chatbot_vectors_{project_id}",
    user_id=user_id,
    project_id=project_id
)

# Or drop entire collection
milvus_service.drop_collection(f"chatbot_vectors_{project_id}")

Note: Vectors are NOT moved to trash - hard deleted immediately!

💾 Persistence & Backup¶

Azure Blob Storage Backend¶

Configuration:

import os
from milvus import connections

connections.connect(
    alias="default",
    host=os.getenv("MILVUS_HOST"),
    port=os.getenv("MILVUS_PORT", "19530")
)

Storage Path:

Azure Blob Container: milvus-data
Path: /collections/{collection_name}/

Persistence:

Vectors persisted to Azure Blob automatically
Metadata stored in Milvus
Checkpoints every 5 minutes

Backup Strategy¶

Nightly Snapshots:

# Azure CLI - Create snapshot
az disk create \
  --resource-group milvus-rg \
  --name milvus-snapshot-2025-01-15 \
  --source milvus-data-disk \
  --sku Standard_LRS

# Retention: 7 days

Recovery:

# Restore from snapshot
az disk create \
  --resource-group milvus-rg \
  --name milvus-data-disk \
  --source milvus-snapshot-2025-01-10

# Restart Milvus container
docker restart milvus-standalone

📊 Monitoring & Metrics¶

Key Metrics¶

Search Performance:

Search latency (p50, p95, p99)
Queries per second (QPS)
Index building time
Memory usage

Data Metrics:

Total vectors
Vectors per collection
Collection count
Storage size

Health:

Milvus uptime
Connection pool status
Error rate

Milvus Metrics API¶

from pymilvus import utility

# Get collection stats
stats = collection.get_stats()
print(f"Total entities: {stats['row_count']}")

# Get index building progress
progress = utility.index_building_progress(collection_name)
print(f"Index progress: {progress['indexed_rows']}/{progress['total_rows']}")

# Get memory usage
memory_usage = utility.get_query_segment_info(collection_name)

Services:

Response 3D Service - RAG implementation
Data Crawling Service - Embedding generation

Data Architecture:

Database Schema - MongoDB files collection
Index - Overall architecture

Features:

Data Training - RAG strategy

Progress: Section 4 - ⅜ files complete (37.5%)

"Vectors make semantics searchable." 🔍✅