Skip to content

Vector Store - Milvus DatabaseΒΆ

Section: 4-data-architecture-governance
Document: Milvus Vector Database Architecture
Technology: Milvus 2.x
Deployment: Azure Container Instance
Purpose: Semantic search and RAG (Retrieval-Augmented Generation)


🎯 Overview¢

Milvus is the vector database powering MachineAvatars' Retrieval-Augmented Generation (RAG) system, enabling semantic search across user-uploaded documents, crawled websites, and Q&A pairs to provide contextual chatbot responses.

Key Statistics:

  • Total Vectors: ~1M (across all projects)
  • Vector Dimensions: 1536 (OpenAI) or 384 (bge-small)
  • Search Latency: <40ms (p95)
  • Index Type: IVF_FLAT
  • Distance Metric: Cosine similarity

πŸ—οΈ Milvus ArchitectureΒΆ

graph TB
    subgraph "Data Ingestion"
        UPLOAD[User Upload<br/>PDF/DOCX/TXT]
        CRAWL[Website Crawling]
        QNA[Q&A Pairs]
    end

    subgraph "Processing Pipeline"
        EXTRACT[Text Extraction]
        CHUNK[Text Chunking]
        EMBED[Embedding Generation<br/>OpenAI / bge-small]
    end

    subgraph "Milvus Vector DB"
        COLL[Collections<br/>Per Project]
        INDEX[IVF_FLAT Index<br/>nlist=1024]
        SEARCH[Similarity Search<br/>Cosine Distance]
    end

    subgraph "RAG Query"
        QUERY[User Question]
        QEMBED[Question Embedding]
        RETRIEVE[Top-K Retrieval<br/>k=3-5]
        LLM[LLM Context<br/>+ Response]
    end

    subgraph "Storage"
        BLOB[Azure Blob Storage<br/>Persistence Layer]
    end

    UPLOAD --> EXTRACT
    CRAWL --> EXTRACT
    QNA --> CHUNK
    EXTRACT --> CHUNK
    CHUNK --> EMBED
    EMBED --> COLL
    COLL --> INDEX

    QUERY --> QEMBED
    QEMBED --> SEARCH
    INDEX --> SEARCH
    SEARCH --> RETRIEVE
    RETRIEVE --> LLM

    COLL -.-> BLOB

    style EMBED fill:#FFF3E0
    style INDEX fill:#E3F2FD
    style SEARCH fill:#C8E6C9
    style LLM fill:#FFE082

πŸ“Š Collection StructureΒΆ

Dynamic CollectionsΒΆ

Naming Pattern: chatbot_vectors_{project_id}

Example Collections:

  • chatbot_vectors_User-123456_Project_1
  • chatbot_vectors_User-789456_Project_Support

Why Per-Project Collections?

  • βœ… Isolation (user A can't search user B's data)
  • βœ… Deletion (drop entire collection on chatbot delete)
  • βœ… Performance (smaller collections = faster search)
  • ❌ Many collections (management overhead)

Vector SchemaΒΆ

Fields:

from pymilvus import CollectionSchema, FieldSchema, DataType

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),  # or 384
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
    FieldSchema(name="chunk_id", dtype=DataType.VARCHAR, max_length=256),
    FieldSchema(name="url", dtype=DataType.VARCHAR, max_length=2048),
    FieldSchema(name="project_id", dtype=DataType.VARCHAR, max_length=256),
    FieldSchema(name="user_id", dtype=DataType.VARCHAR, max_length=256),
    FieldSchema(name="file_type", dtype=DataType.VARCHAR, max_length=50),  # "pdf", "url", "qna"
    FieldSchema(name="timestamp", dtype=DataType.INT64)  # Unix timestamp
]

schema = CollectionSchema(fields=fields, description="Chatbot knowledge base vectors")

Field Details:

Field Type Dimension Description
id INT64 - Auto-generated primary key
embedding FLOAT_VECTOR 1536 or 384 Vector embedding
text VARCHAR max 65535 Original text chunk
chunk_id VARCHAR max 256 Unique chunk identifier
url VARCHAR max 2048 Source URL (if crawled)
project_id VARCHAR max 256 Chatbot project ID
user_id VARCHAR max 256 Owner user ID
file_type VARCHAR max 50 Source type
timestamp INT64 - Creation time

Sample Vector DocumentΒΆ

{
    "id": 442893741234567,
    "embedding": [0.123, -0.456, 0.789, ..., 0.234],  // 1536 dimensions
    "text": "Our business hours are Monday to Friday, 9 AM to 5 PM Eastern Time. We are closed on weekends and public holidays.",
    "chunk_id": "User-123456_Project_1_chunk_0",
    "url": "https://example.com/contact",
    "project_id": "User-123456_Project_1",
    "user_id": "User-123456",
    "file_type": "url",
    "timestamp": 1735214400
}

πŸ” Indexing StrategyΒΆ

IVF_FLAT IndexΒΆ

Configuration:

from pymilvus import Collection

index_params = {
    "metric_type": "COSINE",  # Cosine similarity
    "index_type": "IVF_FLAT",  # Inverted File with Flat (exact) search
    "params": {"nlist": 1024}  # Number of cluster units
}

collection.create_index(field_name="embedding", index_params=index_params)

Parameters:

  • metric_type: COSINE (cosine similarity, range: -1 to 1, higher = more similar)
  • index_type: IVF_FLAT (good balance of speed and accuracy)
  • nlist: 1024 (number of cluster centroids)

Search Parameters:

search_params = {
    "metric_type": "COSINE",
    "params": {"nprobe": 16}  # Number of clusters to search
}

Performance:

  • Build Time: ~10 seconds for 10K vectors
  • Search Latency: 20-40ms for top-5 results
  • Accuracy: 99%+ recall (near-exact)

Index Types ComparisonΒΆ

Index Type Speed Accuracy Memory Use Case
IVF_FLAT Fast High (99%+) Medium Current - Production
IVF_SQ8 Faster High (98%+) Low Memory-constrained
IVF_PQ Very Fast Medium (95%+) Very Low Large-scale (>10M vectors)
HNSW Very Fast Very High (99.9%+) High Real-time, low-latency
FLAT Slow Perfect (100%) High Small datasets (<10K)

Why IVF_FLAT?

  • βœ… Good balance for 1M vectors
  • βœ… High accuracy needed for RAG
  • βœ… Acceptable latency (<40ms)
  • ⏳ May migrate to HNSW at 5M+ vectors

Search FlowΒΆ

1. Generate Question Embedding

# Using OpenAI
from openai import OpenAI
client = OpenAI(api_key="...")

response = client.embeddings.create(
    model="text-embedding-ada-002",
    input="What are your business hours?"
)
question_embedding = response.data[0].embedding  # 1536 dimensions

2. Search Milvus

from pymilvus import Collection

collection = Collection(f"chatbot_vectors_{project_id}")
collection.load()  # Load to memory

results = collection.search(
    data=[question_embedding],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"nprobe": 16}},
    limit=5,  # Top-5 results
    output_fields=["text", "url", "file_type"]
)

3. Result Format

for hits in results:
    for hit in hits:
        print(f"Score: {hit.score:.4f}")  # 0.95 (very similar)
        print(f"Text: {hit.entity.get('text')}")
        print(f"URL: {hit.entity.get('url')}")
        print(f"Type: {hit.entity.get('file_type')}")

Sample Output:

Score: 0.9512
Text: Our business hours are Monday to Friday, 9 AM to 5 PM Eastern Time...
URL: https://example.com/contact
Type: url

Score: 0.8734
Text: For urgent matters outside business hours, please call our emergency line...
URL: https://example.com/support
Type: pdf

Score: 0.8156
Text: Q: When are you open? A: Mon-Fri 9-5 EST
URL: None
Type: qna

πŸ€– RAG Pipeline IntegrationΒΆ

Complete RAG FlowΒΆ

sequenceDiagram
    participant User
    participant Frontend
    participant Response3D as Response 3D Service
    participant OpenAI as Azure OpenAI
    participant Milvus
    participant LLM as LLM Service

    User->>Frontend: "What are your hours?"
    Frontend->>Response3D: POST /get-response-3d

    Note over Response3D: Step 1: Embed Question
    Response3D->>OpenAI: Embed("What are your hours?")
    OpenAI-->>Response3D: [0.123, -0.456, ...]

    Note over Response3D: Step 2: Search Milvus
    Response3D->>Milvus: search(embedding, top_k=5)
    Milvus-->>Response3D: Top-5 similar chunks

    Note over Response3D: Step 3: Build Context
    Response3D->>Response3D: context = join(chunks)

    Note over Response3D: Step 4: Generate Response
    Response3D->>LLM: generate(context + question)
    LLM-->>Response3D: "We're open Mon-Fri 9-5 EST"

    Note over Response3D: Step 5: TTS + Lipsync
    Response3D->>Response3D: Generate audio + lipsync

    Response3D-->>Frontend: {text, audio, lipsync}
    Frontend-->>User: Avatar speaks response

Context ConstructionΒΆ

Prompt Template:

def build_rag_prompt(question: str, context_chunks: list, system_prompt: str):
    context = "\n\n".join([f"[Source {i+1}]: {chunk}" for i, chunk in enumerate(context_chunks)])

    prompt = f"""
{system_prompt}

Context Information:
{context}

User Question: {question}

Instructions: Answer the question based on the provided context. If the context doesn't contain enough information, say so politely.

Answer:"""

    return prompt

Example:

You are Emma, a friendly customer support assistant.

Context Information:
[Source 1]: Our business hours are Monday to Friday, 9 AM to 5 PM Eastern Time.
[Source 2]: For urgent matters outside business hours, call our emergency line at 1-800-URGENT.
[Source 3]: Q: When are you open? A: Mon-Fri 9-5 EST

User Question: What are your business hours?

Instructions: Answer the question based on the provided context...

Answer:

πŸ“ˆ Performance OptimizationΒΆ

Search OptimizationΒΆ

1. Collection Loading

# Load entire collection to memory for faster search
collection.load()

# Partial loading (if memory constrained)
collection.load(partition_names=["partition_2025"])

2. Batch Search

# Search multiple questions at once
question_embeddings = [emb1, emb2, emb3]  # Batch of 3
results = collectionction.search(
    data=question_embeddings,
    limit=5
)
# Faster than 3 individual searches

3. nprobe Tuning

# Lower nprobe = faster, less accurate
search_params = {"nprobe": 8}  # Fast, 97% accuracy

# Higher nprobe = slower, more accurate
search_params = {"nprobe": 32}  # Slow, 99.5% accuracy

# Sweet spot for our use case
search_params = {"nprobe": 16}  # Balanced, 99% accuracy

Embedding Model ComparisonΒΆ

Model Dimensions Speed Accuracy Cost Use Case
text-embedding-ada-002 1536 Medium High $0.0001/1K Current - Production
text-embedding-3-small 1536 Fast Medium $0.00002/1K Cost-sensitive
text-embedding-3-large 3072 Slow Very High $0.00013/1K High accuracy needed
bge-small-en-v1.5 384 Very Fast Medium FREE Alternative - Self-hosted
bge-large-en-v1.5 1024 Medium High FREE Self-hosted, high accuracy

Current Usage:

  • OpenAI ada-002: Response services (query time)
  • bge-small: Data ingestion (document embedding)

Why Different Models?

  • OpenAI for query: Better semantic understanding
  • bge-small for docs: Cost savings (millions of embeddings)
  • ⚠️ Issue: Embedding space mismatch! (1536 vs 384 dimensions)

Solution: Should use same model for query + documents


πŸ—‘οΈ Vector DeletionΒΆ

Hard Delete on Chatbot DeletionΒΆ

from database.milvus_embeddings_service import get_milvus_embeddings_service

milvus_service = get_milvus_embeddings_service()

# Delete all vectors for a project
deleted_count = milvus_service.delete_embeddings_by_user_project(
    collection_name=f"chatbot_vectors_{project_id}",
    user_id=user_id,
    project_id=project_id
)

# Or drop entire collection
milvus_service.drop_collection(f"chatbot_vectors_{project_id}")

Note: Vectors are NOT moved to trash - hard deleted immediately!


πŸ’Ύ Persistence & BackupΒΆ

Azure Blob Storage BackendΒΆ

Configuration:

import os
from milvus import connections

connections.connect(
    alias="default",
    host=os.getenv("MILVUS_HOST"),
    port=os.getenv("MILVUS_PORT", "19530")
)

Storage Path:

Azure Blob Container: milvus-data
Path: /collections/{collection_name}/

Persistence:

  • Vectors persisted to Azure Blob automatically
  • Metadata stored in Milvus
  • Checkpoints every 5 minutes

Backup StrategyΒΆ

Nightly Snapshots:

# Azure CLI - Create snapshot
az disk create \
  --resource-group milvus-rg \
  --name milvus-snapshot-2025-01-15 \
  --source milvus-data-disk \
  --sku Standard_LRS

# Retention: 7 days

Recovery:

# Restore from snapshot
az disk create \
  --resource-group milvus-rg \
  --name milvus-data-disk \
  --source milvus-snapshot-2025-01-10

# Restart Milvus container
docker restart milvus-standalone

πŸ“Š Monitoring & MetricsΒΆ

Key MetricsΒΆ

Search Performance:

  • Search latency (p50, p95, p99)
  • Queries per second (QPS)
  • Index building time
  • Memory usage

Data Metrics:

  • Total vectors
  • Vectors per collection
  • Collection count
  • Storage size

Health:

  • Milvus uptime
  • Connection pool status
  • Error rate

Milvus Metrics APIΒΆ

from pymilvus import utility

# Get collection stats
stats = collection.get_stats()
print(f"Total entities: {stats['row_count']}")

# Get index building progress
progress = utility.index_building_progress(collection_name)
print(f"Index progress: {progress['indexed_rows']}/{progress['total_rows']}")

# Get memory usage
memory_usage = utility.get_query_segment_info(collection_name)

Services:

Data Architecture:

Features:


Progress: Section 4 - β…œ files complete (37.5%)

"Vectors make semantics searchable." πŸ”βœ