Embedding Strategy & Implementation¶

Purpose: Complete documentation of embedding model, generation process, and optimization strategies
Audience: ML Engineers, Backend Developers
Owner: ML Engineering Lead
Last Updated: 2025-12-26
Version: 1.0

Embedding Model¶

Model Specification¶

Model: BAAI/bge-small-en-v1.5
Provider: Beijing Academy of Artificial Intelligence (BAAI)
Type: Sentence Transformer (BERT-based)
Architecture: Small-scale BERT for efficient embeddings

Technical Specifications¶

Parameter	Value	Description
Model Name	`BAAI/bge-small-en-v1.5`	Official HuggingFace identifier
Dimensions	384	Output vector dimensionality
Max Input Length	512 tokens	Maximum sequence length
Model Size	~34 MB	Small footprint for fast loading
Language	English (v1.5)	Optimized for English text
Embedding Speed	~50ms p50, ~100ms p95	Per embedding generation

Why BAAI/bge-small-en-v1.5?¶

Advantages:

Compact Size: 384 dimensions vs. 768 (OpenAI), 1536 (text-embedding-ada-002)
50% faster search
50% less storage (192 bytes vs. 384 bytes per vector)
Better Milvus performance
Fast Inference: Small model = fast embedding generation
Average: 50ms per embedding
Can process 20 embeddings/second on CPU
Strong Performance:
Competitive with larger models on semantic similarity tasks
Optimized for retrieval tasks (RAG)
Cost-Effective:
Self-hosted (no API costs)
Runs on CPU (no GPU required)
Small memory footprint (~100 MB RAM)
Production-Ready:
Battle-tested on MTEB benchmark
Stable v1.5 release
Good documentation and community support

Alternative Models Considered:

Model	Dimensions	Why NOT Chosen
`all-MiniLM-L6-v2`	384	Slightly lower accuracy
`BAAI/bge-large-en-v1.5`	1024	3x slower, 2.7x larger storage
`text-embedding-ada-002` (OpenAI)	1536	API costs ($0.0001/1K tokens), vendor lock-in
`text-embedding-3-small` (OpenAI)	512-1536	API costs, unnecessary complexity

Embedding Generation Process¶

Implementation¶

Code Location: data-crawling-service/src/main.py

from fastembed import Embedding

# Initialize embedder (singleton, loaded once at service startup)
embedder = Embedding(
    model_name="BAAI/bge-small-en-v1.5",
    max_length=512  # Truncate longer texts
)

# Generate embedding for a chunk
def generate_embedding(text: str) -> List[float]:
    """
    Generate 384-dimensional embedding vector.

    Args:
        text: Input text (max 512 tokens)

    Returns:
        List of 384 floats
    """
    # embed() returns generator, convert to list
    embedding = list(embedder.embed([text]))[0]

    # Convert to float list (Milvus requirement)
    embedding_list = [float(x) for x in embedding]

    return embedding_list  # 384 floats

Embedding Pipeline¶

Step 1: Text Preprocessing

Before embedding, text is cleaned:

def preprocess_text(text: str) -> str:
    """
    Clean text before embedding to improve quality.
    """
    # Remove HTML tags
    text = re.sub(r'<[^>]*>', ' ', text)

    # Remove emails
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text)

    # Remove phone numbers
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '', text)

    # Remove URLs
    text = re.sub(r'http[s]?://\S+', '', text)

    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

Why preprocess?

Removes noise (HTML, emails, phone numbers)
Focuses embedding on semantic content
Improves retrieval relevance

Step 2: Chunking

Text is split into 1000-character chunks with 200-character overlap:

def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200):
    """
    Fixed-size character chunking with overlap.

    Chunk size: 1000 characters ≈ 250 tokens (well within 512 limit)
    Overlap: 200 characters preserves context across boundaries
    """
    chunks = []
    start = 0
    text_length = len(text)

    while start < text_length:
        end = min(start + chunk_size, text_length)
        chunk_content = text[start:end]

        chunks.append({
            "chunk_index": len(chunks),
            "content": chunk_content,
            "start_pos": start,
            "end_pos": end,
            "length": len(chunk_content)
        })

        # Move start position (overlap = 200)
        start = end - overlap
        if end == text_length:
            break

    return chunks

Why 1000/200?

1000 chars ≈ 250 tokens (safe margin from 512 max)
200-char overlap ensures sentences aren't cut mid-context
Balances granularity vs. search performance

Step 3: Embedding Generation

Each chunk is embedded independently:

for chunk in chunks:
    if chunk['content'].strip():
        try:
            # Generate embedding
            embedding = list(embedder.embed([chunk['content']]))[0]
            chunk['embedding'] = [float(x) for x in embedding]
        except Exception as e:
            logger.warning(f"Embedding failed for chunk {chunk['chunk_index']}: {e}")
            chunk['embedding'] = None  # Skip this chunk

Error Handling:

If embedding fails (rare), chunk is skipped
Doesn't block entire document processing
Logged for debugging

Step 4: Milvus Insertion

Embeddings inserted into partition-based Milvus:

milvus_embeddings.insert_embeddings(
    collection_name="embeddings",
    embeddings_data=[{
        "document_id": doc_id,
        "user_id": user_id,
        "project_id": project_id,  # Partition key
        "chunk_index": chunk['chunk_index'],
        "text": chunk['content'][:2000],  # Truncate for storage
        "embedding": chunk['embedding'],  # 384 floats
        "data_type": "pdf",  # pdf, text, qa, url, org
        "source_url": source_url or "",
        "created_at": datetime.utcnow().isoformat()
    }]
)

Query-Time Embedding¶

When a user asks a question:

def retrieve_relevant_documents(user_id: str, project_id: str, question: str, top_k: int = 5):
    """
    Generate query embedding and search Milvus.
    """
    # Step 1: Embed query
    question_embedding = list(embedder.embed([question]))[0]
    question_embedding_list = [float(x) for x in question_embedding]

    # Step 2: Search Milvus (partition-scoped)
    search_results = milvus_embeddings.search_embeddings(
        collection_name="embeddings",
        query_vector=question_embedding_list,  # 384-dim
        user_id=user_id,
        project_id=project_id,  # Searches ONLY this partition
        top_k=top_k
    )

    # Step 3: Format results
    return [{
        "content": result.get("text", ""),
        "similarity": result.get("score", 0.0),
        "document_id": result.get("document_id", ""),
        "chunk_index": result.get("chunk_index", 0)
    } for result in search_results]

Performance: ~50-100ms total (50ms embedding + 15-35ms Milvus search)

Similarity Metric: L2 Distance¶

Milvus Configuration:

# Collection index parameters
index_params = {
    "metric_type": "L2",  # Euclidean distance
    "index_type": "IVF_FLAT",
    "params": {"nlist": 128}
}

# Search parameters
search_params = {
    "metric_type": "L2",
    "params": {"nprobe": 10}
}

Why L2 (Euclidean Distance)?

L2 distance measures straight-line distance between vectors:

distance = sqrt((v1[0]-v2[0])² + (v1[1]-v2[1])² + ... + (v1[383]-v2[383])²)

Lower distance = More similar

L2 vs. Cosine Similarity:

Metric	Use Case	MachineAvatars Choice
Cosine	Direction matters, magnitude doesn't	❌ Not used
L2	Both direction AND magnitude matter	✅ Used

For normalized embeddings (like BAAI/bge), L2 and cosine give similar results, but L2 is slightly faster in Milvus.

Score Conversion:

Milvus returns L2 distance. We convert to similarity score (0-1):

similarity_score = 1 / (1 + distance)

# Examples:
# distance = 0.0 → score = 1.0 (identical)
# distance = 1.0 → score = 0.5 (moderate similarity)
# distance = 10.0 → score = 0.09 (very different)

Performance Optimization¶

1. Batch Embedding¶

For multiple chunks, embed in batches:

# Slow (sequential)
for chunk in chunks:
    embedding = embedder.embed([chunk['content']])

# Fast (batch)
chunk_texts = [chunk['content'] for chunk in chunks]
embeddings = list(embedder.embed(chunk_texts))  # All at once

for i, chunk in enumerate(chunks):
    chunk['embedding'] = [float(x) for x in embeddings[i]]

Speedup: 3-5x faster for batch processing

2. Caching¶

Frequently asked questions are pre-embedded:

# Cache common queries
QUERY_CACHE = {}

def get_query_embedding(question: str):
    if question in QUERY_CACHE:
        return QUERY_CACHE[question]

    embedding = list(embedder.embed([question]))[0]
    embedding_list = [float(x) for x in embedding]

    QUERY_CACHE[question] = embedding_list
    return embedding_list

Hit Rate: ~15% (common FAQs)

3. Singleton Embedder¶

Embedder loaded once at service startup:

# Service startup (once)
embedder = Embedding(model_name="BAAI/bge-small-en-v1.5", max_length=512)

# Then reuse for all requests (fast)
embedding = embedder.embed([text])

vs. Loading per-request:

Startup: ~2 seconds (loads model weights)
Per-request with singleton: ~50ms
Per-request without singleton: ~2000ms + 50ms

Savings: 40x faster

Storage Efficiency¶

Embedding Size¶

Per embedding:

384 dimensions × 4 bytes/float = 1,536 bytes = 1.5 KB

Per document (50 chunks):

50 chunks × 1.5 KB = 75 KB

For 1 million chunks:

1M × 1.5 KB = 1.5 GB

Comparison¶

Model	Dimensions	Storage per 1M Chunks
BAAI/bge-small-en-v1.5	384	1.5 GB
BAAI/bge-large-en-v1.5	1024	4.0 GB
OpenAI ada-002	1536	6.0 GB

Savings: 75% less storage vs. OpenAI embeddings

Quality Metrics¶

MTEB Benchmark Performance¶

BAAI/bge-small-en-v1.5 Scores:

Task	Score	Notes
Retrieval	51.7	Good for RAG use cases
Semantic Similarity	68.2	Strong performance
Classification	61.4	Not our primary use case
Clustering	42.1	Not our primary use case

Overall MTEB Score: 58.9 / 100

vs. Alternatives:

all-MiniLM-L6-v2: 56.3
BAAI/bge-large-en-v1.5: 63.6 (slower, bigger)
text-embedding-ada-002: 60.0 (API costs)

Conclusion: Best performance/cost trade-off for RAG

Monitoring & Observability¶

Key Metrics¶

logger.info("Embedding generation", extra={
    "model": "BAAI/bge-small-en-v1.5",
    "text_length": len(text),
    "embedding_dims": len(embedding),
    "generation_time_ms": elapsed_ms
})

Collected Metrics¶

Metric	Threshold	Alert
Embedding latency (p95)	< 200ms	> 500ms
Embedding failures	< 0.1%	> 1%
Model load time	< 3s	> 10s
Memory usage	< 200 MB	> 500 MB

Troubleshooting¶

Issue: Slow Embedding Generation¶

Symptoms: Embedding takes > 500ms

Causes:

Model not cached (reloading every request)
Large text (> 512 tokens)
CPU overload

Solutions:

Use singleton embedder pattern
Chunk text before embedding
Scale horizontally (add more service instances)

Issue: Low Retrieval Relevance¶

Symptoms: Retrieved chunks not relevant to query

Causes:

Poor text preprocessing
Chunks too large/small
Query phrasing mismatch

Solutions:

Improve preprocessing (remove noise)
Adjust chunk size (test 500/800/1000)
Query expansion (add synonyms)
Hybrid search (combine with BM25 keyword search)

Issue: Memory Leak¶

Symptoms: Service memory grows over time

Causes:

Embeddings not garbage collected
Cache unbounded growth

Solutions:

# Limit cache size
from functools import lru_cache

@lru_cache(maxsize=1000)  # Max 1000 cached queries
def get_query_embedding(question: str):
    ...

Future Enhancements¶

Planned Improvements¶

Q1 2025:

Fine-tune BAAI/bge-small on MachineAvatars domain data
A/B test vs. BAAI/bge-base-en-v1.5 (768 dims)
Implement query expansion

Q2 2025:

Multi-lingual embeddings (support Hindi, Spanish)
Hybrid search (embeddings + BM25)
Embedding quantization (reduce to 256 dims)

Q3 2025:

Custom embedding model training
Late interaction embeddings (ColBERT-style)

Last Updated: 2025-12-26
Version: 1.0
Owner: ML Engineering Lead

"384 dimensions of semantic understanding."

Embedding Strategy & Implementation¶

Embedding Model¶

Model Specification¶

Technical Specifications¶

Why BAAI/bge-small-en-v1.5?¶

Embedding Generation Process¶

Implementation¶

Embedding Pipeline¶

Query-Time Embedding¶

Similarity Metric: L2 Distance¶

Performance Optimization¶

1. Batch Embedding¶

2. Caching¶

3. Singleton Embedder¶

Storage Efficiency¶

Embedding Size¶

Comparison¶

Quality Metrics¶

MTEB Benchmark Performance¶

Monitoring & Observability¶

Key Metrics¶

Collected Metrics¶

Troubleshooting¶

Issue: Slow Embedding Generation¶

Issue: Low Retrieval Relevance¶

Issue: Memory Leak¶

Future Enhancements¶

Planned Improvements¶

Related Documentation¶