Skip to content

Embedding Strategy & Implementation

Purpose: Complete documentation of embedding model, generation process, and optimization strategies
Audience: ML Engineers, Backend Developers
Owner: ML Engineering Lead
Last Updated: 2025-12-26
Version: 1.0


Embedding Model

Model Specification

Model: BAAI/bge-small-en-v1.5
Provider: Beijing Academy of Artificial Intelligence (BAAI)
Type: Sentence Transformer (BERT-based)
Architecture: Small-scale BERT for efficient embeddings

Technical Specifications

Parameter Value Description
Model Name BAAI/bge-small-en-v1.5 Official HuggingFace identifier
Dimensions 384 Output vector dimensionality
Max Input Length 512 tokens Maximum sequence length
Model Size ~34 MB Small footprint for fast loading
Language English (v1.5) Optimized for English text
Embedding Speed ~50ms p50, ~100ms p95 Per embedding generation

Why BAAI/bge-small-en-v1.5?

Advantages:

  1. Compact Size: 384 dimensions vs. 768 (OpenAI), 1536 (text-embedding-ada-002)

  2. 50% faster search

  3. 50% less storage (192 bytes vs. 384 bytes per vector)
  4. Better Milvus performance

  5. Fast Inference: Small model = fast embedding generation

  6. Average: 50ms per embedding

  7. Can process 20 embeddings/second on CPU

  8. Strong Performance:

  9. Competitive with larger models on semantic similarity tasks

  10. Optimized for retrieval tasks (RAG)

  11. Cost-Effective:

  12. Self-hosted (no API costs)

  13. Runs on CPU (no GPU required)
  14. Small memory footprint (~100 MB RAM)

  15. Production-Ready:

  16. Battle-tested on MTEB benchmark
  17. Stable v1.5 release
  18. Good documentation and community support

Alternative Models Considered:

Model Dimensions Why NOT Chosen
all-MiniLM-L6-v2 384 Slightly lower accuracy
BAAI/bge-large-en-v1.5 1024 3x slower, 2.7x larger storage
text-embedding-ada-002 (OpenAI) 1536 API costs ($0.0001/1K tokens), vendor lock-in
text-embedding-3-small (OpenAI) 512-1536 API costs, unnecessary complexity

Embedding Generation Process

Implementation

Code Location: data-crawling-service/src/main.py

from fastembed import Embedding

# Initialize embedder (singleton, loaded once at service startup)
embedder = Embedding(
    model_name="BAAI/bge-small-en-v1.5",
    max_length=512  # Truncate longer texts
)

# Generate embedding for a chunk
def generate_embedding(text: str) -> List[float]:
    """
    Generate 384-dimensional embedding vector.

    Args:
        text: Input text (max 512 tokens)

    Returns:
        List of 384 floats
    """
    # embed() returns generator, convert to list
    embedding = list(embedder.embed([text]))[0]

    # Convert to float list (Milvus requirement)
    embedding_list = [float(x) for x in embedding]

    return embedding_list  # 384 floats

Embedding Pipeline

Step 1: Text Preprocessing

Before embedding, text is cleaned:

def preprocess_text(text: str) -> str:
    """
    Clean text before embedding to improve quality.
    """
    # Remove HTML tags
    text = re.sub(r'<[^>]*>', ' ', text)

    # Remove emails
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text)

    # Remove phone numbers
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '', text)

    # Remove URLs
    text = re.sub(r'http[s]?://\S+', '', text)

    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

Why preprocess?

  • Removes noise (HTML, emails, phone numbers)
  • Focuses embedding on semantic content
  • Improves retrieval relevance

Step 2: Chunking

Text is split into 1000-character chunks with 200-character overlap:

def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200):
    """
    Fixed-size character chunking with overlap.

    Chunk size: 1000 characters ≈ 250 tokens (well within 512 limit)
    Overlap: 200 characters preserves context across boundaries
    """
    chunks = []
    start = 0
    text_length = len(text)

    while start < text_length:
        end = min(start + chunk_size, text_length)
        chunk_content = text[start:end]

        chunks.append({
            "chunk_index": len(chunks),
            "content": chunk_content,
            "start_pos": start,
            "end_pos": end,
            "length": len(chunk_content)
        })

        # Move start position (overlap = 200)
        start = end - overlap
        if end == text_length:
            break

    return chunks

Why 1000/200?

  • 1000 chars ≈ 250 tokens (safe margin from 512 max)
  • 200-char overlap ensures sentences aren't cut mid-context
  • Balances granularity vs. search performance

Step 3: Embedding Generation

Each chunk is embedded independently:

for chunk in chunks:
    if chunk['content'].strip():
        try:
            # Generate embedding
            embedding = list(embedder.embed([chunk['content']]))[0]
            chunk['embedding'] = [float(x) for x in embedding]
        except Exception as e:
            logger.warning(f"Embedding failed for chunk {chunk['chunk_index']}: {e}")
            chunk['embedding'] = None  # Skip this chunk

Error Handling:

  • If embedding fails (rare), chunk is skipped
  • Doesn't block entire document processing
  • Logged for debugging

Step 4: Milvus Insertion

Embeddings inserted into partition-based Milvus:

milvus_embeddings.insert_embeddings(
    collection_name="embeddings",
    embeddings_data=[{
        "document_id": doc_id,
        "user_id": user_id,
        "project_id": project_id,  # Partition key
        "chunk_index": chunk['chunk_index'],
        "text": chunk['content'][:2000],  # Truncate for storage
        "embedding": chunk['embedding'],  # 384 floats
        "data_type": "pdf",  # pdf, text, qa, url, org
        "source_url": source_url or "",
        "created_at": datetime.utcnow().isoformat()
    }]
)

Query-Time Embedding

When a user asks a question:

def retrieve_relevant_documents(user_id: str, project_id: str, question: str, top_k: int = 5):
    """
    Generate query embedding and search Milvus.
    """
    # Step 1: Embed query
    question_embedding = list(embedder.embed([question]))[0]
    question_embedding_list = [float(x) for x in question_embedding]

    # Step 2: Search Milvus (partition-scoped)
    search_results = milvus_embeddings.search_embeddings(
        collection_name="embeddings",
        query_vector=question_embedding_list,  # 384-dim
        user_id=user_id,
        project_id=project_id,  # Searches ONLY this partition
        top_k=top_k
    )

    # Step 3: Format results
    return [{
        "content": result.get("text", ""),
        "similarity": result.get("score", 0.0),
        "document_id": result.get("document_id", ""),
        "chunk_index": result.get("chunk_index", 0)
    } for result in search_results]

Performance: ~50-100ms total (50ms embedding + 15-35ms Milvus search)


Similarity Metric: L2 Distance

Milvus Configuration:

# Collection index parameters
index_params = {
    "metric_type": "L2",  # Euclidean distance
    "index_type": "IVF_FLAT",
    "params": {"nlist": 128}
}

# Search parameters
search_params = {
    "metric_type": "L2",
    "params": {"nprobe": 10}
}

Why L2 (Euclidean Distance)?

L2 distance measures straight-line distance between vectors:

distance = sqrt((v1[0]-v2[0])² + (v1[1]-v2[1])² + ... + (v1[383]-v2[383])²)

Lower distance = More similar

L2 vs. Cosine Similarity:

Metric Use Case MachineAvatars Choice
Cosine Direction matters, magnitude doesn't ❌ Not used
L2 Both direction AND magnitude matter Used

For normalized embeddings (like BAAI/bge), L2 and cosine give similar results, but L2 is slightly faster in Milvus.

Score Conversion:

Milvus returns L2 distance. We convert to similarity score (0-1):

similarity_score = 1 / (1 + distance)

# Examples:
# distance = 0.0 → score = 1.0 (identical)
# distance = 1.0 → score = 0.5 (moderate similarity)
# distance = 10.0 → score = 0.09 (very different)

Performance Optimization

1. Batch Embedding

For multiple chunks, embed in batches:

# Slow (sequential)
for chunk in chunks:
    embedding = embedder.embed([chunk['content']])

# Fast (batch)
chunk_texts = [chunk['content'] for chunk in chunks]
embeddings = list(embedder.embed(chunk_texts))  # All at once

for i, chunk in enumerate(chunks):
    chunk['embedding'] = [float(x) for x in embeddings[i]]

Speedup: 3-5x faster for batch processing

2. Caching

Frequently asked questions are pre-embedded:

# Cache common queries
QUERY_CACHE = {}

def get_query_embedding(question: str):
    if question in QUERY_CACHE:
        return QUERY_CACHE[question]

    embedding = list(embedder.embed([question]))[0]
    embedding_list = [float(x) for x in embedding]

    QUERY_CACHE[question] = embedding_list
    return embedding_list

Hit Rate: ~15% (common FAQs)

3. Singleton Embedder

Embedder loaded once at service startup:

# Service startup (once)
embedder = Embedding(model_name="BAAI/bge-small-en-v1.5", max_length=512)

# Then reuse for all requests (fast)
embedding = embedder.embed([text])

vs. Loading per-request:

  • Startup: ~2 seconds (loads model weights)
  • Per-request with singleton: ~50ms
  • Per-request without singleton: ~2000ms + 50ms

Savings: 40x faster


Storage Efficiency

Embedding Size

Per embedding:

  • 384 dimensions × 4 bytes/float = 1,536 bytes = 1.5 KB

Per document (50 chunks):

  • 50 chunks × 1.5 KB = 75 KB

For 1 million chunks:

  • 1M × 1.5 KB = 1.5 GB

Comparison

Model Dimensions Storage per 1M Chunks
BAAI/bge-small-en-v1.5 384 1.5 GB
BAAI/bge-large-en-v1.5 1024 4.0 GB
OpenAI ada-002 1536 6.0 GB

Savings: 75% less storage vs. OpenAI embeddings


Quality Metrics

MTEB Benchmark Performance

BAAI/bge-small-en-v1.5 Scores:

Task Score Notes
Retrieval 51.7 Good for RAG use cases
Semantic Similarity 68.2 Strong performance
Classification 61.4 Not our primary use case
Clustering 42.1 Not our primary use case

Overall MTEB Score: 58.9 / 100

vs. Alternatives:

  • all-MiniLM-L6-v2: 56.3
  • BAAI/bge-large-en-v1.5: 63.6 (slower, bigger)
  • text-embedding-ada-002: 60.0 (API costs)

Conclusion: Best performance/cost trade-off for RAG


Monitoring & Observability

Key Metrics

logger.info("Embedding generation", extra={
    "model": "BAAI/bge-small-en-v1.5",
    "text_length": len(text),
    "embedding_dims": len(embedding),
    "generation_time_ms": elapsed_ms
})

Collected Metrics

Metric Threshold Alert
Embedding latency (p95) < 200ms > 500ms
Embedding failures < 0.1% > 1%
Model load time < 3s > 10s
Memory usage < 200 MB > 500 MB

Troubleshooting

Issue: Slow Embedding Generation

Symptoms: Embedding takes > 500ms

Causes:

  1. Model not cached (reloading every request)
  2. Large text (> 512 tokens)
  3. CPU overload

Solutions:

  1. Use singleton embedder pattern
  2. Chunk text before embedding
  3. Scale horizontally (add more service instances)

Issue: Low Retrieval Relevance

Symptoms: Retrieved chunks not relevant to query

Causes:

  1. Poor text preprocessing
  2. Chunks too large/small
  3. Query phrasing mismatch

Solutions:

  1. Improve preprocessing (remove noise)
  2. Adjust chunk size (test 500/800/1000)
  3. Query expansion (add synonyms)
  4. Hybrid search (combine with BM25 keyword search)

Issue: Memory Leak

Symptoms: Service memory grows over time

Causes:

  1. Embeddings not garbage collected
  2. Cache unbounded growth

Solutions:

# Limit cache size
from functools import lru_cache

@lru_cache(maxsize=1000)  # Max 1000 cached queries
def get_query_embedding(question: str):
    ...

Future Enhancements

Planned Improvements

Q1 2025:

  • Fine-tune BAAI/bge-small on MachineAvatars domain data
  • A/B test vs. BAAI/bge-base-en-v1.5 (768 dims)
  • Implement query expansion

Q2 2025:

  • Multi-lingual embeddings (support Hindi, Spanish)
  • Hybrid search (embeddings + BM25)
  • Embedding quantization (reduce to 256 dims)

Q3 2025:

  • Custom embedding model training
  • Late interaction embeddings (ColBERT-style)


Last Updated: 2025-12-26
Version: 1.0
Owner: ML Engineering Lead


"384 dimensions of semantic understanding."