Embedding Strategy & Implementation¶
Purpose: Complete documentation of embedding model, generation process, and optimization strategies
Audience: ML Engineers, Backend Developers
Owner: ML Engineering Lead
Last Updated: 2025-12-26
Version: 1.0
Embedding Model¶
Model Specification¶
Model: BAAI/bge-small-en-v1.5
Provider: Beijing Academy of Artificial Intelligence (BAAI)
Type: Sentence Transformer (BERT-based)
Architecture: Small-scale BERT for efficient embeddings
Technical Specifications¶
| Parameter | Value | Description |
|---|---|---|
| Model Name | BAAI/bge-small-en-v1.5 |
Official HuggingFace identifier |
| Dimensions | 384 | Output vector dimensionality |
| Max Input Length | 512 tokens | Maximum sequence length |
| Model Size | ~34 MB | Small footprint for fast loading |
| Language | English (v1.5) | Optimized for English text |
| Embedding Speed | ~50ms p50, ~100ms p95 | Per embedding generation |
Why BAAI/bge-small-en-v1.5?¶
Advantages:
-
Compact Size: 384 dimensions vs. 768 (OpenAI), 1536 (text-embedding-ada-002)
-
50% faster search
- 50% less storage (192 bytes vs. 384 bytes per vector)
-
Better Milvus performance
-
Fast Inference: Small model = fast embedding generation
-
Average: 50ms per embedding
-
Can process 20 embeddings/second on CPU
-
Strong Performance:
-
Competitive with larger models on semantic similarity tasks
-
Optimized for retrieval tasks (RAG)
-
Cost-Effective:
-
Self-hosted (no API costs)
- Runs on CPU (no GPU required)
-
Small memory footprint (~100 MB RAM)
-
Production-Ready:
- Battle-tested on MTEB benchmark
- Stable v1.5 release
- Good documentation and community support
Alternative Models Considered:
| Model | Dimensions | Why NOT Chosen |
|---|---|---|
all-MiniLM-L6-v2 |
384 | Slightly lower accuracy |
BAAI/bge-large-en-v1.5 |
1024 | 3x slower, 2.7x larger storage |
text-embedding-ada-002 (OpenAI) |
1536 | API costs ($0.0001/1K tokens), vendor lock-in |
text-embedding-3-small (OpenAI) |
512-1536 | API costs, unnecessary complexity |
Embedding Generation Process¶
Implementation¶
Code Location: data-crawling-service/src/main.py
from fastembed import Embedding
# Initialize embedder (singleton, loaded once at service startup)
embedder = Embedding(
model_name="BAAI/bge-small-en-v1.5",
max_length=512 # Truncate longer texts
)
# Generate embedding for a chunk
def generate_embedding(text: str) -> List[float]:
"""
Generate 384-dimensional embedding vector.
Args:
text: Input text (max 512 tokens)
Returns:
List of 384 floats
"""
# embed() returns generator, convert to list
embedding = list(embedder.embed([text]))[0]
# Convert to float list (Milvus requirement)
embedding_list = [float(x) for x in embedding]
return embedding_list # 384 floats
Embedding Pipeline¶
Step 1: Text Preprocessing
Before embedding, text is cleaned:
def preprocess_text(text: str) -> str:
"""
Clean text before embedding to improve quality.
"""
# Remove HTML tags
text = re.sub(r'<[^>]*>', ' ', text)
# Remove emails
text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text)
# Remove phone numbers
text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '', text)
# Remove URLs
text = re.sub(r'http[s]?://\S+', '', text)
# Remove excessive whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
Why preprocess?
- Removes noise (HTML, emails, phone numbers)
- Focuses embedding on semantic content
- Improves retrieval relevance
Step 2: Chunking
Text is split into 1000-character chunks with 200-character overlap:
def chunk_text(text: str, chunk_size: int = 1000, overlap: int = 200):
"""
Fixed-size character chunking with overlap.
Chunk size: 1000 characters ≈ 250 tokens (well within 512 limit)
Overlap: 200 characters preserves context across boundaries
"""
chunks = []
start = 0
text_length = len(text)
while start < text_length:
end = min(start + chunk_size, text_length)
chunk_content = text[start:end]
chunks.append({
"chunk_index": len(chunks),
"content": chunk_content,
"start_pos": start,
"end_pos": end,
"length": len(chunk_content)
})
# Move start position (overlap = 200)
start = end - overlap
if end == text_length:
break
return chunks
Why 1000/200?
- 1000 chars ≈ 250 tokens (safe margin from 512 max)
- 200-char overlap ensures sentences aren't cut mid-context
- Balances granularity vs. search performance
Step 3: Embedding Generation
Each chunk is embedded independently:
for chunk in chunks:
if chunk['content'].strip():
try:
# Generate embedding
embedding = list(embedder.embed([chunk['content']]))[0]
chunk['embedding'] = [float(x) for x in embedding]
except Exception as e:
logger.warning(f"Embedding failed for chunk {chunk['chunk_index']}: {e}")
chunk['embedding'] = None # Skip this chunk
Error Handling:
- If embedding fails (rare), chunk is skipped
- Doesn't block entire document processing
- Logged for debugging
Step 4: Milvus Insertion
Embeddings inserted into partition-based Milvus:
milvus_embeddings.insert_embeddings(
collection_name="embeddings",
embeddings_data=[{
"document_id": doc_id,
"user_id": user_id,
"project_id": project_id, # Partition key
"chunk_index": chunk['chunk_index'],
"text": chunk['content'][:2000], # Truncate for storage
"embedding": chunk['embedding'], # 384 floats
"data_type": "pdf", # pdf, text, qa, url, org
"source_url": source_url or "",
"created_at": datetime.utcnow().isoformat()
}]
)
Query-Time Embedding¶
When a user asks a question:
def retrieve_relevant_documents(user_id: str, project_id: str, question: str, top_k: int = 5):
"""
Generate query embedding and search Milvus.
"""
# Step 1: Embed query
question_embedding = list(embedder.embed([question]))[0]
question_embedding_list = [float(x) for x in question_embedding]
# Step 2: Search Milvus (partition-scoped)
search_results = milvus_embeddings.search_embeddings(
collection_name="embeddings",
query_vector=question_embedding_list, # 384-dim
user_id=user_id,
project_id=project_id, # Searches ONLY this partition
top_k=top_k
)
# Step 3: Format results
return [{
"content": result.get("text", ""),
"similarity": result.get("score", 0.0),
"document_id": result.get("document_id", ""),
"chunk_index": result.get("chunk_index", 0)
} for result in search_results]
Performance: ~50-100ms total (50ms embedding + 15-35ms Milvus search)
Similarity Metric: L2 Distance¶
Milvus Configuration:
# Collection index parameters
index_params = {
"metric_type": "L2", # Euclidean distance
"index_type": "IVF_FLAT",
"params": {"nlist": 128}
}
# Search parameters
search_params = {
"metric_type": "L2",
"params": {"nprobe": 10}
}
Why L2 (Euclidean Distance)?
L2 distance measures straight-line distance between vectors:
Lower distance = More similar
L2 vs. Cosine Similarity:
| Metric | Use Case | MachineAvatars Choice |
|---|---|---|
| Cosine | Direction matters, magnitude doesn't | ❌ Not used |
| L2 | Both direction AND magnitude matter | ✅ Used |
For normalized embeddings (like BAAI/bge), L2 and cosine give similar results, but L2 is slightly faster in Milvus.
Score Conversion:
Milvus returns L2 distance. We convert to similarity score (0-1):
similarity_score = 1 / (1 + distance)
# Examples:
# distance = 0.0 → score = 1.0 (identical)
# distance = 1.0 → score = 0.5 (moderate similarity)
# distance = 10.0 → score = 0.09 (very different)
Performance Optimization¶
1. Batch Embedding¶
For multiple chunks, embed in batches:
# Slow (sequential)
for chunk in chunks:
embedding = embedder.embed([chunk['content']])
# Fast (batch)
chunk_texts = [chunk['content'] for chunk in chunks]
embeddings = list(embedder.embed(chunk_texts)) # All at once
for i, chunk in enumerate(chunks):
chunk['embedding'] = [float(x) for x in embeddings[i]]
Speedup: 3-5x faster for batch processing
2. Caching¶
Frequently asked questions are pre-embedded:
# Cache common queries
QUERY_CACHE = {}
def get_query_embedding(question: str):
if question in QUERY_CACHE:
return QUERY_CACHE[question]
embedding = list(embedder.embed([question]))[0]
embedding_list = [float(x) for x in embedding]
QUERY_CACHE[question] = embedding_list
return embedding_list
Hit Rate: ~15% (common FAQs)
3. Singleton Embedder¶
Embedder loaded once at service startup:
# Service startup (once)
embedder = Embedding(model_name="BAAI/bge-small-en-v1.5", max_length=512)
# Then reuse for all requests (fast)
embedding = embedder.embed([text])
vs. Loading per-request:
- Startup: ~2 seconds (loads model weights)
- Per-request with singleton: ~50ms
- Per-request without singleton: ~2000ms + 50ms
Savings: 40x faster
Storage Efficiency¶
Embedding Size¶
Per embedding:
- 384 dimensions × 4 bytes/float = 1,536 bytes = 1.5 KB
Per document (50 chunks):
- 50 chunks × 1.5 KB = 75 KB
For 1 million chunks:
- 1M × 1.5 KB = 1.5 GB
Comparison¶
| Model | Dimensions | Storage per 1M Chunks |
|---|---|---|
| BAAI/bge-small-en-v1.5 | 384 | 1.5 GB |
| BAAI/bge-large-en-v1.5 | 1024 | 4.0 GB |
| OpenAI ada-002 | 1536 | 6.0 GB |
Savings: 75% less storage vs. OpenAI embeddings
Quality Metrics¶
MTEB Benchmark Performance¶
BAAI/bge-small-en-v1.5 Scores:
| Task | Score | Notes |
|---|---|---|
| Retrieval | 51.7 | Good for RAG use cases |
| Semantic Similarity | 68.2 | Strong performance |
| Classification | 61.4 | Not our primary use case |
| Clustering | 42.1 | Not our primary use case |
Overall MTEB Score: 58.9 / 100
vs. Alternatives:
- all-MiniLM-L6-v2: 56.3
- BAAI/bge-large-en-v1.5: 63.6 (slower, bigger)
- text-embedding-ada-002: 60.0 (API costs)
Conclusion: Best performance/cost trade-off for RAG
Monitoring & Observability¶
Key Metrics¶
logger.info("Embedding generation", extra={
"model": "BAAI/bge-small-en-v1.5",
"text_length": len(text),
"embedding_dims": len(embedding),
"generation_time_ms": elapsed_ms
})
Collected Metrics¶
| Metric | Threshold | Alert |
|---|---|---|
| Embedding latency (p95) | < 200ms | > 500ms |
| Embedding failures | < 0.1% | > 1% |
| Model load time | < 3s | > 10s |
| Memory usage | < 200 MB | > 500 MB |
Troubleshooting¶
Issue: Slow Embedding Generation¶
Symptoms: Embedding takes > 500ms
Causes:
- Model not cached (reloading every request)
- Large text (> 512 tokens)
- CPU overload
Solutions:
- Use singleton embedder pattern
- Chunk text before embedding
- Scale horizontally (add more service instances)
Issue: Low Retrieval Relevance¶
Symptoms: Retrieved chunks not relevant to query
Causes:
- Poor text preprocessing
- Chunks too large/small
- Query phrasing mismatch
Solutions:
- Improve preprocessing (remove noise)
- Adjust chunk size (test 500/800/1000)
- Query expansion (add synonyms)
- Hybrid search (combine with BM25 keyword search)
Issue: Memory Leak¶
Symptoms: Service memory grows over time
Causes:
- Embeddings not garbage collected
- Cache unbounded growth
Solutions:
# Limit cache size
from functools import lru_cache
@lru_cache(maxsize=1000) # Max 1000 cached queries
def get_query_embedding(question: str):
...
Future Enhancements¶
Planned Improvements¶
Q1 2025:
- Fine-tune BAAI/bge-small on MachineAvatars domain data
- A/B test vs. BAAI/bge-base-en-v1.5 (768 dims)
- Implement query expansion
Q2 2025:
- Multi-lingual embeddings (support Hindi, Spanish)
- Hybrid search (embeddings + BM25)
- Embedding quantization (reduce to 256 dims)
Q3 2025:
- Custom embedding model training
- Late interaction embeddings (ColBERT-style)
Related Documentation¶
Last Updated: 2025-12-26
Version: 1.0
Owner: ML Engineering Lead
"384 dimensions of semantic understanding."