Data Training & Knowledge Base Features¶

Category: Content Management & AI Training
Target Users: All users (varies by plan)
Business Value: Makes chatbots intelligent and accurate

🎯 Overview¶

The RAG (Retrieval-Augmented Generation) pipeline that powers intelligent conversations. Upload data, train your chatbot, and watch it become an expert in your domain.

🌐 Website Data Collection¶

Automated Website Crawling¶

What it does:
Automatically extract all content from your website to train your chatbot.

Process:

Step 1: Add URLs

Enter website URLs (up to 50 by plan)
Supports:
Homepage
Product pages
FAQ pages
Blog posts
Documentation
Any public page

Step 2: Crawl Configuration

Depth: How many links deep
Include/exclude patterns
Respect robots.txt
Rate limiting (polite crawling)

Step 3: Auto-Crawl

Scrapes all text content
Extracts headings, paragraphs, lists
Ignores images/scripts/styles
Follows links within domain
Processes in background

Step 4: Content Processing

Clean HTML tags
Remove duplicates
Chunk into digestible pieces (1024 tokens)
Generate embeddings (vector representations)
Store in Milvus vector database

Limits by Plan:

Free: 5 URLs
Pro: 10 URLs
Business: 25 URLs
Premium: 50 URLs

Technical:

Backend: client-data-collection-service
Method: BeautifulSoup HTML parsing
Embeddings: OpenAI text-embedding-ada-002
Storage: Milvus collections partitioned by project_id

URL Management¶

What it does:
Manage your website data sources.

Features:

View URLs: See all crawled URLs
Add URLs: Add new pages anytime
Remove URLs: Delete specific pages
Re-crawl: Update content when website changes
Status Tracking: See crawl progress
Error Handling: Retry failed URLs

Auto-Updates:

Schedule re-crawls (weekly/monthly)
Detect content changes
Incremental updates
No duplicate processing

📄 File Uploads¶

Supported File Types¶

What it does:
Upload documents containing your knowledge base.

Supported Formats:

1. PDF Files

Product catalogs
User manuals
Policy documents
Research papers
Presentations (PDF)

2. Microsoft Documents

Word (.docx)
Excel (.xlsx) - tables extracted
PowerPoint (.pptx)

3. Text Files

Plain text (.txt)
Markdown (.md)
CSV (structured data)

Processing:

Extract text content
Preserve formatting context
Handle multi-page documents
OCR for scanned PDFs (roadmap)

Limits by Plan:

Free: 5 MB total
Pro: 50 MB total
Business: 200 MB total
Premium: 1 GB total

Technical:

Backend: chatbot-file-upload-service
Collections: files, files_secondary
File storage: MongoDB GridFS
Text extraction: PyPDF2, python-docx, openpyxl

File Management¶

Features:

Upload Files: Drag & drop or browse
View Files: List of all uploaded files
Download Files: Retrieve original files
Delete Files: Remove specific files
File Metadata: Size, type, upload date
Processing Status: Track embedding progress

Smart Chunking:

Split large documents into chunks
Maintain context (overlapping chunks)
Chapter/section awareness
Optimal chunk size for retrieval

✍️ Manual Text Input¶

Direct Text Entry¶

What it does:
Type or paste content directly into the knowledge base.

Use Cases:

Quick facts
Company policies not on website
Special instructions
Temporary information
Q&A pairs

Features:

Rich text editor
Formatting support
Preview before save
Character count
Categorization

Technical:

Immediate embedding
Same vector storage
Full-text search enabled

❓ Q&A Builder¶

Question-Answer Pairs¶

What it does:
Train your chatbot with specific question-answer pairs for precise responses.

Why It's Powerful:

Exact Matches: When user asks exact question, get exact answer
Common Questions: Pre-load FAQs
Corrections: Override RAG when needed
Consistent Answers: Same question always gets same answer

Interface:

Add Q&A:

Question: What are your business hours?
Answer: We're open Monday-Friday, 9 AM - 5 PM EST.

Bulk Import:

Upload CSV with Q&A columns
Excel import support
Validate before import
Duplicate detection

Edit Q&A:

Inline editing
Version history
A/B test different answers
Activation toggle (enable/disable)

Search Q&A:

Find specific questions
Filter by category
Sort by frequency asked
Export to CSV

Limits:

Free: 50 Q&A pairs
Pro: 200 Q&A pairs
Business: 1,000 Q&A pairs
Premium: Unlimited

Technical:

Collection: qa_pairs (custom structure)
Priority: Q&A checked before RAG retrieval
Matching: Fuzzy matching for similar questions

🧠 RAG Pipeline¶

How Knowledge Becomes Answers¶

What it does:
The AI engine that retrieves relevant information and generates responses.

5-Step Process:

Step 1: User Question

User: "What's your return policy?"

Step 2: Question Embedding

Convert question to vector (1536 dimensions)
OpenAI embedding model
Semantic representation

Step 3: Similarity Search

Query Milvus vector database
Find top 5 most similar chunks
Cosine similarity scoring
Filter by project_id partition

Step 4: Context Building

Retrieve matched content chunks
Add conversation history (last 10 messages)
Include system prompt
Total context: ~3,000 tokens

Step 5: LLM Generation

Send context to chosen LLM (GPT-4, Llama, etc.)
Generate natural response
Apply guardrails
Return answer

Response:

Bot: "We offer a 30-day return policy on all products. Items must be unused and in original packaging. Contact support@company.com to initiate a return."

Technical:

Embedding: text-embedding-ada-002
Vector DB: Milvus (40ms search time)
LLMs: 11 models available
Context window: 4K-128K tokens (model dependent)

Advanced RAG Features¶

Hybrid Search:

Vector similarity (semantic)
Keyword matching (exact)
Combined scoring

Re-ranking:

Prioritize recent content
Boost based on user feedback
Domain-specific relevance

Conversation Context:

Remember last 10 messages
Understand follow-ups
Maintain topic coherence

Guardrails:

Filter inappropriate content
Restrict to allowed topics
Fallback responses

📊 Data Quality & Analytics¶

Knowledge Base Health¶

What it does:
Monitor and improve your data quality.

Metrics:

Coverage:

Topics covered
Data source breakdown
Content freshness
Gap analysis

Performance:

Average confidence score
Retrieval accuracy
Response quality
User satisfaction

Optimization Tips:

Add more Q&A for common questions
Update outdated content
Fill knowledge gaps
Remove redundant data

Dashboard:

Visual analytics
Question coverage heatmap
Unanswered questions log
Improvement suggestions

Testing & Validation¶

What it does:
Test your knowledge base before deployment.

Features:

Test Console:

Ask questions
See retrieved chunks
View confidence scores
Check response quality

Batch Testing:

Upload test questions
Auto-evaluate accuracy
Compare against expected answers
Generate quality report

A/B Testing:

Test different prompts
Compare LLM models
Optimize chunk size
Measure improvements

🔄 Data Updates & Synchronization¶

Keep Knowledge Current¶

What it does:
Automatically keep your chatbot's knowledge up-to-date.

Auto-Refresh:

Schedule URL re-crawls
Detect website changes
Incremental updates
No downtime

Manual Updates:

Re-upload files
Edit Q&A
Add new content
Remove outdated info

Version Control:

Track data changes
Rollback if needed
Compare versions
Audit trail

Backend:

- Client Data Collection Service ¶

Response Services - RAG implementation

Frontend:

Technical:

Embedding Pipeline

"Transform data into intelligent conversations." 🧠📚

Data Training & Knowledge Base Features¶

🎯 Overview¶

🌐 Website Data Collection¶

Automated Website Crawling¶

URL Management¶

📄 File Uploads¶

Supported File Types¶

File Management¶

✍️ Manual Text Input¶

Direct Text Entry¶

❓ Q&A Builder¶

Question-Answer Pairs¶

🧠 RAG Pipeline¶

How Knowledge Becomes Answers¶

Advanced RAG Features¶

📊 Data Quality & Analytics¶

Knowledge Base Health¶

Testing & Validation¶

🔄 Data Updates & Synchronization¶

Keep Knowledge Current¶

🔗 Related Documentation¶

- Client Data Collection Service¶

- Client Data Collection Service ¶