Data Training & Knowledge Base Features¶
Category: Content Management & AI Training
Target Users: All users (varies by plan)
Business Value: Makes chatbots intelligent and accurate
🎯 Overview¶
The RAG (Retrieval-Augmented Generation) pipeline that powers intelligent conversations. Upload data, train your chatbot, and watch it become an expert in your domain.
🌐 Website Data Collection¶
Automated Website Crawling¶
What it does:
Automatically extract all content from your website to train your chatbot.
Process:
Step 1: Add URLs
- Enter website URLs (up to 50 by plan)
- Supports:
- Homepage
- Product pages
- FAQ pages
- Blog posts
- Documentation
- Any public page
Step 2: Crawl Configuration
- Depth: How many links deep
- Include/exclude patterns
- Respect robots.txt
- Rate limiting (polite crawling)
Step 3: Auto-Crawl
- Scrapes all text content
- Extracts headings, paragraphs, lists
- Ignores images/scripts/styles
- Follows links within domain
- Processes in background
Step 4: Content Processing
- Clean HTML tags
- Remove duplicates
- Chunk into digestible pieces (1024 tokens)
- Generate embeddings (vector representations)
- Store in Milvus vector database
Limits by Plan:
- Free: 5 URLs
- Pro: 10 URLs
- Business: 25 URLs
- Premium: 50 URLs
Technical:
- Backend: client-data-collection-service
- Method: BeautifulSoup HTML parsing
- Embeddings: OpenAI text-embedding-ada-002
- Storage: Milvus collections partitioned by
project_id
URL Management¶
What it does:
Manage your website data sources.
Features:
- View URLs: See all crawled URLs
- Add URLs: Add new pages anytime
- Remove URLs: Delete specific pages
- Re-crawl: Update content when website changes
- Status Tracking: See crawl progress
- Error Handling: Retry failed URLs
Auto-Updates:
- Schedule re-crawls (weekly/monthly)
- Detect content changes
- Incremental updates
- No duplicate processing
📄 File Uploads¶
Supported File Types¶
What it does:
Upload documents containing your knowledge base.
Supported Formats:
1. PDF Files
- Product catalogs
- User manuals
- Policy documents
- Research papers
- Presentations (PDF)
2. Microsoft Documents
- Word (.docx)
- Excel (.xlsx) - tables extracted
- PowerPoint (.pptx)
3. Text Files
- Plain text (.txt)
- Markdown (.md)
- CSV (structured data)
Processing:
- Extract text content
- Preserve formatting context
- Handle multi-page documents
- OCR for scanned PDFs (roadmap)
Limits by Plan:
- Free: 5 MB total
- Pro: 50 MB total
- Business: 200 MB total
- Premium: 1 GB total
Technical:
- Backend: chatbot-file-upload-service
- Collections:
files,files_secondary - File storage: MongoDB GridFS
- Text extraction: PyPDF2, python-docx, openpyxl
File Management¶
Features:
- Upload Files: Drag & drop or browse
- View Files: List of all uploaded files
- Download Files: Retrieve original files
- Delete Files: Remove specific files
- File Metadata: Size, type, upload date
- Processing Status: Track embedding progress
Smart Chunking:
- Split large documents into chunks
- Maintain context (overlapping chunks)
- Chapter/section awareness
- Optimal chunk size for retrieval
✍️ Manual Text Input¶
Direct Text Entry¶
What it does:
Type or paste content directly into the knowledge base.
Use Cases:
- Quick facts
- Company policies not on website
- Special instructions
- Temporary information
- Q&A pairs
Features:
- Rich text editor
- Formatting support
- Preview before save
- Character count
- Categorization
Technical:
- Immediate embedding
- Same vector storage
- Full-text search enabled
❓ Q&A Builder¶
Question-Answer Pairs¶
What it does:
Train your chatbot with specific question-answer pairs for precise responses.
Why It's Powerful:
- Exact Matches: When user asks exact question, get exact answer
- Common Questions: Pre-load FAQs
- Corrections: Override RAG when needed
- Consistent Answers: Same question always gets same answer
Interface:
Add Q&A:
Bulk Import:
- Upload CSV with Q&A columns
- Excel import support
- Validate before import
- Duplicate detection
Edit Q&A:
- Inline editing
- Version history
- A/B test different answers
- Activation toggle (enable/disable)
Search Q&A:
- Find specific questions
- Filter by category
- Sort by frequency asked
- Export to CSV
Limits:
- Free: 50 Q&A pairs
- Pro: 200 Q&A pairs
- Business: 1,000 Q&A pairs
- Premium: Unlimited
Technical:
- Collection:
qa_pairs(custom structure) - Priority: Q&A checked before RAG retrieval
- Matching: Fuzzy matching for similar questions
🧠 RAG Pipeline¶
How Knowledge Becomes Answers¶
What it does:
The AI engine that retrieves relevant information and generates responses.
5-Step Process:
Step 1: User Question
Step 2: Question Embedding
- Convert question to vector (1536 dimensions)
- OpenAI embedding model
- Semantic representation
Step 3: Similarity Search
- Query Milvus vector database
- Find top 5 most similar chunks
- Cosine similarity scoring
- Filter by
project_idpartition
Step 4: Context Building
- Retrieve matched content chunks
- Add conversation history (last 10 messages)
- Include system prompt
- Total context: ~3,000 tokens
Step 5: LLM Generation
- Send context to chosen LLM (GPT-4, Llama, etc.)
- Generate natural response
- Apply guardrails
- Return answer
Response:
Bot: "We offer a 30-day return policy on all products. Items must be unused and in original packaging. Contact support@company.com to initiate a return."
Technical:
- Embedding: text-embedding-ada-002
- Vector DB: Milvus (40ms search time)
- LLMs: 11 models available
- Context window: 4K-128K tokens (model dependent)
Advanced RAG Features¶
Hybrid Search:
- Vector similarity (semantic)
- Keyword matching (exact)
- Combined scoring
Re-ranking:
- Prioritize recent content
- Boost based on user feedback
- Domain-specific relevance
Conversation Context:
- Remember last 10 messages
- Understand follow-ups
- Maintain topic coherence
Guardrails:
- Filter inappropriate content
- Restrict to allowed topics
- Fallback responses
📊 Data Quality & Analytics¶
Knowledge Base Health¶
What it does:
Monitor and improve your data quality.
Metrics:
Coverage:
- Topics covered
- Data source breakdown
- Content freshness
- Gap analysis
Performance:
- Average confidence score
- Retrieval accuracy
- Response quality
- User satisfaction
Optimization Tips:
- Add more Q&A for common questions
- Update outdated content
- Fill knowledge gaps
- Remove redundant data
Dashboard:
- Visual analytics
- Question coverage heatmap
- Unanswered questions log
- Improvement suggestions
Testing & Validation¶
What it does:
Test your knowledge base before deployment.
Features:
Test Console:
- Ask questions
- See retrieved chunks
- View confidence scores
- Check response quality
Batch Testing:
- Upload test questions
- Auto-evaluate accuracy
- Compare against expected answers
- Generate quality report
A/B Testing:
- Test different prompts
- Compare LLM models
- Optimize chunk size
- Measure improvements
🔄 Data Updates & Synchronization¶
Keep Knowledge Current¶
What it does:
Automatically keep your chatbot's knowledge up-to-date.
Auto-Refresh:
- Schedule URL re-crawls
- Detect website changes
- Incremental updates
- No downtime
Manual Updates:
- Re-upload files
- Edit Q&A
- Add new content
- Remove outdated info
Version Control:
- Track data changes
- Rollback if needed
- Compare versions
- Audit trail
🔗 Related Documentation¶
Backend:
- Client Data Collection Service¶
- Response Services - RAG implementation
Frontend:
Technical:
"Transform data into intelligent conversations." 🧠📚