Skip to content

Data Training & Knowledge Base Features

Category: Content Management & AI Training
Target Users: All users (varies by plan)
Business Value: Makes chatbots intelligent and accurate


🎯 Overview

The RAG (Retrieval-Augmented Generation) pipeline that powers intelligent conversations. Upload data, train your chatbot, and watch it become an expert in your domain.


🌐 Website Data Collection

Automated Website Crawling

What it does:
Automatically extract all content from your website to train your chatbot.

Process:

Step 1: Add URLs

  • Enter website URLs (up to 50 by plan)
  • Supports:
  • Homepage
  • Product pages
  • FAQ pages
  • Blog posts
  • Documentation
  • Any public page

Step 2: Crawl Configuration

  • Depth: How many links deep
  • Include/exclude patterns
  • Respect robots.txt
  • Rate limiting (polite crawling)

Step 3: Auto-Crawl

  • Scrapes all text content
  • Extracts headings, paragraphs, lists
  • Ignores images/scripts/styles
  • Follows links within domain
  • Processes in background

Step 4: Content Processing

  • Clean HTML tags
  • Remove duplicates
  • Chunk into digestible pieces (1024 tokens)
  • Generate embeddings (vector representations)
  • Store in Milvus vector database

Limits by Plan:

  • Free: 5 URLs
  • Pro: 10 URLs
  • Business: 25 URLs
  • Premium: 50 URLs

Technical:

  • Backend: client-data-collection-service
  • Method: BeautifulSoup HTML parsing
  • Embeddings: OpenAI text-embedding-ada-002
  • Storage: Milvus collections partitioned by project_id

URL Management

What it does:
Manage your website data sources.

Features:

  • View URLs: See all crawled URLs
  • Add URLs: Add new pages anytime
  • Remove URLs: Delete specific pages
  • Re-crawl: Update content when website changes
  • Status Tracking: See crawl progress
  • Error Handling: Retry failed URLs

Auto-Updates:

  • Schedule re-crawls (weekly/monthly)
  • Detect content changes
  • Incremental updates
  • No duplicate processing

📄 File Uploads

Supported File Types

What it does:
Upload documents containing your knowledge base.

Supported Formats:

1. PDF Files

  • Product catalogs
  • User manuals
  • Policy documents
  • Research papers
  • Presentations (PDF)

2. Microsoft Documents

  • Word (.docx)
  • Excel (.xlsx) - tables extracted
  • PowerPoint (.pptx)

3. Text Files

  • Plain text (.txt)
  • Markdown (.md)
  • CSV (structured data)

Processing:

  • Extract text content
  • Preserve formatting context
  • Handle multi-page documents
  • OCR for scanned PDFs (roadmap)

Limits by Plan:

  • Free: 5 MB total
  • Pro: 50 MB total
  • Business: 200 MB total
  • Premium: 1 GB total

Technical:

  • Backend: chatbot-file-upload-service
  • Collections: files, files_secondary
  • File storage: MongoDB GridFS
  • Text extraction: PyPDF2, python-docx, openpyxl

File Management

Features:

  • Upload Files: Drag & drop or browse
  • View Files: List of all uploaded files
  • Download Files: Retrieve original files
  • Delete Files: Remove specific files
  • File Metadata: Size, type, upload date
  • Processing Status: Track embedding progress

Smart Chunking:

  • Split large documents into chunks
  • Maintain context (overlapping chunks)
  • Chapter/section awareness
  • Optimal chunk size for retrieval

✍️ Manual Text Input

Direct Text Entry

What it does:
Type or paste content directly into the knowledge base.

Use Cases:

  • Quick facts
  • Company policies not on website
  • Special instructions
  • Temporary information
  • Q&A pairs

Features:

  • Rich text editor
  • Formatting support
  • Preview before save
  • Character count
  • Categorization

Technical:

  • Immediate embedding
  • Same vector storage
  • Full-text search enabled

❓ Q&A Builder

Question-Answer Pairs

What it does:
Train your chatbot with specific question-answer pairs for precise responses.

Why It's Powerful:

  • Exact Matches: When user asks exact question, get exact answer
  • Common Questions: Pre-load FAQs
  • Corrections: Override RAG when needed
  • Consistent Answers: Same question always gets same answer

Interface:

Add Q&A:

Question: What are your business hours?
Answer: We're open Monday-Friday, 9 AM - 5 PM EST.

Bulk Import:

  • Upload CSV with Q&A columns
  • Excel import support
  • Validate before import
  • Duplicate detection

Edit Q&A:

  • Inline editing
  • Version history
  • A/B test different answers
  • Activation toggle (enable/disable)

Search Q&A:

  • Find specific questions
  • Filter by category
  • Sort by frequency asked
  • Export to CSV

Limits:

  • Free: 50 Q&A pairs
  • Pro: 200 Q&A pairs
  • Business: 1,000 Q&A pairs
  • Premium: Unlimited

Technical:

  • Collection: qa_pairs (custom structure)
  • Priority: Q&A checked before RAG retrieval
  • Matching: Fuzzy matching for similar questions

🧠 RAG Pipeline

How Knowledge Becomes Answers

What it does:
The AI engine that retrieves relevant information and generates responses.

5-Step Process:

Step 1: User Question

User: "What's your return policy?"

Step 2: Question Embedding

  • Convert question to vector (1536 dimensions)
  • OpenAI embedding model
  • Semantic representation

Step 3: Similarity Search

  • Query Milvus vector database
  • Find top 5 most similar chunks
  • Cosine similarity scoring
  • Filter by project_id partition

Step 4: Context Building

  • Retrieve matched content chunks
  • Add conversation history (last 10 messages)
  • Include system prompt
  • Total context: ~3,000 tokens

Step 5: LLM Generation

  • Send context to chosen LLM (GPT-4, Llama, etc.)
  • Generate natural response
  • Apply guardrails
  • Return answer

Response:

Bot: "We offer a 30-day return policy on all products. Items must be unused and in original packaging. Contact support@company.com to initiate a return."

Technical:

  • Embedding: text-embedding-ada-002
  • Vector DB: Milvus (40ms search time)
  • LLMs: 11 models available
  • Context window: 4K-128K tokens (model dependent)

Advanced RAG Features

Hybrid Search:

  • Vector similarity (semantic)
  • Keyword matching (exact)
  • Combined scoring

Re-ranking:

  • Prioritize recent content
  • Boost based on user feedback
  • Domain-specific relevance

Conversation Context:

  • Remember last 10 messages
  • Understand follow-ups
  • Maintain topic coherence

Guardrails:

  • Filter inappropriate content
  • Restrict to allowed topics
  • Fallback responses

📊 Data Quality & Analytics

Knowledge Base Health

What it does:
Monitor and improve your data quality.

Metrics:

Coverage:

  • Topics covered
  • Data source breakdown
  • Content freshness
  • Gap analysis

Performance:

  • Average confidence score
  • Retrieval accuracy
  • Response quality
  • User satisfaction

Optimization Tips:

  • Add more Q&A for common questions
  • Update outdated content
  • Fill knowledge gaps
  • Remove redundant data

Dashboard:

  • Visual analytics
  • Question coverage heatmap
  • Unanswered questions log
  • Improvement suggestions

Testing & Validation

What it does:
Test your knowledge base before deployment.

Features:

Test Console:

  • Ask questions
  • See retrieved chunks
  • View confidence scores
  • Check response quality

Batch Testing:

  • Upload test questions
  • Auto-evaluate accuracy
  • Compare against expected answers
  • Generate quality report

A/B Testing:

  • Test different prompts
  • Compare LLM models
  • Optimize chunk size
  • Measure improvements

🔄 Data Updates & Synchronization

Keep Knowledge Current

What it does:
Automatically keep your chatbot's knowledge up-to-date.

Auto-Refresh:

  • Schedule URL re-crawls
  • Detect website changes
  • Incremental updates
  • No downtime

Manual Updates:

  • Re-upload files
  • Edit Q&A
  • Add new content
  • Remove outdated info

Version Control:

  • Track data changes
  • Rollback if needed
  • Compare versions
  • Audit trail

Backend:

- Client Data Collection Service

Frontend:

Technical:


"Transform data into intelligent conversations." 🧠📚