Response Voice Chatbot Service - Complete Developer Documentation¶
Service: LLM Response Generation for Voice Chatbots (RAG + TTS)
Port: 8013
Purpose: Generate AI responses with voice synthesis using RAG and Azure TTS
Technology: FastAPI, Azure OpenAI (Multiple Models), Azure TTS, Milvus, FastEmbed
Code Location:/response-voice-chatbot-service/src/main.py(775 lines)
Owner: Backend Team
Last Updated: 2025-12-26
Table of Contents¶
- Service Overview
- Complete Architecture
- Multi-LLM Support
- Azure TTS Integration
- Complete Endpoint
- Security Analysis
- Performance
- Deployment
Service Overview¶
The Response Voice Chatbot Service extends the text chatbot RAG pipeline with voice synthesis capabilities. It generates AI responses and converts them to speech using Azure Cognitive Services TTS.
Key Responsibilities¶
✅ RAG Pipeline - Semantic search + LLM generation
✅ Milvus Vector Search - Find relevant content chunks
✅ Multi-LLM Support - 8 different LLM models
✅ Azure TTS - Convert responses to speech
✅ Base64 Audio - Return audio as base64 string
✅ Contact Scrubbing - Remove phone numbers/URLs from voice output
Technology Stack¶
| Technology | Purpose | Specifications |
|---|---|---|
| Azure OpenAI | Primary LLMs | GPT-4, GPT-3.5-Turbo-16k, GPT-4o-mini, o1-mini |
| Azure Llama | Alternative LLM | Llama-3.3-70B-Instruct |
| DeepSeek R1 | Alternative LLM | DeepSeek-R1 |
| Ministral | Alternative LLM | Ministral-3B |
| Phi-3 | Alternative LLM | Phi-3-small-8k |
| Azure TTS | Speech synthesis | Neural voices (10 options) |
| Milvus | Vector database | Cosine similarity search |
| FastEmbed | Embeddings | BAAI/bge-small-en-v1.5 (384D) |
Statistics¶
- Total Lines: 775
- Endpoints: 1 main endpoint
- Supported LLMs: 8 models
- Supported Voices: 10 (5 male, 5 female)
- System Prompts: 3 (Sales-Agent, Service-Agent, Informational-Agent)
- Default Top-K: 5 chunks
- Average Response Time: 3-7 seconds (includes TTS)
Complete Architecture¶
End-to-End Data Flow¶
graph TB
USER["User Question<br/>(Voice Input)"]
subgraph "Step 1: RAG Retrieval"
EMBED["Generate Embedding<br/>(BAAI/bge-small-en-v1.5)"]
MILVUS["Milvus Search<br/>(Top-5 Chunks)"]
TRUNCATE["Truncate Context<br/>(Max 10K tokens)"]
end
subgraph "Step 2: History & Config"
MONGO["MongoDB Lookup<br/>(chatbot_history)"]
CONFIG["Get Chatbot Config<br/>(purpose + voice)"]
SYSPROMPT["Build System Prompt<br/>(with chatbot name)"]
end
subgraph "Step 3: LLM Generation"
MESSAGES["Construct Messages<br/>(system + history + context)"]
TOKENCHECK["Token Limit Check<br/>(Max 16K)"]
LLM["Azure OpenAI<br/>GPT-3.5-Turbo-16k"]
CLEAN["Remove Markdown<br/>(*, #)"]
end
subgraph "Step 4: TTS Generation"
SCRUB["Scrub Contacts<br/>(Phone/URL removal)"]
AZURETTS["Azure TTS<br/>(Neural Voice)"]
WAV["Generate WAV<br/>File"]
B64["Encode Base64"]
CLEANUP["Delete WAV<br/>File"]
end
subgraph "Step 5: Save & Return"
SAVE["Save to<br/>chatbot_history"]
RESP["Return Text<br/>+ Audio"]
end
USER --> EMBED
EMBED --> MILVUS
MILVUS --> TRUNCATE
USER --> MONGO
USER --> CONFIG
CONFIG --> SYSPROMPT
TRUNCATE --> MESSAGES
MONGO --> MESSAGES
SYSPROMPT --> MESSAGES
MESSAGES --> TOKENCHECK
TOKENCHECK --> LLM
LLM --> CLEAN
CLEAN --> SCRUB
SCRUB --> AZURETTS
AZURETTS --> WAV
WAV --> B64
B64 --> CLEANUP
CLEAN --> SAVE
CLEANUP --> RESP
SAVE --> RESP
RESP --> USER
style USER fill:#e1f5fe
style LLM fill:#fff3e0
style AZURETTS fill:#ffecb3
style MILVUS fill:#f3e5f5
style SAVE fill:#c8e6c9
Multi-LLM Support¶
8 Supported LLM Models¶
The service integrates 8 different LLM models, though only GPT-3.5-Turbo-16k is actively used in the main endpoint.
1. Azure OpenAI GPT-4¶
Function: call_openai_4() (Lines 337-347)
Configuration:
endpoint_gpt4 = "https://machineagentopenai.openai.azure.com/openai/deployments/gpt-4-0613/..."
deployment_gpt4 = "gpt-4-0613"
subscription_key = "AZxDVMYB..." # ⚠️ Hardcoded!
client = AzureOpenAI(
azure_endpoint=endpoint_gpt4,
api_key=subscription_key,
api_version="2024-02-15-preview"
)
Usage:
2. Azure OpenAI GPT-3.5-Turbo-16k ⭐¶
Function: call_openai_35() (Lines 361-371)
⭐ ACTIVELY USED in main endpoint (Line 701)
Configuration:
endpoint_gpt35 = "https://machineagentopenai.openai.azure.com/openai/deployments/gpt-35-turbo-16k-0613/..."
deployment_gpt35 = "gpt-35-turbo-16k-0613"
subscription_key = "AZxDVMYB..." # ⚠️ Hardcoded!
client_gpt35 = AzureOpenAI(
azure_endpoint=endpoint_gpt35,
api_key=subscription_key,
api_version="2024-02-15-preview"
)
Usage in Endpoint:
response = client_gpt35.chat.completions.create(
model="gpt-35-turbo-16k-0613",
messages=messages,
temperature=0.7
)
3. Azure OpenAI GPT-4o-mini¶
Function: call_openai_4o() (Lines 385-395)
Configuration:
endpoint_gpt4o = "https://machineagentopenai.openai.azure.com/openai/deployments/gpt-4o-mini-2024-07-18/..."
deployment_gpt4o = "gpt-4o-mini-2024-07-18"
subscription_key = "AZxDVMYB..." # ⚠️ Hardcoded!
4. Azure OpenAI o1-mini¶
Function: call_openai_gpto1mini() (Lines 409-419)
Configuration:
endpoint_gpto1mini = "https://machineagentopenai.openai.azure.com/openai/deployments/o1-mini-2024-09-12/..."
deployment_gpto1mini = "o1-mini-2024-09-12"
subscription_key = "AZxDVMYB..." # ⚠️ Hardcoded!
api_version = "2024-09-12-" # Note the trailing dash (typo?)
5. Azure Llama 3.3-70B-Instruct¶
Function: call_llama() (Lines 432-440)
Configuration:
AZURE_LLAMA_ENDPOINT = "https://Llama-3-3-70B-Instruct-ulmca.eastus.models.ai.azure.com/chat/completions"
AZURE_API_KEY = "JOfcw0VW0dS31Z8XgkNRSP9tUaBiwUYZ" # ⚠️ Hardcoded!
llama_HEADERS = {
"Content-Type": "application/json",
"Authorization": f"Bearer {AZURE_API_KEY}"
}
Usage:
payload = {"messages": messages, "temperature": 0.7, "max_tokens": 50}
response = requests.post(AZURE_LLAMA_ENDPOINT, json=payload, headers=llama_HEADERS)
6. DeepSeek R1¶
Function: call_deepseek() (Lines 453-464)
Configuration:
deepseek_api_url = "https://DeepSeek-R1-imalr.eastus2.models.ai.azure.com/chat/completions"
deepseek_api_key = "GwUcGzHhhUbvApfMR4aq1ZPFUic6lbWE" # ⚠️ Hardcoded!
deepseekheaders = {
"Content-Type": "application/json",
"Authorization": f"Bearer {deepseek_api_key}"
}
Special Feature: Removes <think>...</think> tags from response
7. Ministral-3B¶
Function: call_Ministral() (Lines 475-486)
Configuration:
Ministral_api_url = "https://Ministral-3B-rvgab.eastus2.models.ai.azure.com/chat/completions"
Ministral_api_key = "Z7fNcdnw5Tht1xAz6VlgUlLOeZoVTkIf" # ⚠️ Hardcoded!
8. Phi-3-small-8k¶
Function: call_phi() (Lines 497-508)
Configuration:
phi_api_url = "https://Phi-3-small-8k-instruct-qvlpq.eastus2.models.ai.azure.com/chat/completions"
phi_api_key = "T8I14He3lbMyAyUwNfffwG58e23EcXsU" # ⚠️ Hardcoded!
Azure TTS Integration¶
Supported Voices¶
10 Azure Neural Voices (Lines 207-218)
| Voice Code | Azure Voice Name | Gender | Accent | Notes |
|---|---|---|---|---|
Male_1 |
en-US-EricNeural | Male | US | Professional |
Male_2 |
en-US-GuyNeural | Male | US | Friendly |
Male_3 |
en-CA-LiamNeural | Male | Canadian | Neutral |
Male_IND |
en-IN-PrabhatNeural | Male | Indian | Regional |
Female_1 |
en-US-AvaMultilingualNeural | Female | US | Multilingual |
Female_2 |
en-US-JennyNeural | Female | US | Natural |
Female_3 |
en-US-EmmaMultilingualNeural | Female | US | Multilingual |
Female_4 |
en-AU-NatashaNeural | Female | Australian | Regional |
Female_IND |
en-IN-NeerjaExpressiveNeural | Female | Indian | Expressive |
Female_IND2 |
en-IN-NeerjaNeural | Female | Indian | Standard |
TTS Generation Function¶
Function: text_to_speech() (Lines 222-265)
Complete Flow:
Step 1: Clean Text
Step 2: Configure Azure Speech SDK
speech_config = speechsdk.SpeechConfig(
subscription="9N41NOfDyVDoduiD4EjlzmZU9CbUX3pPqWfLCORpl7cBf0l2lzVQJQQJ99BCACGhslBXJ3w3AAAYACOG2329", # ⚠️ HARDCODED!
region="centralindia"
)
speech_config.speech_synthesis_voice_name = voice # e.g., "en-US-JennyNeural"
Step 3: Generate WAV File
wav_file = os.path.join(OUTPUT_DIR, f"{user_id}.wav")
audio_config = speechsdk.audio.AudioOutputConfig(filename=wav_file)
speech_synthesizer = speechsdk.SpeechSynthesizer(
speech_config=speech_config,
audio_config=audio_config
)
result = speech_synthesizer.speak_text_async(cleaned_text).get()
Step 4: Check Result
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print("Speech synthesis completed successfully.")
else:
error_details = result.cancellation_details
raise Exception(f"Speech synthesis failed: {error_details.reason}")
Returns: Path to WAV file
Contact Scrubbing¶
Function: remove_contact_numbers() (Lines 511-521)
Purpose: Remove sensitive information from voice responses
Patterns Removed:
1. Phone Numbers:
phone_number_pattern = r"\+?[0-9]{1,4}[-.\ s]?[0-9]{1,3}[-.\ s]?[0-9]{3}[-.\ s]?[0-9]{3,4}"
# Matches: +1-555-123-4567, 555.123.4567, etc.
Replacement: "the number provided in the chat"
2. URLs:
url_pattern = r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
# Matches: https://example.com, http://site.com/path, etc.
Replacement: "via the url provided in the chat"
Example:
# Input
"Call us at +1-555-123-4567 or visit https://example.com/pricing"
# Output (for voice)
"Call us at the number provided in the chat or visit via the url provided in the chat"
Why? Voice assistants shouldn't read out long numbers/URLs - users can't remember them!
Complete Endpoint¶
POST /v2/get-response-voice-chatbot¶
Purpose: Generate AI response with voice synthesis
Code Location: Lines 744-772 (entry point) + Lines 523-740 (processing)
Request:
POST /v2/get-response-voice-chatbot
Content-Type: multipart/form-data
user_id=User-123456
project_id=User-123456_Project_1
session_id=session_20250115_140530
question=What are your pricing options?
Response:
{
"text": "We offer three pricing tiers: Basic at 29 dollars per month, Pro at 99 dollars, and Enterprise with custom pricing. Which interests you most?",
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA..."
}
Fields:
text- Clean text response (markdown removed)audio- Base64-encoded WAV file
Processing Flow¶
Entry Point (Lines 744-772):
@app.post("/v2/get-response-voice-chatbot")
async def ask_gpt_voicechat(
user_id: str = Form(...),
project_id: str = Form(...),
session_id: str = Form(...),
question: str = Form(...)
):
# 1. Fetch chatbot config
chatbot_config = chatbot_collection.find_one({"user_id": user_id, "project_id": project_id})
chatbot_purpose = chatbot_config.get("chatbot_purpose", "Service-Agent")
chatbot_voice = chatbot_config.get("voice", "Female_3")
chatbot_name = chatbot_config.get("hidden_name")
# 2. Execute in thread pool (for synchronous TTS)
loop = asyncio.get_event_loop()
with concurrent.futures.ThreadPoolExecutor() as pool:
result = await loop.run_in_executor(
pool, sync_ask_voicechatbot, user_id, project_id, session_id,
chatbot_purpose, chatbot_voice, question
)
return result
Why ThreadPoolExecutor? Azure TTS SDK is synchronous and blocks - running in thread prevents blocking the async event loop.
Main Processing Logic¶
Function: sync_ask_voicechatbot() (Lines 523-740)
Complete Step-by-Step:
Step 1: Milvus RAG Retrieval (Lines 536-584)
# Generate question embedding
embedder = Embedding(model_name="BAAI/bge-small-en-v1.5", max_length=512)
question_embedding = list(embedder.embed([question]))[0]
question_embedding_list = [float(x) for x in question_embedding]
# Search Milvus
search_results = milvus_embeddings.search_embeddings(
collection_name="embeddings",
query_vector=question_embedding_list,
user_id=user_id,
project_id=project_id,
top_k=5
)
# Combine top chunks
combined_text = "\n\n".join([result.get("text", "") for result in search_results])
most_relevant_document = combined_text
# Truncate if too large (max 10K tokens)
MAX_DOCUMENT_TOKENS = 10000
if count_tokens(most_relevant_document) > MAX_DOCUMENT_TOKENS:
most_relevant_document = truncate_text_by_tokens(most_relevant_document, MAX_DOCUMENT_TOKENS)
Step 2: Build System Prompt (Lines 586-597)
chatbot_settings = chatbot_collection.find_one({"user_id": user_id, "project_id": project_id})
chatbot_name = chatbot_settings.get("hidden_name", "Unknown Chatbot")
# Inject chatbot name into system prompt
system_prompt = f"Your name is {chatbot_name}. {system_prompts.get(chatbot_purpose, system_prompts['Service-Agent'])}"
Example:
Step 3: Retrieve Chat History (Lines 599-660)
MAX_CONTEXT_TOKENS = 12000
chat_session = history_collection.find_one({
"project_id": project_id,
"user_id": user_id,
"session_id": session_id
})
# Build history with token limit
temp_history = []
total_tokens = 0
for msg in chat_session["chat_data"]:
message_text = f"User: {msg['input_prompt']}\nAssistant: {msg['output_response']}"
msg_tokens = count_tokens(message_text)
if total_tokens + msg_tokens > MAX_CONTEXT_TOKENS:
break
temp_history.append(message_text)
total_tokens += msg_tokens
chat_history_text = "\n\n".join(reversed(temp_history)) # Chronological order
Step 4: Construct Final Prompt (Lines 662-690)
context_parts = []
remaining_tokens = MAX_CONTEXT_TOKENS - count_tokens(system_prompt) - count_tokens(question) - 100
# Add chat history
if chat_history_text:
context_parts.append(f"Previous conversation:\n{chat_history_text}")
remaining_tokens -= count_tokens(chat_history_text)
# Add relevant document
if most_relevant_document:
if count_tokens(most_relevant_document) > remaining_tokens:
most_relevant_document = truncate_text_by_tokens(most_relevant_document, remaining_tokens)
context_parts.append(f"Relevant document:\n{most_relevant_document}")
prompt = "\n\n".join(context_parts + [f"Current question: {question}"])
# Final safety check
total_tokens = count_tokens(system_prompt) + count_tokens(prompt)
if total_tokens > 15000: # Buffer from 16K limit
# Keep only last 2 exchanges
recent_history = "\n\n".join(chat_history_text.split("\n\n")[-2:])
prompt = f"Recent conversation:\n{recent_history}\n\nCurrent question: {question}"
Step 5: LLM Generation (Lines 692-709)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": prompt}
]
response = client_gpt35.chat.completions.create(
model="gpt-35-turbo-16k-0613",
messages=messages,
temperature=0.7
)
raw_answer = response.choices[0].message.content
clean_answer = re.sub(r'[\*#]', '', raw_answer).strip() # Remove markdown
answer = clean_answer
Step 6: TTS Generation (Lines 711-724)
voice = SUPPORTED_VOICES.get(chatbot_voice) # e.g., "en-US-JennyNeural"
wav_file = asyncio.run(text_to_speech(answer, user_id, voice))
# Encode as base64
with open(wav_file, "rb") as audio_file:
audio_base64 = base64.b64encode(audio_file.read()).decode("utf-8")
# Cleanup
os.remove(wav_file)
Step 7: Save History (Lines 726-728)
Step 8: Return Response (Lines 730-734)
Security Analysis¶
CRITICAL Security Issues¶
1. ⚠️ MULTIPLE HARDCODED API KEYS!
Azure OpenAI API Key (Lines 328, 352, 376, 401)
subscription_key = "AZxDVMYB08AaUip0i5ed1sy73ZpUsqencYYxKDbm6nfWfG1AqPZ3JQQJ99BCACYeBjFXJ3w3AAABACOGVUo7"
Azure TTS Subscription Key (Line 237)
Llama API Key (Line 424)
DeepSeek API Key (Line 445)
Ministral API Key (Line 468)
Phi-3 API Key (Line 490)
⚠️ 6 HARDCODED API KEYS IN TOTAL!
2. ⚠️ Hardcoded Database Name (Line 78)
Problem: Hardcoded to "Machine_agent_dev" even after loading from environment
3. ⚠️ File System Race Condition
Problem: WAV files use {user_id}.wav - concurrent requests from same user will overwrite each other
Better:
4. ⚠️ Limited Error Handling in TTS
Problem: If TTS fails, entire request fails - no fallback to text-only response
Performance¶
Response Time Breakdown¶
| Step | Avg Latency | Notes |
|---|---|---|
| 1. Embedding generation | 50-100ms | BAAI/bge-small-en-v1.5 |
| 2. Milvus search | 50-150ms | Top-5 chunks |
| 3. MongoDB history | 20-50ms | Simple query |
| 4. Azure OpenAI | 1-3 seconds | GPT-3.5-Turbo-16k |
| 5. Azure TTS | 1-3 seconds | Neural voice synthesis |
| 6. Base64 encoding | 10-50ms | Depends on audio length |
| 7. MongoDB save | 20-50ms | Update operation |
| TOTAL | 3-7 seconds | User-facing latency |
TTS adds 1-3 seconds compared to text-only responses.
Deployment¶
Docker Configuration¶
Dockerfile:
FROM python:3.9-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy shared modules
COPY shared/ ./shared/
COPY src/ .
# Create TTS output directory
RUN mkdir -p tts_audio
EXPOSE 8013
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8013"]
Requirements.txt¶
fastapi>=0.95.0
uvicorn[standard]>=0.22.0
pymongo>=4.3.3
python-multipart>=0.0.6
python-dotenv>=1.0.0
# Azure Services
azure-cognitiveservices-speech>=1.30.0
openai>=1.0.0
# Embeddings & ML
fastembed>=0.1.0
scikit-learn>=1.3.0
numpy>=1.24.0
# Utilities
tiktoken>=0.5.0
pytz>=2023.3
requests>=2.31.0
# Monitoring
ddtrace>=1.19.0
Environment Variables¶
# Azure OpenAI
AZURE_OPENAI_API_KEY=<your-key>
ENDPOINT_URL=https://machineagentopenai.openai.azure.com/...
DEPLOYMENT_NAME=gpt-35-turbo-16k-0613
# MongoDB
MONGO_URI=mongodb://...
MONGO_DB_NAME=Machine_agent_demo
# Milvus
MILVUS_HOST=localhost
MILVUS_PORT=19530
# DataDog
DD_SERVICE=response-voice-chatbot-service
DD_ENV=production
Related Documentation¶
- Response Text Chatbot Service - Text variant
- Response 3D Chatbot Service - 3D variant (MOST CRITICAL - next)
- Data Crawling Service - Creates embeddings
Recommendations¶
CRITICAL (Security)¶
- ⚠️ Move ALL API Keys to Environment (6 hardcoded keys!)
- ⚠️ Fix Hardcoded Database Name - Use environment variable
- ⚠️ Fix WAV File Naming - Use UUID to prevent collisions
- ⚠️ Add TTS Fallback - Return text-only if TTS fails
Performance¶
- Cache TTS Outputs - Same response = reuse audio
- Async TTS - Use async Azure SDK if available
- Streaming Response - Stream text first, then audio
- Reduce LLM Clients - Only initialize models that are used
Code Quality¶
- Remove Unused LLM Functions - 7 models defined, only 1 used
- Extract Configuration - Move all endpoints/keys to config file
- Add Type Hints - Complete typing
- Add Unit Tests - Test contact scrubbing, voice selection
Last Updated: 2025-12-26
Code Version: response-voice-chatbot-service/src/main.py (775 lines)
Total Endpoints: 1
Supported LLMs: 8
Supported Voices: 10
Review Cycle: Monthly
"Conversations that speak for themselves."