Skip to content

Response Voice Chatbot Service - Complete Developer Documentation

Service: LLM Response Generation for Voice Chatbots (RAG + TTS)
Port: 8013
Purpose: Generate AI responses with voice synthesis using RAG and Azure TTS
Technology: FastAPI, Azure OpenAI (Multiple Models), Azure TTS, Milvus, FastEmbed
Code Location: /response-voice-chatbot-service/src/main.py (775 lines)
Owner: Backend Team
Last Updated: 2025-12-26


Table of Contents

  1. Service Overview
  2. Complete Architecture
  3. Multi-LLM Support
  4. Azure TTS Integration
  5. Complete Endpoint
  6. Security Analysis
  7. Performance
  8. Deployment

Service Overview

The Response Voice Chatbot Service extends the text chatbot RAG pipeline with voice synthesis capabilities. It generates AI responses and converts them to speech using Azure Cognitive Services TTS.

Key Responsibilities

RAG Pipeline - Semantic search + LLM generation
Milvus Vector Search - Find relevant content chunks
Multi-LLM Support - 8 different LLM models
Azure TTS - Convert responses to speech
Base64 Audio - Return audio as base64 string
Contact Scrubbing - Remove phone numbers/URLs from voice output

Technology Stack

Technology Purpose Specifications
Azure OpenAI Primary LLMs GPT-4, GPT-3.5-Turbo-16k, GPT-4o-mini, o1-mini
Azure Llama Alternative LLM Llama-3.3-70B-Instruct
DeepSeek R1 Alternative LLM DeepSeek-R1
Ministral Alternative LLM Ministral-3B
Phi-3 Alternative LLM Phi-3-small-8k
Azure TTS Speech synthesis Neural voices (10 options)
Milvus Vector database Cosine similarity search
FastEmbed Embeddings BAAI/bge-small-en-v1.5 (384D)

Statistics

  • Total Lines: 775
  • Endpoints: 1 main endpoint
  • Supported LLMs: 8 models
  • Supported Voices: 10 (5 male, 5 female)
  • System Prompts: 3 (Sales-Agent, Service-Agent, Informational-Agent)
  • Default Top-K: 5 chunks
  • Average Response Time: 3-7 seconds (includes TTS)

Complete Architecture

End-to-End Data Flow

graph TB
    USER["User Question<br/>(Voice Input)"]

    subgraph "Step 1: RAG Retrieval"
        EMBED["Generate Embedding<br/>(BAAI/bge-small-en-v1.5)"]
        MILVUS["Milvus Search<br/>(Top-5 Chunks)"]
        TRUNCATE["Truncate Context<br/>(Max 10K tokens)"]
    end

    subgraph "Step 2: History & Config"
        MONGO["MongoDB Lookup<br/>(chatbot_history)"]
        CONFIG["Get Chatbot Config<br/>(purpose + voice)"]
        SYSPROMPT["Build System Prompt<br/>(with chatbot name)"]
    end

    subgraph "Step 3: LLM Generation"
        MESSAGES["Construct Messages<br/>(system + history + context)"]
        TOKENCHECK["Token Limit Check<br/>(Max 16K)"]
        LLM["Azure OpenAI<br/>GPT-3.5-Turbo-16k"]
        CLEAN["Remove Markdown<br/>(*, #)"]
    end

    subgraph "Step 4: TTS Generation"
        SCRUB["Scrub Contacts<br/>(Phone/URL removal)"]
        AZURETTS["Azure TTS<br/>(Neural Voice)"]
        WAV["Generate WAV<br/>File"]
        B64["Encode Base64"]
        CLEANUP["Delete WAV<br/>File"]
    end

    subgraph "Step 5: Save & Return"
        SAVE["Save to<br/>chatbot_history"]
        RESP["Return Text<br/>+ Audio"]
    end

    USER --> EMBED
    EMBED --> MILVUS
    MILVUS --> TRUNCATE

    USER --> MONGO
    USER --> CONFIG
    CONFIG --> SYSPROMPT

    TRUNCATE --> MESSAGES
    MONGO --> MESSAGES
    SYSPROMPT --> MESSAGES

    MESSAGES --> TOKENCHECK
    TOKENCHECK --> LLM
    LLM --> CLEAN

    CLEAN --> SCRUB
    SCRUB --> AZURETTS
    AZURETTS --> WAV
    WAV --> B64
    B64 --> CLEANUP

    CLEAN --> SAVE
    CLEANUP --> RESP
    SAVE --> RESP

    RESP --> USER

    style USER fill:#e1f5fe
    style LLM fill:#fff3e0
    style AZURETTS fill:#ffecb3
    style MILVUS fill:#f3e5f5
    style SAVE fill:#c8e6c9

Multi-LLM Support

8 Supported LLM Models

The service integrates 8 different LLM models, though only GPT-3.5-Turbo-16k is actively used in the main endpoint.


1. Azure OpenAI GPT-4

Function: call_openai_4() (Lines 337-347)

Configuration:

endpoint_gpt4 = "https://machineagentopenai.openai.azure.com/openai/deployments/gpt-4-0613/..."
deployment_gpt4 = "gpt-4-0613"
subscription_key = "AZxDVMYB..." # ⚠️ Hardcoded!

client = AzureOpenAI(
    azure_endpoint=endpoint_gpt4,
    api_key=subscription_key,
    api_version="2024-02-15-preview"
)

Usage:

response = call_openai_4(messages)

2. Azure OpenAI GPT-3.5-Turbo-16k ⭐

Function: call_openai_35() (Lines 361-371)

⭐ ACTIVELY USED in main endpoint (Line 701)

Configuration:

endpoint_gpt35 = "https://machineagentopenai.openai.azure.com/openai/deployments/gpt-35-turbo-16k-0613/..."
deployment_gpt35 = "gpt-35-turbo-16k-0613"
subscription_key = "AZxDVMYB..." # ⚠️ Hardcoded!

client_gpt35 = AzureOpenAI(
    azure_endpoint=endpoint_gpt35,
    api_key=subscription_key,
    api_version="2024-02-15-preview"
)

Usage in Endpoint:

response = client_gpt35.chat.completions.create(
    model="gpt-35-turbo-16k-0613",
    messages=messages,
    temperature=0.7
)

3. Azure OpenAI GPT-4o-mini

Function: call_openai_4o() (Lines 385-395)

Configuration:

endpoint_gpt4o = "https://machineagentopenai.openai.azure.com/openai/deployments/gpt-4o-mini-2024-07-18/..."
deployment_gpt4o = "gpt-4o-mini-2024-07-18"
subscription_key = "AZxDVMYB..." # ⚠️ Hardcoded!

4. Azure OpenAI o1-mini

Function: call_openai_gpto1mini() (Lines 409-419)

Configuration:

endpoint_gpto1mini = "https://machineagentopenai.openai.azure.com/openai/deployments/o1-mini-2024-09-12/..."
deployment_gpto1mini = "o1-mini-2024-09-12"
subscription_key = "AZxDVMYB..." # ⚠️ Hardcoded!
api_version = "2024-09-12-"  # Note the trailing dash (typo?)

5. Azure Llama 3.3-70B-Instruct

Function: call_llama() (Lines 432-440)

Configuration:

AZURE_LLAMA_ENDPOINT = "https://Llama-3-3-70B-Instruct-ulmca.eastus.models.ai.azure.com/chat/completions"
AZURE_API_KEY = "JOfcw0VW0dS31Z8XgkNRSP9tUaBiwUYZ"  # ⚠️ Hardcoded!

llama_HEADERS = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {AZURE_API_KEY}"
}

Usage:

payload = {"messages": messages, "temperature": 0.7, "max_tokens": 50}
response = requests.post(AZURE_LLAMA_ENDPOINT, json=payload, headers=llama_HEADERS)

6. DeepSeek R1

Function: call_deepseek() (Lines 453-464)

Configuration:

deepseek_api_url = "https://DeepSeek-R1-imalr.eastus2.models.ai.azure.com/chat/completions"
deepseek_api_key = "GwUcGzHhhUbvApfMR4aq1ZPFUic6lbWE"  # ⚠️ Hardcoded!

deepseekheaders = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {deepseek_api_key}"
}

Special Feature: Removes <think>...</think> tags from response

answer = re.sub(r"<think>.*?</think>", "", answer, flags=re.DOTALL).strip()

7. Ministral-3B

Function: call_Ministral() (Lines 475-486)

Configuration:

Ministral_api_url = "https://Ministral-3B-rvgab.eastus2.models.ai.azure.com/chat/completions"
Ministral_api_key = "Z7fNcdnw5Tht1xAz6VlgUlLOeZoVTkIf"  # ⚠️ Hardcoded!

8. Phi-3-small-8k

Function: call_phi() (Lines 497-508)

Configuration:

phi_api_url = "https://Phi-3-small-8k-instruct-qvlpq.eastus2.models.ai.azure.com/chat/completions"
phi_api_key = "T8I14He3lbMyAyUwNfffwG58e23EcXsU"  # ⚠️ Hardcoded!

Azure TTS Integration

Supported Voices

10 Azure Neural Voices (Lines 207-218)

Voice Code Azure Voice Name Gender Accent Notes
Male_1 en-US-EricNeural Male US Professional
Male_2 en-US-GuyNeural Male US Friendly
Male_3 en-CA-LiamNeural Male Canadian Neutral
Male_IND en-IN-PrabhatNeural Male Indian Regional
Female_1 en-US-AvaMultilingualNeural Female US Multilingual
Female_2 en-US-JennyNeural Female US Natural
Female_3 en-US-EmmaMultilingualNeural Female US Multilingual
Female_4 en-AU-NatashaNeural Female Australian Regional
Female_IND en-IN-NeerjaExpressiveNeural Female Indian Expressive
Female_IND2 en-IN-NeerjaNeural Female Indian Standard

TTS Generation Function

Function: text_to_speech() (Lines 222-265)

Complete Flow:

Step 1: Clean Text

cleaned_text = remove_contact_numbers(text)
# Removes phone numbers and URLs for voice output

Step 2: Configure Azure Speech SDK

speech_config = speechsdk.SpeechConfig(
    subscription="9N41NOfDyVDoduiD4EjlzmZU9CbUX3pPqWfLCORpl7cBf0l2lzVQJQQJ99BCACGhslBXJ3w3AAAYACOG2329",  # ⚠️ HARDCODED!
    region="centralindia"
)
speech_config.speech_synthesis_voice_name = voice  # e.g., "en-US-JennyNeural"

Step 3: Generate WAV File

wav_file = os.path.join(OUTPUT_DIR, f"{user_id}.wav")
audio_config = speechsdk.audio.AudioOutputConfig(filename=wav_file)

speech_synthesizer = speechsdk.SpeechSynthesizer(
    speech_config=speech_config,
    audio_config=audio_config
)

result = speech_synthesizer.speak_text_async(cleaned_text).get()

Step 4: Check Result

if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
    print("Speech synthesis completed successfully.")
else:
    error_details = result.cancellation_details
    raise Exception(f"Speech synthesis failed: {error_details.reason}")

Returns: Path to WAV file


Contact Scrubbing

Function: remove_contact_numbers() (Lines 511-521)

Purpose: Remove sensitive information from voice responses

Patterns Removed:

1. Phone Numbers:

phone_number_pattern = r"\+?[0-9]{1,4}[-.\ s]?[0-9]{1,3}[-.\ s]?[0-9]{3}[-.\ s]?[0-9]{3,4}"
# Matches: +1-555-123-4567, 555.123.4567, etc.

Replacement: "the number provided in the chat"

2. URLs:

url_pattern = r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
# Matches: https://example.com, http://site.com/path, etc.

Replacement: "via the url provided in the chat"

Example:

# Input
"Call us at +1-555-123-4567 or visit https://example.com/pricing"

# Output (for voice)
"Call us at the number provided in the chat or visit via the url provided in the chat"

Why? Voice assistants shouldn't read out long numbers/URLs - users can't remember them!


Complete Endpoint

POST /v2/get-response-voice-chatbot

Purpose: Generate AI response with voice synthesis

Code Location: Lines 744-772 (entry point) + Lines 523-740 (processing)

Request:

POST /v2/get-response-voice-chatbot
Content-Type: multipart/form-data

user_id=User-123456
project_id=User-123456_Project_1
session_id=session_20250115_140530
question=What are your pricing options?

Response:

{
  "text": "We offer three pricing tiers: Basic at 29 dollars per month, Pro at 99 dollars, and Enterprise with custom pricing. Which interests you most?",
  "audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA..."
}

Fields:

  • text - Clean text response (markdown removed)
  • audio - Base64-encoded WAV file

Processing Flow

Entry Point (Lines 744-772):

@app.post("/v2/get-response-voice-chatbot")
async def ask_gpt_voicechat(
    user_id: str = Form(...),
    project_id: str = Form(...),
    session_id: str = Form(...),
    question: str = Form(...)
):
    # 1. Fetch chatbot config
    chatbot_config = chatbot_collection.find_one({"user_id": user_id, "project_id": project_id})

    chatbot_purpose = chatbot_config.get("chatbot_purpose", "Service-Agent")
    chatbot_voice = chatbot_config.get("voice", "Female_3")
    chatbot_name = chatbot_config.get("hidden_name")

    # 2. Execute in thread pool (for synchronous TTS)
    loop = asyncio.get_event_loop()
    with concurrent.futures.ThreadPoolExecutor() as pool:
        result = await loop.run_in_executor(
            pool, sync_ask_voicechatbot, user_id, project_id, session_id,
            chatbot_purpose, chatbot_voice, question
        )

    return result

Why ThreadPoolExecutor? Azure TTS SDK is synchronous and blocks - running in thread prevents blocking the async event loop.


Main Processing Logic

Function: sync_ask_voicechatbot() (Lines 523-740)

Complete Step-by-Step:

Step 1: Milvus RAG Retrieval (Lines 536-584)

# Generate question embedding
embedder = Embedding(model_name="BAAI/bge-small-en-v1.5", max_length=512)
question_embedding = list(embedder.embed([question]))[0]
question_embedding_list = [float(x) for x in question_embedding]

# Search Milvus
search_results = milvus_embeddings.search_embeddings(
    collection_name="embeddings",
    query_vector=question_embedding_list,
    user_id=user_id,
    project_id=project_id,
    top_k=5
)

# Combine top chunks
combined_text = "\n\n".join([result.get("text", "") for result in search_results])
most_relevant_document = combined_text

# Truncate if too large (max 10K tokens)
MAX_DOCUMENT_TOKENS = 10000
if count_tokens(most_relevant_document) > MAX_DOCUMENT_TOKENS:
    most_relevant_document = truncate_text_by_tokens(most_relevant_document, MAX_DOCUMENT_TOKENS)

Step 2: Build System Prompt (Lines 586-597)

chatbot_settings = chatbot_collection.find_one({"user_id": user_id, "project_id": project_id})
chatbot_name = chatbot_settings.get("hidden_name", "Unknown Chatbot")

# Inject chatbot name into system prompt
system_prompt = f"Your name is {chatbot_name}. {system_prompts.get(chatbot_purpose, system_prompts['Service-Agent'])}"

Example:

"Your name is Emma. You are a specialized chatbot acting as a service agent..."

Step 3: Retrieve Chat History (Lines 599-660)

MAX_CONTEXT_TOKENS = 12000

chat_session = history_collection.find_one({
    "project_id": project_id,
    "user_id": user_id,
    "session_id": session_id
})

# Build history with token limit
temp_history = []
total_tokens = 0

for msg in chat_session["chat_data"]:
    message_text = f"User: {msg['input_prompt']}\nAssistant: {msg['output_response']}"
    msg_tokens = count_tokens(message_text)

    if total_tokens + msg_tokens > MAX_CONTEXT_TOKENS:
        break

    temp_history.append(message_text)
    total_tokens += msg_tokens

chat_history_text = "\n\n".join(reversed(temp_history))  # Chronological order

Step 4: Construct Final Prompt (Lines 662-690)

context_parts = []
remaining_tokens = MAX_CONTEXT_TOKENS - count_tokens(system_prompt) - count_tokens(question) - 100

# Add chat history
if chat_history_text:
    context_parts.append(f"Previous conversation:\n{chat_history_text}")
    remaining_tokens -= count_tokens(chat_history_text)

# Add relevant document
if most_relevant_document:
    if count_tokens(most_relevant_document) > remaining_tokens:
        most_relevant_document = truncate_text_by_tokens(most_relevant_document, remaining_tokens)
    context_parts.append(f"Relevant document:\n{most_relevant_document}")

prompt = "\n\n".join(context_parts + [f"Current question: {question}"])

# Final safety check
total_tokens = count_tokens(system_prompt) + count_tokens(prompt)
if total_tokens > 15000:  # Buffer from 16K limit
    # Keep only last 2 exchanges
    recent_history = "\n\n".join(chat_history_text.split("\n\n")[-2:])
    prompt = f"Recent conversation:\n{recent_history}\n\nCurrent question: {question}"

Step 5: LLM Generation (Lines 692-709)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt}
]

response = client_gpt35.chat.completions.create(
    model="gpt-35-turbo-16k-0613",
    messages=messages,
    temperature=0.7
)

raw_answer = response.choices[0].message.content
clean_answer = re.sub(r'[\*#]', '', raw_answer).strip()  # Remove markdown
answer = clean_answer

Step 6: TTS Generation (Lines 711-724)

voice = SUPPORTED_VOICES.get(chatbot_voice)  # e.g., "en-US-JennyNeural"
wav_file = asyncio.run(text_to_speech(answer, user_id, voice))

# Encode as base64
with open(wav_file, "rb") as audio_file:
    audio_base64 = base64.b64encode(audio_file.read()).decode("utf-8")

# Cleanup
os.remove(wav_file)

Step 7: Save History (Lines 726-728)

save_chat_history(user_id, project_id, session_id, question, answer)

Step 8: Return Response (Lines 730-734)

return {
    "text": answer,
    "audio": audio_base64
}

Security Analysis

CRITICAL Security Issues

1. ⚠️ MULTIPLE HARDCODED API KEYS!

Azure OpenAI API Key (Lines 328, 352, 376, 401)

subscription_key = "AZxDVMYB08AaUip0i5ed1sy73ZpUsqencYYxKDbm6nfWfG1AqPZ3JQQJ99BCACYeBjFXJ3w3AAABACOGVUo7"

Azure TTS Subscription Key (Line 237)

subscription="9N41NOfDyVDoduiD4EjlzmZU9CbUX3pPqWfLCORpl7cBf0l2lzVQJQQJ99BCACGhslBXJ3w3AAAYACOG2329"

Llama API Key (Line 424)

AZURE_API_KEY = "JOfcw0VW0dS31Z8XgkNRSP9tUaBiwUYZ"

DeepSeek API Key (Line 445)

deepseek_api_key = "GwUcGzHhhUbvApfMR4aq1ZPFUic6lbWE"

Ministral API Key (Line 468)

Ministral_api_key = "Z7fNcdnw5Tht1xAz6VlgUlLOeZoVTkIf"

Phi-3 API Key (Line 490)

phi_api_key = "T8I14He3lbMyAyUwNfffwG58e23EcXsU"

⚠️ 6 HARDCODED API KEYS IN TOTAL!


2. ⚠️ Hardcoded Database Name (Line 78)

db = client["Machine_agent_dev"]  # Overwrites environment variable!

Problem: Hardcoded to "Machine_agent_dev" even after loading from environment


3. ⚠️ File System Race Condition

Problem: WAV files use {user_id}.wav - concurrent requests from same user will overwrite each other

Better:

import uuid
wav_file = os.path.join(OUTPUT_DIR, f"{user_id}_{uuid.uuid4()}.wav")

4. ⚠️ Limited Error Handling in TTS

Problem: If TTS fails, entire request fails - no fallback to text-only response


Performance

Response Time Breakdown

Step Avg Latency Notes
1. Embedding generation 50-100ms BAAI/bge-small-en-v1.5
2. Milvus search 50-150ms Top-5 chunks
3. MongoDB history 20-50ms Simple query
4. Azure OpenAI 1-3 seconds GPT-3.5-Turbo-16k
5. Azure TTS 1-3 seconds Neural voice synthesis
6. Base64 encoding 10-50ms Depends on audio length
7. MongoDB save 20-50ms Update operation
TOTAL 3-7 seconds User-facing latency

TTS adds 1-3 seconds compared to text-only responses.


Deployment

Docker Configuration

Dockerfile:

FROM python:3.9-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy shared modules
COPY shared/ ./shared/

COPY src/ .

# Create TTS output directory
RUN mkdir -p tts_audio

EXPOSE 8013

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8013"]

Requirements.txt

fastapi>=0.95.0
uvicorn[standard]>=0.22.0
pymongo>=4.3.3
python-multipart>=0.0.6
python-dotenv>=1.0.0

# Azure Services
azure-cognitiveservices-speech>=1.30.0
openai>=1.0.0

# Embeddings & ML
fastembed>=0.1.0
scikit-learn>=1.3.0
numpy>=1.24.0

# Utilities
tiktoken>=0.5.0
pytz>=2023.3
requests>=2.31.0

# Monitoring
ddtrace>=1.19.0

Environment Variables

# Azure OpenAI
AZURE_OPENAI_API_KEY=<your-key>
ENDPOINT_URL=https://machineagentopenai.openai.azure.com/...
DEPLOYMENT_NAME=gpt-35-turbo-16k-0613

# MongoDB
MONGO_URI=mongodb://...
MONGO_DB_NAME=Machine_agent_demo

# Milvus
MILVUS_HOST=localhost
MILVUS_PORT=19530

# DataDog
DD_SERVICE=response-voice-chatbot-service
DD_ENV=production


Recommendations

CRITICAL (Security)

  1. ⚠️ Move ALL API Keys to Environment (6 hardcoded keys!)
  2. ⚠️ Fix Hardcoded Database Name - Use environment variable
  3. ⚠️ Fix WAV File Naming - Use UUID to prevent collisions
  4. ⚠️ Add TTS Fallback - Return text-only if TTS fails

Performance

  1. Cache TTS Outputs - Same response = reuse audio
  2. Async TTS - Use async Azure SDK if available
  3. Streaming Response - Stream text first, then audio
  4. Reduce LLM Clients - Only initialize models that are used

Code Quality

  1. Remove Unused LLM Functions - 7 models defined, only 1 used
  2. Extract Configuration - Move all endpoints/keys to config file
  3. Add Type Hints - Complete typing
  4. Add Unit Tests - Test contact scrubbing, voice selection

Last Updated: 2025-12-26
Code Version: response-voice-chatbot-service/src/main.py (775 lines)
Total Endpoints: 1
Supported LLMs: 8
Supported Voices: 10
Review Cycle: Monthly


"Conversations that speak for themselves."