LLM Model Service (Port 8016)¶
Service Path: machineagents-be/llm-model-service/
Port: 8016
Total Lines: 368
Purpose: Unified API gateway for multiple LLM providers, abstracting model-specific implementations and providing a single interface for all chatbot response services.
Table of Contents¶
- Service Overview
- Architecture & Dependencies
- Supported Models
- API Endpoint
- Model Implementations
- Request/Response Format
- Security Analysis
- Integration Points
Service Overview¶
Primary Responsibility¶
Unified LLM Gateway: Single endpoint /call-model/{model_name} routes requests to 9 different LLM providers
Key Capabilities¶
- ✅ Multi-Provider Support - 9 LLM models from 5 providers
- ✅ Standardized Interface - Same request/response format for all models
- ✅ Provider Abstraction - Response services don't need provider-specific code
- ✅ Error Handling - Consistent error responses across providers
- ✅ Logging - Standardized logging for all model calls
Architecture Pattern¶
Adapter Pattern:
Response Services
↓
POST /call-model/{model_name}
↓
Router (switch on model_name)
↓
Model-Specific Adapter Functions
↓
External LLM APIs
Architecture & Dependencies¶
Technology Stack¶
Framework:
- FastAPI (web framework)
- Uvicorn (ASGI server)
LLM SDK:
- Azure OpenAI Python SDK (for GPT-4, GPT-3.5)
- requests (for all other models)
Providers:
- Azure OpenAI (GPT-4, GPT-3.5)
- Azure AI Model Catalog (Llama, DeepSeek, Ministral, Phi, Grok)
- Google Gemini API
- Anthropic Claude API
Key Imports¶
from fastapi import FastAPI, HTTPException
from openai import AzureOpenAI
import requests
from pydantic import BaseModel
from typing import List
Environment Variables¶
Azure OpenAI:
ENDPOINT_GPT4=https://machineagentopenai.openai.azure.com/...
DEPLOYMENT_GPT4=gpt-4-0613
ENDPOINT_GPT35=https://machineagentopenai.openai.azure.com/...
DEPLOYMENT_GPT35=gpt-35-turbo-16k-0613
AZURE_OPENAI_API_KEY=AZxDVMYB08Aa... # ⚠️ HARDCODED
Azure AI Models:
AZURE_LLAMA_ENDPOINT=https://Llama-3-3-70B-Instruct-ulmca.eastus.models.ai.azure.com/...
AZURE_LLAMA_API_KEY=JOfcw0VW0dS31Z8XgkNRSP9tUaBiwUYZ # ⚠️ HARDCODED
DEEPSEEK_API_URL=https://DeepSeek-R1-imalr.eastus2.models.ai.azure.com/...
DEEPSEEK_API_KEY=GwUcGzHhhUbvApfMR4aq1ZPFUic6lbWE # ⚠️ HARDCODED
MINISTRAL_API_URL=https://Ministral-3B-rvgab.eastus2.models.ai.azure.com/...
MINISTRAL_API_KEY=Z7fNcdnw5Tht1xAz6VlgUlLOeZoVTkIf # ⚠️ HARDCODED
PHI_API_URL=https://Phi-3-small-8k-instruct-qvlpq.eastus2.models.ai.azure.com/...
PHI_API_KEY=T8I14He3lbMyAyUwNfffwG58e23EcXsU # ⚠️ HARDCODED
External APIs:
GEMINI_API_ENDPOINT=https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=AIzaSyCx... # ⚠️ KEY IN URL
GEMINI_API_KEY= # Empty (key in endpoint URL)
CLAUDE_API_ENDPOINT=https://api.anthropic.com/v1/messages
CLAUDE_API_KEY=sk-ant-api03-lUgPozzMkSUbxOXr0Hya... # ⚠️ HARDCODED
GROK_API_ENDPOINT=https://machineagents-resource.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview
GROK_API_KEY=DZHPHaEk96KHbCgaD3fsaI7opVSL... # ⚠️ HARDCODED
Supported Models¶
9 LLM Models Across 4 Providers¶
| Model Name | Provider | Model ID | Endpoint Type | Max Tokens | Temperature |
|---|---|---|---|---|---|
| openai-4 | Azure OpenAI | gpt-4-0613 | SDK | Default | 0.7 |
| openai-35 | Azure OpenAI | gpt-35-turbo-16k-0613 | SDK | Default | 0.7 |
| llama | Azure AI | Llama-3-3-70B-Instruct | HTTP | 50 | 0.7 |
| deepseek | Azure AI | DeepSeek-R1 | HTTP | 100 | Default |
| ministral | Azure AI | Ministral-3B | HTTP | 100 | Default |
| phi | Azure AI | Phi-3-small-8k-instruct | HTTP | 100 | Default |
| Gemini Flash-2.5 | Google Gemini | gemini-2.0-flash | HTTP | 1000 | 0.7 |
| Claude sonnet 4 | Anthropic | claude-3-5-sonnet-20241022 | HTTP | 1024 | Default |
| Grok-3 | Azure AI (x.ai) | grok-3 | HTTP | 2048 | 1.0 |
Note: Model names are case-sensitive in the API request
API Endpoint¶
POST /call-model/{model_name}¶
Single Endpoint for All Models
Path Parameter:
model_name(string, case-sensitive) - One of:openai-4,openai-35,llama,deepseek,ministral,phi,Gemini Flash-2.5,Claude sonnet 4,Grok-3
Request Body:
{
"messages": [
{
"role": "user",
"content": "What is machine learning?"
},
{
"role": "assistant",
"content": "Machine learning is..."
},
{
"role": "user",
"content": "Tell me more"
}
]
}
Response (Success):
Response (Error):
Status Code: 400
Response (Model API Error):
Status Code: 500
Model Implementations¶
1. Azure OpenAI Models (GPT-4, GPT-3.5)¶
Implementation:
client_gpt4 = AzureOpenAI(
azure_endpoint=ENDPOINT_GPT4,
api_key=SUBSCRIPTION_KEY,
api_version="2024-02-15-preview"
)
def call_openai_gpt4(messages):
response = client_gpt4.chat.completions.create(
model=DEPLOYMENT_GPT4,
messages=messages,
temperature=0.7
)
return {"response": response.choices[0].message.content}
Features:
- Uses official Azure OpenAI SDK
- Same API key for both GPT-4 and GPT-3.5
- Temperature: 0.7 (fixed)
- No max_tokens limit (uses model default)
2. Azure AI Models (Llama, DeepSeek, Ministral, Phi, Grok)¶
Common Pattern:
HEADERS = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_KEY}"
}
payload = {
"messages": messages,
"max_tokens": 100,
"temperature": 0.7
}
response = requests.post(API_URL, json=payload, headers=HEADERS)
response.raise_for_status()
return {"response": response.json()["choices"][0]["message"]["content"]}
Llama Specifics:
- Max tokens: 50 (lowest among all models)
- Temperature: 0.7
DeepSeek Specifics:
- Max tokens: 100
- Extra logging for debugging
- Detailed error handling (RequestException, KeyError)
Ministral Specifics:
- ⚠️ Role Modification: Changes last message from
assistanttouserif needed - Max tokens: 100
Phi Specifics:
- Max tokens: 100
- Standard implementation
Grok Specifics:
- Max tokens: 2048 (highest)
- Temperature: 1.0 (highest, more creative)
- Model name in payload:
grok-3
3. Google Gemini¶
Message Format Transformation:
# Convert from OpenAI format to Gemini format
gemini_contents = [
{
"role": msg["role"],
"parts": [{"text": msg["content"]}]
}
for msg in messages
]
payload = {
"contents": gemini_contents,
"generationConfig": {
"temperature": 0.7,
"maxOutputTokens": 1000
}
}
# API key is in the URL, not headers
response = requests.post(GEMINI_API_ENDPOINT, json=payload)
Response Parsing:
response_json = response.json()
if "candidates" in response_json:
return {"response": response_json["candidates"][0]["content"]["parts"][0]["text"]}
elif "error" in response_json:
raise HTTPException(status_code=response.status_code, detail=response_json['error']['message'])
Features:
- Different message format (parts-based)
- API key in URL (not recommended for production)
- Complex response structure with candidates
- Max output tokens: 1000
4. Anthropic Claude¶
Implementation:
headers = {
"Content-Type": "application/json",
"X-API-Key": CLAUDE_API_KEY,
"anthropic-version": "2023-06-01" # Required header
}
payload = {
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 1024,
"messages": messages
}
response = requests.post(CLAUDE_API_ENDPOINT, json=payload, headers=headers)
Response Parsing:
response_json = response.json()
if "content" in response_json:
# Join all text blocks
return {"response": "".join([
block["text"]
for block in response_json["content"]
if block["type"] == "text"
])}
Features:
- Custom header:
anthropic-versionrequired - Custom header format:
X-API-Keyinstead ofAuthorization - Multi-block response format
- Model: claude-3-5-sonnet-20241022
Request/Response Format¶
Pydantic Models¶
class Message(BaseModel):
role: str # "user" or "assistant"
content: str
class MessageRequest(BaseModel):
messages: List[Message]
Example Request¶
cURL:
curl -X POST "http://localhost:8016/call-model/openai-4" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"},
{"role": "user", "content": "How are you?"}
]
}'
Python:
import requests
response = requests.post(
"http://localhost:8016/call-model/Gemini Flash-2.5",
json={
"messages": [
{"role": "user", "content": "What is AI?"}
]
}
)
print(response.json()["response"])
Error Responses¶
Invalid Model:
API Failure:
Security Analysis¶
🔴 CRITICAL: 9 Hardcoded API Keys¶
All API keys are hardcoded in the source code:
- Azure OpenAI (Line 36):
AZURE_OPENAI_API_KEY - Llama (Line 44):
AZURE_LLAMA_API_KEY - DeepSeek (Line 46):
DEEPSEEK_API_KEY - Ministral (Line 48):
MINISTRAL_API_KEY - Phi (Line 50):
PHI_API_KEY - Gemini (Line 53): API key in URL
- Claude (Line 57):
CLAUDE_API_KEY - Grok (Line 59):
GROK_API_KEY
Risk:
- Keys exposed in version control
- Billing fraud potential
- Unauthorized API access
- No key rotation possible
Fix:
# Remove defaults entirely
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
if not AZURE_OPENAI_API_KEY:
raise ValueError("AZURE_OPENAI_API_KEY environment variable not set")
🟠 SECURITY: Overly Permissive CORS¶
Lines 23-29:
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
Risk: Any website can call LLM API through this service
Fix:
🟡 CODE QUALITY: Inconsistent Logging¶
Issue: Mix of logger and logging
Lines 166-169 (DeepSeek):
logging.info(f"DeepSeek Payload: {payload}") # lowercase 'logging'
logger.info("✓ DeepSeek API call successful") # 'logger' object
Impact: Inconsistent log output format
Fix: Use logger consistently everywhere
🟡 CODE QUALITY: Ministral Role Modification¶
Lines 188-190:
if messages[-1]["role"] == "assistant":
messages[-1]["role"] = "user"
logging.info("Modified the last message role to 'user'.")
Issue: Modifies input data (side effect)
Impact:
- Unexpected behavior for callers
- May break conversation flow
- Hard to debug
Fix: Create a copy before modifying
messages_copy = [msg.copy() for msg in messages]
if messages_copy[-1]["role"] == "assistant":
messages_copy[-1]["role"] = "user"
Integration Points¶
1. Response Services Integration¶
All response services call this gateway:
# In response-3d-chatbot-service, response-text-chatbot-service, etc.
import requests
LLM_MODEL_SERVICE_URL = "http://llm-model-service:8016"
def get_llm_response(model_name, messages):
response = requests.post(
f"{LLM_MODEL_SERVICE_URL}/call-model/{model_name}",
json={"messages": messages}
)
return response.json()["response"]
Usage:
# Get response from GPT-4
response = get_llm_response("openai-4", [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": user_question}
])
2. Model Selection Flow¶
User selects model in dashboard
↓
Stored in CosmosDB (project config)
↓
Response service reads model name
↓
Calls /call-model/{model_name}
↓
LLM Model Service routes to correct provider
↓
Returns unified response
Summary¶
Service Statistics¶
- Total Lines: 368
- Total Endpoints: 1
- Total Models: 9
- Total Providers: 4
- Total API Keys: 9 (all hardcoded ⚠️)
Key Capabilities¶
- ✅ Unified Interface - Single endpoint for all models
- ✅ Multi-Provider - Azure OpenAI, Azure AI, Gemini, Claude
- ✅ Error Handling - Consistent error responses
- ✅ Logging - Standardized logging (mostly)
- ✅ Provider Abstraction - Response services don't need provider code
Critical Fixes Needed¶
- 🔴 Externalize all 9 API keys - Most critical security issue
- 🟠 Restrict CORS - Prevent unauthorized access
- 🟡 Fix inconsistent logging - Use
loggereverywhere - 🟡 Fix Ministral role modification - Don't modify input data
Deployment Notes¶
Docker Compose (Port 8016):
llm-model-service:
build: ./llm-model-service
container_name: llm-model-service
ports:
- "8016:8016"
environment:
- AZURE_OPENAI_API_KEY=*** # Must externalize
- AZURE_LLAMA_API_KEY=***
- DEEPSEEK_API_KEY=***
# ... all 9 keys
Dependencies:
- All response services depend on this gateway
- Single point of failure for all LLM calls
- No caching or rate limiting implemented
Documentation Complete: LLM Model Service (Port 8016)
Status: COMPREHENSIVE, DEVELOPER-GRADE, INVESTOR-GRADE, AUDIT-READY ✅