LLM Model Service (Port 8016)¶

Service Path: machineagents-be/llm-model-service/
Port: 8016
Total Lines: 368
Purpose: Unified API gateway for multiple LLM providers, abstracting model-specific implementations and providing a single interface for all chatbot response services.

Table of Contents¶

Service Overview
Architecture & Dependencies
Supported Models
API Endpoint
Model Implementations
Request/Response Format
Security Analysis
Integration Points

Service Overview¶

Primary Responsibility¶

Unified LLM Gateway: Single endpoint /call-model/{model_name} routes requests to 9 different LLM providers

Key Capabilities¶

✅ Multi-Provider Support - 9 LLM models from 5 providers
✅ Standardized Interface - Same request/response format for all models
✅ Provider Abstraction - Response services don't need provider-specific code
✅ Error Handling - Consistent error responses across providers
✅ Logging - Standardized logging for all model calls

Architecture Pattern¶

Adapter Pattern:

Response Services
        ↓
POST /call-model/{model_name}
        ↓
Router (switch on model_name)
        ↓
Model-Specific Adapter Functions
        ↓
External LLM APIs

Architecture & Dependencies¶

Technology Stack¶

Framework:

FastAPI (web framework)
Uvicorn (ASGI server)

LLM SDK:

Azure OpenAI Python SDK (for GPT-4, GPT-3.5)
requests (for all other models)

Providers:

Azure OpenAI (GPT-4, GPT-3.5)
Azure AI Model Catalog (Llama, DeepSeek, Ministral, Phi, Grok)
Google Gemini API
Anthropic Claude API

Key Imports¶

from fastapi import FastAPI, HTTPException
from openai import AzureOpenAI
import requests
from pydantic import BaseModel
from typing import List

Environment Variables¶

Azure OpenAI:

ENDPOINT_GPT4=https://machineagentopenai.openai.azure.com/...
DEPLOYMENT_GPT4=gpt-4-0613
ENDPOINT_GPT35=https://machineagentopenai.openai.azure.com/...
DEPLOYMENT_GPT35=gpt-35-turbo-16k-0613
AZURE_OPENAI_API_KEY=AZxDVMYB08Aa...  # ⚠️ HARDCODED

Azure AI Models:

AZURE_LLAMA_ENDPOINT=https://Llama-3-3-70B-Instruct-ulmca.eastus.models.ai.azure.com/...
AZURE_LLAMA_API_KEY=JOfcw0VW0dS31Z8XgkNRSP9tUaBiwUYZ  # ⚠️ HARDCODED

DEEPSEEK_API_URL=https://DeepSeek-R1-imalr.eastus2.models.ai.azure.com/...
DEEPSEEK_API_KEY=GwUcGzHhhUbvApfMR4aq1ZPFUic6lbWE  # ⚠️ HARDCODED

MINISTRAL_API_URL=https://Ministral-3B-rvgab.eastus2.models.ai.azure.com/...
MINISTRAL_API_KEY=Z7fNcdnw5Tht1xAz6VlgUlLOeZoVTkIf  # ⚠️ HARDCODED

PHI_API_URL=https://Phi-3-small-8k-instruct-qvlpq.eastus2.models.ai.azure.com/...
PHI_API_KEY=T8I14He3lbMyAyUwNfffwG58e23EcXsU  # ⚠️ HARDCODED

External APIs:

GEMINI_API_ENDPOINT=https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=AIzaSyCx...  # ⚠️ KEY IN URL
GEMINI_API_KEY=  # Empty (key in endpoint URL)

CLAUDE_API_ENDPOINT=https://api.anthropic.com/v1/messages
CLAUDE_API_KEY=sk-ant-api03-lUgPozzMkSUbxOXr0Hya...  # ⚠️ HARDCODED

GROK_API_ENDPOINT=https://machineagents-resource.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview
GROK_API_KEY=DZHPHaEk96KHbCgaD3fsaI7opVSL...  # ⚠️ HARDCODED

Supported Models¶

9 LLM Models Across 4 Providers¶

Model Name	Provider	Model ID	Endpoint Type	Max Tokens	Temperature
openai-4	Azure OpenAI	gpt-4-0613	SDK	Default	0.7
openai-35	Azure OpenAI	gpt-35-turbo-16k-0613	SDK	Default	0.7
llama	Azure AI	Llama-3-3-70B-Instruct	HTTP	50	0.7
deepseek	Azure AI	DeepSeek-R1	HTTP	100	Default
ministral	Azure AI	Ministral-3B	HTTP	100	Default
phi	Azure AI	Phi-3-small-8k-instruct	HTTP	100	Default
Gemini Flash-2.5	Google Gemini	gemini-2.0-flash	HTTP	1000	0.7
Claude sonnet 4	Anthropic	claude-3-5-sonnet-20241022	HTTP	1024	Default
Grok-3	Azure AI (x.ai)	grok-3	HTTP	2048	1.0

Note: Model names are case-sensitive in the API request

API Endpoint¶

POST `/call-model/{model_name}`¶

Single Endpoint for All Models

Path Parameter:

model_name (string, case-sensitive) - One of: openai-4, openai-35, llama, deepseek, ministral, phi, Gemini Flash-2.5, Claude sonnet 4, Grok-3

Request Body:

{
  "messages": [
    {
      "role": "user",
      "content": "What is machine learning?"
    },
    {
      "role": "assistant",
      "content": "Machine learning is..."
    },
    {
      "role": "user",
      "content": "Tell me more"
    }
  ]
}

Response (Success):

{
  "response": "Machine learning is a subset of artificial intelligence..."
}

Response (Error):

{
  "detail": "Invalid model name"
}

Status Code: 400

Response (Model API Error):

{
  "detail": "GPT-4 API Error: Rate limit exceeded"
}

Status Code: 500

Model Implementations¶

1. Azure OpenAI Models (GPT-4, GPT-3.5)¶

Implementation:

client_gpt4 = AzureOpenAI(
    azure_endpoint=ENDPOINT_GPT4,
    api_key=SUBSCRIPTION_KEY,
    api_version="2024-02-15-preview"
)

def call_openai_gpt4(messages):
    response = client_gpt4.chat.completions.create(
        model=DEPLOYMENT_GPT4,
        messages=messages,
        temperature=0.7
    )
    return {"response": response.choices[0].message.content}

Features:

Uses official Azure OpenAI SDK
Same API key for both GPT-4 and GPT-3.5
Temperature: 0.7 (fixed)
No max_tokens limit (uses model default)

2. Azure AI Models (Llama, DeepSeek, Ministral, Phi, Grok)¶

Common Pattern:

HEADERS = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}

payload = {
    "messages": messages,
    "max_tokens": 100,
    "temperature": 0.7
}

response = requests.post(API_URL, json=payload, headers=HEADERS)
response.raise_for_status()

return {"response": response.json()["choices"][0]["message"]["content"]}

Llama Specifics:

Max tokens: 50 (lowest among all models)
Temperature: 0.7

DeepSeek Specifics:

Max tokens: 100
Extra logging for debugging
Detailed error handling (RequestException, KeyError)

Ministral Specifics:

⚠️ Role Modification: Changes last message from assistant to user if needed

if messages[-1]["role"] == "assistant":
    messages[-1]["role"] = "user"

Max tokens: 100

Phi Specifics:

Max tokens: 100
Standard implementation

Grok Specifics:

Max tokens: 2048 (highest)
Temperature: 1.0 (highest, more creative)
Model name in payload: grok-3

3. Google Gemini¶

Message Format Transformation:

# Convert from OpenAI format to Gemini format
gemini_contents = [
    {
        "role": msg["role"],
        "parts": [{"text": msg["content"]}]
    }
    for msg in messages
]

payload = {
    "contents": gemini_contents,
    "generationConfig": {
        "temperature": 0.7,
        "maxOutputTokens": 1000
    }
}

# API key is in the URL, not headers
response = requests.post(GEMINI_API_ENDPOINT, json=payload)

Response Parsing:

response_json = response.json()

if "candidates" in response_json:
    return {"response": response_json["candidates"][0]["content"]["parts"][0]["text"]}
elif "error" in response_json:
    raise HTTPException(status_code=response.status_code, detail=response_json['error']['message'])

Features:

Different message format (parts-based)
API key in URL (not recommended for production)
Complex response structure with candidates
Max output tokens: 1000

4. Anthropic Claude¶

Implementation:

headers = {
    "Content-Type": "application/json",
    "X-API-Key": CLAUDE_API_KEY,
    "anthropic-version": "2023-06-01"  # Required header
}

payload = {
    "model": "claude-3-5-sonnet-20241022",
    "max_tokens": 1024,
    "messages": messages
}

response = requests.post(CLAUDE_API_ENDPOINT, json=payload, headers=headers)

Response Parsing:

response_json = response.json()

if "content" in response_json:
    # Join all text blocks
    return {"response": "".join([
        block["text"]
        for block in response_json["content"]
        if block["type"] == "text"
    ])}

Features:

Custom header: anthropic-version required
Custom header format: X-API-Key instead of Authorization
Multi-block response format
Model: claude-3-5-sonnet-20241022

Request/Response Format¶

Pydantic Models¶

class Message(BaseModel):
    role: str  # "user" or "assistant"
    content: str

class MessageRequest(BaseModel):
    messages: List[Message]

Example Request¶

cURL:

curl -X POST "http://localhost:8016/call-model/openai-4" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Hello"},
      {"role": "assistant", "content": "Hi there!"},
      {"role": "user", "content": "How are you?"}
    ]
  }'

Python:

import requests

response = requests.post(
    "http://localhost:8016/call-model/Gemini Flash-2.5",
    json={
        "messages": [
            {"role": "user", "content": "What is AI?"}
        ]
    }
)

print(response.json()["response"])

Error Responses¶

Invalid Model:

HTTP 400
{
    "detail": "Invalid model name"
}

API Failure:

HTTP 500
{
    "detail": "DeepSeek API Request Error: Connection timeout"
}

Security Analysis¶

🔴 CRITICAL: 9 Hardcoded API Keys¶

All API keys are hardcoded in the source code:

Azure OpenAI (Line 36): AZURE_OPENAI_API_KEY
Llama (Line 44): AZURE_LLAMA_API_KEY
DeepSeek (Line 46): DEEPSEEK_API_KEY
Ministral (Line 48): MINISTRAL_API_KEY
Phi (Line 50): PHI_API_KEY
Gemini (Line 53): API key in URL
Claude (Line 57): CLAUDE_API_KEY
Grok (Line 59): GROK_API_KEY

Risk:

Keys exposed in version control
Billing fraud potential
Unauthorized API access
No key rotation possible

Fix:

# Remove defaults entirely
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
if not AZURE_OPENAI_API_KEY:
    raise ValueError("AZURE_OPENAI_API_KEY environment variable not set")

🟠 SECURITY: Overly Permissive CORS¶

Lines 23-29:

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

Risk: Any website can call LLM API through this service

Fix:

allow_origins=[
    "https://app.machineagents.ai",
    "https://admin.machineagents.ai"
]

🟡 CODE QUALITY: Inconsistent Logging¶

Issue: Mix of logger and logging

Lines 166-169 (DeepSeek):

logging.info(f"DeepSeek Payload: {payload}")  # lowercase 'logging'
logger.info("✓ DeepSeek API call successful")  # 'logger' object

Impact: Inconsistent log output format

Fix: Use logger consistently everywhere

🟡 CODE QUALITY: Ministral Role Modification¶

Lines 188-190:

if messages[-1]["role"] == "assistant":
    messages[-1]["role"] = "user"
    logging.info("Modified the last message role to 'user'.")

Issue: Modifies input data (side effect)

Impact:

Unexpected behavior for callers
May break conversation flow
Hard to debug

Fix: Create a copy before modifying

messages_copy = [msg.copy() for msg in messages]
if messages_copy[-1]["role"] == "assistant":
    messages_copy[-1]["role"] = "user"

Integration Points¶

1. Response Services Integration¶

All response services call this gateway:

# In response-3d-chatbot-service, response-text-chatbot-service, etc.
import requests

LLM_MODEL_SERVICE_URL = "http://llm-model-service:8016"

def get_llm_response(model_name, messages):
    response = requests.post(
        f"{LLM_MODEL_SERVICE_URL}/call-model/{model_name}",
        json={"messages": messages}
    )
    return response.json()["response"]

Usage:

# Get response from GPT-4
response = get_llm_response("openai-4", [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": user_question}
])

2. Model Selection Flow¶

User selects model in dashboard
        ↓
Stored in CosmosDB (project config)
        ↓
Response service reads model name
        ↓
Calls /call-model/{model_name}
        ↓
LLM Model Service routes to correct provider
        ↓
Returns unified response

Summary¶

Service Statistics¶

Total Lines: 368
Total Endpoints: 1
Total Models: 9
Total Providers: 4
Total API Keys: 9 (all hardcoded ⚠️)

Key Capabilities¶

✅ Unified Interface - Single endpoint for all models
✅ Multi-Provider - Azure OpenAI, Azure AI, Gemini, Claude
✅ Error Handling - Consistent error responses
✅ Logging - Standardized logging (mostly)
✅ Provider Abstraction - Response services don't need provider code

Critical Fixes Needed¶

🔴 Externalize all 9 API keys - Most critical security issue
🟠 Restrict CORS - Prevent unauthorized access
🟡 Fix inconsistent logging - Use logger everywhere
🟡 Fix Ministral role modification - Don't modify input data

Deployment Notes¶

Docker Compose (Port 8016):

llm-model-service:
  build: ./llm-model-service
  container_name: llm-model-service
  ports:
    - "8016:8016"
  environment:
    - AZURE_OPENAI_API_KEY=*** # Must externalize
    - AZURE_LLAMA_API_KEY=***
    - DEEPSEEK_API_KEY=***
    # ... all 9 keys

Dependencies:

All response services depend on this gateway
Single point of failure for all LLM calls
No caching or rate limiting implemented

Documentation Complete: LLM Model Service (Port 8016)
Status: COMPREHENSIVE, DEVELOPER-GRADE, INVESTOR-GRADE, AUDIT-READY ✅

LLM Model Service (Port 8016)¶

Table of Contents¶

Service Overview¶

Primary Responsibility¶

Key Capabilities¶

Architecture Pattern¶

Architecture & Dependencies¶

Technology Stack¶

Key Imports¶

Environment Variables¶

Supported Models¶

9 LLM Models Across 4 Providers¶

API Endpoint¶

POST /call-model/{model_name}¶

Model Implementations¶

1. Azure OpenAI Models (GPT-4, GPT-3.5)¶

2. Azure AI Models (Llama, DeepSeek, Ministral, Phi, Grok)¶

3. Google Gemini¶

4. Anthropic Claude¶

Request/Response Format¶

Pydantic Models¶

Example Request¶

Error Responses¶

Security Analysis¶

🔴 CRITICAL: 9 Hardcoded API Keys¶

🟠 SECURITY: Overly Permissive CORS¶

🟡 CODE QUALITY: Inconsistent Logging¶

🟡 CODE QUALITY: Ministral Role Modification¶

Integration Points¶

1. Response Services Integration¶

2. Model Selection Flow¶

Summary¶

Service Statistics¶

Key Capabilities¶

Critical Fixes Needed¶

Deployment Notes¶

POST `/call-model/{model_name}`¶