Skip to content

LLM Model Service (Port 8016)

Service Path: machineagents-be/llm-model-service/
Port: 8016
Total Lines: 368
Purpose: Unified API gateway for multiple LLM providers, abstracting model-specific implementations and providing a single interface for all chatbot response services.


Table of Contents

  1. Service Overview
  2. Architecture & Dependencies
  3. Supported Models
  4. API Endpoint
  5. Model Implementations
  6. Request/Response Format
  7. Security Analysis
  8. Integration Points

Service Overview

Primary Responsibility

Unified LLM Gateway: Single endpoint /call-model/{model_name} routes requests to 9 different LLM providers

Key Capabilities

  1. Multi-Provider Support - 9 LLM models from 5 providers
  2. Standardized Interface - Same request/response format for all models
  3. Provider Abstraction - Response services don't need provider-specific code
  4. Error Handling - Consistent error responses across providers
  5. Logging - Standardized logging for all model calls

Architecture Pattern

Adapter Pattern:

Response Services
POST /call-model/{model_name}
Router (switch on model_name)
Model-Specific Adapter Functions
External LLM APIs

Architecture & Dependencies

Technology Stack

Framework:

  • FastAPI (web framework)
  • Uvicorn (ASGI server)

LLM SDK:

  • Azure OpenAI Python SDK (for GPT-4, GPT-3.5)
  • requests (for all other models)

Providers:

  1. Azure OpenAI (GPT-4, GPT-3.5)
  2. Azure AI Model Catalog (Llama, DeepSeek, Ministral, Phi, Grok)
  3. Google Gemini API
  4. Anthropic Claude API

Key Imports

from fastapi import FastAPI, HTTPException
from openai import AzureOpenAI
import requests
from pydantic import BaseModel
from typing import List

Environment Variables

Azure OpenAI:

ENDPOINT_GPT4=https://machineagentopenai.openai.azure.com/...
DEPLOYMENT_GPT4=gpt-4-0613
ENDPOINT_GPT35=https://machineagentopenai.openai.azure.com/...
DEPLOYMENT_GPT35=gpt-35-turbo-16k-0613
AZURE_OPENAI_API_KEY=AZxDVMYB08Aa...  # ⚠️ HARDCODED

Azure AI Models:

AZURE_LLAMA_ENDPOINT=https://Llama-3-3-70B-Instruct-ulmca.eastus.models.ai.azure.com/...
AZURE_LLAMA_API_KEY=JOfcw0VW0dS31Z8XgkNRSP9tUaBiwUYZ  # ⚠️ HARDCODED

DEEPSEEK_API_URL=https://DeepSeek-R1-imalr.eastus2.models.ai.azure.com/...
DEEPSEEK_API_KEY=GwUcGzHhhUbvApfMR4aq1ZPFUic6lbWE  # ⚠️ HARDCODED

MINISTRAL_API_URL=https://Ministral-3B-rvgab.eastus2.models.ai.azure.com/...
MINISTRAL_API_KEY=Z7fNcdnw5Tht1xAz6VlgUlLOeZoVTkIf  # ⚠️ HARDCODED

PHI_API_URL=https://Phi-3-small-8k-instruct-qvlpq.eastus2.models.ai.azure.com/...
PHI_API_KEY=T8I14He3lbMyAyUwNfffwG58e23EcXsU  # ⚠️ HARDCODED

External APIs:

GEMINI_API_ENDPOINT=https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=AIzaSyCx...  # ⚠️ KEY IN URL
GEMINI_API_KEY=  # Empty (key in endpoint URL)

CLAUDE_API_ENDPOINT=https://api.anthropic.com/v1/messages
CLAUDE_API_KEY=sk-ant-api03-lUgPozzMkSUbxOXr0Hya...  # ⚠️ HARDCODED

GROK_API_ENDPOINT=https://machineagents-resource.services.ai.azure.com/models/chat/completions?api-version=2024-05-01-preview
GROK_API_KEY=DZHPHaEk96KHbCgaD3fsaI7opVSL...  # ⚠️ HARDCODED

Supported Models

9 LLM Models Across 4 Providers

Model Name Provider Model ID Endpoint Type Max Tokens Temperature
openai-4 Azure OpenAI gpt-4-0613 SDK Default 0.7
openai-35 Azure OpenAI gpt-35-turbo-16k-0613 SDK Default 0.7
llama Azure AI Llama-3-3-70B-Instruct HTTP 50 0.7
deepseek Azure AI DeepSeek-R1 HTTP 100 Default
ministral Azure AI Ministral-3B HTTP 100 Default
phi Azure AI Phi-3-small-8k-instruct HTTP 100 Default
Gemini Flash-2.5 Google Gemini gemini-2.0-flash HTTP 1000 0.7
Claude sonnet 4 Anthropic claude-3-5-sonnet-20241022 HTTP 1024 Default
Grok-3 Azure AI (x.ai) grok-3 HTTP 2048 1.0

Note: Model names are case-sensitive in the API request


API Endpoint

POST /call-model/{model_name}

Single Endpoint for All Models

Path Parameter:

  • model_name (string, case-sensitive) - One of: openai-4, openai-35, llama, deepseek, ministral, phi, Gemini Flash-2.5, Claude sonnet 4, Grok-3

Request Body:

{
  "messages": [
    {
      "role": "user",
      "content": "What is machine learning?"
    },
    {
      "role": "assistant",
      "content": "Machine learning is..."
    },
    {
      "role": "user",
      "content": "Tell me more"
    }
  ]
}

Response (Success):

{
  "response": "Machine learning is a subset of artificial intelligence..."
}

Response (Error):

{
  "detail": "Invalid model name"
}

Status Code: 400

Response (Model API Error):

{
  "detail": "GPT-4 API Error: Rate limit exceeded"
}

Status Code: 500


Model Implementations

1. Azure OpenAI Models (GPT-4, GPT-3.5)

Implementation:

client_gpt4 = AzureOpenAI(
    azure_endpoint=ENDPOINT_GPT4,
    api_key=SUBSCRIPTION_KEY,
    api_version="2024-02-15-preview"
)

def call_openai_gpt4(messages):
    response = client_gpt4.chat.completions.create(
        model=DEPLOYMENT_GPT4,
        messages=messages,
        temperature=0.7
    )
    return {"response": response.choices[0].message.content}

Features:

  • Uses official Azure OpenAI SDK
  • Same API key for both GPT-4 and GPT-3.5
  • Temperature: 0.7 (fixed)
  • No max_tokens limit (uses model default)

2. Azure AI Models (Llama, DeepSeek, Ministral, Phi, Grok)

Common Pattern:

HEADERS = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}

payload = {
    "messages": messages,
    "max_tokens": 100,
    "temperature": 0.7
}

response = requests.post(API_URL, json=payload, headers=HEADERS)
response.raise_for_status()

return {"response": response.json()["choices"][0]["message"]["content"]}

Llama Specifics:

  • Max tokens: 50 (lowest among all models)
  • Temperature: 0.7

DeepSeek Specifics:

  • Max tokens: 100
  • Extra logging for debugging
  • Detailed error handling (RequestException, KeyError)

Ministral Specifics:

  • ⚠️ Role Modification: Changes last message from assistant to user if needed
    if messages[-1]["role"] == "assistant":
        messages[-1]["role"] = "user"
    
  • Max tokens: 100

Phi Specifics:

  • Max tokens: 100
  • Standard implementation

Grok Specifics:

  • Max tokens: 2048 (highest)
  • Temperature: 1.0 (highest, more creative)
  • Model name in payload: grok-3

3. Google Gemini

Message Format Transformation:

# Convert from OpenAI format to Gemini format
gemini_contents = [
    {
        "role": msg["role"],
        "parts": [{"text": msg["content"]}]
    }
    for msg in messages
]

payload = {
    "contents": gemini_contents,
    "generationConfig": {
        "temperature": 0.7,
        "maxOutputTokens": 1000
    }
}

# API key is in the URL, not headers
response = requests.post(GEMINI_API_ENDPOINT, json=payload)

Response Parsing:

response_json = response.json()

if "candidates" in response_json:
    return {"response": response_json["candidates"][0]["content"]["parts"][0]["text"]}
elif "error" in response_json:
    raise HTTPException(status_code=response.status_code, detail=response_json['error']['message'])

Features:

  • Different message format (parts-based)
  • API key in URL (not recommended for production)
  • Complex response structure with candidates
  • Max output tokens: 1000

4. Anthropic Claude

Implementation:

headers = {
    "Content-Type": "application/json",
    "X-API-Key": CLAUDE_API_KEY,
    "anthropic-version": "2023-06-01"  # Required header
}

payload = {
    "model": "claude-3-5-sonnet-20241022",
    "max_tokens": 1024,
    "messages": messages
}

response = requests.post(CLAUDE_API_ENDPOINT, json=payload, headers=headers)

Response Parsing:

response_json = response.json()

if "content" in response_json:
    # Join all text blocks
    return {"response": "".join([
        block["text"]
        for block in response_json["content"]
        if block["type"] == "text"
    ])}

Features:

  • Custom header: anthropic-version required
  • Custom header format: X-API-Key instead of Authorization
  • Multi-block response format
  • Model: claude-3-5-sonnet-20241022

Request/Response Format

Pydantic Models

class Message(BaseModel):
    role: str  # "user" or "assistant"
    content: str

class MessageRequest(BaseModel):
    messages: List[Message]

Example Request

cURL:

curl -X POST "http://localhost:8016/call-model/openai-4" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Hello"},
      {"role": "assistant", "content": "Hi there!"},
      {"role": "user", "content": "How are you?"}
    ]
  }'

Python:

import requests

response = requests.post(
    "http://localhost:8016/call-model/Gemini Flash-2.5",
    json={
        "messages": [
            {"role": "user", "content": "What is AI?"}
        ]
    }
)

print(response.json()["response"])

Error Responses

Invalid Model:

HTTP 400
{
    "detail": "Invalid model name"
}

API Failure:

HTTP 500
{
    "detail": "DeepSeek API Request Error: Connection timeout"
}

Security Analysis

🔴 CRITICAL: 9 Hardcoded API Keys

All API keys are hardcoded in the source code:

  1. Azure OpenAI (Line 36): AZURE_OPENAI_API_KEY
  2. Llama (Line 44): AZURE_LLAMA_API_KEY
  3. DeepSeek (Line 46): DEEPSEEK_API_KEY
  4. Ministral (Line 48): MINISTRAL_API_KEY
  5. Phi (Line 50): PHI_API_KEY
  6. Gemini (Line 53): API key in URL
  7. Claude (Line 57): CLAUDE_API_KEY
  8. Grok (Line 59): GROK_API_KEY

Risk:

  • Keys exposed in version control
  • Billing fraud potential
  • Unauthorized API access
  • No key rotation possible

Fix:

# Remove defaults entirely
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
if not AZURE_OPENAI_API_KEY:
    raise ValueError("AZURE_OPENAI_API_KEY environment variable not set")

🟠 SECURITY: Overly Permissive CORS

Lines 23-29:

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

Risk: Any website can call LLM API through this service

Fix:

allow_origins=[
    "https://app.machineagents.ai",
    "https://admin.machineagents.ai"
]

🟡 CODE QUALITY: Inconsistent Logging

Issue: Mix of logger and logging

Lines 166-169 (DeepSeek):

logging.info(f"DeepSeek Payload: {payload}")  # lowercase 'logging'
logger.info("✓ DeepSeek API call successful")  # 'logger' object

Impact: Inconsistent log output format

Fix: Use logger consistently everywhere

🟡 CODE QUALITY: Ministral Role Modification

Lines 188-190:

if messages[-1]["role"] == "assistant":
    messages[-1]["role"] = "user"
    logging.info("Modified the last message role to 'user'.")

Issue: Modifies input data (side effect)

Impact:

  • Unexpected behavior for callers
  • May break conversation flow
  • Hard to debug

Fix: Create a copy before modifying

messages_copy = [msg.copy() for msg in messages]
if messages_copy[-1]["role"] == "assistant":
    messages_copy[-1]["role"] = "user"

Integration Points

1. Response Services Integration

All response services call this gateway:

# In response-3d-chatbot-service, response-text-chatbot-service, etc.
import requests

LLM_MODEL_SERVICE_URL = "http://llm-model-service:8016"

def get_llm_response(model_name, messages):
    response = requests.post(
        f"{LLM_MODEL_SERVICE_URL}/call-model/{model_name}",
        json={"messages": messages}
    )
    return response.json()["response"]

Usage:

# Get response from GPT-4
response = get_llm_response("openai-4", [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": user_question}
])

2. Model Selection Flow

User selects model in dashboard
Stored in CosmosDB (project config)
Response service reads model name
Calls /call-model/{model_name}
LLM Model Service routes to correct provider
Returns unified response

Summary

Service Statistics

  • Total Lines: 368
  • Total Endpoints: 1
  • Total Models: 9
  • Total Providers: 4
  • Total API Keys: 9 (all hardcoded ⚠️)

Key Capabilities

  1. Unified Interface - Single endpoint for all models
  2. Multi-Provider - Azure OpenAI, Azure AI, Gemini, Claude
  3. Error Handling - Consistent error responses
  4. Logging - Standardized logging (mostly)
  5. Provider Abstraction - Response services don't need provider code

Critical Fixes Needed

  1. 🔴 Externalize all 9 API keys - Most critical security issue
  2. 🟠 Restrict CORS - Prevent unauthorized access
  3. 🟡 Fix inconsistent logging - Use logger everywhere
  4. 🟡 Fix Ministral role modification - Don't modify input data

Deployment Notes

Docker Compose (Port 8016):

llm-model-service:
  build: ./llm-model-service
  container_name: llm-model-service
  ports:
    - "8016:8016"
  environment:
    - AZURE_OPENAI_API_KEY=*** # Must externalize
    - AZURE_LLAMA_API_KEY=***
    - DEEPSEEK_API_KEY=***
    # ... all 9 keys

Dependencies:

  • All response services depend on this gateway
  • Single point of failure for all LLM calls
  • No caching or rate limiting implemented

Documentation Complete: LLM Model Service (Port 8016)
Status: COMPREHENSIVE, DEVELOPER-GRADE, INVESTOR-GRADE, AUDIT-READY ✅