The landscape of Generative AI is rapidly evolving, and with it comes the need for more efficient, flexible ways to deploy and interact with Large Language Models (LLMs). With the release of Meta's Llama 3 and MAX 24.6 , developers now have access to a truly native Generative AI platform that simplifies the entire journey from development to production deployment.
MAX 24.6 introduces MAX GPU, our new vertically integrated serving stack that delivers high-performance inference without vendor-specific dependencies. At its core are two revolutionary technologies:
MAX Engine , our high-performance AI model compiler and runtime, and MAX Serve , a sophisticated Python-native serving layer engineered specifically for LLM applications. In this blog, we'll leverage these innovations to create a chat application that uses Llama 3.
In this blog, we will cover:
How to set up a chat application using Llama 3 and MAX. Implementing efficient token management through rolling context windows. Handling concurrent requests for optimal performance. Containerizing and deploying your application with Docker Compose. We'll walk through building a solution that showcases MAX's NVIDIA GPU-optimized capabilities, featuring efficient token management through rolling context windows, concurrent request handling, and straightforward deployment using Docker Compose for demonstration. For more details on deployment, check out our tutorials on deploying Llama 3 on GPU with MAX Serve to AWS, GCP or Azure or on Kubernetes . Our implementation demonstrates how MAX Serve's native Hugging Face integration and OpenAI-compatible API makes it simple to develop and deploy high-performance chat applications.
Whether you're building a proof-of-concept or scaling to production, this guide provides everything you need to get started with Llama 3 on MAX. Let's dive into creating your own GPU-accelerated chat application using our native serving stack, designed to deliver consistent and reliable performance even under heavy workloads.
Quick start: running the chat app Getting started with our chat app is straightforward. Follow these steps to set up and run the application using Docker Compose:
Prerequisites Ensure your system meets these requirements:
Bash
export HUGGING_FACE_HUB_TOKEN="your-token-here"
Copy
Clone the repository Clone the Llama 3 Chat repository to your local machine:
Bash
git clone https://github.com/modularml/devrel-extras/
cd devrel-extras/blogs/llama3-chat
Copy
Build the docker images Create and use a Docker builder (required only once):
Bash
docker buildx create --use --name mybuilder
Copy
Build the UI image for your platform:
Bash
# Intel, AMD
docker buildx bake --load --set "ui.platform=linux/amd64"
# OR for ARM such as Apple M-series
docker buildx bake --load --set "ui.platform=linux/arm64"
Copy
Start the services If you don't have access to the supported NVIDIA GPU locally, you can instead follow our tutorials on deploying Llama 3 on GPU with MAX Serve to AWS, GCP or Azure or on Kubernetes to get a public IP (running on port 80) and then run the UI component separately as follows:
Bash
docker run -p 7860:7860 \
-e "BASE_URL=http://PUBLIC_IP/v1" \
-e "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \
llama3-chat-ui
Copy
Or if you do have local access to the supported NVIDIA GPU locally, launch the services via Docker Compose :
Bash
docker compose up
Copy
Once the Llama3 server and UI server are running, open http://localhost:7860 to view the chat interface:
Chat interface Development Alternatively, in particular for development, we can run the MAX Serve docker individually on a compatible GPU machine:
Bash
docker run -d \
--env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
--gpus 1 \
-p 8000:8000 \
--ipc=host \
modular/max-openai-api:24.6.0 \
--huggingface-repo-id modularai/llama-3.1 \
--max-cache-batch-size 1 \
--max-length 4096
Copy
and launch the UI separately via the magic
CLI (install Magic if you haven’t already and for more check out this step-by-step guide to Magic ):
Bash
magic run python ui.py
Copy
Note: Check the available UI options magic run python ui.py --help
. For example, this also enables launching the UI that connects to a remote public IP as follows:
Bash
magic run python ui.py --base-url http://YOUR_PUBLIC_IP/v1
Copy
Features of Llama 3 chat app Gradio-based interface : A sleek, interactive UI built with Gradio for intuitive interactions.Seamless integration : Leverages Llama 3 models via MAX Serve on GPU, ensuring rapid and efficient chat responses.Customizable environment : Adjust settings like context window size, batch size, and system prompts to suit your needs.Efficient continuous chat : Employs a rolling context window implementation that dynamically maintains the chat context without exceeding the maximum token limit.Architecture overview Our chat application consists of three main components:
Frontend layer : A Gradio-based web interface that provides real-time chat interactions.MAX Serve layer : Our OpenAI-compatible API server that handles:Request batching and scheduling through advanced techniques such as continuous batching. Token management and context windows. Model inference optimization. Model Layer : Llama 3 running on MAX Engine, optimized for GPU inference.Chat application architecture This architecture ensures:
Efficient resource utilization through batched inference. Scalable request handling via concurrent processing. Optimized memory management with rolling context windows. Technical deep dive Continuous chat with rolling context window A key feature of our chat application is the rolling context window . This mechanism ensures that conversations remain coherent and contextually relevant without overwhelming system resources. Here's an in-depth look at how this is achieved:
1. Dynamic token management The ChatConfig
class is responsible for tracking token usage and maintaining a rolling window of messages within the configured token limit. Tokens are the fundamental units processed by language models, and managing them efficiently is crucial for performance and cost-effectiveness.
Python
class ChatConfig:
def __init__(self, base_url: str, max_context_window: int):
self.base_url = base_url
self.max_context_window = max_context_window
self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
def count_tokens(self, messages: List[Dict]) -> int:
num_tokens = 0
for message in messages:
text = f"<|im_start|>{message['role']}\\n{message['content']}<|im_end|>\\n"
num_tokens += len(self.tokenizer.encode(text))
return num_tokens
Copy
How it works:
Token counting : Each message's content is wrapped with special tokens (<|im_start|>
and <|im_end|>
) to denote the start and end of a message. The tokenizer then encodes this text and counts the number of tokens.Configuration : The max_context_window
parameter defines the maximum number of tokens allowed in the conversation context. This ensures that the application doesn't exceed the model's capacity, maintaining efficiency.2. Prioritized message inclusion To maintain the conversation's relevance, the latest user and system messages are always included. Older messages are trimmed dynamically when the token count exceeds the window size.
Python
if chat_history:
for user_msg, bot_msg in reversed(chat_history):
new_messages = [
{"role": "user", "content": user_msg},
{"role": "assistant", "content": bot_msg},
]
history_tokens = config.count_tokens(new_messages)
if running_total + history_tokens ≤ config.max_context_window:
history_messages = new_messages + history_messages
running_total += history_tokens
else:
break
Copy
How it works:
Reversed iteration : By iterating over the chat history in reverse, the system prioritizes the most recent messages.Token check : For each pair of user and assistant messages, the total tokens are calculated. If adding these messages keeps the total within the max_context_window
, they are included in the active context.Dynamic trimming : Once the token limit is approached, older messages are excluded, ensuring the context remains within bounds.3. Efficient resource usage By keeping the active context concise and relevant, the system optimizes resource usage and maintains high performance even during extended interactions. This approach prevents unnecessary memory consumption and ensures the application remains responsive.
Chat user-interface The UI logic is included in ui.py
file and is central to the continuous chat interface. Here’s how it enables the chat system:
Gradio Integration Gradio provides a user-friendly interface, making interactions intuitive and accessible.
Python
def create_interface(config: ChatConfig, client, system_prompt, concurrency_limit: int = 1):
with gr.Blocks(theme="soft") as iface:
gr.Markdown("# Chat with Llama 3 model\n\nPowered by Modular [MAX](https://docs.modular.com/max/) 🚀")
chatbot = gr.Chatbot(height=400)
msg = gr.Textbox(label="Message", placeholder="Type your message here...")
clear = gr.Button("Clear")
initial_usage = f"**Total Tokens Generated**: 0 | Context Window: {config.max_context_window}"
token_display = gr.Markdown(initial_usage)
async def respond_wrapped(message, chat_history):
async for response in respond(message, chat_history, config, client, system_prompt):
yield response
msg.submit(
respond_wrapped,
[msg, chatbot],
[chatbot, token_display],
api_name="chat"
).then(lambda: "", None, msg)
clear.click(lambda: ([], initial_usage), None, [chatbot, token_display], api_name="clear")
iface.queue(default_concurrency_limit=concurrency_limit)
return iface
Copy
Key components:
Markdown : Displays the application title and branding.Chatbot component : Shows the conversation history.Textbox : Allows users to input messages.Clear button : Resets the conversation.Token display : Shows the total tokens generated and the current context window usage.Asynchronous response handling : Ensures smooth and non-blocking interactions.Server interaction The interface communicates with the Llama 3 model via the MAX Serve API to fetch chat completions.
Python
async def respond(message, chat_history, config: ChatConfig, client, system_prompt):
chat_history = chat_history or []
if not isinstance(message, str) or not message.strip():
yield chat_history, f"**Active Context**: 0/{config.max_context_window}"
return
messages = [system_prompt]
current_message = {"role": "user", "content": message}
messages.extend(history_messages)
messages.append(current_message)
chat_history.append([message, None])
response = await client.chat.completions.create(
model=config.model_repo_id,
messages=messages,
stream=True,
max_tokens=config.max_context_window,
)
for chunk in response:
if hasattr(chunk.choices[0].delta, 'content'):
bot_message += chunk.choices[0].delta.content
chat_history[-1][1] = bot_message
yield chat_history, f"**Active Context**: {running_total}/{config.max_context_window}"
Copy
Health checks The wait_for_healthy
function ensures the MAX Serve API is ready before processing requests, retrying until the server is live.
Python
from tenacity import (
retry,
stop_after_attempt,
wait_fixed,
retry_if_exception_type,
retry_if_result,
)
def wait_for_healthy(base_url: str):
@retry(
stop=stop_after_attempt(20),
wait=wait_fixed(60),
retry=(
retry_if_exception_type(requests.RequestException)
| retry_if_result(lambda x: x.status_code != 200)
)
)
def _check_health():
return requests.get(f"{base_url}/health", timeout=5)
return _check_health()
Copy
Explaining docker-compose.yml
The docker-compose.yml
content is as follows
docker-compose.yml
services:
ui:
container_name: llama3-chat-ui
build:
context: .
dockerfile: Dockerfile.ui
ports:
- "7860:7860"
depends_on:
- server
environment:
- HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}
- BASE_URL=http://server:8000/v1
- MAX_CONTEXT_WINDOW=${MAX_CONTEXT_WINDOW:-4096}
- CONCURRENCY_LIMIT=${MAX_CACHE_BATCH_SIZE:-1}
- SYSTEM_PROMPT="You are a helpful AI assistant."
- API_KEY=${API_KEY:-local}
server:
image: modular/max-openai-api:24.6.0
container_name: llama3-chat-server
environment:
- HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}
- HF_HUB_ENABLE_HF_TRANSFER=1
volumes:
- $HOME/.cache/huggingface:/root/.cache/huggingface
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ipc: host
command: "\
--huggingface-repo-id ${HUGGINGFACE_REPO_ID:-modularai/llama-3.1} \
--max-length ${MAX_CONTEXT_WINDOW:-4096} \
--max-cache-batch-size ${MAX_CACHE_BATCH_SIZE:-1}"
Copy
The docker-compose.yml
orchestrates both the UI and server components:
UI Service :Builds the Gradio interface using Dockerfile.ui
. Communicates with the server via the environment variable BASE_URL
. Server Service :Runs the MAX Serve API using the specified image. Uses NVIDIA GPUs for inference, ensuring optimal performance. Shares necessary Hugging Face cache data via mounted volumes. Dockerfile.ui
componentWe use the official magic-docker base image ghcr.io/modular/magic:noble
to create our Dockerfile.ui
as follows. It supports multi-platform builds by leveraging the following configuration:
Dockerfile.ui
FROM ghcr.io/modular/magic:noble AS build
RUN apt-get update && apt-get install -y build-essential
WORKDIR /app
COPY pyproject.toml ui.py ./
RUN magic clean
RUN magic install
RUN magic shell-hook > /shell-hook.sh && \
echo 'exec "$@" 2>&1' >> /shell-hook.sh
FROM ghcr.io/modular/magic:noble AS runtime
COPY --from=build /app /app
COPY --from=build /shell-hook.sh /shell-hook.sh
WORKDIR /app
ENV PYTHONUNBUFFERED=1
ENTRYPOINT ["/bin/bash", "/shell-hook.sh"]
CMD ["magic", "run", "python", "ui.py"]
Copy
To define target platforms for a multi-platform build, we include the following in docker-bake.hcl
:
docker-bake.hcl
# Define the target platforms
variable "PLATFORMS" {
default = ["linux/amd64", "linux/arm64"]
}
# Default target group
group "default" {
targets = ["ui"]
}
# UI service target
target "ui" {
context = "."
dockerfile = "Dockerfile.ui"
platforms = "${PLATFORMS}"
tags = ["llama3-chat-ui"]
output = ["type=docker"]
}
Copy
Configuration and customization Environment variables MAX_CONTEXT_WINDOW
: Max tokens for the context window (default: 4096).CONCURRENCY_LIMIT
: Must match the MAX_CACHE_BATCH_SIZE
that enables continuous batching in MAX Serve for efficient handling of concurrent streaming requestsSYSTEM_PROMPT
: Default system prompt for the AI assistant.Dependency management All dependencies are managed via Magic and defined in pyproject.toml
to ensure consistency across environments:
pyproject.toml
[project]
dependencies = [
"gradio>=5.8.0,<6",
"fastapi>=0.115.6,<0.116",
"requests>=2.32.3,<3",
"openai>=1.57.3,<2",
"tenacity>=9.0.0,<10",
"transformers>=4.47.0,<5",
"click>=8.1.7,<9"
]
description = "Chat with llama3 MAX Serve on GPU"
name = "llama3-chat"
requires-python = ">= 3.9,<3.13"
version = "0.0.0"
[build-system]
build-backend = "hatchling.build"
requires = ["hatchling"]
[tool.hatch.build.targets.wheel]
packages = ["."]
[tool.pixi.project]
channels = ["conda-forge", "https://conda.modular.com/max"]
platforms = ["linux-64", "osx-arm64", "linux-aarch64"]
[tool.pixi.pypi-dependencies]
llama3_chat = { path = ".", editable = true }
Copy
Performance considerations When deploying your chat application, consider these key factors:
Context window size Default: 4096 tokens (in MAX Serve --max-length
). Larger windows increase memory usage but maintain more conversation context Recommended: Start with 4096 and adjust based on your use case. Continuous batching MAX_CACHE_BATCH_SIZE
controls concurrent request handling via the continuous batching (in MAX Serve --max-cache-batch-size
).Higher values increase throughput but may impact latency. Recommended: Start with 1 and increase based on your GPU capacity. MAX Serve also gives a recommendation at the start for the optimal size. Memory management Monitor GPU memory usage with nvidia-smi.
Consider implementing additional caching for frequent responses. Check out various configuration options by:
Bash
git clone https://github.com/modularml/max
cd max/pipelines/python
magic run serve --help
Copy
On the serving side, make sure to check out the benchmarking blog too.
Clean up (Optional) To stop and clean up all resources:
Bash
docker compose down
Copy
Remove all images related to the project:
Bash
docker rmi $(docker images -q modular/max-openai-api:24.6.0)
docker rmi $(docker images -q llama3-chat-ui)
Copy
Remove the Docker builder (if no longer needed):
Bash
docker buildx rm mybuilder
Copy
Conclusion In this tutorial, we've built a functional chat application using Llama 3 and MAX 24.6. We've explored:
Basic setup : Using Docker and NVIDIA GPU support to create a working environmentArchitecture overview : Creating a three-layer system with a Gradio frontend, MAX Serve API, and Llama 3 model backendToken management : Implementing rolling context windows to maintain conversation historyPerformance basics : Understanding batch processing and concurrent request handlingSimple deployment : Using Docker Compose to run the applicationConfiguration options : Managing environment variables and dependenciesThis demo shows how MAX's GPU-optimized serving stack can be combined with Llama 3 to create interactive chat applications. While this implementation focuses on the basics, it provides a foundation that you can build upon for your own projects.
Next Steps Deploy Llama 3 on GPU with MAX Serve to AWS, GCP or Azure or on Kubernetes .
Explore MAX's documentation for additional features.
Join our Modular Forum and Discord community to share your experiences and get support.
We're excited to see what you'll build with Llama 3 and MAX! Share your projects and experiences with us using #ModularAI on social media.