December 17, 2024

Build a Continuous Chat Interface with Llama 3 and MAX Serve

The landscape of Generative AI is rapidly evolving, and with it comes the need for more efficient, flexible ways to deploy and interact with Large Language Models (LLMs). With the release of Meta's Llama 3 and MAX 24.6, developers now have access to a truly native Generative AI platform that simplifies the entire journey from development to production deployment.

MAX 24.6 introduces MAX GPU, our new vertically integrated serving stack that delivers high-performance inference without vendor-specific dependencies. At its core are two revolutionary technologies:

  • MAX Engine, our high-performance AI model compiler and runtime, and
  • MAX Serve, a sophisticated Python-native serving layer engineered specifically for LLM applications.

In this blog, we'll leverage these innovations to create a chat application that uses Llama 3.

In this blog, we will cover:

  • How to set up a chat application using Llama 3 and MAX.
  • Implementing efficient token management through rolling context windows.
  • Handling concurrent requests for optimal performance.
  • Containerizing and deploying your application with Docker Compose.

We'll walk through building a solution that showcases MAX's NVIDIA GPU-optimized capabilities, featuring efficient token management through rolling context windows, concurrent request handling, and straightforward deployment using Docker Compose for demonstration. For more details on deployment, check out our tutorials on deploying Llama 3 on GPU with MAX Serve to AWS, GCP or Azure or on Kubernetes. Our implementation demonstrates how MAX Serve's native Hugging Face integration and OpenAI-compatible API makes it simple to develop and deploy high-performance chat applications.

Whether you're building a proof-of-concept or scaling to production, this guide provides everything you need to get started with Llama 3 on MAX. Let's dive into creating your own GPU-accelerated chat application using our native serving stack, designed to deliver consistent and reliable performance even under heavy workloads.

Quick start: running the chat app

Getting started with our chat app is straightforward. Follow these steps to set up and run the application using Docker Compose:

Prerequisites

Ensure your system meets these requirements:

Bash
export HUGGING_FACE_HUB_TOKEN="your-token-here"

Clone the repository

Clone the Llama 3 Chat repository to your local machine:

Bash
git clone https://github.com/modularml/devrel-extras/ cd devrel-extras/blogs/llama3-chat

Build the docker images

Create and use a Docker builder (required only once):

Bash
docker buildx create --use --name mybuilder

Build the UI image for your platform:

Bash
# Intel, AMD docker buildx bake --load --set "ui.platform=linux/amd64" # OR for ARM such as Apple M-series docker buildx bake --load --set "ui.platform=linux/arm64"

Start the services

  • If you don't have access to the supported NVIDIA GPU locally, you can instead follow our tutorials on deploying Llama 3 on GPU with MAX Serve to AWS, GCP or Azure or on Kubernetes to get a public IP (running on port 80) and then run the UI component separately as follows:
Bash
docker run -p 7860:7860 \ -e "BASE_URL=http://PUBLIC_IP/v1" \ -e "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \ llama3-chat-ui
  • Or if you do have local access to the supported NVIDIA GPU locally, launch the services via Docker Compose:
Bash
docker compose up

Once the Llama3 server and UI server are running, open http://localhost:7860 to view the chat interface:

Chat interface

Development

Alternatively, in particular for development, we can run the MAX Serve docker individually on a compatible GPU machine:

Bash
docker run -d \ --env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \ --env "HF_HUB_ENABLE_HF_TRANSFER=1" \ -v $HOME/.cache/huggingface:/root/.cache/huggingface \ --gpus 1 \ -p 8000:8000 \ --ipc=host \ modular/max-openai-api:24.6.0 \ --huggingface-repo-id modularai/llama-3.1 \ --max-cache-batch-size 1 \ --max-length 4096

and launch the UI separately via the magic CLI (install Magic if you haven’t already and for more check out this step-by-step guide to Magic):

Bash
magic run python ui.py

Note: Check the available UI options magic run python ui.py --help. For example, this also enables launching the UI that connects to a remote public IP as follows:

Bash
magic run python ui.py --base-url http://YOUR_PUBLIC_IP/v1

Features of Llama 3 chat app

  • Gradio-based interface: A sleek, interactive UI built with Gradio for intuitive interactions.
  • Seamless integration: Leverages Llama 3 models via MAX Serve on GPU, ensuring rapid and efficient chat responses.
  • Customizable environment: Adjust settings like context window size, batch size, and system prompts to suit your needs.
  • Efficient continuous chat: Employs a rolling context window implementation that dynamically maintains the chat context without exceeding the maximum token limit.

Architecture overview

Our chat application consists of three main components:

  1. Frontend layer: A Gradio-based web interface that provides real-time chat interactions.
  2. MAX Serve layer: Our OpenAI-compatible API server that handles:
    • Request batching and scheduling through advanced techniques such as continuous batching.
    • Token management and context windows.
    • Model inference optimization.
  3. Model Layer: Llama 3 running on MAX Engine, optimized for GPU inference.
Chat application architecture

This architecture ensures:

  • Efficient resource utilization through batched inference.
  • Scalable request handling via concurrent processing.
  • Optimized memory management with rolling context windows.

Technical deep dive

Continuous chat with rolling context window

A key feature of our chat application is the rolling context window. This mechanism ensures that conversations remain coherent and contextually relevant without overwhelming system resources. Here's an in-depth look at how this is achieved:

1. Dynamic token management

The ChatConfig class is responsible for tracking token usage and maintaining a rolling window of messages within the configured token limit. Tokens are the fundamental units processed by language models, and managing them efficiently is crucial for performance and cost-effectiveness.

Python
class ChatConfig: def __init__(self, base_url: str, max_context_window: int): self.base_url = base_url self.max_context_window = max_context_window self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct") def count_tokens(self, messages: List[Dict]) -> int: num_tokens = 0 for message in messages: text = f"<|im_start|>{message['role']}\\n{message['content']}<|im_end|>\\n" num_tokens += len(self.tokenizer.encode(text)) return num_tokens

How it works:

  • Token counting: Each message's content is wrapped with special tokens (<|im_start|> and <|im_end|>) to denote the start and end of a message. The tokenizer then encodes this text and counts the number of tokens.
  • Configuration: The max_context_window parameter defines the maximum number of tokens allowed in the conversation context. This ensures that the application doesn't exceed the model's capacity, maintaining efficiency.

2. Prioritized message inclusion

To maintain the conversation's relevance, the latest user and system messages are always included. Older messages are trimmed dynamically when the token count exceeds the window size.

Python
if chat_history: for user_msg, bot_msg in reversed(chat_history): new_messages = [ {"role": "user", "content": user_msg}, {"role": "assistant", "content": bot_msg}, ] history_tokens = config.count_tokens(new_messages) if running_total + history_tokens ≤ config.max_context_window: history_messages = new_messages + history_messages running_total += history_tokens else: break

How it works:

  • Reversed iteration: By iterating over the chat history in reverse, the system prioritizes the most recent messages.
  • Token check: For each pair of user and assistant messages, the total tokens are calculated. If adding these messages keeps the total within the max_context_window, they are included in the active context.
  • Dynamic trimming: Once the token limit is approached, older messages are excluded, ensuring the context remains within bounds.

3. Efficient resource usage

By keeping the active context concise and relevant, the system optimizes resource usage and maintains high performance even during extended interactions. This approach prevents unnecessary memory consumption and ensures the application remains responsive.

Chat user-interface

The UI logic is included in ui.py file and is central to the continuous chat interface. Here’s how it enables the chat system:

Gradio Integration

Gradio provides a user-friendly interface, making interactions intuitive and accessible.

Python
def create_interface(config: ChatConfig, client, system_prompt, concurrency_limit: int = 1): with gr.Blocks(theme="soft") as iface: gr.Markdown("# Chat with Llama 3 model\n\nPowered by Modular [MAX](https://docs.modular.com/max/) 🚀") chatbot = gr.Chatbot(height=400) msg = gr.Textbox(label="Message", placeholder="Type your message here...") clear = gr.Button("Clear") initial_usage = f"**Total Tokens Generated**: 0 | Context Window: {config.max_context_window}" token_display = gr.Markdown(initial_usage) async def respond_wrapped(message, chat_history): async for response in respond(message, chat_history, config, client, system_prompt): yield response msg.submit( respond_wrapped, [msg, chatbot], [chatbot, token_display], api_name="chat" ).then(lambda: "", None, msg) clear.click(lambda: ([], initial_usage), None, [chatbot, token_display], api_name="clear") iface.queue(default_concurrency_limit=concurrency_limit) return iface

Key components:

  • Markdown: Displays the application title and branding.
  • Chatbot component: Shows the conversation history.
  • Textbox: Allows users to input messages.
  • Clear button: Resets the conversation.
  • Token display: Shows the total tokens generated and the current context window usage.
  • Asynchronous response handling: Ensures smooth and non-blocking interactions.

Server interaction

The interface communicates with the Llama 3 model via the MAX Serve API to fetch chat completions.

Python
async def respond(message, chat_history, config: ChatConfig, client, system_prompt): chat_history = chat_history or [] if not isinstance(message, str) or not message.strip(): yield chat_history, f"**Active Context**: 0/{config.max_context_window}" return messages = [system_prompt] current_message = {"role": "user", "content": message} messages.extend(history_messages) messages.append(current_message) chat_history.append([message, None]) response = await client.chat.completions.create( model=config.model_repo_id, messages=messages, stream=True, max_tokens=config.max_context_window, ) for chunk in response: if hasattr(chunk.choices[0].delta, 'content'): bot_message += chunk.choices[0].delta.content chat_history[-1][1] = bot_message yield chat_history, f"**Active Context**: {running_total}/{config.max_context_window}"

Health checks

The wait_for_healthy function ensures the MAX Serve API is ready before processing requests, retrying until the server is live.

Python
from tenacity import ( retry, stop_after_attempt, wait_fixed, retry_if_exception_type, retry_if_result, ) def wait_for_healthy(base_url: str): @retry( stop=stop_after_attempt(20), wait=wait_fixed(60), retry=( retry_if_exception_type(requests.RequestException) | retry_if_result(lambda x: x.status_code != 200) ) ) def _check_health(): return requests.get(f"{base_url}/health", timeout=5) return _check_health()

Explaining docker-compose.yml

The docker-compose.yml content is as follows

docker-compose.yml
services: ui: container_name: llama3-chat-ui build: context: . dockerfile: Dockerfile.ui ports: - "7860:7860" depends_on: - server environment: - HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN} - BASE_URL=http://server:8000/v1 - MAX_CONTEXT_WINDOW=${MAX_CONTEXT_WINDOW:-4096} - CONCURRENCY_LIMIT=${MAX_CACHE_BATCH_SIZE:-1} - SYSTEM_PROMPT="You are a helpful AI assistant." - API_KEY=${API_KEY:-local} server: image: modular/max-openai-api:24.6.0 container_name: llama3-chat-server environment: - HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN} - HF_HUB_ENABLE_HF_TRANSFER=1 volumes: - $HOME/.cache/huggingface:/root/.cache/huggingface ports: - "8000:8000" deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ipc: host command: "\ --huggingface-repo-id ${HUGGINGFACE_REPO_ID:-modularai/llama-3.1} \ --max-length ${MAX_CONTEXT_WINDOW:-4096} \ --max-cache-batch-size ${MAX_CACHE_BATCH_SIZE:-1}"

The docker-compose.yml orchestrates both the UI and server components:

  1. UI Service:
    • Builds the Gradio interface using Dockerfile.ui.
    • Communicates with the server via the environment variable BASE_URL.
  2. Server Service:
    • Runs the MAX Serve API using the specified image.
    • Uses NVIDIA GPUs for inference, ensuring optimal performance.
    • Shares necessary Hugging Face cache data via mounted volumes.

Dockerfile.ui component

We use the official magic-docker base image ghcr.io/modular/magic:noble to create our Dockerfile.ui as follows. It supports multi-platform builds by leveraging the following configuration:

Dockerfile.ui
FROM ghcr.io/modular/magic:noble AS build RUN apt-get update && apt-get install -y build-essential WORKDIR /app COPY pyproject.toml ui.py ./ RUN magic clean RUN magic install RUN magic shell-hook > /shell-hook.sh && \ echo 'exec "$@" 2>&1' >> /shell-hook.sh FROM ghcr.io/modular/magic:noble AS runtime COPY --from=build /app /app COPY --from=build /shell-hook.sh /shell-hook.sh WORKDIR /app ENV PYTHONUNBUFFERED=1 ENTRYPOINT ["/bin/bash", "/shell-hook.sh"] CMD ["magic", "run", "python", "ui.py"]

To define target platforms for a multi-platform build, we include the following in docker-bake.hcl:

docker-bake.hcl
# Define the target platforms variable "PLATFORMS" { default = ["linux/amd64", "linux/arm64"] } # Default target group group "default" { targets = ["ui"] } # UI service target target "ui" { context = "." dockerfile = "Dockerfile.ui" platforms = "${PLATFORMS}" tags = ["llama3-chat-ui"] output = ["type=docker"] }

Configuration and customization

Environment variables

  • MAX_CONTEXT_WINDOW: Max tokens for the context window (default: 4096).
  • CONCURRENCY_LIMIT: Must match the MAX_CACHE_BATCH_SIZE that enables continuous batching in MAX Serve for efficient handling of concurrent streaming requests
  • SYSTEM_PROMPT: Default system prompt for the AI assistant.

Dependency management

All dependencies are managed via Magic and defined in pyproject.toml to ensure consistency across environments:

pyproject.toml
[project] dependencies = [ "gradio>=5.8.0,<6", "fastapi>=0.115.6,<0.116", "requests>=2.32.3,<3", "openai>=1.57.3,<2", "tenacity>=9.0.0,<10", "transformers>=4.47.0,<5", "click>=8.1.7,<9" ] description = "Chat with llama3 MAX Serve on GPU" name = "llama3-chat" requires-python = ">= 3.9,<3.13" version = "0.0.0" [build-system] build-backend = "hatchling.build" requires = ["hatchling"] [tool.hatch.build.targets.wheel] packages = ["."] [tool.pixi.project] channels = ["conda-forge", "https://conda.modular.com/max"] platforms = ["linux-64", "osx-arm64", "linux-aarch64"] [tool.pixi.pypi-dependencies] llama3_chat = { path = ".", editable = true }

Performance considerations

When deploying your chat application, consider these key factors:

  1. Context window size
  • Default: 4096 tokens (in MAX Serve --max-length).
  • Larger windows increase memory usage but maintain more conversation context
  • Recommended: Start with 4096 and adjust based on your use case.
  1. Continuous batching
  • MAX_CACHE_BATCH_SIZE controls concurrent request handling via the continuous batching (in MAX Serve --max-cache-batch-size).
  • Higher values increase throughput but may impact latency.
  • Recommended: Start with 1 and increase based on your GPU capacity. MAX Serve also gives a recommendation at the start for the optimal size.
  1. Memory management
  • Monitor GPU memory usage with nvidia-smi.
  • Consider implementing additional caching for frequent responses.

Check out various configuration options by:

Bash
git clone https://github.com/modularml/max cd max/pipelines/python magic run serve --help

On the serving side, make sure to check out the benchmarking blog too.

Clean up (Optional)

To stop and clean up all resources:

Bash
docker compose down

Remove all images related to the project:

Bash
docker rmi $(docker images -q modular/max-openai-api:24.6.0) docker rmi $(docker images -q llama3-chat-ui)

Remove the Docker builder (if no longer needed):

Bash
docker buildx rm mybuilder

Conclusion

In this tutorial, we've built a functional chat application using Llama 3 and MAX 24.6. We've explored:

  • Basic setup: Using Docker and NVIDIA GPU support to create a working environment
  • Architecture overview: Creating a three-layer system with a Gradio frontend, MAX Serve API, and Llama 3 model backend
  • Token management: Implementing rolling context windows to maintain conversation history
  • Performance basics: Understanding batch processing and concurrent request handling
  • Simple deployment: Using Docker Compose to run the application
  • Configuration options: Managing environment variables and dependencies

This demo shows how MAX's GPU-optimized serving stack can be combined with Llama 3 to create interactive chat applications. While this implementation focuses on the basics, it provides a foundation that you can build upon for your own projects.

Next Steps

Deploy Llama 3 on GPU with MAX Serve to AWS, GCP or Azure or on Kubernetes.

Explore MAX's documentation for additional features.

Join our Modular Forum and Discord community to share your experiences and get support.

We're excited to see what you'll build with Llama 3 and MAX! Share your projects and experiences with us using #ModularAI on social media.

Ehsan M. Kermani
,
AI DevRel

Ehsan M. Kermani

AI DevRel

Ehsan is a Seasoned Machine Learning Engineer with a decade of experience and a rich background in Mathematics and Computer Science. His expertise lies in the development of cutting-edge Machine Learning and Deep Learning systems ranging from Natural Language Processing, Computer Vision, Generative AI and LLMs, Time Series Forecasting and Anomaly Detection while ensuring proper MLOps practices are in-place. Beyond his technical skills, he is very passionate about demystifying complex concepts by creating high-quality and engaging content. His goal is to empower and inspire the developer community through clear, accessible communication and innovative problem-solving. Ehsan lives in Vancouver, Canada.