With MAX 24.6, we introduced MAX Serve: our cutting-edge LLM serving solution, delivering state-of-the-art performance on NVIDIA A100 GPUs . While still in its early days, MAX Serve already offers a combination of capabilities that unlock value for AI Engineers—especially those looking to build using Retrieval-Augmented Generation (RAG), tool use, and AI safety .
Out of the box, MAX Serve:
Runs the same code on your laptop or an NVIDIA-equipped server, with zero configuration required Downloads and serves any PyTorch LLM from Hugging Face, with special acceleration for LlamaForCausalLM -compatible models Provides an OpenAI-compatible chat completion endpoint, making it a drop-in replacement for other solutions Enabled by Modular’s groundbreaking work in AI compilers and infrastructure, MAX provides this feature set from a single command at your terminal. In this post, you’ll experience how quickly MAX and Open WebUI get you up-and-running with RAG, web search, and Llama 3.1 on GPU.
Spoiler alert: MAX Serve is great as the force behind Open WebUI About Open WebUI Building on the solid foundation MAX provides, adding a robust user interface is a natural next step in creating a full-stack web app. At Modular, we often use Open WebUI in our own work, as it seamlessly integrates with our technology stack. This powerful platform offers a familiar chat-driven interface for interacting with open-source AI models.
Like MAX, Open WebUI empowers users to maintain complete ownership of their AI infrastructure, avoiding vendor lock-in risks, and enhancing privacy. By combining MAX with Open WebUI, you gain instant access to a streamlined development environment—you’ll spend no time troubleshooting your CUDA configuration, and more time building.
About RAG and web search As we discussed in our previous post about RAG :
Training a large language model with new knowledge is not feasible for most people—it’s time-consuming and prohibitively expensive. To overcome these limits, we can provide new knowledge to the model by retrieving specific, relevant information from external sources. With RAG, the content source is often a vector database containing proprietary documents. Meanwhile, with web search the source is an API capable of searching the entire web, like those from DuckDuckGo and Google. Regardless of the tool, the system searches for documents, then essentially copy-pastes the results into the LLM’s context window. This grounds the model’s response in the information the documents contain.
Before you begin Install Magic Magic is Modular’s CLI and you’ll need it to follow along with this post. To install it, run this command at your terminal and follow the instructions:
Bash
curl -ssL https://magic.modular.com/ | bash
Copy
Install Docker In this post, we’ll use Docker to run the Open WebUI container. Follow the instructions in the Docker documentation if you need to install it.
Set up Hugging Face access For our work here, we’ll leverage MAX’s ability to run any PyTorch LLM from Hugging Face. Before we can begin, you must obtain an access token from Hugging Face to download models hosted there. Follow the instructions in the Hugging Face documentation to obtain one.
Start MAX and Open WebUI At this point, you have a choice: run locally on your laptop, or in the cloud on a GPU-equipped instance.
Option A: Run locally The local to cloud developer experience is something we care deeply about here at Modular. Simply follow our getting started guide to run MAX on your laptop.
To start MAX Serve, customize the guide's magic run serve
command with a sequence length long enough to support the RAG workload:
Bash
magic run serve --huggingface-repo-id modularai/llama-3.1 --max-length 16384
Copy
MAX Serve is ready once you see a log message containing:
Bash
Uvicorn running on http://...
Copy
Leave the terminal window open and open a second terminal window, then run the following Docker command:
Bash
docker run \
-v open-webui:/app/backend/data \
-e "WEBUI_AUTH=false" \
-e "OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1" \
-e "OPENAI_API_KEY=123" \
-p 8080:8080 \
ghcr.io/open-webui/open-webui:main
Copy
Open WebUI is ready once you see a log message containing:
Bash
Uvicorn running on http://...
Copy
Leave both terminal windows open and visit http://localhost:8080/
in your web browser to access Open WebUI.
Note: performance will be much slower locally than on a GPU-equipped cloud instance.
Option B: Run in the cloud This option requires SSH access to an NVIDIA GPU-equipped cloud instance running Linux.
Note: If you don’t have a cloud GPU instance, follow our tutorial to create one on AWS, Azure, or GCP. Once your instance is up, SSH into it and stop the MAX container ; we’ll start another one in the following steps.
To get going, SSH into your cloud instance, then run the following command to start a new Magic project and change into its directory:
Bash
magic init max-open-webui --format pyproject
cd max-open-webui
Copy
First, we will store our Hugging Face access token in a .env
file. Such files are a best practice for storing environment-specific configuration variables and sensitive data, such as API keys and other application settings. Create a new
file in your favorite code editor, placing it in the max-open-webui
directory with the name .env
, and add your token like so:
Bash
HUGGING_FACE_HUB_TOKEN=<YOUR TOKEN HERE>
Copy
Next, create a new file in your code editory, placing it in the max-open-webui
directory with the name docker-compose.yml
, and paste in the following contents:
YAML
services:
max-openai-api:
image: modular/max-openai-api:24.6.0
container_name: max-openai-api
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
HUGGING_FACE_HUB_TOKEN: "${HUGGING_FACE_HUB_TOKEN}"
HF_HUB_ENABLE_HF_TRANSFER: "1"
command: [
"--huggingface-repo-id", "modularai/llama-3.1",
"--max-length", "16384"
]
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
environment:
WEBUI_AUTH: "False"
OPENAI_API_BASE_URL: "http://max-openai-api:8000/v1"
OPENAI_API_KEY: "123"
volumes:
- open-webui:/app/backend/data
ports:
- "8080:8080"
depends_on:
- max-openai-api
volumes:
open-webui:
Copy
This is a typical docker-compose configuration for an app with a backend service (MAX) and web UI (Open WebUI). Let’s briefly explain what’s going on here:
We start by defining two containers: max-openai-api
and open-webui
For the max-openai-api
container:We set deploy.resources.reservations.devices.count
to 1
. This indicates that we require a GPU. We pass the necessary settings for Hugging Face into the environment, including getting the token from our .env
file The command specifies the model we want to use—weights for Llama 3.1 that Modular makes available for convenience—and max-length
that ensures a large enough context window for RAG and web search The line containing ~/.cache/huggingface:/root/.cache/huggingface
syncs downloads from Hugging Face between the instance’s persistent storage and the container. For the open-webui
container:We set several environment settings:WEB_AUTH
: Open WebUI supports multiple user accounts. Setting this value to false causes Open WebUI to enter single-user mode, which is convenient to get up and running quickly for development.OPENAI_API_BASE_URL
: This is actually your MAX Serve endpoint, which matches the chat completion API that OpenAI usesOPENAI_API_KEY
: Use any value here. MAX Serve does not need an API key, but the OpenAI library requires one. The depends_on: max-openai-api
means the open-webui
container will start after the max-openai-api
container The last section, volumes.open-webui
, is simply a stub that tells Docker to create persistent on-disk storage for the open-webui
container. (The colon at the end might look odd if you’re new to docker-compose; rest assured it's intentional.) Next, we’ll add some tasks to our Magic project as shortcuts for a few Docker commands. Open the pyproject.toml
file and replace the [tool.pixi.tasks]
sections with the following:
TOML
[tool.pixi.tasks]
stop = "docker compose -f docker-compose.yml down"
start = { cmd = "docker compose -f docker-compose.yml up -dV", depends-on = ["stop"] }
logs = "docker compose logs -f"
Copy
Let’s dig into each task:
stop
: stops the containers and removes their entries from the Docker daemonstart
: first runs the stop task, and then starts the containerslogs
: streams the log output of the containersFinally, we’re ready to start our app! Simply run these commands:
Bash
magic run start
magic run logs
Copy
The first command above will download and run the containers. The second command will stream logs for each container. The app is ready once you see the following message from both containers :
Bash
Uvicorn running on http://...
Copy
It’s safe to press CTRL+C here to exit the logs task; the containers will not stop until you run: magic run stop
Visit http://<YOUR_CLOUD_IP>:8080/
in your web browser to access Open WebUI.
Configure Open WebUI Before we can begin chatting, there’s just a few features to set up.
Configure Connection to MAX First, we need to manually provide our model name to Open WebUI. (Automatic model discovery is coming to MAX; manually specifying the model name will soon be unnecessary.) Access the Open WebUI admin panel by clicking the 👤User button in the bottom left corner of the page, then navigate to Admin Panel > Settings > Connections.
First, turn off the Ollama API, then add modularai/llama-3.1
to the MAX Serve endpoint under OpenAI API. Your settings should look like this (your URL may differ from what is shown):
Edit Connection dialog for “OpenAI API Connections” containing modularai/llama-3.1 as a Model ID Enable web search To enable web search, navigate to Admin Panel > Settings > Web Search , turn on the Enable Web Search switch, then choose DuckDuckGo as the search engine. Using DuckDuckGo is free and does not require an API key.
Configure RAG To perform RAG, we’ll use the knowledge and custom models features of Open WebUI. Knowledge is built upon Chroma, the same vector database that features in our previous post about RAG . Custom models are how we can augment our MAX model with tools and knowledge.
First, we’ll add some knowledge. Navigate to Workspace > Knowledge and click the + button to add a knowledge base. Provide a name and optionally a description. After creating the knowledge base, click the + button within it and either upload some files, or use the built-in editor to write / paste-in some text documents. Here’s an example of what you should have:
Knowledge base with multiple text documents; each shown in its own window Next, we’ll add a custom model. Navigate to Workspace > Models and click the + button to add a new model. Provide a name and choose modularai/llama-3.1
as the base model. Under knowledge, choose your knowledge base. Optionally, give your model an avatar image. Finally, scroll to the bottom and choose Save & Create. Your custom model should look something like this:
Custom model configured with knowledge about the author’s cats Use Open WebUI to chat with Llama 3.1 Now we’re ready to use Open WebUI with MAX!
RAG Choose New Chat from the Open WebUI sidebar, then select your custom model from the model selector at the top of the chat surface. Ask it a question, and observe how it can augment the LLM with the knowledge you provide, like so:
Chatting with the custom model which accesses proprietary information kept in the knowledge base As you can see above, the LLM is able to correctly answer a question about something not widely known. It even provides appropriate citations. Out of the box, MAX supports working with RAG pipelines like this.
Web search To use the web search feature of Open WebUI, start a new chat, select modularai/llama-3.1
as the model, click the + button in the message composer, and turn it on. Then try asking a question about a current event, such as: What are some highlights from CES 2025?
Chatting with a model that can use web search to gather up-to-date information Web search can be an incredibly powerful tool for LLMs, and as you can see, MAX works out of the box with Open WebUI to support it.
Next Steps In this post, we’ve only scratched the surface of what’s possible by combining MAX with Open WebUI. MAX Serve enables you to access an Open AI-compatible endpoint for any PyTorch LLM from Hugging Face in minutes. Open WebUI provides a slick, feature-rich user interface with all the tools you expect, and more.
We encourage you to dig in and try more of Open WebUI with MAX, like its Tools feature (Workspace > Tools) to execute any arbitrary Python function. If you really want to dig in, have a look at the Pipelines project from Open WebUI, which “extend functionalities, integrate unique logic, and create dynamic workflows with just a few lines of code.”
We at Modular are proud to be part of the open source AI community—expect to hear more on this topic throughout the year. Join our new Modular Forum and Subreddit to share your experiences and connect with the Modular community!