Imagine unlocking a world of open innovation while ensuring secure, reliable, and enterprise-ready Gen AI deployments—MAX 24.6 enables enterprise AI teams to seamlessly run a vast range of cutting-edge AI models from Hugging Face on NVIDIA GPUs.
MAX 24.6 introduces a set of high-performance models built with the MAX Graph API. These MAX graphs are precision implementations of leading model architectures, including Llama (via LlamaForCausalLM
compatibility). Among the 20,000+ Llama variants and “finetunes” available on Hugging Face, Llama Guard stands out as a crucial tool for enterprises. Models like Llama Guard empower organizations to enforce critical guardrails around AI content, addressing safety, compliance, and ethical considerations.
In this post, we’ll show you how to rapidly evaluate multiple models from Hugging Face with MAX, within the context of using Llama Guard and other prompt safety models. You’ll learn how to compare the performance of Meta’s Llama Guard and IBM’s Granite Guardian by using Surge AI’s Toxicity dataset , exploring how MAX accelerates model evaluation like never before. Whether your focus is compliance, brand protection, or creating a safe user experience, MAX provides the tools to enhance your AI strategy.
🛠️ Just want the code? A version of the full code for this post is available on GitHub . About the models Just as llamas serve as natural guardians in the wild, protecting sheep and goats from predators with their vigilant nature, Meta's Llama Guard plays a similar protective role in the AI landscape. This specialized model builds on the capabilities of Meta's Llama family, screening content across multiple languages and use cases—from user queries to AI-generated content. Like its animal namesake, it spots and categorizes potential threats with remarkable accuracy.
Using the code we provide here, and thanks to the flexibility of MAX, you can directly compare Llama Guard against a number of other models, helping you determine which model is the best fit for your needs. One such alternative is IBM Granite Guardian: it draws on IBM’s comprehensive AI Risk Atlas to create a robust safety system. Built through rigorous testing and diverse data training, including carefully constructed challenge scenarios, it offers the kind of reliability you might expect from IBM’s rich history of data security.
About the dataset To evaluate Llama Guard’s ability to categorize content as safe or unsafe, we’ll use Surge AI’s Toxicity dataset. For our evaluation, we’ll download the subset of this dataset that’s freely available on GitHub . The set features 1000 social media posts and comments that human reviewers chose to label as toxic or non-toxic.
Set up Hugging Face access For our work here, we’ll leverage MAX’s ability to run any PyTorch LLM from Hugging Face. Before we can begin, you must obtain an access token from Hugging Face to download models hosted there. Follow the instructions in the Hugging Face documentation to obtain one.
Additionally, Meta gates access to its Llama family of models on Hugging Face. Visit the Llama Guard model’s page and submit the request form; approval is usually granted within a few minutes. Access to IBM’s Granite family of models is open and does not require an approval step.
Serve the model Using MAX Serve is easy: we support Docker in the cloud and provide our Magic CLI for local development. If you’ve worked with other AI stacks before, you’ll notice there’s no hassling with CUDA here—MAX works at the hardware level, requiring only NVIDIA’s GPU driver. This enables you to switch between models and environments with ease.
Note: MAX supports the following NVIDIA GPUs: A100 (most optimized), A10G, L4 and L40.
Already have a GPU in the cloud? The best way to experience model serving with MAX is on an NVIDIA GPU-equipped cloud instance. If you already have one that you can access via SSH, simply run the following commands at its terminal.
Bash
export HUGGING_FACE_HUB_TOKEN=<YOUR ACCESS TOKEN HERE>
export HUGGING_FACE_REPO_ID=meta-llama/Llama-Guard-3-8B
docker run \
--env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
--gpus 1 \
-p 8000:8000 \
--ipc=host \
modular/max-openai-api:24.6.0 \
--huggingface-repo-id ${HUGGING_FACE_REPO_ID}
Copy
Need to deploy a GPU in the cloud? If you don’t have a GPU-equipped cloud instance, but you can provision one, we have a tutorial you can follow to get MAX running with Docker on AWS, Google Cloud, or Microsoft Azure.
Note: You must swap the model name used in the documentation with meta-llama/Llama-Guard-3-8B
or ibm-granite/granite-guardian-3.0-2b
.
Can’t access a GPU in the cloud? MAX has your back! The local to cloud developer experience is something we care deeply about here at Modular. Simply follow the local setup section of this tutorial to run MAX locally on your laptop—just note performance will be much slower locally than on a GPU-equipped cloud instance.
Evaluate the Models With our MAX Serve endpoint running, we’re ready to dive in and evaluate Llama Guard. We’ll use Python to download the free subset of Surge AI’s Toxicity dataset, then run each example in the set through the model for safety classification. Finally, we’ll calculate some standard metrics for measuring deviations between the model’s predictions and the human-labeled sample data.
Define models to evaluate In the following code, we define each model to evaluate using Python's enum library. This Model class includes the Hugging Face repository ID and the keyword the model responds with to indicate unsafe content.
Python
from enum import Enum
class Model(Enum):
LLAMA_GUARD = ("meta-llama/Llama-Guard-3-8B", "Unsafe")
GRANITE_GUARD = ("ibm-granite/granite-guardian-3.0-2b", "Yes")
def __init__(self, huggingface_repo_id, unsafe_classification_keyword):
self.huggingface_repo_id = huggingface_repo_id
self.unsafe_classification_keyword = unsafe_classification_keyword
selected_model = Model.LLAMA_GUARD
Copy
Above, we define our two models: LLAMA_GUARD
, which uses the keyword "Unsafe," and GRANITE_GUARD
, which uses the keyword "Yes." We also set selected_model
to LLAMA_GUARD
to use that model for our evaluation.
Configure API client Next, we’ll use the OpenAI Python library to communicate with our MAX endpoint. You read that right—MAX Serve provides an OpenAI-compatible API endpoint!
Python
from openai import OpenAI
client = OpenAI(
api_key="123", # Use any value here; can't be blank or absent
base_url="http://0.0.0.0:8000/v1", # Replace 0.0.0.0 with your MAX deployment's URL
)
Copy
Let’s break down the code above:
api_key
: Use any value here, it just can't be blank or absentbase_url
: Replace 0.0.0.0
with your MAX Serve deployment's URLChat completions in MAX work just like they do with OpenAI. Let’s define a function to call the MAX endpoint:
Python
def get_llm_prediction(input):
try:
response = client.chat.completions.create(
model=selected_model.huggingface_repo_id,
messages=[{"role": "user", "content": input}]
)
content = response.choices[0].message.content
return content
except Exception as _:
return None
Copy
As you can see above, MAX is a drop-in replacement for OpenAI.
Note: MAX Serve is in preview. The OpenAI chat completion API works, but you may find rough edges. Additional capabilities like function calling are not yet available.
Download the dataset Next, let’s download the free subset of Surge AI’s Toxicity dataset:
Python
import requests
import pandas as pd
from io import StringIO
def download_dataset():
url = "https://raw.githubusercontent.com/surge-ai/toxicity/refs/heads/main/toxicity_en.csv"
try:
response = requests.get(url)
response.raise_for_status()
df = pd.read_csv(StringIO(response.text))
df['is_toxic'] = df['is_toxic'].apply(lambda x: 1 if x == 'Toxic' else 0)
return df
except Exception as e:
print("Problem downloading dataset:", e)
return None
Copy
In the code above, we define a function to retrieve the dataset in CSV format. We use the requests
library to make an HTTP GET request to the URL and check for any errors. If successful, we read the CSV data into a DataFrame
from the pandas library, converting the is_toxic
column values from 'Toxic' to 1 and otherwise to 0.
Run the evaluation Now we’re ready to run our evaluation! We need to define a function that takes the dataset as input, then sends each example in the set to Llama Guard. Since it will take some time to process all 1,000 examples in the dataset, we’ll use the rich
library to display a progress bar.
Python
import time
from rich.progress import Progress
def evaluate_llm(dataset, keyword="unsafe"):
results = pd.DataFrame(columns=[
"content",
"y_true",
"y_pred"
])
size = len(dataset)
start_time = time.time()
with Progress() as progress:
task = progress.add_task(
f"[cyan]💬 Evaluating {selected_model.huggingface_repo_id}[/cyan]",
total=size
)
for i, row in dataset.iterrows():
content = row['text']
y_true = row['is_toxic']
response = get_llm_prediction(content)
if response:
y_pred = 1 if keyword.lower() in response.lower() else 0
results.loc[i] = [content, y_true, y_pred]
else:
size -= 1
progress.update(task, advance=1)
progress.update(
task,
description=f"[green]✅ Evaluated {selected_model.huggingface_repo_id} [/green]"
)
elapsed_time = time.time() - start_time
elapsed_time_formatted = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
print(f"Time Elapsed: {elapsed_time_formatted}")
print("Sample Size:", size)
return results
Copy
There’s quite a bit going on here, so let’s break it down:
We initialize a DataFrame
to store the results
We use the variable size
to keep track of how many elements the model successfully processes We keep track of the start_time
using the time
library We track the evaluation progress using Progress
from the rich
library For each prediction:If the response is valid, we update the results
set If the model was unable to respond to a given example, we reduce our size
count by one After processing the dataset, we calculate and print the elapsed time and sample size Lastly, we return the results
for further analysis Calculating metrics Finally, we can calculate some metrics to help us understand how closely the model aligns with the sample data. We’ll use the scikit-learn
library here, as it’s an incredibly useful tool for this type of analysis.
Python
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def calculate_metrics(results):
accuracy = accuracy_score(results["y_true"], results["y_pred"])
precision, recall, f1, _ = precision_recall_fscore_support(
results["y_true"],
results["y_pred"],
average="binary"
)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
Copy
In the code above, we compute some standard binary classification metrics comparing Llama Guard’s predictions against human-labeled ground truths from the Toxicity dataset:
Accuracy : How often do the model’s predictions match the ground-truth labelsPrecision : Of all the model's "unsafe" predictions, how many did humans label as toxic? Recall : Of all the human-labeled toxic items, how many did the model identify as unsafe?F1 Score : The harmonic mean of precision and recall, a single score that balances both metrics.some textThe harmonic mean gives more weight to smaller values, which is why it’s useful for things combining precision and recall in the F1 score. Run the code A version of the full code for this post is available on GitHub . You can clone the repo and run the script as follows. Alternatively, you can copy-paste the code blocks above and run them in a Jupyter notebook—just be sure to manually install the necessary library dependencies if you choose to use Jupyter.
We'll use the Magic CLI to create a development environment on your local workstation and install the required packages.
Don't have the Magic CLI yet? Run this command in your terminal and follow the instructions:
Bash
curl -ssL https://magic.modular.com/ | bash
Copy
With Magic installed, run the following commands in the terminal of your local workstation :
Bash
git clone https://github.com/modularml/devrel-extras.git
cd devrel-extras/blogs/max-guardrails-eval
magic run help
Copy
The last command will output:
Bash
Usage: python -m guardrails_eval [OPTIONS]
Options:
-s, --server TEXT URL from "Uvicorn running on <URL>"
-m, --model TEXT Model to evaluate
-k, --keyword TEXT Response model gives when content is unsafe
-n, --num_examples INTEGER Number of examples to evaluate
--help Show this message and exit.
Copy
The script provides default options for running the evaluation with Llama Guard, but you still must provide your MAX Serve endpoint URL like this:
Bash
magic run eval --server "<YOUR MAX URL HERE>"
Copy
The script takes some The script will output something like this:
Bash
✅ Evaluated meta-llama/Llama-Guard-3-8B
Time Elapsed: 00:02:52
Sample Size: 995
Accuracy: 0.63
Precision: 0.96
Recall: 0.27
F1 Score: 0.42
Copy
To run the script against IBM Granite Guardian, use the following command. Make sure you stop your MAX Serve endpoint and start it again using the --huggingface-repo-id ibm-granite/granite-guardian-3.0-2b
flag.
Bash
magic run eval \
--server "<YOUR MAX URL HERE>"
--model "ibm-granite/granite-guardian-3.0-2b" \
--keyword "Yes"
Copy
This command will output results in the same format as Llama Guard, but you’ll notice the numbers differ quite a bit. What does this tell us? For one, this data tells us Granite Guardian’s definition of unsafe content more closely aligns with the subset of Surge AI’s Toxicity dataset we evaluated against.
Next steps Through this practical exploration of Llama Guard, we've demonstrated how you can rapidly evaluate and deploy robust AI safety solutions with MAX. Its high-performance architecture and broad support for open models provides a production-ready foundation for responsible AI governance. We hope the evaluation framework we offer here provides a blueprint for you and your teams to assess and implement AI guardrails, enabling you to confidently innovate while maintaining necessary safeguards.
Learn more about the MAX 24.6 GPU Preview release and its SOTA performance , and join the Modular community to share your feedback and experiences.
Until next time! 🔥