Imagine unlocking a world of open innovation while ensuring secure, reliable, and enterprise-ready Gen AI deployments—MAX 24.6 enables enterprise AI teams to seamlessly run a vast range of cutting-edge AI models from Hugging Face on NVIDIA GPUs.
MAX 24.6 introduces a set of high-performance models built with the MAX Graph API. These MAX graphs are precision implementations of leading model architectures, including Llama (via LlamaForCausalLM
compatibility). Among the 20,000+ Llama variants and “finetunes” available on Hugging Face, Llama Guard stands out as a crucial tool for enterprises. Models like Llama Guard empower organizations to enforce critical guardrails around AI content, addressing safety, compliance, and ethical considerations.
In this post, we’ll show you how to rapidly evaluate multiple models from Hugging Face with MAX, within the context of using Llama Guard and other prompt safety models. You’ll learn how to compare the performance of Meta’s Llama Guard and IBM’s Granite Guardian by using Surge AI’s Toxicity dataset, exploring how MAX accelerates model evaluation like never before. Whether your focus is compliance, brand protection, or creating a safe user experience, MAX provides the tools to enhance your AI strategy.
🛠️ Just want the code?
A version of the full code for this post is available on GitHub.
About the models
Just as llamas serve as natural guardians in the wild, protecting sheep and goats from predators with their vigilant nature, Meta's Llama Guard plays a similar protective role in the AI landscape. This specialized model builds on the capabilities of Meta's Llama family, screening content across multiple languages and use cases—from user queries to AI-generated content. Like its animal namesake, it spots and categorizes potential threats with remarkable accuracy.
Using the code we provide here, and thanks to the flexibility of MAX, you can directly compare Llama Guard against a number of other models, helping you determine which model is the best fit for your needs. One such alternative is IBM Granite Guardian: it draws on IBM’s comprehensive AI Risk Atlas to create a robust safety system. Built through rigorous testing and diverse data training, including carefully constructed challenge scenarios, it offers the kind of reliability you might expect from IBM’s rich history of data security.
About the dataset
To evaluate Llama Guard’s ability to categorize content as safe or unsafe, we’ll use Surge AI’s Toxicity dataset. For our evaluation, we’ll download the subset of this dataset that’s freely available on GitHub. The set features 1000 social media posts and comments that human reviewers chose to label as toxic or non-toxic.
Set up Hugging Face access
For our work here, we’ll leverage MAX’s ability to run any PyTorch LLM from Hugging Face. Before we can begin, you must obtain an access token from Hugging Face to download models hosted there. Follow the instructions in the Hugging Face documentation to obtain one.
Additionally, Meta gates access to its Llama family of models on Hugging Face. Visit the Llama Guard model’s page and submit the request form; approval is usually granted within a few minutes. Access to IBM’s Granite family of models is open and does not require an approval step.
Serve the model
Using MAX Serve is easy: we support Docker in the cloud and provide our Magic CLI for local development. If you’ve worked with other AI stacks before, you’ll notice there’s no hassling with CUDA here—MAX works at the hardware level, requiring only NVIDIA’s GPU driver. This enables you to switch between models and environments with ease.
Note: MAX supports the following NVIDIA GPUs: A100 (most optimized), A10G, L4 and L40.
Already have a GPU in the cloud?
The best way to experience model serving with MAX is on an NVIDIA GPU-equipped cloud instance. If you already have one that you can access via SSH, simply run the following commands at its terminal.
Need to deploy a GPU in the cloud?
If you don’t have a GPU-equipped cloud instance, but you can provision one, we have a tutorial you can follow to get MAX running with Docker on AWS, Google Cloud, or Microsoft Azure.
Note: You must swap the model name used in the documentation with meta-llama/Llama-Guard-3-8B
or ibm-granite/granite-guardian-3.0-2b
.
Can’t access a GPU in the cloud?
MAX has your back! The local to cloud developer experience is something we care deeply about here at Modular. Simply follow the local setup section of this tutorial to run MAX locally on your laptop—just note performance will be much slower locally than on a GPU-equipped cloud instance.
Evaluate the Models
With our MAX Serve endpoint running, we’re ready to dive in and evaluate Llama Guard. We’ll use Python to download the free subset of Surge AI’s Toxicity dataset, then run each example in the set through the model for safety classification. Finally, we’ll calculate some standard metrics for measuring deviations between the model’s predictions and the human-labeled sample data.
Define models to evaluate
In the following code, we define each model to evaluate using Python's enum library. This Model class includes the Hugging Face repository ID and the keyword the model responds with to indicate unsafe content.
Above, we define our two models: LLAMA_GUARD
, which uses the keyword "Unsafe," and GRANITE_GUARD
, which uses the keyword "Yes." We also set selected_model
to LLAMA_GUARD
to use that model for our evaluation.
Configure API client
Next, we’ll use the OpenAI Python library to communicate with our MAX endpoint. You read that right—MAX Serve provides an OpenAI-compatible API endpoint!
Let’s break down the code above:
api_key
: Use any value here, it just can't be blank or absentbase_url
: Replace0.0.0.0
with your MAX Serve deployment's URL
Chat completions in MAX work just like they do with OpenAI. Let’s define a function to call the MAX endpoint:
As you can see above, MAX is a drop-in replacement for OpenAI.
Note: MAX Serve is in preview. The OpenAI chat completion API works, but you may find rough edges. Additional capabilities like function calling are not yet available.
Download the dataset
Next, let’s download the free subset of Surge AI’s Toxicity dataset:
In the code above, we define a function to retrieve the dataset in CSV format. We use the requests
library to make an HTTP GET request to the URL and check for any errors. If successful, we read the CSV data into a DataFrame
from the pandas library, converting the is_toxic
column values from 'Toxic' to 1 and otherwise to 0.
Run the evaluation
Now we’re ready to run our evaluation! We need to define a function that takes the dataset as input, then sends each example in the set to Llama Guard. Since it will take some time to process all 1,000 examples in the dataset, we’ll use the rich
library to display a progress bar.
There’s quite a bit going on here, so let’s break it down:
- We initialize a
DataFrame
to store theresults
- We use the variable
size
to keep track of how many elements the model successfully processes - We keep track of the
start_time
using thetime
library - We track the evaluation progress using
Progress
from therich
library - For each prediction:
- If the response is valid, we update the
results
set - If the model was unable to respond to a given example, we reduce our
size
count by one
- If the response is valid, we update the
- After processing the dataset, we calculate and print the elapsed time and sample size
- Lastly, we return the
results
for further analysis
Calculating metrics
Finally, we can calculate some metrics to help us understand how closely the model aligns with the sample data. We’ll use the scikit-learn
library here, as it’s an incredibly useful tool for this type of analysis.
In the code above, we compute some standard binary classification metrics comparing Llama Guard’s predictions against human-labeled ground truths from the Toxicity dataset:
- Accuracy: How often do the model’s predictions match the ground-truth labels
- Precision: Of all the model's "unsafe" predictions, how many did humans label as toxic?
- Recall: Of all the human-labeled toxic items, how many did the model identify as unsafe?
- F1 Score: The harmonic mean of precision and recall, a single score that balances both metrics.some text
- The harmonic mean gives more weight to smaller values, which is why it’s useful for things combining precision and recall in the F1 score.
Run the code
A version of the full code for this post is available on GitHub. You can clone the repo and run the script as follows. Alternatively, you can copy-paste the code blocks above and run them in a Jupyter notebook—just be sure to manually install the necessary library dependencies if you choose to use Jupyter.
We'll use the Magic CLI to create a development environment on your local workstation and install the required packages.
Don't have the Magic CLI yet? Run this command in your terminal and follow the instructions:
With Magic installed, run the following commands in the terminal of your local workstation:
The last command will output:
The script provides default options for running the evaluation with Llama Guard, but you still must provide your MAX Serve endpoint URL like this:
The script takes some The script will output something like this:
To run the script against IBM Granite Guardian, use the following command. Make sure you stop your MAX Serve endpoint and start it again using the --huggingface-repo-id ibm-granite/granite-guardian-3.0-2b
flag.
This command will output results in the same format as Llama Guard, but you’ll notice the numbers differ quite a bit. What does this tell us? For one, this data tells us Granite Guardian’s definition of unsafe content more closely aligns with the subset of Surge AI’s Toxicity dataset we evaluated against.
Next steps
Through this practical exploration of Llama Guard, we've demonstrated how you can rapidly evaluate and deploy robust AI safety solutions with MAX. Its high-performance architecture and broad support for open models provides a production-ready foundation for responsible AI governance. We hope the evaluation framework we offer here provides a blueprint for you and your teams to assess and implement AI guardrails, enabling you to confidently innovate while maintaining necessary safeguards.
Learn more about the MAX 24.6 GPU Preview release and its SOTA performance, and join the Modular community to share your feedback and experiences.
Until next time! 🔥