December 19, 2024

Evaluating Llama Guard with MAX 24.6 and Hugging Face

Imagine unlocking a world of open innovation while ensuring secure, reliable, and enterprise-ready Gen AI deployments—MAX 24.6 enables enterprise AI teams to seamlessly run a vast range of cutting-edge AI models from Hugging Face on NVIDIA GPUs.

MAX 24.6 introduces a set of high-performance models built with the MAX Graph API. These MAX graphs are precision implementations of leading model architectures, including Llama (via LlamaForCausalLM compatibility). Among the 20,000+ Llama variants and “finetunes” available on Hugging Face, Llama Guard stands out as a crucial tool for enterprises. Models like Llama Guard empower organizations to enforce critical guardrails around AI content, addressing safety, compliance, and ethical considerations.

In this post, we’ll show you how to rapidly evaluate multiple models from Hugging Face with MAX, within the context of using Llama Guard and other prompt safety models. You’ll learn how to compare the performance of Meta’s Llama Guard and IBM’s Granite Guardian by using Surge AI’s Toxicity dataset, exploring how MAX accelerates model evaluation like never before. Whether your focus is compliance, brand protection, or creating a safe user experience, MAX provides the tools to enhance your AI strategy.

🛠️ Just want the code?
A version of the full code for this post is available on GitHub.

About the models

Just as llamas serve as natural guardians in the wild, protecting sheep and goats from predators with their vigilant nature, Meta's Llama Guard plays a similar protective role in the AI landscape. This specialized model builds on the capabilities of Meta's Llama family, screening content across multiple languages and use cases—from user queries to AI-generated content. Like its animal namesake, it spots and categorizes potential threats with remarkable accuracy.

Using the code we provide here, and thanks to the flexibility of MAX, you can directly compare Llama Guard against a number of other models, helping you determine which model is the best fit for your needs. One such alternative is IBM Granite Guardian: it draws on IBM’s comprehensive AI Risk Atlas to create a robust safety system. Built through rigorous testing and diverse data training, including carefully constructed challenge scenarios, it offers the kind of reliability you might expect from IBM’s rich history of data security.

About the dataset

To evaluate Llama Guard’s ability to categorize content as safe or unsafe, we’ll use Surge AI’s Toxicity dataset. For our evaluation, we’ll download the subset of this dataset that’s freely available on GitHub. The set features 1000 social media posts and comments that human reviewers chose to label as toxic or non-toxic.

Set up Hugging Face access

For our work here, we’ll leverage MAX’s ability to run any PyTorch LLM from Hugging Face. Before we can begin, you must obtain an access token from Hugging Face to download models hosted there. Follow the instructions in the Hugging Face documentation to obtain one.

Additionally, Meta gates access to its Llama family of models on Hugging Face. Visit the Llama Guard model’s page and submit the request form; approval is usually granted within a few minutes. Access to IBM’s Granite family of models is open and does not require an approval step.

Serve the model

Using MAX Serve is easy: we support Docker in the cloud and provide our Magic CLI for local development. If you’ve worked with other AI stacks before, you’ll notice there’s no hassling with CUDA here—MAX works at the hardware level, requiring only NVIDIA’s GPU driver. This enables you to switch between models and environments with ease.

Note: MAX supports the following NVIDIA GPUs: A100 (most optimized), A10G, L4 and L40.

Already have a GPU in the cloud?

The best way to experience model serving with MAX is on an NVIDIA GPU-equipped cloud instance. If you already have one that you can access via SSH, simply run the following commands at its terminal.

Bash
export HUGGING_FACE_HUB_TOKEN=<YOUR ACCESS TOKEN HERE> export HUGGING_FACE_REPO_ID=meta-llama/Llama-Guard-3-8B docker run \ --env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \ --env "HF_HUB_ENABLE_HF_TRANSFER=1" \ -v $HOME/.cache/huggingface:/root/.cache/huggingface \ --gpus 1 \ -p 8000:8000 \ --ipc=host \ modular/max-openai-api:24.6.0 \ --huggingface-repo-id ${HUGGING_FACE_REPO_ID}

Need to deploy a GPU in the cloud?

If you don’t have a GPU-equipped cloud instance, but you can provision one, we have a tutorial you can follow to get MAX running with Docker on AWS, Google Cloud, or Microsoft Azure.

Note: You must swap the model name used in the documentation with meta-llama/Llama-Guard-3-8B or ibm-granite/granite-guardian-3.0-2b.

Can’t access a GPU in the cloud?

MAX has your back! The local to cloud developer experience is something we care deeply about here at Modular. Simply follow the local setup section of this tutorial to run MAX locally on your laptop—just note performance will be much slower locally than on a GPU-equipped cloud instance.

Evaluate the Models

With our MAX Serve endpoint running, we’re ready to dive in and evaluate Llama Guard. We’ll use Python to download the free subset of Surge AI’s Toxicity dataset, then run each example in the set through the model for safety classification. Finally, we’ll calculate some standard metrics for measuring deviations between the model’s predictions and the human-labeled sample data.

Define models to evaluate

In the following code, we define each model to evaluate using Python's enum library. This Model class includes the Hugging Face repository ID and the keyword the model responds with to indicate unsafe content.

Python
from enum import Enum class Model(Enum): LLAMA_GUARD = ("meta-llama/Llama-Guard-3-8B", "Unsafe") GRANITE_GUARD = ("ibm-granite/granite-guardian-3.0-2b", "Yes") def __init__(self, huggingface_repo_id, unsafe_classification_keyword): self.huggingface_repo_id = huggingface_repo_id self.unsafe_classification_keyword = unsafe_classification_keyword selected_model = Model.LLAMA_GUARD

Above, we define our two models: LLAMA_GUARD, which uses the keyword "Unsafe," and GRANITE_GUARD, which uses the keyword "Yes." We also set selected_model to LLAMA_GUARD to use that model for our evaluation.

Configure API client

Next, we’ll use the OpenAI Python library to communicate with our MAX endpoint. You read that right—MAX Serve provides an OpenAI-compatible API endpoint!

Python
from openai import OpenAI client = OpenAI( api_key="123", # Use any value here; can't be blank or absent base_url="http://0.0.0.0:8000/v1", # Replace 0.0.0.0 with your MAX deployment's URL )

Let’s break down the code above:

  • api_key: Use any value here, it just can't be blank or absent
  • base_url: Replace 0.0.0.0 with your MAX Serve deployment's URL

Chat completions in MAX work just like they do with OpenAI. Let’s define a function to call the MAX endpoint:

Python
def get_llm_prediction(input): try: response = client.chat.completions.create( model=selected_model.huggingface_repo_id, messages=[{"role": "user", "content": input}] ) content = response.choices[0].message.content return content except Exception as _: return None

As you can see above, MAX is a drop-in replacement for OpenAI.

Note: MAX Serve is in preview. The OpenAI chat completion API works, but you may find rough edges. Additional capabilities like function calling are not yet available.

Download the dataset

Next, let’s download the free subset of Surge AI’s Toxicity dataset:

Python
import requests import pandas as pd from io import StringIO def download_dataset(): url = "https://raw.githubusercontent.com/surge-ai/toxicity/refs/heads/main/toxicity_en.csv" try: response = requests.get(url) response.raise_for_status() df = pd.read_csv(StringIO(response.text)) df['is_toxic'] = df['is_toxic'].apply(lambda x: 1 if x == 'Toxic' else 0) return df except Exception as e: print("Problem downloading dataset:", e) return None

In the code above, we define a function to retrieve the dataset in CSV format. We use the requests library to make an HTTP GET request to the URL and check for any errors. If successful, we read the CSV data into a DataFrame from the pandas library, converting the is_toxic column values from 'Toxic' to 1 and otherwise to 0.

Run the evaluation

Now we’re ready to run our evaluation! We need to define a function that takes the dataset as input, then sends each example in the set to Llama Guard. Since it will take some time to process all 1,000 examples in the dataset, we’ll use the rich library to display a progress bar.

Python
import time from rich.progress import Progress def evaluate_llm(dataset, keyword="unsafe"): results = pd.DataFrame(columns=[ "content", "y_true", "y_pred" ]) size = len(dataset) start_time = time.time() with Progress() as progress: task = progress.add_task( f"[cyan]💬 Evaluating {selected_model.huggingface_repo_id}[/cyan]", total=size ) for i, row in dataset.iterrows(): content = row['text'] y_true = row['is_toxic'] response = get_llm_prediction(content) if response: y_pred = 1 if keyword.lower() in response.lower() else 0 results.loc[i] = [content, y_true, y_pred] else: size -= 1 progress.update(task, advance=1) progress.update( task, description=f"[green]✅ Evaluated {selected_model.huggingface_repo_id} [/green]" ) elapsed_time = time.time() - start_time elapsed_time_formatted = time.strftime("%H:%M:%S", time.gmtime(elapsed_time)) print(f"Time Elapsed: {elapsed_time_formatted}") print("Sample Size:", size) return results

There’s quite a bit going on here, so let’s break it down:

  • We initialize a DataFrame to store the results
  • We use the variable size to keep track of how many elements the model successfully processes
  • We keep track of the start_time using the time library
  • We track the evaluation progress using Progress from the rich library
  • For each prediction:
    • If the response is valid, we update the results set
    • If the model was unable to respond to a given example, we reduce our size count by one
  • After processing the dataset, we calculate and print the elapsed time and sample size
  • Lastly, we return the results for further analysis

Calculating metrics

Finally, we can calculate some metrics to help us understand how closely the model aligns with the sample data. We’ll use the scikit-learn library here, as it’s an incredibly useful tool for this type of analysis.

Python
from sklearn.metrics import accuracy_score, precision_recall_fscore_support def calculate_metrics(results): accuracy = accuracy_score(results["y_true"], results["y_pred"]) precision, recall, f1, _ = precision_recall_fscore_support( results["y_true"], results["y_pred"], average="binary" ) print(f"Accuracy: {accuracy:.2f}") print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1 Score: {f1:.2f}")

In the code above, we compute some standard binary classification metrics comparing Llama Guard’s predictions against human-labeled ground truths from the Toxicity dataset:

  • Accuracy: How often do the model’s predictions match the ground-truth labels
  • Precision: Of all the model's "unsafe" predictions, how many did humans label as toxic? 
  • Recall: Of all the human-labeled toxic items, how many did the model identify as unsafe?
  • F1 Score: The harmonic mean of precision and recall, a single score that balances both metrics.some text
    • The harmonic mean gives more weight to smaller values, which is why it’s useful for things combining precision and recall in the F1 score.

Run the code

A version of the full code for this post is available on GitHub. You can clone the repo and run the script as follows. Alternatively, you can copy-paste the code blocks above and run them in a Jupyter notebook—just be sure to manually install the necessary library dependencies if you choose to use Jupyter.

We'll use the Magic CLI to create a development environment on your local workstation and install the required packages.

Don't have the Magic CLI yet? Run this command in your terminal and follow the instructions:

Bash
curl -ssL https://magic.modular.com/ | bash

With Magic installed, run the following commands in the terminal of your local workstation:

Bash
git clone https://github.com/modularml/devrel-extras.git cd devrel-extras/blogs/max-guardrails-eval magic run help

The last command will output:

Bash
Usage: python -m guardrails_eval [OPTIONS] Options: -s, --server TEXT URL from "Uvicorn running on <URL>" -m, --model TEXT Model to evaluate -k, --keyword TEXT Response model gives when content is unsafe -n, --num_examples INTEGER Number of examples to evaluate --help Show this message and exit.

The script provides default options for running the evaluation with Llama Guard, but you still must provide your MAX Serve endpoint URL like this:

Bash
magic run eval --server "<YOUR MAX URL HERE>"

The script takes some The script will output something like this:

Bash
✅ Evaluated meta-llama/Llama-Guard-3-8B Time Elapsed: 00:02:52 Sample Size: 995 Accuracy: 0.63 Precision: 0.96 Recall: 0.27 F1 Score: 0.42

To run the script against IBM Granite Guardian, use the following command. Make sure you stop your MAX Serve endpoint and start it again using the --huggingface-repo-id ibm-granite/granite-guardian-3.0-2b flag.

Bash
magic run eval \ --server "<YOUR MAX URL HERE>" --model "ibm-granite/granite-guardian-3.0-2b" \ --keyword "Yes"

This command will output results in the same format as Llama Guard, but you’ll notice the numbers differ quite a bit. What does this tell us? For one, this data tells us Granite Guardian’s definition of unsafe content more closely aligns with the subset of Surge AI’s Toxicity dataset we evaluated against.

Next steps

Through this practical exploration of Llama Guard, we've demonstrated how you can rapidly evaluate and deploy robust AI safety solutions with MAX. Its high-performance architecture and broad support for open models provides a production-ready foundation for responsible AI governance. We hope the evaluation framework we offer here provides a blueprint for you and your teams to assess and implement AI guardrails, enabling you to confidently innovate while maintaining necessary safeguards.

Learn more about the MAX 24.6 GPU Preview release and its SOTA performance, and join the Modular community to share your feedback and experiences.

Until next time! 🔥

Bill Welense
,
AI Developer Advocate

Bill Welense

AI Developer Advocate

Bill is a versatile technologist with a diverse background. He began his career in broadcasting and the performing arts before stepping into tech with a startup in Chicago. Holding an MS in Human-Computer Interaction, Bill has over a decade of experience in UX design and creative technology and has served as Adjunct Faculty at DePaul University. Prior to joining Modular, he spent over five years in the audio industry, where he worked as both an iOS Developer and an AI Software Engineer. Bill currently resides in Chicago, where he enjoys spending quality time with his family.

billw@modular.com