Deploying MAX on Amazon SageMaker

March 27, 2024

Shashank Prasanna

AI Developer Advocate

Model deployment is often the domain of IT professionals and cloud infrastructure experts who understand how to securely and reliably host model endpoints that scale with usage demand. Thankfully, Amazon SageMaker is fully managed and handles all the underlying infrastructure, allowing developers and data scientists like you and me, who are not IT experts, to use simple APIs to host secure, low-latency, and highly scalable model endpoints.

In this blog post, I’ll share an end-to-end guide on how to host a MAX optimized model endpoint using MAX Serving and Amazon SageMaker. Here are the steps we’ll follow:

  1. Download a pre-trained Roberta model from HuggingFace
  2. Upload model to Amazon S3 so Amazon SageMaker and MAX Serving container has access to it.
  3. Pull the latest MAX Serving container image and push it to Amazon Elastic Container Registry (Amazon ECR)
  4. Create an Amazon SageMaker model and deploy to specified instance type. We’ll use Amazon EC2 c6i.4xlarge, on which MAX Engine can deliver up to 2.6x faster performance vs. TensorFlow
  5. Invoke the endpoint to test it
  6. Clean up AWS resources

If you’re just getting started with MAX, I also recommend reading this getting started blog post on how to optimize models and run inference with MAX. And this blog post on evaluating MAX Engine performance and accuracy

Where can I get this example: All the code in this blog post is available as a runnable Jupyter Notebook on GitHub.

Step 0: Setup 

I ran this example on an Amazon SageMaker notebook instance which you can access from AWS Console > Amazon SageMaker > Notebook > Notebook instances > Create notebook Instance. And follow the steps to create a new notebook instance with the default Amazon SageMaker execution role.

After the notebook instance is up and running you’ll get access to a hosted Jupyter notebook client. Choose the conda_tensorflow2_p310 conda environment since we’ll need TensorFlow to save our model in the TensorFlow saved model format. 

Note: This is our development instance only. SageMaker will spin up a separate and dedicated instance for model hosting as we’ll see in Step 4

If you want to run this entire workflow on any other system such as your laptop or Amazon EC2 instance, make sure that you have permissions to access resources in your AWS account. The IAM managed policy Amazon SageMakerFullAccess grants all the necessary permissions. See AWS documentation for more details.

Next, download the example from GitHub. You can clone the entire repository or just get the Jupyter Notebook.

Now we’re ready to walk through the steps in the Jupyter Notebooks.

Step 1: Download a pre-trained Roberta model from HuggingFace

The first step is to download the model we want to serve. Let’s start with some basic imports and get access to boto3 and sagemaker session, role, bucket name, account number, and region which are all required by SageMaker to manage the deployment for us.

Python
import shutil import os import boto3 import sagemaker import tensorflow as tf from transformers import AutoTokenizer, TFRobertaForSequenceClassification os.environ['CUDA_VISIBLE_DEVICES'] = '-1' os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" os.environ["TRANSFORMERS_VERBOSITY"] = "critical" # Create boto3 and sagemaker session, get role, bucket name, account number and region sess = boto3.Session() sagemaker_session = sagemaker.Session() role = sagemaker.get_execution_role() bucket_name = sagemaker_session.default_bucket() account = boto3.client('sts').get_caller_identity().get('Account') region = sess.region_name

Next, I define a function to download and save the Roberta sentence classification model

Python
def download_and_save_model(hf_model_name, saved_model_dir): model = TFRobertaForSequenceClassification.from_pretrained(hf_model_name) shutil.rmtree(saved_model_dir, ignore_errors=True) tf.saved_model.save(model, saved_model_dir+"/1/saved_model/") saved_model_dir = "model-repository/roberta" hf_model_name = "cardiffnlp/twitter-roberta-base-emotion-multilabel-latest" download_and_save_model(hf_model_name, saved_model_dir)

The MAX Serving container is based on the NVIDIA Triton server and it expects models to reside in the specific layout seen below. See the docs for more info. It also expects a config.pbtxt file that tells the server to use the MAX Engine backend for high-performance inference instead of the default backend.

Output
%%sh cat > model-repository/roberta/config.pbtxt <<EOL instance_group { kind: KIND_CPU } default_model_filename: "saved_model" backend: "max" EOL tree model-repository

Output:

Step 2: Upload model to Amazon S3 so Amazon SageMaker and MAX Serving container has access to it

Now that you have the saved model in the format expected by Amazon SageMaker and MAX Serving, we’ll now need to compress it into a tar.gz file and upload it to Amazon S3. You can upload it to any bucket you and SageMaker have access to. In this example I choose the default bucket and capture the path in model_uri.

Python
shutil.rmtree('model.tar.gz', ignore_errors=True) !tar -C model-repository -czf model.tar.gz roberta model_uri = sagemaker_session.upload_data(path="model.tar.gz", key_prefix="max-serving-models/roberta/")

You can verify that the model is in Amazon S3 on the AWS console:

Step 3: Pull the latest MAX Serving container image and push it to Amazon Elastic Container Registry (Amazon ECR)

Amazon SageMaker expects the container image I want to use to host my models to be in private Amazon Elastic Container Registry. Modular provides a pre-built container image: public.ecr.aws/modular/max-serving-de, so we must first pull the image to our system and then push it to our private ECR repository. Note: If your development instance is of a different architecture than your deployment instance then be sure to choose the right tag when pulling the MAX Serving container. You can find the tags for all platforms here: https://gallery.ecr.aws/modular/max-serving-de

Python
repo_name = 'sagemaker-max-serving' image_label = 'v1' max_serving_image_uri = "public.ecr.aws/modular/max-serving-de" image = f'{account}.dkr.ecr.{region}.amazonaws.com/{repo_name}:{image_label}' !aws ecr create-repository --repository-name {repo_name} !docker pull {max_serving_image_uri} !docker tag {max_serving_image_uri} {image} !$(aws ecr get-login --no-include-email --region {region}) #For aws-cli v1.x # If using aws-cli v2.x run # !aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {f'{account}.dkr.ecr.{region}.amazonaws.com'} !docker push {image}

Here’s what’s happening in each line of code.

Create a new repository called sagemaker-max-serving.

  • aws ecr create-repository --repository-name {registry_name}

Pull the MAX Serving container hosted by Modular

  • docker pull {max_serving_image_uri}

Tag the MAX Serving container with the name that matches the repository we created.

  • docker tag {max_serving_image_uri} {image}

Log in to Amazon ECR in your region

  • For aws-cli v1.x: $(aws ecr get-login --no-include-email --region {region})
  • For aws-cli v2.x: aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {f'{account}.dkr.ecr.{region}.amazonaws.com'}

Finally, push a copy of the MAX Serving container to your ECR repository.

  • docker push {image}

You can head over to Amazon ECR to confirm that your image is now available for Amazon SageMaker.

Step 4: Create an Amazon SageMaker model and deploy to specified instance type

In this example I’ll deploy our Roberta model to Amazon EC2 c6i.4xlarge instance. If you head on over to our performance dashboard you can see that MAX Engine can deliver up to 2.6x faster performance vs. TensorFlow on this model and EC2 instance.

Python
from sagemaker.model import Model from datetime import datetime date = datetime.now().strftime("%Y-%m-%d-%H-%m-%S") model_name= f"MAX-model-roberta-{date}" max_model = Model( model_data=model_uri, name=model_name, role=role, image_uri=image, )

In the above code, we first create a Model using Amazon SageMaker SDK and specify attributes including: 

  • Model path on Amazon S3
  • Path to MAX Serving container image in Amazon ECR
  • IAM Role 
  • Model name (optional)

With the model created with a single API call you can deploy your model to the specified endpoint using Amazon SageMaker.

Python
date = datetime.now().strftime("%Y-%m-%d-%H-%m-%S") endpoint_name = f"MAX-endpoint-roberta-{date}" predictor = max_model.deploy( initial_instance_count=1, instance_type="ml.c6i.4xlarge", endpoint_name=endpoint_name, )

Notice that I specify initial_instance_count=1. You can specify a higher number to load balance large volume of requests, or head over to the AWS Console > Amazon SageMaker > Inference > Endpoint Configuration > your endpoint configuration that was just created and add scaling policies that can automatically add or reduce the number of instances based on traffic.

You can also confirm that the endpoint was created and is operational on the AWS Console.

And see the AWS cloudwatch for inference request logs.

Step 5: Invoke the endpoint to test the endpoint

With our MAX Serving endpoint hosted and operational, let’s send some inference requests to it!

To keep things simple, in this example MAX serving is only running model inference, I did not set it up to do any pre- or post–processing steps such as tokenization or conversion of IDs back to labels. Instead I’ll do those steps in the notebook. You can alternatively include the pre- and post-processing steps in the MAX Serving container, and I’ll demonstrate that in an upcoming blog post. Let us know on Discord if you want to see more content on deployment.

Python
import numpy as np import json model = TFRobertaForSequenceClassification.from_pretrained(hf_model_name) client = boto3.client("sagemaker-runtime") text = "MAX Serving and Amazon SageMaker are a match made in heaven" tokenizer = AutoTokenizer.from_pretrained(hf_model_name) inputs = tokenizer(text, return_tensors="np", return_token_type_ids=True) payload = { "inputs": [ {"name": "input_ids", "shape": inputs["input_ids"].shape, "datatype": "INT32", "data": inputs["input_ids"].tolist()}, {"name": "attention_mask", "shape": inputs["attention_mask"].shape, "datatype": "INT32", "data": inputs["attention_mask"].tolist()}, {"name": "token_type_ids", "shape": inputs["token_type_ids"].shape, "datatype": "INT32", "data": inputs["token_type_ids"].tolist()}, ] }

I’ll use the boto3 client to invoke the endpoint with our payload from above, get the response, and find the most confident classification.

Python
http_response = client.invoke_endpoint( EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload) ) response = json.loads(http_response["Body"].read().decode("utf8")) outputs = response["outputs"] predicted_class_id = np.argmax(outputs[0]['data'],axis=-1) classification = model.config.id2label[predicted_class_id] print(f"The sentiment of the input statement is: {classification}")

Output:

Output
The sentiment of the input statement is: joy

Step 6: Clean up AWS resources

On AWS you only pay for what you use, which means when you are done using services, you have to clean up resources. You can run the following commands to delete the endpoint, endpoint config, model, Amazon S3 artifacts, and Amazon ECR repository we created.

Python
sm = sess.client('sagemaker') endpoint_config_name = sm.describe_endpoint(EndpointName=endpoint_name)['EndpointConfigName'] model_name = sm.describe_endpoint_config(EndpointConfigName=endpoint_config_name)['ProductionVariants'][0]['ModelName'] # Delete endpoint and clean up model and endpoint config sm.delete_endpoint(EndpointName=endpoint_name) sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name) sm.delete_model(ModelName=model_name) # Delete model artifacts in Amazon S3 s3 = boto3.resource("s3") bucket = s3.Bucket(bucket_name) bucket.objects.filter(Prefix="max-serving-models/roberta/").all().delete() # Delete Amazon ECR registry and all the images we created ecr = boto3.client('ecr') ecr.delete_repository(registryId=account, repositoryName=repo_name, force=True)

Conclusion

AWS offers many options to host models for inference including Amazon EC2, Amazon Elastic Kubernetes Service (EKS) to manage container orchestration, and fully-managed services such as Amazon SageMaker we discussed in this blog post. There are benefits to every approach depending on how much flexibility and control you want over your deployments. 

For most data scientists and developers who are not already IT professionals or MLOps experts, Amazon SageMaker provides a great balance of ease of use, flexibility and scalability which makes it very easy for developers to experiment with models and deploy quickly.

I hope you enjoyed this walkthrough! I sure enjoyed writing it. The Jupyter notebook with all the code in the blog post is available on GitHub. Check it out and share your feedback on discord

Until next time! 🔥

Additional resources:

Report feedback, including issues on our Mojo and MAX GitHub tracker

Shashank Prasanna
,
AI Developer Advocate