MAX Engine is a high-performance AI compiler and runtime designed to deliver low latency, and high-throughput inference for AI applications. We've shared how you can get started quickly with MAX in this getting started guide , and how you can deploy MAX Engine optimized models as a microservice using MAX Serving. In this blog post, I’ll show you how MAX Engine optimized model provide huge performance gains , while still delivering highly accurate inference results .
When you provide MAX Engine with a TensorFlow, PyTorch, or ONNX model, it performs several graph-level optimizations such as operator and kernel fusion, kernel specialization, memory layout optimization, shape inference, constant folding and more. These graph-level optimizations do not change the underlying computation in the graph, instead they restructure the graph to perform the operations much faster and more efficiently while maintaining high numerical accuracy .
Through the rest of the blog post, I'll walk you through setting up and run an experiment to compare MAX Engine's inference performance and accuracy against native TensorFlow execution on the famous ImageNet dataset. We'll see that MAX Engine delivers massive performance gains while maintaining the same high accuracy you get from TensorFlow.
MAX Engine accuracy on the ImageNet dataset using ResNet50 model The ImageNet dataset is one of the most influential dataset of all time, since it set the modern AI revolution in motion with the famous ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. The training dataset is huge, at 1.2 million images for training 50,000 images for validation. This makes it a good dataset for benchmarking and testing model performance and accuracy.
The ResNet50 model (Kaiming He et. al) won the ILSVRC competition in 2015 and is now a standard for image classification, thanks to it's high accuracy and ease of training. ResNet50 is also commonly used as a backbone architecture for various other models such as the R-CNN family of models and some YOLO family of models used for object detection.
Due to it's popularity, deep learning frameworks like TensorFlow and PyTorch include highly optimized implementations of ResNet50. However, as you'll see in this example, MAX Engine can squeeze out 2.4x faster inference performance over native TensorFlow, while maintaining the same high accuracy. You can see the inference accuracy, performance results below:
In the table, you can see that the validation accuracy is identical for both TensorFlow+Keras model execution and MAX Engine execution for 50,000 images in the ImageNet dataset. At the same time you get 2.4x throughput speed up and sub 100 millisecond inference for a batch size of 8. Checkout performance.modular.com for more performance data.
Download and prepare the dataset You can get the ImageNet dataset directly from the ImageNet website by signing up and requesting an access key to download the data. The dataset is also available on Kaggle as an alternative source. The entire dataset is approximately 167GB of data, but I’ll only be using the approximately 7GB of validation dataset to measure accuracy. I also used this helpful script from the good people on the TensorFlow team to convert the raw images into TFRecord format which is far more efficient to read and process using TensorFlow’s tf.data API. I used an Amazon EC2 c6i.4xlarge instance running Ubuntu 20.04 to run this example. Make sure you have sufficient storage space for the dataset. All of the code below is available on GitHub .
Setup data loading pipeline In order to feed our model with batches of images, we have to set up a data loading pipeline. The data loading pipeline includes creating a TFRecordDataset , preprocessing images and performing data augmentation. The code for that is below:
Python
import os
import time
import shutil
import time
import pandas as pd
import numpy as np
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.applications.resnet import preprocess_input, ResNet50 as resnet
def deserialize_image_record(record):
feature_map = {'image/encoded': tf.io.FixedLenFeature([], tf.string, ''),
'image/class/label': tf.io.FixedLenFeature([1], tf.int64, -1),
'image/class/text': tf.io.FixedLenFeature([], tf.string, '')}
obj = tf.io.parse_single_example(serialized=record, features=feature_map)
imgdata = obj['image/encoded']
label = tf.cast(obj['image/class/label'], tf.int32)
label_text = tf.cast(obj['image/class/text'], tf.string)
return imgdata, label, label_text
def val_preprocessing(record):
imgdata, label, label_text = deserialize_image_record(record)
label -= 1
image = tf.io.decode_jpeg(imgdata, channels=3,
fancy_upscaling=False,
dct_method='INTEGER_FAST')
shape = tf.shape(image)
height = tf.cast(shape[0], tf.float32)
width = tf.cast(shape[1], tf.float32)
side = tf.cast(tf.convert_to_tensor(256, dtype=tf.int32), tf.float32)
scale = tf.cond(tf.greater(height, width),
lambda: side / width,
lambda: side / height)
new_height = tf.cast(tf.math.rint(height * scale), tf.int32)
new_width = tf.cast(tf.math.rint(width * scale), tf.int32)
image = tf.image.resize(image, [new_height, new_width], method='bicubic')
image = tf.image.resize_with_crop_or_pad(image, 224, 224)
image = preprocess_input(image)
return image, label, label_text
def get_dataset(batch_size, use_cache=False):
data_dir = '/path/to/imagenet/dataset/tf-records/validation/*'
files = tf.io.gfile.glob(os.path.join(data_dir))
dataset = tf.data.TFRecordDataset(files)
dataset = dataset.map(map_func=val_preprocessing, num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.batch(batch_size=batch_size)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
dataset = dataset.repeat(count=1)
if use_cache:
shutil.rmtree('tfdatacache', ignore_errors=True)
os.mkdir('tfdatacache')
dataset = dataset.cache(f'./tfdatacache/imagenet_val')
return dataset
Copy
Make sure you update the data_dir = '/path/to/imagenet/dataset/tf-records/validation/*'
to where your ImageNet validation TFRecord files are.
Download a ResNet50 model We can download ResNet50 model using Keras APIs and save it to TensorFlow saved_model format since MAX Engine expects models to be in this format. MAX Engine also supports TorchScript and ONNX model formats.
Python
def download_and_save_model(keras_model, saved_model_dir):
model = keras_model(weights='imagenet')
shutil.rmtree(saved_model_dir, ignore_errors=True)
model.save(saved_model_dir,
include_optimizer=False,
save_format='tf')
saved_model_dir = "resnet50_saved_model"
download_and_save_model(ResNet50, saved_model_dir)
Copy
Inference on the ImageNet validation dataset using TensorFlow and Keras First, let's run the inference loop over the ImageNet validation dataset using TensorFlow only, and report its performance and accuracy results.
Python
model = tf.keras.models.load_model(saved_model_dir)
display_every = 500
display_threshold = display_every
pred_labels = []
actual_labels = []
iter_times = []
batch_size = 8
# Get the tf.data.TFRecordDataset object for the ImageNet2012 validation dataset
dataset = get_dataset(batch_size)
walltime_start = time.time()
for i, (validation_ds, batch_labels, _) in enumerate(dataset):
start_time = time.time()
pred_prob_keras = model(validation_ds)
iter_times.append(time.time() - start_time)
actual_labels.extend(label for label_list in batch_labels.numpy() for label in label_list)
pred_labels.extend(list(np.argmax(pred_prob_keras, axis=1)))
if i*batch_size >= display_threshold:
avg_throughput = np.mean(batch_size/np.array(iter_times[-display_every:]))
cum_acc = np.sum(np.array(actual_labels) == np.array(pred_labels))/len(actual_labels)
print(f'Images {i*batch_size}/50000. Average i/s {avg_throughput:.4f}. Cum. acc: {cum_acc:.4f}')
display_threshold+=display_every
iter_times = np.array(iter_times)
acc_keras_cpu = np.sum(np.array(actual_labels) == np.array(pred_labels))/len(actual_labels)
keras_results = pd.DataFrame(columns = [f'keras_cpu_{batch_size}'])
keras_results.loc['user_batch_size'] = [batch_size]
keras_results.loc['accuracy'] = [acc_keras_cpu]
keras_results.loc['prediction_time'] = [np.sum(iter_times)]
keras_results.loc['wall_time'] = [time.time() - walltime_start]
keras_results.loc['images_per_sec_mean'] = [np.mean(batch_size / iter_times)]
keras_results.loc['images_per_sec_std'] = [np.std(batch_size / iter_times, ddof=1)]
keras_results.loc['latency_mean'] = [np.mean(iter_times) * 1000]
keras_results.loc['latency_99th_percentile'] = [np.percentile(iter_times, q=99, interpolation="lower") * 1000]
keras_results.loc['latency_median'] = [np.median(iter_times) * 1000]
keras_results.loc['latency_min'] = [np.min(iter_times) * 1000]
display(keras_results)
Copy
Output
Output
Images 504/50000. Average i/s 41.4532. Cum. acc: 0.7402
Images 1000/50000. Average i/s 41.5001. Cum. acc: 0.7401
Images 1504/50000. Average i/s 41.5313. Cum. acc: 0.7526
Images 2000/50000. Average i/s 41.5613. Cum. acc: 0.7480
Images 2504/50000. Average i/s 41.5802. Cum. acc: 0.7560
...
Images 48000/50000. Average i/s 41.9211. Cum. acc: 0.7491
Images 48504/50000. Average i/s 41.9232. Cum. acc: 0.7489
Images 49000/50000. Average i/s 41.9431. Cum. acc: 0.7486
Images 49504/50000. Average i/s 41.9518. Cum. acc: 0.7494
Copy
We get about ~41 images/second with native TensorFlow inference, and you can see the full summary below:
Inference on the ImageNet validation dataset using MAX Engine Now let's take a look at MAX Engine performance and accuracy. To use the same model with MAX Engine you have to first compile the model with MAX Engine and all this takes is 3 lines of code:
Python
from max import engine
sess = engine.InferenceSession()
model = sess.load(saved_model_dir)
Copy
After that you just have to replace: model(validation_ds) with model.execute(input_1=validation_ds)
And the rest of the code remains the same (except for any variable name changes you want to make).
Python
### MAX Engine Python API ###
from max import engine
sess = engine.InferenceSession()
model = sess.load(saved_model_dir)
display_every = 500
display_threshold = display_every
pred_labels = []
actual_labels = []
iter_times = []
batch_size = 8
# Get the tf.data.TFRecordDataset object for the ImageNet2012 validation dataset
dataset = get_dataset(batch_size)
walltime_start = time.time()
for i, (validation_ds, batch_labels, _) in enumerate(dataset):
start_time = time.time()
pred_prob_max = model.execute(input_1=validation_ds)
iter_times.append(time.time() - start_time)
actual_labels.extend(label for label_list in batch_labels.numpy() for label in label_list)
pred_labels.extend(list(np.argmax(pred_prob_max['predictions'], axis=1)))
if i*batch_size >= display_threshold:
avg_throughput = np.mean(batch_size/np.array(iter_times[-display_every:]))
cum_acc = np.sum(np.array(actual_labels) == np.array(pred_labels))/len(actual_labels)
print(f'Images {i*batch_size}/50000. Average i/s {avg_throughput:.4f}. Cum. acc: {cum_acc:.4f}')
display_threshold+=display_every
iter_times = np.array(iter_times)
acc_max = np.sum(np.array(actual_labels) == np.array(pred_labels))/len(actual_labels)
max_results = pd.DataFrame(columns = [f'max_cpu_{batch_size}'])
max_results.loc['user_batch_size'] = [batch_size]
max_results.loc['accuracy'] = [acc_max]
max_results.loc['prediction_time'] = [np.sum(iter_times)]
max_results.loc['wall_time'] = [time.time() - walltime_start]
max_results.loc['images_per_sec_mean'] = [np.mean(batch_size / iter_times)]
max_results.loc['images_per_sec_std'] = [np.std(batch_size / iter_times, ddof=1)]
max_results.loc['latency_mean'] = [np.mean(iter_times) * 1000]
max_results.loc['latency_99th_percentile'] = [np.percentile(iter_times, q=99, interpolation="lower") * 1000]
max_results.loc['latency_median'] = [np.median(iter_times) * 1000]
max_results.loc['latency_min'] = [np.min(iter_times) * 1000]
display(max_results)
Copy
Output:
Output
Compiling model.
Done!
Images 504/50000. Average i/s 98.4731. Cum. acc: 0.7402
Images 1000/50000. Average i/s 98.8026. Cum. acc: 0.7401
Images 1504/50000. Average i/s 98.8956. Cum. acc: 0.7526
Images 2000/50000. Average i/s 99.0765. Cum. acc: 0.7480
Images 2504/50000. Average i/s 99.4256. Cum. acc: 0.7560
...
Images 48000/50000. Average i/s 93.7445. Cum. acc: 0.7491
Images 48504/50000. Average i/s 93.3451. Cum. acc: 0.7489
Images 49000/50000. Average i/s 93.3041. Cum. acc: 0.7486
Images 49504/50000. Average i/s 93.2523. Cum. acc: 0.7494
Copy
This gives us about 2.4x speedup on this model compared to TensorFlow execution. But we're more interested in accuracy and you can see that the accuracy is identical to our previous TensorFlow only section.
Conclusion With MAX Engine you get huge performance gains over optimized TensorFlow implementation while maintaining high accuracy! Check out the MAX performance dashboard for speedups MAX can deliver for a range of popular computer vision, natural language, recommender systems and other models: performance.modular.com . The code example used in this blog post is available on GitHub . Download MAX and try it out and share your feedback with us!
Until next time!🔥
Additional resources:
Report feedback, including issues on our Mojo and MAX GitHub tracker