March 13, 2024

Evaluating MAX Engine inference accuracy on the ImageNet dataset

MAX Engine is a high-performance AI compiler and runtime designed to deliver low latency, and high-throughput inference for AI applications. We've shared how you can get started quickly with MAX in this getting started guide, and how you can deploy MAX Engine optimized models as a microservice using MAX Serving. In this blog post, I’ll show you how MAX Engine optimized model provide huge performance gains, while still delivering highly accurate inference results.

When you provide MAX Engine with a TensorFlow, PyTorch, or ONNX model, it performs several graph-level optimizations such as operator and kernel fusion, kernel specialization, memory layout optimization, shape inference, constant folding and more. These graph-level optimizations do not change the underlying computation in the graph, instead they restructure the graph to perform the operations much faster and more efficiently while maintaining high numerical accuracy.

Through the rest of the blog post, I'll walk you through setting up and run an experiment to compare MAX Engine's inference performance and accuracy against native TensorFlow execution on the famous ImageNet dataset. We'll see that MAX Engine delivers massive performance gains while maintaining the same high accuracy you get from TensorFlow.

MAX Engine accuracy on the ImageNet dataset using ResNet50 model

The ImageNet dataset is one of the most influential dataset of all time, since it set the modern AI revolution in motion with the famous ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. The training dataset is huge, at 1.2 million images for training 50,000 images for validation. This makes it a good dataset for benchmarking and testing model performance and accuracy.

The ResNet50 model (Kaiming He et. al) won the ILSVRC competition in 2015 and is now a standard for image classification, thanks to it's high accuracy and ease of training. ResNet50 is also commonly used as a backbone architecture for various other models such as the R-CNN family of models and some YOLO family of models used for object detection.

Due to it's popularity, deep learning frameworks like TensorFlow and PyTorch include highly optimized implementations of ResNet50. However, as you'll see in this example, MAX Engine can squeeze out 2.4x faster inference performance over native TensorFlow, while maintaining the same high accuracy. You can see the inference accuracy, performance results below:

In the table, you can see that the validation accuracy is identical for both TensorFlow+Keras model execution and MAX Engine execution for 50,000 images in the ImageNet dataset. At the same time you get 2.4x throughput speed up and sub 100 millisecond inference for a batch size of 8. Checkout performance.modular.com for more performance data.

Download and prepare the dataset

You can get the ImageNet dataset directly from the ImageNet website by signing up and requesting an access key to download the data. The dataset is also available on Kaggle as an alternative source. The entire dataset is approximately 167GB of data, but I’ll only be using the approximately 7GB of validation dataset to measure accuracy. I also used this helpful script from the good people on the TensorFlow team to convert the raw images into TFRecord format which is far more efficient to read and process using TensorFlow’s tf.data API. I used an Amazon EC2 c6i.4xlarge instance running Ubuntu 20.04 to run this example. Make sure you have sufficient storage space for the dataset. All of the code below is available on GitHub.

Setup data loading pipeline

In order to feed our model with batches of images, we have to set up a data loading pipeline. The data loading pipeline includes creating a TFRecordDataset, preprocessing images and performing data augmentation. The code for that is below:

Python
import os import time import shutil import time import pandas as pd import numpy as np os.environ['CUDA_VISIBLE_DEVICES'] = '-1' import tensorflow as tf import tensorflow.keras as keras from tensorflow.keras.applications.resnet import preprocess_input, ResNet50 as resnet def deserialize_image_record(record): feature_map = {'image/encoded': tf.io.FixedLenFeature([], tf.string, ''), 'image/class/label': tf.io.FixedLenFeature([1], tf.int64, -1), 'image/class/text': tf.io.FixedLenFeature([], tf.string, '')} obj = tf.io.parse_single_example(serialized=record, features=feature_map) imgdata = obj['image/encoded'] label = tf.cast(obj['image/class/label'], tf.int32) label_text = tf.cast(obj['image/class/text'], tf.string) return imgdata, label, label_text def val_preprocessing(record): imgdata, label, label_text = deserialize_image_record(record) label -= 1 image = tf.io.decode_jpeg(imgdata, channels=3, fancy_upscaling=False, dct_method='INTEGER_FAST') shape = tf.shape(image) height = tf.cast(shape[0], tf.float32) width = tf.cast(shape[1], tf.float32) side = tf.cast(tf.convert_to_tensor(256, dtype=tf.int32), tf.float32) scale = tf.cond(tf.greater(height, width), lambda: side / width, lambda: side / height) new_height = tf.cast(tf.math.rint(height * scale), tf.int32) new_width = tf.cast(tf.math.rint(width * scale), tf.int32) image = tf.image.resize(image, [new_height, new_width], method='bicubic') image = tf.image.resize_with_crop_or_pad(image, 224, 224) image = preprocess_input(image) return image, label, label_text def get_dataset(batch_size, use_cache=False): data_dir = '/path/to/imagenet/dataset/tf-records/validation/*' files = tf.io.gfile.glob(os.path.join(data_dir)) dataset = tf.data.TFRecordDataset(files) dataset = dataset.map(map_func=val_preprocessing, num_parallel_calls=tf.data.experimental.AUTOTUNE) dataset = dataset.batch(batch_size=batch_size) dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE) dataset = dataset.repeat(count=1) if use_cache: shutil.rmtree('tfdatacache', ignore_errors=True) os.mkdir('tfdatacache') dataset = dataset.cache(f'./tfdatacache/imagenet_val') return dataset

Make sure you update the data_dir = '/path/to/imagenet/dataset/tf-records/validation/*' to where your ImageNet validation TFRecord files are.

Download a ResNet50 model

We can download ResNet50 model using Keras APIs and save it to TensorFlow saved_model format since MAX Engine expects models to be in this format. MAX Engine also supports TorchScript and ONNX model formats.

Python
def download_and_save_model(keras_model, saved_model_dir): model = keras_model(weights='imagenet') shutil.rmtree(saved_model_dir, ignore_errors=True) model.save(saved_model_dir, include_optimizer=False, save_format='tf') saved_model_dir = "resnet50_saved_model" download_and_save_model(ResNet50, saved_model_dir)

Inference on the ImageNet validation dataset using TensorFlow and Keras

First, let's run the inference loop over the ImageNet validation dataset using TensorFlow only, and report its performance and accuracy results.

Python
model = tf.keras.models.load_model(saved_model_dir) display_every = 500 display_threshold = display_every pred_labels = [] actual_labels = [] iter_times = [] batch_size = 8 # Get the tf.data.TFRecordDataset object for the ImageNet2012 validation dataset dataset = get_dataset(batch_size) walltime_start = time.time() for i, (validation_ds, batch_labels, _) in enumerate(dataset): start_time = time.time() pred_prob_keras = model(validation_ds) iter_times.append(time.time() - start_time) actual_labels.extend(label for label_list in batch_labels.numpy() for label in label_list) pred_labels.extend(list(np.argmax(pred_prob_keras, axis=1))) if i*batch_size >= display_threshold: avg_throughput = np.mean(batch_size/np.array(iter_times[-display_every:])) cum_acc = np.sum(np.array(actual_labels) == np.array(pred_labels))/len(actual_labels) print(f'Images {i*batch_size}/50000. Average i/s {avg_throughput:.4f}. Cum. acc: {cum_acc:.4f}') display_threshold+=display_every iter_times = np.array(iter_times) acc_keras_cpu = np.sum(np.array(actual_labels) == np.array(pred_labels))/len(actual_labels) keras_results = pd.DataFrame(columns = [f'keras_cpu_{batch_size}']) keras_results.loc['user_batch_size'] = [batch_size] keras_results.loc['accuracy'] = [acc_keras_cpu] keras_results.loc['prediction_time'] = [np.sum(iter_times)] keras_results.loc['wall_time'] = [time.time() - walltime_start] keras_results.loc['images_per_sec_mean'] = [np.mean(batch_size / iter_times)] keras_results.loc['images_per_sec_std'] = [np.std(batch_size / iter_times, ddof=1)] keras_results.loc['latency_mean'] = [np.mean(iter_times) * 1000] keras_results.loc['latency_99th_percentile'] = [np.percentile(iter_times, q=99, interpolation="lower") * 1000] keras_results.loc['latency_median'] = [np.median(iter_times) * 1000] keras_results.loc['latency_min'] = [np.min(iter_times) * 1000] display(keras_results)

Output

Output
Images 504/50000. Average i/s 41.4532. Cum. acc: 0.7402 Images 1000/50000. Average i/s 41.5001. Cum. acc: 0.7401 Images 1504/50000. Average i/s 41.5313. Cum. acc: 0.7526 Images 2000/50000. Average i/s 41.5613. Cum. acc: 0.7480 Images 2504/50000. Average i/s 41.5802. Cum. acc: 0.7560 ... Images 48000/50000. Average i/s 41.9211. Cum. acc: 0.7491 Images 48504/50000. Average i/s 41.9232. Cum. acc: 0.7489 Images 49000/50000. Average i/s 41.9431. Cum. acc: 0.7486 Images 49504/50000. Average i/s 41.9518. Cum. acc: 0.7494

We get about ~41 images/second with native TensorFlow inference, and you can see the full summary below:

Inference on the ImageNet validation dataset using MAX Engine

Now let's take a look at MAX Engine performance and accuracy. To use the same model with MAX Engine you have to first compile the model with MAX Engine and all this takes is 3 lines of code:

Python
from max import engine sess = engine.InferenceSession() model = sess.load(saved_model_dir)

After that you just have to replace: model(validation_ds) with model.execute(input_1=validation_ds)

And the rest of the code remains the same (except for any variable name changes you want to make).

Python
### MAX Engine Python API ### from max import engine sess = engine.InferenceSession() model = sess.load(saved_model_dir) display_every = 500 display_threshold = display_every pred_labels = [] actual_labels = [] iter_times = [] batch_size = 8 # Get the tf.data.TFRecordDataset object for the ImageNet2012 validation dataset dataset = get_dataset(batch_size) walltime_start = time.time() for i, (validation_ds, batch_labels, _) in enumerate(dataset): start_time = time.time() pred_prob_max = model.execute(input_1=validation_ds) iter_times.append(time.time() - start_time) actual_labels.extend(label for label_list in batch_labels.numpy() for label in label_list) pred_labels.extend(list(np.argmax(pred_prob_max['predictions'], axis=1))) if i*batch_size >= display_threshold: avg_throughput = np.mean(batch_size/np.array(iter_times[-display_every:])) cum_acc = np.sum(np.array(actual_labels) == np.array(pred_labels))/len(actual_labels) print(f'Images {i*batch_size}/50000. Average i/s {avg_throughput:.4f}. Cum. acc: {cum_acc:.4f}') display_threshold+=display_every iter_times = np.array(iter_times) acc_max = np.sum(np.array(actual_labels) == np.array(pred_labels))/len(actual_labels) max_results = pd.DataFrame(columns = [f'max_cpu_{batch_size}']) max_results.loc['user_batch_size'] = [batch_size] max_results.loc['accuracy'] = [acc_max] max_results.loc['prediction_time'] = [np.sum(iter_times)] max_results.loc['wall_time'] = [time.time() - walltime_start] max_results.loc['images_per_sec_mean'] = [np.mean(batch_size / iter_times)] max_results.loc['images_per_sec_std'] = [np.std(batch_size / iter_times, ddof=1)] max_results.loc['latency_mean'] = [np.mean(iter_times) * 1000] max_results.loc['latency_99th_percentile'] = [np.percentile(iter_times, q=99, interpolation="lower") * 1000] max_results.loc['latency_median'] = [np.median(iter_times) * 1000] max_results.loc['latency_min'] = [np.min(iter_times) * 1000] display(max_results)

Output:

Output
Compiling model. Done! Images 504/50000. Average i/s 98.4731. Cum. acc: 0.7402 Images 1000/50000. Average i/s 98.8026. Cum. acc: 0.7401 Images 1504/50000. Average i/s 98.8956. Cum. acc: 0.7526 Images 2000/50000. Average i/s 99.0765. Cum. acc: 0.7480 Images 2504/50000. Average i/s 99.4256. Cum. acc: 0.7560 ... Images 48000/50000. Average i/s 93.7445. Cum. acc: 0.7491 Images 48504/50000. Average i/s 93.3451. Cum. acc: 0.7489 Images 49000/50000. Average i/s 93.3041. Cum. acc: 0.7486 Images 49504/50000. Average i/s 93.2523. Cum. acc: 0.7494

This gives us about 2.4x speedup on this model compared to TensorFlow execution. But we're more interested in accuracy and you can see that the accuracy is identical to our previous TensorFlow only section.

Conclusion

With MAX Engine you get huge performance gains over optimized TensorFlow implementation while maintaining high accuracy! Check out the MAX performance dashboard for speedups MAX can deliver for a range of popular computer vision, natural language, recommender systems and other models: performance.modular.com. The code example used in this blog post is available on GitHub. Download MAX and try it out and share your feedback with us!

Until next time!🔥

Additional resources:

Report feedback, including issues on our Mojo and MAX GitHub tracker

Shashank Prasanna
,
AI Developer Advocate

Shashank Prasanna

AI Developer Advocate

Shashank is an engineer, educator and doodler. He writes and talks about machine learning, specialized machine learning hardware (AI Accelerators) and AI Infrastructure in the cloud. He previously worked at Meta, AWS, NVIDIA, MathWorks (MATLAB) and Oracle in developer relations and marketing, product management, and software development roles and hold an M.S. in electrical engineering.