July 9, 2024

Bring your own PyTorch model

The adoption of AI by enterprises has surged significantly over the last couple years, particularly with the advent of Generative AI (GenAI) and Large Language Models (LLMs). Most enterprises start by prototyping and building proof-of-concept products (POCs), using all-in-one API endpoints provided by big tech companies like OpenAI and Google, among others. However, as these companies transition to full-scale production, many are looking for ways to control their AI infrastructure. This requires the ability to effectively manage and deploy PyTorch.

PyTorch has become the top choice for researchers pushing the boundaries of AI, thanks to its blend of flexibility, Pythonic simplicity, and robust community support. Despite its popularity and success in research environments, PyTorch faces challenges in large-scale production deployments. What makes PyTorch excellent for development—like its Python eager mode—causes difficulties in production settings where resource management, latency targets, and reliability are critical.

While research teams favor Python for its ease of use, deployment teams often use high-performance languages and libraries like C++ and CUDA to optimize models for latency, throughput, and cost efficiency. Despite efforts to optimize the PyTorch deployment processes (e.g., TorchScript), a universal method for deploying high-performance PyTorch models at scale remains elusive.

Recently, developers have resorted to building point-solutions to deploy LLMs (that is, use-case-specific solutions such as TRT-LLM and vLLM), which further fragment the industry and increase complexity for everyone.

This ongoing challenge underscores the complexities inherent in bridging the gap between research-driven AI development and robust, scalable AI deployment. But it also describes one of the core reasons we built MAX—the Modular Accelerated Xecution platform.

Simply put, MAX provides the best way to deploy PyTorch into production.

MAX unlocks the full power of PyTorch

MAX provides inference API backed by a state-of-the-art compiler and inference runtime that works with a variety of models and hardware types, across local laptops and common cloud instances. Importantly, MAX doesn't require that you rewrite your PyTorch models—it meets you where you are now, using your existing model and existing code, with minimal changes to adopt our inference API (available in Python and C). Over time, you can incrementally add more MAX features for more performance, programmability, and portability.

MAX provides the following benefits for PyTorch deployments:

  1. Optimized performance: MAX provides advanced compiler and runtime optimizations that improve your resource efficiency and resource management. With only a few lines of code, MAX accelerates your models, reducing latency, improving user experience, and saving you valuable compute costs and resources. Relative to stock PyTorch, MAX runs PyTorch models up to 5x faster on CPU, depending on the specific workload and hardware. And GPUs are coming soon!
  2. Full compatibility: MAX reduces fragmentation in your workflow, by meeting you where you are now, with your existing PyTorch models, tools, and libraries. We’ve taken away the complexity in converting your Python models to high-performance languages. Instead of using brittle model translators, MAX is compatible with the PyTorch and ONNX ecosystems, which means your models work out of the box.
  3. Simple extensibility: MAX allows you to progressively upgrade your AI infrastructure over time, as you want to optimize the performance of your models. For even the most sophisticated AI engineers, performance-tuning AI pipelines is complicated because it involves advanced knowledge of system programming languages, AI hardware, and PyTorch itself. With MAX, you’re able to extend your models with custom operations (ops) written in Mojo—a new programming language that looks like Python and provides the performance of C. Importantly, custom ops written in Mojo are natively portable to any hardware that MAX supports and automatically fuse into the graph, ensuring peak performance.

MAX brings your PyTorch models to their full potential and scales much further to meet the demands of GenAI as it continues to rapidly evolve. MAX seamlessly integrates with your existing PyTorch models, provides an unparalleled boost in performance and efficiency, and is easily customizable, so you can focus on what you do best — innovating and creating.

How to use MAX with PyTorch

As we mentioned earlier, there is no universal way to deploy PyTorch. As such, MAX provides a few different integration points with the PyTorch ecosystem. We detail each below.

MAX Engine for TorchScript Models

If you are using TorchScript, MAX Engine simplifies the process into three easy steps: load, compile, and execute.

Python
from max import engine # create MAX Engine inference session session = engine.InferenceSession() # load, compile and execute the model with MAX Engine clip = session.load("clip_vit.torchscript") outputs = clip.execute(inputs=inputs)

MAX Engine for ONNX Models

Similarly, for ONNX models, the process is streamlined into load, compile, and execute steps.

Python
from ultralytics import YOLO from max import engine model = YOLO("yolov8n-seg.pt") model.export(format="onnx") # create MAX Engine inference session session = engine.InferenceSession() # load, compile and execute the model with MAX Engine yolo = session.load("yolov8n-seg.onnx") outputs = yolo.execute(images=input)

Torch.compile (coming soon!)

And coming soon, we will also provide a MAX backend for PyTorch 2.x’s torch.compile API. Below is how we would optimize a Stable Diffusion model with MAX in this scenario.

Python
import torch import max from diffusers import StableDiffusionPipeline pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5").to("cuda") pipe.unet = torch.compile(pipe.unet, backend=max.engine.get_max_backend()) image = pipe("a photo of an astronaut riding a horse on mars").images[0]

Whether you're using cutting-edge PyTorch features such as torch.compile, or staying with more traditional ways of serving your models, such as TorchScript and ONNX, MAX is the best way to deploy your PyTorch workloads.

MAX is free! Download now

By adopting MAX in your enterprise, you can take control over your PyTorch models. Join the growing community of developers who trust MAX for their model optimization needs. Don't let performance limitations hold you back. With MAX, you can elevate your PyTorch and ONNX models, delivering faster and more efficient results.

Get started with MAX today and experience the difference!

Modular Team
,
Company

Modular Team

Company

Our mission is to have real, positive impact in the world by reinventing the way AI technology is developed and deployed into production with a next-generation developer platform.