Unified inference serving in the cloud ⚡️

A deployment solution for the MAX Engine that works as a drop-in replacement for your existing server-side inferencing system.

Quickly embed into your existing applications

Python, C++, and Mojo API's make it easy to integrate MAX Serve into your client applications, and leverage HTTP/REST and GRPC to communicate with supported inference servers.

Full framework compatibility

MAX Serving wraps the MAX Engine, meaning you can deploy models built with any AI framework -- TensorFlow, PyTorch, ONNX, and more.

Maximum inference performance on any platform

Integrate MAX Engine with industry standard inference servers (e.g., Triton) to get the best performance on x86 and Arm CPUs and NVIDIA GPUs. Maximize throughput with support for dynamic batching, streaming and ensemble models.

$ 0.13
TensorFlow
$0.36
PyTorch
$0.21
Modular Engine
$0.13
* Model
BERT-base-uncased
Instance
Intel Xeon c5.4xlarge
$ 0.23
TensorFlow
$0.49
PyTorch
$0.45
Modular Engine
$0.23
* Model
BERT-base-uncased
Instance
AMD EPYC c5a.4xlarge
$ 0.13
TensorFlow
$0.52
PyTorch
$0.32
Modular Engine
$0.13
* Model
BERT-base-uncased
Instance
AWS c6g.4xlarge (Graviton2)

Drop into existing serving systems

Easily integrates with existing serving systems such as NVIDIA Triton Inference Server, TF Serving and KServe. Scale your inference workloads using container infrastructure such as Kubernetes and deploy models on major cloud providers.

Join the next-generation compute platform with MAX Serving

Get Started

Get started now

Get started with MAX right now for your AI workloads.

See how you can get up and running with Modular