Modular: MAX 24.5 - With SOTA CPU Performance for Llama 3.1

We’re excited to announce the release of MAX 24.5, which ships with significant improvements to Llama 3.1 CPU performance, new Python graph API bindings, our biggest update to Mojo ever, industry-standard packaging, and a clarified license. Read on to learn more!

MAX 24.5 marks our final CPU-only release and ships with an improved Llama 3.1 pipeline, featuring up to 45% improved token generation over the 24.4 release^*. This improvement is made possible by the addition of the new MAX Driver interface, which gives developers more control over the MAX engine and the accelerators it controls.

In addition to this improved performance, the MAX Llama pipeline is rebuilt from the ground up using a new technology preview of the Python graph API bindings, bringing the power of MAX directly to Python developers.

Get started with MAX 24.5 and the Llama 3.1 pipeline today using Magic, our new package manager. Magic delivers MAX and Mojo as a single package and gives you access to thousands of community-built packages for Python and other languages. You can install Magic with a single command from the Modular getting started page.

Once you have Magic installed, run the following commands to experience state-of-the-art performance on Llama 3.1 on CPUs:

git clone https://github.com/modularml/max.git
cd max/examples/graph-api/pipelines
magic run llama3 --prompt "Why is the sky blue?"

That’s it! Magic automatically handles setting up MAX, installing the required dependencies, setting up an isolated virtual environment, and launching the Llama pipeline.

In addition, this release of MAX ships with several improvements, including:

A unified MAX and Mojo package based on industry-standard Conda packaging, with a 30% reduction in download size.
MAX now works with your choice of PyTorch, delivering an even more streamlined and interoperable experience.
An update to MAX and Mojo’s Community License, that outlines the many use cases you're empowered to build and monetize with MAX and Mojo. You can learn more in our licensing FAQ.
A new documentation site, including a growing collection of tutorials, examples, and getting started guides.
Support for Python 3.12, making MAX more widely available to developers.
Our biggest update to Mojo ever, with streamlined language features, performance improvements across the core language, and new standard library APIs that add features for strings, collections, and system interactions. Check out the full release notes to learn about these improvements, and so much more!

MAX 24.5 with Magic is available today! Download it now, experience state-of-the-art LLM performance with Llama 3.1 on CPUs, and get ready for what's coming in the next release.

* Up to 45% performance improvement of Llama 3.1 token generation over MAX 24.4 on a macOS M2 processor using Q4_K quantization. Up to 35% improvement on Graviton systems (c7g.16xlarge), and up to 20% improvement on Intel (c6i.16xlarge).

MAX 24.5 - With SOTA CPU Performance for Llama 3.1

Next blog post: