Today, we're excited to announce the release of MAX 25.1, marking a significant evolution in our approach to delivering cutting-edge AI development tools to our community. This release substantially improves the developer experience for Agentic and LLM workflows, introduces a new nightly release model that includes a new GPU programming interface, and launches MAX Builds - your one-stop destination for GenAI development.
MAX Builds
We're thrilled to announce MAX Builds, your comprehensive hub for GenAI models, application recipes, and community-driven packages. While we're still expanding our coverage, MAX Builds serves as your go-to resource for:
- Models optimized for both CPU and GPU deployment
- Advanced quantization schemes
- Complete workflow recipes
- Community-contributed packages
- Simple, intelligent packaging in magic
This new platform represents our commitment to making AI development more accessible and efficient for developers at every skill level. Our goal is to continue making AI simpler and more approachable, handling local to cloud deployment, along with a single stack that delivers incredible Gen AI performance.

Come take a look, and tell us which models you want to see in MAX Builds!
MAX 25.1
MAX 25.1 introduces multiple groundbreaking improvements across several key areas:
Enhanced Agent and RAG Capabilities
The release brings substantial quality-of-life improvements that streamline the development process:
- New GPU-accelerated mpnet2 model: Access the
/v1/embeddings
API endpoint to quickly generate embeddings for your RAG and agent applications. - OpenAI-compatible function calling API: Leverage an industry-standard approach to interface MAX models with external services.
- Structured Output Generation: Guarantee that your LLM-generated responses conform to any API specification, perfect for agent workloads.
High-Performance LLM Workflows
We've implemented two significant performance enhancements:
- Support for paged attention: by enabling larger cache sizes during token generation, users can see up to a 5% improvement in token generation performance. In addition, paged attention also improves memory efficiency, allowing for longer context lengths.
- A preview of prefix caching: by caching the key-value (KV) computation of existing inference requests, new queries can reuse the context encoded in the KV cache, eliminating redundant computations and improving performance for workloads with repeated prefixes. In workloads that with repetitive inputs, prefix caching can commonly improve throughput by 30%.
You can read more about how to enable paged attention and prefix caching in the docs.
Offline Batch Inference
MAX 25.1 supports offline batched inference for LLM workflows, allowing you to load a model and run inference directly from Python. Offline batch inference improves performance by grouping requests together and removing the latency associated with HTTP requests. Expected performance depends on the exact workload, with smaller batch jobs demonstrating a 12% higher throughput relative to vLLM due to improved latency and model load time.
GPU Programming with Mojo via MAX Graphs
MAX is built on an exciting GPU programming interface that abstracts away the underlying hardware. MAX 25.1 introduces a new Custom Ops API that allows you to extend MAX Engine with new graph operations written in Mojo that execute on either CPU or GPU, providing full composability and extensibility for your models. Take a deep dive into the lastest GPU programming examples on GitHub!
Accelerating Innovation with MAX Nightlies
MAX 25.1 represents more than feature improvements–it marks a fundamental shift in our delivery approach. We're moving to a nightly-first model that emphasizes continuous innovation:
- Immediate Feature Access: New capabilities will be available in MAX nightlies as soon as they're ready. New models, pipeline optimizations, and APIs are reach the nightlies first, with documentation and recipes available to get you started with the new feature.
- Community-Driven Development: Early access enables real-time feedback from our community, helping shape features during development. Follow along with the development of the GPU programming and other new APIs in the MAX nightlies!
The MAX GitHub repository, release packages, and Docker images will now default to nightly builds, ensuring you can always access the latest improvements.
The year of MAX
This release marks an exciting start to 2025 for the MAX Platform. Whether you're building groundbreaking AI applications, contributing to our growing ecosystem, or exploring the possibilities of GenAI, MAX 25.1 provides the tools and infrastructure you need to succeed.
Ready to dive in? Visit our documentation to get started with MAX 25.1, explore MAX Builds, and join our community discussion on Discourse.
Come build with us!
