February 27, 2025

Modverse #46: MAX 25.1, MAX Builds, and Democratizing AI Compute

Caroline Frasca

We recently introduced MAX 25.1, a major leap forward in AI development. This release enhances agentic and LLM workflows, introduces MAX Builds as a central hub for GenAI models and application recipes, and debuts a new GPU programming interface. Developers can now take advantage of GPU-accelerated embeddings, OpenAI-compatible function calling, structured output generation, and high-performance LLM optimizations like paged attention and prefix caching for improved efficiency.

MAX 25.1 also marks a shift to a nightly release model, giving developers early access to new features, real-time community-driven improvements, and continuous innovation. With offline batch inference, Mojo-powered GPU programming via MAX Graphs, and streamlined deployment from local to cloud, MAX 25.1 delivers improved performance and flexibility. Get started today by exploring MAX Builds, diving into the docs, and joining the community forum!

Blogs, Tutorials, and Videos

  • Experience faster, smarter AI with MAX 25.1! Prefix caching and paged attention boost LLM performance, offline batch inference improves latency and load time, and MAX Builds is your go-to hub for GenAI models, recipes, and packages. Get the full details in our recent blog post.
  • Our MAX 25.1 livestream was packed with insights, demos, powerful updates to MAX and Mojo, and a live audience Q&A with Chris Lattner and team.
  • In Community Meeting #13, we discussed Owen's structured async Mojo proposal and highlighted two standout community projects: Brian's EmberJSON for Mojo JSON parsing and Martin's Modo for generating Mojo docs.
  • In part 1 of our “Democratizing AI Compute” series, Chris Lattner explored how novel ideas, backed by highly focused teams, can unlock efficiency breakthroughs.
    • In part 2, he tackled the question, “What exactly is CUDA?”, revealing why DeepSeek and others are bypassing it entirely.
    • Part 3 examined how CUDA has become the dominant force in GPU computing, dissecting the layers of NVIDIA’s strategy.
    • In part 4, Chris addressed the question, “Is CUDA any good?”, digging into the perspectives of frequent users in the Gen AI ecosystem.
  • We're thrilled to introduce Paged Attention and Prefix Caching in MAX Serve, delivering cutting-edge optimizations for LLM inference.
  • Chris Lattner delivered a keynote on MAX and spoke on an expert panel on compute and at the Democratize Intelligence conference in San Francisco.
  • MAX Builds is now your go-to hub to get started building with MAX, featuring the latest GenAI models supported on both CPU and GPU, community-created packages, and application recipes.
  • Check out all of our new recipes, which are step-by-step guides to deploy GenAI using MAX:
    • Continuous Chat App With MAX Serve: build a chat application using Llama 3 and MAX by implementing efficient token management with rolling context windows, handling concurrent requests for optimal performance, and containerizing and deploying your application with Docker Compose.
    • Generate Embeddings with MAX Serve: run an OpenAI-compatible embeddings endpoint on MAX Serve with Docker, and generate embeddings with MPNet using the OpenAI Python client.
    • MAX Serve OpenAI Function Calling: explore LLM function calling with MAX Serve and llama3-8B on both CPU and GPU, demo OpenAI's function calling for interacting with external tools, and run a working example locally.
    • Offline Inference With MAX: use MAX to run inference with models from Hugging Face and generate text completions using the Llama-3.1 8B model.
    • Build Your Own AI Weather Agent: integrate OpenAI’s function calling with FastAPI and Llama 3.1 to build an interactive app that retrieves real-time data based on user queries.
    • Use Open WebUI With MAX Serve: use MAX Serve to create an OpenAI-compatible endpoint for Llama 3.1, set up Open WebUI for a robust chat interface, explore its RAG and web search capabilities, and configure the setup for multiple users.
    • MAX Serve Multi-Modal Structured Output: run a multimodal vision model with Llama 3.2 Vision and MAX Serve, implement structured output parsing with Pydantic models, convert image analysis into strongly-typed JSON, and leverage MAX Serve’s capabilities for multimodal models, type-safe parsing, and simple deployment with the magic CLI.

Awesome MAX + Mojo

Open-Source Contributions

If you’ve recently had your first PR merged, message Caroline Frasca in the forum to claim your epic Mojo swag!

Check out the recently merged contributions from our valuable community members:

Coming Up

Beyond CUDA: Accelerating GenAI Workloads with Modular’s MAX Engine

Join us on March 4th in San Francisco for an exclusive in-person talk with Chris Lattner, exploring how the MAX Engine accelerates GenAI workloads on both CPUs and GPUs—without relying on CUDA. Chris will break down the next-gen graph compiler and runtime behind MAX, designed to make high-performance AI more accessible across diverse hardware.

Modular at NVIDIA GTC

Planning to attend NVIDIA GTC? Stop by the Modular booth to connect with the team in-person! Modular's booth is #2315, and you can sign up for updates on Modular at GTC. We'll send you an email update as we get closer to the conference with a full schedule of the live demos at our booth, instructions to find our booth (#2315), and details on claiming your epic swag. Live demos at our booth will include programming GPUs with Mojo and deploying agent workflows on MAX.

Caroline Frasca
,
Technical Community Manager

Caroline Frasca

Technical Community Manager