Modular: Paged Attention & Prefix Caching Now Available in MAX Serve

We're excited to announce the availability of Paged Attention and Prefix Caching in MAX Serve, bringing state-of-the-art LLM inference optimizations. These features are available in MAX nightly and the MAX Serve nightly Docker image.

Try them now

To proceed, please make sure to install the magic CLI

Bash

curl -ssL https://magic.modular.com/ | bash

Or update it via

Bash

magic self-update

Now install the max-pipelines package with a single command

Bash

magic global install max-pipelines

Serve with optimizations enabled

Bash

max-pipelines serve \ --huggingface-repo-id modularai/llama-3.1 \ --cache-strategy paged \ --enable-prefix-caching

‍Check out what’s available with

Bash

max-pipelines serve --help

These features are available in Modular's officially supported models, leveraging the highly optimized MAX Graph APIs.

Why do Paged Attention and Prefix Caching matter?

Multi-Head Attention (MHA) is a core building block of modern LLMs, but it can be computationally intensive during inference. MHA's computational complexity scales quadratically with sequence length O(n²) and linearly with batch size, making it particularly demanding for long sequences or large batches. KV Cache optimizes this by storing previously computed Key and Value projections, avoiding redundant computations during autoregressive generation. However, traditional KV caching faces memory management challenges with long sequences.

PagedAttention and Prefix Caching address these challenges.

Paged Attention: Memory-efficient KV Cache management

Paged attention, introduced by vLLM, revolutionizes how we handle attention computation in LLMs with:

Block-based memory management:
- Organizes KV cache into fixed-size memory blocks (pages)
- Each block typically contains 16 or 32 tokens
- Enables efficient memory allocation and deallocation
Key benefits:
- Continuous memory guarantee: No memory fragmentation
- Dynamic sequence management: Efficiently handles variable-length sequences
- Memory pooling: Shares memory across multiple requests
- GPU memory savings: Up to 40% reduction in memory usage

Learn more about paged attention vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention‍

Prefix Caching: Optimizing similar prompts

‍Prefix caching, introduced by SGLang, provides powerful optimization for structured LLM programs:

Core concept:
- Identifies and caches common prefix patterns in text prompts
- Leverages program structure for optimal cache reuse
- Implements intelligent cache management in the prefix trees
Key advantages:
- Smart prefix detection: Automatically identifies reusable prompt segments
- Program-aware caching: Optimizes for common patterns in LLM applications
- Throughput improvement: Up to 3x speedup for structured workflows
- Resource optimization: Efficient memory usage through structured sharing

Learn more prefix caching SGLang: Efficient Execution of Structured Language Model Programs

‍What’s next?

These improvements optimize GPU memory by up to 40% and throughput up to 3x. Here are a few resources to get you started:

‍Get started with MAX
Explore MAX Serve and MAX Container
Check out the tutorial on how to deploy Llama 3 on GPU with MAX Serve
Check out the relevant concept page
Join our Discord and our Modular forum

We're excited to see what you'll build with MAX! Share your projects and experiences with us using #ModularAI on social media.

Paged Attention & Prefix Caching Now Available in MAX Serve

Try them now

Why do Paged Attention and Prefix Caching matter?

Paged Attention: Memory-efficient KV Cache management

Prefix Caching: Optimizing similar prompts

‍What’s next?

Next blog post: