We're excited to announce the availability of Paged Attention and Prefix Caching in MAX Serve, bringing state-of-the-art LLM inference optimizations. These features are available in MAX nightly and the MAX Serve nightly Docker image.
Try them now
To proceed, please make sure to install the magic
CLI
Or update it via
Now install the max-pipelines
package with a single command
Serve with optimizations enabled
Check out what’s available with
These features are available in Modular's officially supported models, leveraging the highly optimized MAX Graph APIs.
Why do Paged Attention and Prefix Caching matter?
Multi-Head Attention (MHA) is a core building block of modern LLMs, but it can be computationally intensive during inference. MHA's computational complexity scales quadratically with sequence length O(n²) and linearly with batch size, making it particularly demanding for long sequences or large batches. KV Cache optimizes this by storing previously computed Key and Value projections, avoiding redundant computations during autoregressive generation. However, traditional KV caching faces memory management challenges with long sequences.
PagedAttention and Prefix Caching address these challenges.
Paged Attention: Memory-efficient KV Cache management
Paged attention, introduced by vLLM, revolutionizes how we handle attention computation in LLMs with:
- Block-based memory management:
- Organizes KV cache into fixed-size memory blocks (pages)
- Each block typically contains 16 or 32 tokens
- Enables efficient memory allocation and deallocation
- Key benefits:
- Continuous memory guarantee: No memory fragmentation
- Dynamic sequence management: Efficiently handles variable-length sequences
- Memory pooling: Shares memory across multiple requests
- GPU memory savings: Up to 40% reduction in memory usage
Learn more about paged attention vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
Prefix Caching: Optimizing similar prompts
Prefix caching, introduced by SGLang, provides powerful optimization for structured LLM programs:
- Core concept:
- Identifies and caches common prefix patterns in text prompts
- Leverages program structure for optimal cache reuse
- Implements intelligent cache management in the prefix trees
- Key advantages:
- Smart prefix detection: Automatically identifies reusable prompt segments
- Program-aware caching: Optimizes for common patterns in LLM applications
- Throughput improvement: Up to 3x speedup for structured workflows
- Resource optimization: Efficient memory usage through structured sharing
Learn more prefix caching SGLang: Efficient Execution of Structured Language Model Programs
What’s next?
These improvements optimize GPU memory by up to 40% and throughput up to 3x. Here are a few resources to get you started:
- Get started with MAX
- Explore MAX Serve and MAX Container
- Check out the tutorial on how to deploy Llama 3 on GPU with MAX Serve
- Check out the relevant concept page
- Join our Discord and our Modular forum
We're excited to see what you'll build with MAX! Share your projects and experiences with us using #ModularAI on social media.