February 6, 2025

Paged Attention & Prefix Caching Now Available in MAX Serve

Ehsan M. Kermani

We're excited to announce the availability of Paged Attention and Prefix Caching in MAX Serve, bringing state-of-the-art LLM inference optimizations. These features are available in MAX nightly and the MAX Serve nightly Docker image.

Try them now

To proceed, please make sure to install the magic CLI

Bash
curl -ssL https://magic.modular.com/ | bash

Or update it via

Bash
magic self-update

Now install the max-pipelines package with a single command

Bash
magic global install max-pipelines

Serve with optimizations enabled

Bash
max-pipelines serve \ --huggingface-repo-id modularai/llama-3.1 \ --cache-strategy paged \ --enable-prefix-caching


Check out what’s available with

Bash
max-pipelines serve --help

These features are available in Modular's officially supported models, leveraging the highly optimized MAX Graph APIs.

Why do Paged Attention and Prefix Caching matter?

Multi-Head Attention (MHA) is a core building block of modern LLMs, but it can be computationally intensive during inference. MHA's computational complexity scales quadratically with sequence length O(n²) and linearly with batch size, making it particularly demanding for long sequences or large batches. KV Cache optimizes this by storing previously computed Key and Value projections, avoiding redundant computations during autoregressive generation. However, traditional KV caching faces memory management challenges with long sequences.

PagedAttention and Prefix Caching address these challenges.

Paged Attention: Memory-efficient KV Cache management

Paged attention, introduced by vLLM, revolutionizes how we handle attention computation in LLMs with:

  • Block-based memory management:
    • Organizes KV cache into fixed-size memory blocks (pages)
    • Each block typically contains 16 or 32 tokens
    • Enables efficient memory allocation and deallocation
  • Key benefits:
    • Continuous memory guarantee: No memory fragmentation
    • Dynamic sequence management: Efficiently handles variable-length sequences
    • Memory pooling: Shares memory across multiple requests
    • GPU memory savings: Up to 40% reduction in memory usage

Learn more about paged attention vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Prefix Caching: Optimizing similar prompts

Prefix caching, introduced by SGLang, provides powerful optimization for structured LLM programs:

  • Core concept:
    • Identifies and caches common prefix patterns in text prompts
    • Leverages program structure for optimal cache reuse
    • Implements intelligent cache management in the prefix trees
  • Key advantages:
    • Smart prefix detection: Automatically identifies reusable prompt segments
    • Program-aware caching: Optimizes for common patterns in LLM applications
    • Throughput improvement: Up to 3x speedup for structured workflows
    • Resource optimization: Efficient memory usage through structured sharing

Learn more prefix caching SGLang: Efficient Execution of Structured Language Model Programs

What’s next?

These improvements optimize GPU memory by up to 40% and throughput up to 3x. Here are a few resources to get you started:

We're excited to see what you'll build with MAX! Share your projects and experiences with us using #ModularAI on social media.

Ehsan M. Kermani
,
AI DevRel

Ehsan M. Kermani

AI DevRel

Ehsan is a Seasoned Machine Learning Engineer with a decade of experience and a rich background in Mathematics and Computer Science. His expertise lies in the development of cutting-edge Machine Learning and Deep Learning systems ranging from Natural Language Processing, Computer Vision, Generative AI and LLMs, Time Series Forecasting and Anomaly Detection while ensuring proper MLOps practices are in-place. Beyond his technical skills, he is very passionate about demystifying complex concepts by creating high-quality and engaging content. His goal is to empower and inspire the developer community through clear, accessible communication and innovative problem-solving. Ehsan lives in Vancouver, Canada.