Updated: June 22, 2024

Read time: # mins

Efficient Memory Management for LLM Serving with PagedAttention

Title: Efficient Memory Management for Large Language Model Serving with PagedAttention

Authors: Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica.

Abstract Summary:

The paper proposes PagedAttention, an attention algorithm inspired by virtual memory and paging techniques in operating systems, to address inefficiencies in managing the key-value cache (KV cache) memory for large language models (LLMs). The resulting system, vLLM, achieves near-zero waste in KV cache memory and flexible sharing within and across requests, improving throughput by 2-4x compared to existing systems like FasterTransformer and Orca.

Key Concepts:

  1. PagedAttention algorithm
  2. Virtual memory and paging techniques
  3. Key-value cache (KV cache) memory
  4. vLLM serving system
  5. High throughput and low latency
  6. Memory fragmentation
  7. Dynamic memory allocation
  8. Distributed execution
  9. Parallel sampling and beam search
  10. Memory sharing

Problem Statement:

The main problem addressed by this paper is the inefficient management of KV cache memory in large language model serving systems, which leads to significant memory waste, limits batch size, and reduces throughput.

Methods and Techniques:

  1. PagedAttention Algorithm: Inspired by virtual memory and paging in operating systems, it partitions the KV cache into blocks that are not stored in contiguous memory, allowing more flexible and efficient memory management.
  2. vLLM System: A serving system built on top of PagedAttention, featuring block-level memory management and preemptive request scheduling, supporting popular LLMs and distributed execution.
  3. Centralized Scheduler: Coordinates the execution of distributed GPU workers, ensuring efficient memory usage and high throughput.
  4. KV Cache Manager: Manages the physical KV cache memory through instructions from the centralized scheduler, allowing dynamic memory allocation and efficient memory sharing.
  5. Copy-on-Write Mechanism: For handling parallel sampling and beam search, allowing shared memory to be copied only when modified.
  6. Fine-Grained Batching and Scheduling: Enables efficient processing of multiple requests with varying input and output lengths without significant queuing delays or memory wastage.

Key Results:

  • vLLM improves LLM serving throughput by 2-4x compared to FasterTransformer and Orca, without affecting model accuracy.
  • The improvements are more pronounced with longer sequences, larger models, and more complex decoding algorithms.
  • vLLM achieves 1.67x to 3.58x higher throughput in translation tasks with shared prefixes compared to Orca.
  • vLLM demonstrates significant memory savings (up to 55.2%) in beam search scenarios and higher throughput in chatbot applications.

Contributions and Innovations:

  • Introduction of PagedAttention, enabling non-contiguous storage of KV cache and reducing memory fragmentation.
  • Development of vLLM, a high-throughput distributed LLM serving system with near-zero memory waste.
  • Implementation of effective memory sharing techniques for parallel sampling and beam search, significantly reducing memory usage.
  • Enhanced scheduling and preemption strategies to handle variable input and output lengths efficiently.
  • Support for various decoding algorithms and mixed decoding methods within the same batch, increasing overall throughput.

Future Work:

The authors suggest exploring further optimizations in kernel-level operations, expanding support for more complex LLMs and additional decoding algorithms, and improving the system’s adaptability to different hardware configurations.

Applications:

  1. Programming Assistants: Enhancing coding efficiency by generating multiple code suggestions in parallel.
  2. Chatbots: Providing more responsive and cost-effective conversational agents by improving memory management and throughput.
  3. Machine Translation: Offering high-quality translations with complex decoding methods like beam search, benefiting from shared prefix techniques.
  4. Content Generation: Enabling efficient generation of long and complex text outputs for applications in marketing, entertainment, and education.
  5. Cloud Services: Reducing operational costs and improving performance for cloud-based LLM services, facilitating wider adoption and scalability.

Relevant Links:

Try 
Max
 right now

Up and running, for free, in 5 minutes.

Start in your terminal now

curl -s https://get.modular.com | sh -
Copy

By downloading, you accept our Terms.

Available now

Coming Soon

Context Windows

ML Systems

ML Systems

Context Windows

ML Systems

Context Windows

ML Systems

Context Windows

Models

Models

ML Systems

ML Systems

Models

Models

Models

ML Systems

ML Systems

ML Systems

Models

Models

Models

ML Systems

ML Systems

Models

Models

Models

ML Systems

ML Systems

Context Windows