Updated: November 16, 2024
Read time: # mins
Attention with Linear Biases Enables Input Length Extrapolation (ALiBi)
Authors: Ofir Press, Noah A. Smith, Mike Lewis
Abstract Summary:
The paper addresses the question of how transformer models can extrapolate to longer sequences during inference than those seen during training. The authors propose a new method called Attention with Linear Biases (ALiBi), which biases the attention scores based on the distance between query and key tokens. This method enables efficient extrapolation, achieving similar perplexity to sinusoidal position embeddings but with faster training and less memory usage. ALiBi's inductive bias towards recency allows it to outperform other position methods on the WikiText-103 benchmark.
Key Concepts:
- Extrapolation in Transformers: The ability of a model to perform well on input sequences longer than those seen during training.
- Position Embeddings: Methods to encode positional information into transformer models, traditionally using sinusoidal or learned embeddings.
- Attention with Linear Biases (ALiBi): A new method that biases attention scores with a penalty proportional to the distance between tokens, eliminating the need for positional embeddings and enabling efficient extrapolation.
- Perplexity: A measurement of how well a probabilistic model predicts a sample, with lower perplexity indicating better performance.
- WikiText-103 Benchmark: A dataset used for evaluating language models.
Problem Statement:
The main problem addressed is how to enable transformer models to extrapolate efficiently to longer sequences during inference than those encountered during training, without incurring significant computational overhead or memory usage.
Methods and Techniques:
- ALiBi Method: Introduces a linear bias to attention scores based on token distance. This bias is a fixed penalty that increases linearly with distance, allowing models to handle longer sequences efficiently.
- Sinusoidal Position Embeddings: Traditional method where positional information is added to word embeddings using sinusoidal functions.
- Rotary Position Embeddings: Another method that multiplies keys and queries by sinusoidal embeddings at each layer.
- T5 Bias: Modifies attention values by adding a learned, shared bias dependent on the distance between tokens.
Key Results:
- ALiBi models trained on shorter sequences can extrapolate to much longer sequences without a significant drop in performance.
- A 1.3 billion parameter model trained on 1024 tokens with ALiBi achieved the same perplexity on 2048-token sequences as a sinusoidal model trained on 2048 tokens, but trained 11% faster and used 11% less memory.
- ALiBi outperforms sinusoidal, rotary, and T5 position methods on the WikiText-103 benchmark, maintaining strong performance even on very long sequences (up to 10,000 tokens).
Contributions and Innovations:
- Efficient Extrapolation: ALiBi allows transformer models to extrapolate efficiently, reducing the need for longer training sequences and saving computational resources.
- Inductive Bias Towards Recency: ALiBi’s bias towards more recent tokens improves performance on tasks where recent context is more relevant.
- Implementation Simplicity: ALiBi can be implemented with minimal changes to existing transformer code, making it easy to adopt.
Future Work:
The authors suggest exploring further improvements in extrapolation efficiency and applying ALiBi to other tasks and models. They also propose combining ALiBi with other recent innovations in transformer models to achieve even better performance.
Applications:
- Language Modeling: Improving the performance and efficiency of large-scale language models.
- Text Generation: Enabling models to generate longer and more coherent text.
- Machine Translation: Applying ALiBi to translation models to handle longer input and output sequences effectively.