FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
Published in ICLR, 2025
This paper introduces FlexPrefill, a flexible sparse pre-filling mechanism for large language models that dynamically adjusts attention patterns in real-time, improving speed and accuracy in long-sequence inference compared to prior sparse attention methods.