Publications

You can also find my articles on my Google Scholar profile.

Conference Papers

FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Published in ICLR, 2025

This paper introduces FlexPrefill, a flexible sparse pre-filling mechanism for large language models that dynamically adjusts attention patterns in real-time, improving speed and accuracy in long-sequence inference compared to prior sparse attention methods.

Download Paper

Preprint

Model Merging in Pre-training of Large Language Models

Published in arxiv, 2025

This paper comprehensively investigates model merging in pre-training, showing that merging constant-learning-rate checkpoints on dense/MoE architectures (millions to 100B+ params) improves performance, predicts annealing, boosts efficiency, reduces costs, and provides ablation-driven insights.

Download Paper

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

Published in arxiv, 2024

This paper proposes Fira, a new training framework for Large Language Models that achieves full-rank training performance while maintaining low-rank memory efficiency, outperforming existing approaches in pre-training and fine-tuning experiments.

Download Paper

Xunhao Lai

Publications

Conference Papers

FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Preprint

Model Merging in Pre-training of Large Language Models

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?