About me

Hi👋! My name is Xunhao Lai (赖勋豪). I am a Master’s student at the School of Intelligence Science and Technology at Peking University. Before that, I was an undergraduate student at Yuan Pei College, Peking University.

My research focuses on natural language processing and large language models. Specifically, I concentrate on long context models, exploring innovative and efficient attention mechanisms, as well as optimizing the efficiency of model training and inference.

Publications

[ICLR 2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, Xun Zhou

arxiv GitHub

FlexPrefill

FlexPrefill is a flexible sparse pre-filling mechanism for LLMs that dynamically adjusts attention patterns in real-time, improving speed and accuracy in long-sequence inference.

Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, Guoren Wang

arxiv GitHub

Fira

Fira is a new training framework for LLMs that achieves full-rank training performance while maintaining low-rank memory efficiency in both pre-training and fine-tuning.

Model Merging in Pre-training of Large Language Models

Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Deyi Liu, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, Xun Zhou, Siyuan Qiao, Liang Xiang, Yonghui Wu

arxiv

PMA

This paper comprehensively investigates model merging in pre-training, showing that merging constant-learning-rate checkpoints improves performance and provides ablation-driven insights.

Open-Source Projects

native-sparse-attention-triton

GitHub

Implemented the Deepseek Native Sparse Attention kernel using Triton, providing flexible and efficient sparse attention training code.

FlexPrefill

GitHub

Implemented the FlexPrefill long-text inference acceleration algorithm, offering a flexible and efficient acceleration solution for long context LLMs.

ring-sliding-window-attention

GitHub

Implemented the Ring Attention algorithm for Sliding Window Attention, enabling context-parallel training for long sequences.

Contact

E-mail: laixunhao@pku.edu.cn

Address: Peking University, Beijing, China