Xiuhong Li

Cited by

	All	Since 2019
Citations	452	383
h-index	13	11
i10-index	15	14

20162017201820192020202120222023202412 20 34 54 48 54 73 67 87

Public access

View all

15 articles

0 articles

available

not available

Based on funding mandates

Co-authors

Yun (Eric) LiangProfessor of EECS, Peking University, ACM Distinguished ScientistVerified email at pku.edu.cn
Shengen YanThe Chinese University of HongKongVerified email at ie.cuhk.edu.hk
Xiaolong XieResearch Engineer, Damo Academy, Alibaba Group.Verified email at alibaba-inc.com
Size ZhengPeking UniversityVerified email at pku.edu.cn
Xuechao WeiPeking UniversityVerified email at pku.edu.cn

Xiuhong Li

Peking University

Verified email at pku.edu.cn

GPGPU Compiler Deep Learning


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Enabling coordinated register allocation and thread-level parallelism optimization for GPUs X Xie, Y Liang, X Li, Y Wu, G Sun, T Wang, D Fan Proceedings of the 48th International Symposium on Microarchitecture, 395-406, 2015	82	2015
TGPA: Tile-grained pipeline architecture for low latency CNN inference X Wei, Y Liang, X Li, CH Yu, P Zhang, J Cong 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 1-8, 2018	78	2018
A coordinated tiling and batching framework for efficient GEMM on GPUs X Li, Y Liang, S Yan, L Jia, Y Li Proceedings of the 24th symposium on principles and practice of parallel …, 2019	59	2019
AMOS: enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction S Zheng, R Chen, A Wei, Y Jin, Q Han, L Lu, B Wu, X Li, S Yan, Y Liang Proceedings of the 49th Annual International Symposium on Computer …, 2022	42	2022
Flashdecoding++: Faster large language model inference on gpus K Hong, G Dai, J Xu, Q Mao, X Li, J Liu, K Chen, H Dong, Y Wang arXiv preprint arXiv:2311.01282, 2023	21	2023
Enabling efficient fast convolution algorithms on GPUs via MegaKernels L Jia, Y Liang, X Li, L Lu, S Yan IEEE Transactions on Computers 69 (7), 986-997, 2020	20	2020
Performance-centric register file design for GPUs using racetrack memory S Wang, Y Liang, C Zhang, X Xie, G Sun, Y Liu, Y Wang, X Li 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC), 25-30, 2016	20	2016
CRAT: Enabling coordinated register allocation and thread-level parallelism optimization for GPUs X Xie, Y Liang, X Li, Y Wu, G Sun, T Wang, D Fan IEEE Transactions on Computers 67 (6), 890-897, 2017	17	2017
Efficient kernel management on GPUs X Li, Y Liang 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), 85-90, 2016	16	2016
Chimera: An analytical optimizing framework for effective compute-intensive operators fusion S Zheng, S Chen, P Song, R Chen, X Li, S Yan, D Lin, J Leng, Y Liang 2023 IEEE International Symposium on High-Performance Computer Architecture …, 2023	14	2023
A survey on efficient inference for large language models Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou, L Wang, Z Yuan, X Li, ... arXiv preprint arXiv:2404.14294, 2024	13	2024
cuMBIR: An efficient framework for low-dose x-ray CT image reconstruction on GPUs X Li, Y Liang, W Zhang, T Liu, H Li, G Luo, M Jiang Proceedings of the 2018 International Conference on Supercomputing, 184-194, 2018	13	2018
Efficient kernel management on GPUs Y Liang, X Li ACM Transactions on Embedded Computing Systems (TECS) 16 (4), 1-24, 2017	13	2017
Exploring cache bypassing and partitioning for multi-tasking on GPUs Y Liang, X Li, X Xie 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 9-16, 2017	12	2017
Neoflow: A flexible framework for enabling efficient compilation for high performance dnn training S Zheng, R Chen, Y Jin, A Wei, B Wu, X Li, S Yan, Y Liang IEEE Transactions on Parallel and Distributed Systems 33 (11), 3220-3232, 2021	11	2021
CuLDA: solving large-scale LDA Problems on GPUs X Xie, Y Liang, X Li, W Tan Proceedings of the 28th International Symposium on High-Performance Parallel …, 2019	8	2019
CuLDA_CGS: Solving large-scale LDA problems on GPUs X Xie, Y Liang, X Li, W Tan Proceedings of the 24th Symposium on Principles and Practice of Parallel …, 2019	6	2019
Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning C Chen, X Li, Q Zhu, J Duan, P Sun, X Zhang, C Yang Proceedings of the 29th ACM International Conference on Architectural …, 2024	2	2024
Theoretical linear convergence of deep unfolding network for block-sparse signal recovery R Fu, Y Liu, X Li Third International Conference on Computer Science and Communication …, 2022	2	2022
FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics K Hong, G Dai, J Xu, Q Mao, X Li, J Liu, Y Dong, Y Wang Proceedings of Machine Learning and Systems 6, 148-161, 2024	1	2024

The system can't perform the operation now. Try again later.

Articles 1–20

Citations per year

Duplicate citations

Merged citations

Add co-authorsCo-authors

Follow

Cited by

Co-authors