DAPPLE: A pipelined data parallel approach for training large models S Fan, Y Rong, C Meng, Z Cao, S Wang, Z Zheng, C Wu, G Long, J Yang, ... Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of …, 2021 | 229 | 2021 |
Understanding and bridging the gaps in current GNN performance optimizations K Huang, J Zhai, Z Zheng, Y Yi, X Shen Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of …, 2021 | 84 | 2021 |
AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures Z Zheng, X Yang, P Zhao, G Long, K Zhu, F Zhu, W Zhao, X Liu, J Yang, ... Proceedings of the 27th ACM International Conference on Architectural …, 2022 | 63 | 2022 |
Whale: Efficient giant model training over heterogeneous {GPUs} X Jia, L Jiang, A Wang, W Xiao, Z Shi, J Zhang, X Li, L Chen, Y Li, ... 2022 USENIX Annual Technical Conference (USENIX ATC 22), 673-688, 2022 | 47 | 2022 |
Refactoring and optimizing the community atmosphere model (CAM) on the sunway taihulight supercomputer H Fu, J Liao, W Xue, L Wang, D Chen, L Gu, J Xu, N Ding, X Wang, C He, ... SC'16: Proceedings of the International Conference for High Performance …, 2016 | 44 | 2016 |
Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity H Xia, Z Zheng, Y Li, D Zhuang, Z Zhou, X Qiu, Y Li, W Lin, SL Song arXiv preprint arXiv:2309.10285, 2023 | 40 | 2023 |
Versapipe: a versatile programming framework for pipelined computing on GPU Z Zheng, C Oh, J Zhai, X Shen, Y Yi, W Chen Proceedings of the 50th Annual IEEE/ACM International Symposium on …, 2017 | 40 | 2017 |
Fusionstitching: boosting memory intensive computations for deep learning workloads Z Zheng, P Zhao, G Long, F Zhu, K Zhu, W Zhao, L Diao, J Yang, W Lin arXiv preprint arXiv:2009.10924, 2020 | 33 | 2020 |
DISC: A dynamic shape compiler for machine learning workloads K Zhu, WY Zhao, Z Zheng, TY Guo, PZ Zhao, JJ Bai, J Yang, XY Liu, ... Proceedings of the 1st Workshop on Machine Learning and Systems, 89-95, 2021 | 30 | 2021 |
Optimizing distributed training deployment in heterogeneous GPU clusters X Yi, S Zhang, Z Luo, G Long, L Diao, C Wu, Z Zheng, J Yang, W Lin Proceedings of the 16th International Conference on emerging Networking …, 2020 | 29 | 2020 |
Drew: Efficient winograd cnn inference with deep reuse R Wu, F Zhang, J Guan, Z Zheng, X Du, X Shen Proceedings of the ACM Web Conference 2022, 1807-1816, 2022 | 18 | 2022 |
Exploring deep reuse in winograd CNN inference R Wu, F Zhang, Z Zheng, X Du, X Shen Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of …, 2021 | 13 | 2021 |
Gopipe: a granularity-oblivious programming framework for pipelined stencil executions on gpu C Oh, Z Zheng, X Shen, J Zhai, Y Yi Proceedings of the ACM International Conference on Parallel Architectures …, 2020 | 13 | 2020 |
Fp6-llm: Efficiently serving large language models through fp6-centric algorithm-system co-design H Xia, Z Zheng, X Wu, S Chen, Z Yao, S Youn, A Bakhtiari, M Wyatt, ... arXiv preprint arXiv:2401.14112, 2024 | 11 | 2024 |
Bladedisc: Optimizing dynamic shape machine learning workloads via compiler approach Z Zheng, Z Pan, D Wang, K Zhu, W Zhao, T Guo, X Qiu, M Sun, J Bai, ... Proceedings of the ACM on Management of Data 1 (3), 1-29, 2023 | 11 | 2023 |
HiWayLib: A software framework for enabling high performance communications for heterogeneous pipeline computations Z Zheng, C Oh, J Zhai, X Shen, Y Yi, W Chen Proceedings of the Twenty-Fourth International Conference on Architectural …, 2019 | 11 | 2019 |
Zeroquant(4+2): Redefining llms quantization with a new fp6-centric strategy for diverse generative tasks X Wu, H Xia, S Youn, Z Zheng, S Chen, A Bakhtiari, M Wyatt, Y He, ... arXiv preprint arXiv:2312.08583, 2023 | 8 | 2023 |
Recom: A compiler approach to accelerating recommendation model inference with massive embedding columns Z Pan, Z Zheng, F Zhang, R Wu, H Liang, D Wang, X Qiu, J Bai, W Lin, ... Proceedings of the 28th ACM International Conference on Architectural …, 2023 | 7 | 2023 |
Auto-MAP: A DQN framework for exploring distributed execution plans for DNN workloads S Wang, Y Rong, S Fan, Z Zheng, LS Diao, G Long, J Yang, X Liu, W Lin arXiv preprint arXiv:2007.04069, 2020 | 7 | 2020 |
Optimizing DNN compilation for distributed training with joint OP and tensor fusion X Yi, S Zhang, L Diao, C Wu, Z Zheng, S Fan, S Wang, J Yang, W Lin IEEE Transactions on Parallel and Distributed Systems 33 (12), 4694-4706, 2022 | 5 | 2022 |