Exploring the limits of transfer learning with a unified text-to-text transformer C Raffel, N Shazeer, A Roberts, K Lee, S Narang, M Matena, Y Zhou, W Li, ... The Journal of Machine Learning Research 21 (1), 5485-5551, 2020 | 11557 | 2020 |
Deep speech 2: End-to-end speech recognition in english and mandarin D Amodei, S Ananthanarayanan, R Anubhai, J Bai, E Battenberg, C Case, ... International conference on machine learning, 173-182, 2016 | 3412 | 2016 |
Palm: Scaling language modeling with pathways A Chowdhery, S Narang, J Devlin, M Bosma, G Mishra, A Roberts, ... arXiv preprint arXiv:2204.02311, 2022 | 2112 | 2022 |
Mixed precision training P Micikevicius, S Narang, J Alben, G Diamos, E Elsen, D Garcia, ... arXiv preprint arXiv:1710.03740, 2017 | 1483 | 2017 |
Llama 2: Open foundation and fine-tuned chat models H Touvron, L Martin, K Stone, P Albert, A Almahairi, Y Babaei, ... arXiv preprint arXiv:2307.09288, 2023 | 902 | 2023 |
Deep voice 3: Scaling text-to-speech with convolutional sequence learning W Ping, K Peng, A Gibiansky, SO Arik, A Kannan, S Narang, J Raiman, ... arXiv preprint arXiv:1710.07654, 2017 | 795* | 2017 |
Scaling instruction-finetuned language models HW Chung, L Hou, S Longpre, B Zoph, Y Tay, W Fedus, Y Li, X Wang, ... arXiv preprint arXiv:2210.11416, 2022 | 789 | 2022 |
Deep learning scaling is predictable, empirically J Hestness, S Narang, N Ardalani, G Diamos, H Jun, H Kianinejad, ... arXiv preprint arXiv:1712.00409, 2017 | 517 | 2017 |
Self-consistency improves chain of thought reasoning in language models X Wang, J Wei, D Schuurmans, Q Le, E Chi, S Narang, A Chowdhery, ... arXiv preprint arXiv:2203.11171, 2022 | 353 | 2022 |
Exploring sparsity in recurrent neural networks S Narang, E Elsen, G Diamos, S Sengupta arXiv preprint arXiv:1704.05119, 2017 | 328 | 2017 |
DSD: regularizing deep neural networks with dense-sparse-dense training flow S Han, J Pool, S Narang, H Mao, S Tang, E Elsen, B Catanzaro, J Tran, ... | 303* | 2016 |
Exploring the limits of transfer learning with a unified text-to-text transformer A Roberts, C Raffel, K Lee, M Matena, N Shazeer, PJ Liu, S Narang, W Li, ... | 230 | 2019 |
Byt5: Towards a token-free future with pre-trained byte-to-byte models L Xue, A Barua, N Constant, R Al-Rfou, S Narang, M Kale, A Roberts, ... Transactions of the Association for Computational Linguistics 10, 291-306, 2022 | 220 | 2022 |
Block-sparse recurrent neural networks S Narang, E Undersander, G Diamos arXiv preprint arXiv:1711.02782, 2017 | 137 | 2017 |
Wt5?! training text-to-text models to explain their predictions S Narang, C Raffel, K Lee, A Roberts, N Fiedel, K Malkan arXiv preprint arXiv:2004.14546, 2020 | 136 | 2020 |
Scaling Up Models and Data with and A Roberts, HW Chung, A Levskaya, G Mishra, J Bradbury, D Andor, ... arXiv preprint arXiv:2203.17189, 2022 | 82 | 2022 |
Do transformer modifications transfer across implementations and applications? S Narang, HW Chung, Y Tay, W Fedus, T Fevry, M Matena, K Malkan, ... arXiv preprint arXiv:2102.11972, 2021 | 73 | 2021 |
Scale efficiently: Insights from pre-training and fine-tuning transformers Y Tay, M Dehghani, J Rao, W Fedus, S Abnar, HW Chung, S Narang, ... arXiv preprint arXiv:2109.10686, 2021 | 70 | 2021 |
H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling instruction-finetuned language models HW Chung, L Hou, S Longpre, B Zoph, Y Tay, W Fedus, E Li, X Wang, ... arXiv preprint arXiv:2210.11416, 0 | 51 | |
Scaling laws vs model architectures: How does inductive bias influence scaling? Y Tay, M Dehghani, S Abnar, HW Chung, W Fedus, J Rao, S Narang, ... arXiv preprint arXiv:2207.10551, 2022 | 34 | 2022 |