Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers H Huang, Y Chen, Z Wang, R Huang, R Xu, T Wang, L Liu, X Cheng, ... arXiv preprint arXiv:2312.08168, 2023 | 62 | 2023 |
Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling S Ji, Z Jiang, W Wang, Y Chen, M Fang, J Zuo, Q Yang, X Cheng, Z Wang, ... arXiv preprint arXiv:2408.16532, 2024 | 43 | 2024 |
Connecting multi-modal contrastive representations Z Wang, Y Zhao, H Huang, J Liu, A Yin, L Tang, L Li, Y Wang, Z Zhang, ... Advances in Neural Information Processing Systems 36, 22099-22114, 2023 | 41 | 2023 |
Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech translation and recognition X Cheng, T Jin, R Huang, L Li, W Lin, Z Wang, Y Wang, H Liu, A Yin, ... Proceedings of the IEEE/CVF International Conference on Computer Vision …, 2023 | 25 | 2023 |
Wavchat: A survey of spoken dialogue models S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang, Z Jiang, L Zhou, S Liu, ... arXiv preprint arXiv:2411.13577, 2024 | 23 | 2024 |
3drp-net: 3d relative position-aware network for 3d visual grounding Z Wang, H Huang, Y Zhao, L Li, X Cheng, Y Zhu, A Yin, Z Zhao arXiv preprint arXiv:2307.13363, 2023 | 22 | 2023 |
Opensr: Open-modality speech recognition via maintaining multi-modality alignment X Cheng, T Jin, L Li, W Lin, X Duan, Z Zhao arXiv preprint arXiv:2306.06410, 2023 | 22 | 2023 |
Distilling coarse-to-fine semantic matching knowledge for weakly supervised 3d visual grounding Z Wang, H Huang, Y Zhao, L Li, X Cheng, Y Zhu, A Yin, Z Zhao Proceedings of the IEEE/CVF International Conference on Computer Vision …, 2023 | 22 | 2023 |
Tavt: Towards transferable audio-visual text generation W Lin, T Jin, W Pan, L Li, X Cheng, Y Wang, Z Zhao Proceedings of the 61st Annual Meeting of the Association for Computational …, 2023 | 18 | 2023 |
Av-transpeech: Audio-visual robust speech-to-speech translation R Huang, H Liu, X Cheng, Y Ren, L Li, Z Ye, J He, L Zhang, J Liu, X Yin, ... arXiv preprint arXiv:2305.15403, 2023 | 18 | 2023 |
Exploring group video captioning with efficient relational approximation W Lin, T Jin, Y Wang, W Pan, L Li, X Cheng, Z Zhao Proceedings of the IEEE/CVF International Conference on Computer Vision …, 2023 | 16 | 2023 |
Freebind: Free lunch in unified multimodal space via knowledge fusion Z Wang, Z Zhang, X Cheng, R Huang, L Liu, Z Ye, H Huang, Y Zhao, T Jin, ... arXiv preprint arXiv:2405.04883, 2024 | 13 | 2024 |
Rethinking missing modality learning from a decoding perspective T Jin, X Cheng, L Li, W Lin, Y Wang, Z Zhao Proceedings of the 31st ACM International Conference on Multimedia, 4431-4439, 2023 | 13 | 2023 |
Omnibind: Large-scale omni multimodal representation via binding spaces Z Wang, Z Zhang, H Zhang, L Liu, R Huang, X Cheng, H Zhao, Z Zhao arXiv preprint arXiv:2407.11895, 2024 | 11 | 2024 |
Audiolcm: Text-to-audio generation with latent consistency models H Liu, R Huang, Y Liu, H Cao, J Wang, X Cheng, S Zheng, Z Zhao arXiv preprint arXiv:2406.00356, 2024 | 11 | 2024 |
Extending multi-modal contrastive representations Z Zhang, Z Wang, L Liu, R Huang, X Cheng, Z Ye, H Liu, H Huang, ... Advances in Neural Information Processing Systems 37, 91880-91903, 2024 | 10 | 2024 |
Controlspeech: Towards simultaneous zero-shot speaker cloning and zero-shot language style control with decoupled codec S Ji, J Zuo, W Wang, M Fang, S Zheng, Q Chen, Z Jiang, H Huang, ... arXiv preprint arXiv:2406.01205, 2024 | 10 | 2024 |
Transface: Unit-based audio-visual speech synthesizer for talking head translation X Cheng, R Huang, L Li, T Jin, Z Wang, A Yin, M Li, X Duan, Z Zhao arXiv preprint arXiv:2312.15197, 2023 | 10 | 2023 |
Weakly-supervised spoken video grounding via semantic interaction learning Y Wang, W Lin, S Zhang, T Jin, L Li, X Cheng, Z Zhao Proceedings of the 61st Annual Meeting of the Association for Computational …, 2023 | 8 | 2023 |
Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts D Fu, X Cheng, X Yang, W Hanting, Z Zhao, T Jin Proceedings of the 32nd ACM International Conference on Multimedia, 3838-3847, 2024 | 7 | 2024 |