Weiting Tan, Jiachen Lian, Hirofumi Inaguma, Paden Tomasello, Philipp Koehn, Xutai Ma
AVLM is a research project that targets modality fusion, integrating visual and speech representation into pre-trained SpeechLM for expressive generation.
AVLM/
├── scripts/ # main scripts
│ ├── avlm/ # folder for AVLM pretraining (with different fusion strategies)
│ ├── avlm_avsr/ # folder for fine-tuning AVLM to perform AVSR task
│ ├── avlm_emo/ # folder for fine-tuning AVLM for expressive speech generation
│ ├── global.sh # config script for paths
├── src/ # Core source code
│ ├── data_utils/ # Data loading and preprocessing
│ ├── models/ # Customized SpiritLM model
│ ├── exp/ # spiritlm source code
│ ├── preprocess/ # video data preprocessing
│ ├── task/ # lighting trainer files
│ ├──├── avlm_iemocap_tune.py # fine-tune AVLM for expressive dialogue generation
│ ├──├── train_avlm.py # pre-train AVLM with differenet fusion strategies or fine-tune AVLM for AVSR task@misc{tan2025seeingbelievingemotionawareaudiovisual,
title={Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation},
author={Weiting Tan and Jiachen Lian and Hirofumi Inaguma and Paden Tomasello and Philipp Koehn and Xutai Ma},
year={2025},
eprint={2508.16188},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.16188},
}