Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

Weiting Tan, Jiachen Lian, Hirofumi Inaguma, Paden Tomasello, Philipp Koehn, Xutai Ma

AVLM is a research project that targets modality fusion, integrating visual and speech representation into pre-trained SpeechLM for expressive generation.

🔧 Project Structure

AVLM/
├── scripts/  # main scripts
│   ├── avlm/ # folder for AVLM pretraining (with different fusion strategies)
│   ├── avlm_avsr/ # folder for fine-tuning AVLM to perform AVSR task
│   ├── avlm_emo/ # folder for fine-tuning AVLM for expressive speech generation
│   ├── global.sh # config script for paths
├── src/                     # Core source code
│   ├── data_utils/          # Data loading and preprocessing
│   ├── models/              # Customized SpiritLM model 
│   ├── exp/                 # spiritlm source code
│   ├── preprocess/          # video data preprocessing
│   ├── task/          # lighting trainer files
│   ├──├── avlm_iemocap_tune.py # fine-tune AVLM for expressive dialogue generation
│   ├──├── train_avlm.py # pre-train AVLM with differenet fusion strategies or fine-tune AVLM for AVSR task

If you find our work useful, please cite:

@misc{tan2025seeingbelievingemotionawareaudiovisual,
      title={Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation}, 
      author={Weiting Tan and Jiachen Lian and Hirofumi Inaguma and Paden Tomasello and Philipp Koehn and Xutai Ma},
      year={2025},
      eprint={2508.16188},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.16188}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

🔧 Project Structure

If you find our work useful, please cite:

About

Uh oh!

Releases

Packages

Languages

License

steventan0110/AVLM

Folders and files

Latest commit

History

Repository files navigation

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

🔧 Project Structure

If you find our work useful, please cite:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages