Music Emotion Recognition (MER) is a challenging task considering the nuances of defining emotions. While unimodal models provide a good baseline for MER, multimodal models are becoming fundamental to provide an in-depth description of emotions. Leveraging on the multimodal MERGE dataset, we investigate the power of audio-related deep embeddings, lyrics informed features, and music-aware cues in providing an informative set of features for low-impact computational learning models. Results confirm that multimodal fusion outperforms unimodal approaches. Moreover, different experiments highlight the positive contribution of genre metadata and the potential use of harmonic features for real-time computationally low-impact applications. These findings confirm the importance of multimodal integration for robust and interpretable emotion recognition systems, while opening up future directions, including advanced feature fusion, user-specific model adaptation (user-tuning), and multi-label emotion representation.
Novacco, A., Gasparini, F., Rizzi, G., Saibene, A. (2026). Decoding Emotions: Multimodal Integration of Deep Embeddings, Lyrics and Music-Aware Cues. In Artificial Intelligence in Music, Sound, Art and Design - 15th International Conference, EvoMUSART 2026, Held as Part of EvoStar 2026, Toulouse, France, April 8–10, 2026, Proceedings (pp.367-382). Springer [10.1007/978-3-032-24350-8_24].
Decoding Emotions: Multimodal Integration of Deep Embeddings, Lyrics and Music-Aware Cues
Gasparini, Francesca
;Rizzi, Giulia;Saibene, AuroraUltimo
2026
Abstract
Music Emotion Recognition (MER) is a challenging task considering the nuances of defining emotions. While unimodal models provide a good baseline for MER, multimodal models are becoming fundamental to provide an in-depth description of emotions. Leveraging on the multimodal MERGE dataset, we investigate the power of audio-related deep embeddings, lyrics informed features, and music-aware cues in providing an informative set of features for low-impact computational learning models. Results confirm that multimodal fusion outperforms unimodal approaches. Moreover, different experiments highlight the positive contribution of genre metadata and the potential use of harmonic features for real-time computationally low-impact applications. These findings confirm the importance of multimodal integration for robust and interpretable emotion recognition systems, while opening up future directions, including advanced feature fusion, user-specific model adaptation (user-tuning), and multi-label emotion representation.| File | Dimensione | Formato | |
|---|---|---|---|
|
Novacco et al-2026-EvoMUSART-VoR.pdf
Solo gestori archivio
Descrizione: Articolo originale
Tipologia di allegato:
Publisher’s Version (Version of Record, VoR)
Licenza:
Tutti i diritti riservati
Dimensione
2.18 MB
Formato
Adobe PDF
|
2.18 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


