Text and Voice Integration with Wav2Vec and Specaugment for Improving Accuracy of Speech Emotion Recognition

SOENARDI, EDWARD and Setiabudi, Djoni Haryadi and Sugiarto, Indar (2025) Text and Voice Integration with Wav2Vec and Specaugment for Improving Accuracy of Speech Emotion Recognition. In: The 8th International Conference on Data Science and Its Applications (ICoDSA) 2025, 05-07-2025 - 05-07-2025, Jakarta - Indonesia.

	PDF Download (502Kb)
	PDF Download (2532Kb)

Official URL: https://ieee.id/2025-international-conference-on-d...

Abstract

Emotions are crucial in understanding the human psychological state and can influence social interactions and decision-making. The integration between emotion processing and computational systems enables the creation of more natural and adaptive interfaces. This research focuses on Speech Emotion Recognition (SER), which aims to identify human emotions through voice signal analysis. Many previous studies have been limited to unimodal approaches, and just few have explored multimodal approaches that combine voice and text simultaneously. Furthermore, there is a lack of research that directly integrates Automatic Speech Recognition (ASR) for text as well as voice feature extraction in one step and applies data augmentation to improve model generalization. This research contributes to the improvement of accuracy for emotion recognition tasks using the IEMOCAP dataset with total 5797 voices and text that contain angry, sad, happy, and neutral by incorporating the Wav2Vec model as a multimodal feature extractor (voice and text), and by applying SpecAugment to enrich the data variety. Structurally, our proposed architecture consists of two branches: a voice branch and a text branch. Features extracted from Wav2Vec are sent to the voice branch using an ECAPA-TDNN model, and to the text branch using a BERT model. These two branches are then combined using fully connected layers for final classification. Experiments show that this multimodal approach can achieve high performance, namely weighted accuracy of 90.28% and unweighted accuracy of 90.62%, with requiring special fine-tuning of the text model. These results indicate that the integration of multimodal with pretrained and data augmentation approaches can significantly improve the performance of SER systems

Item Type:	Conference or Workshop Item (Paper)
Uncontrolled Keywords:	voice signal analysis, speech emotion recognition, multimodal, feature extraction, Wav2Vec
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:	Faculty of Industrial Technology > Electrical Engineering Department
Depositing User:	Admin
Date Deposited:	20 Sep 2025 17:53
Last Modified:	26 Sep 2025 17:19
URI:	https://repository.petra.ac.id/id/eprint/21873

Actions (login required)

View Item