Title |
Multi-modal Emotion Recognition using Speech Features and Text Embedding |
Authors |
김주희(Ju-Hee Kim) ; 이석필(Seok-Pil Lee) |
DOI |
https://doi.org/10.5370/KIEE.2021.70.1.108 |
Keywords |
Speech emotion recognition; Emotion recognition; Multi-modal emotion recognition; Deep learning |
Abstract |
Many studies have been conducted emotion recognition using audio signals as it is easy to collect. However, the accuracy is lower than other methods such as using facial images or video signals. In this paper, we propose an emotion recognition using speech signals and text simultaneously to achieve better performance. For training, we generate 43 feature vectors like mfcc, spectral features and harmonic features from audio data. Also 256 embedding vectors is extracted from text data using pretrained Tacotron encoder. Feature vectors and text embedding vectors are fed into each LSTM layer and fully connected layer which produces a probability distribution over predicted output classes. By combining the average of both results, the data is assigned to one of four emotion categories : anger, happiness, sadness, neutrality. Our proposed model outperforms previous state-of-the-art methods when they use Korean emotional speech dataset. |