Cross-corpus speech emotion recognition based on adversarial training
-
摘要:
在跨语料库语音情感识别中,训练和测试数据分布的差异变得非常明显,导致验证和测试性能差别很大.针对该问题,提出一种基于对抗训练的跨语料库语音情感识别方法.该方法通过语料库之间的对抗训练能有效地缩小不同语料库之间的差异,提升模型对域不变情感特征的提取能力.同时,通过引入多头自注意力机制,对语音序列中不同位置元素之间的依赖关系进行序列建模,增强序列中情感显著特征的提取能力.在以IEMOCAP为源域、MSP-IMPRO为目标域和在以MSP-IMPRO为源域、IEMOCAP为目标域上的实验表明,所提出方法的相对UAR性能相比于基准方法分别提升了0.91%~12.22%和2.27%~6.90%.因此,在目标域标注缺失的情况下,所提出的跨语料库语音情感识别方法具有更好的域不变情感显著特征的提取能力.
Abstract:The difference in data distributions becomes very clear when the training and testing data come from different corpora, causing a large performance gap between validation and testing performance. To solve this problem, a cross-corpus speech emotion recognition method based on adversarial training is proposed. The proposed method can effectively eliminate the differences between different corpora with the adversarial training of corpora, and improve the extracting ability of domain-invariant emotion features. At the same time, model the relative dependence of different position elements in the speech sequence to enhance the emotion-salient features extracting ability of the sequence by introducing the multi-head attention mechanism. When the experiment applies IEMOCAP as the source domain and MSP-IMPRO as the target domain, the results are superior to the benchmark methods about 0.91%~12.22%. Meanwhile, the experiment applies MSP-IMPRO as the source domain and IEMOCAP as the target domain, the results also achieve better performance than the benchmark methods about 2.27%~6.90%. Therefore, in the case of the absence of emotion labels of the target domain, the proposed cross-corpus speech emotion recognition method is more beneficial to extracting domain-invariant emotion salient features.
-
表 1 卷积层体系结构
类型 输出维度 核尺寸 步长 填充 卷积 40×751×128 7×7 1×1 SAME 批归一化 40×751×128 -- -- -- 非线性激活 40×751×128 -- -- -- 最大池化 20×376×128 2×2 2×2 SAME 卷积 1×370×128 20×7 1×1 VALID 批归一化 1×370×128 -- -- -- 非线性激活 1×370×128 -- -- -- 最大池化 1×74×128 1×5 1×5 SAME 表 2 语料库各类别样本数
语料库 低
{1, 2}中
{3}高
{4, 5}总计 IEMOCAP 3 181 1 641 1 994 6 816 MSP-IMPRO 2 160 2 961 2 731 7 852 -
[1] LIU N, ZONG Y, ZHANG B F, et al. Unsupervised cross-corpus speech emotion recognition using domain-adaptive subspace learning[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, AB, Canada: IEEE, 2018: 5144-5148. DOI: 10.1109/ICASSP.2018.8461848. [2] MAO Q R, XU G P, XUE W T, et al. Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition[J]. Speech Communication, 2017(93): 1-10. DOI: 10.1016/j.specom.2017.06.006. [3] ABDELWAHAB M, BUSSO C. Domain adversarial for acoustic emotion recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(12): 2423-2435. DOI: 10.1109/TASLP.2018.2867099. [4] GIDEON J, MCINNIS M, PROVOST E M. Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (ADDoG)[J]. IEEE Transactions on Affective Computing, 2019. DOI: 10.1109/TAFFC.2019.2916092. [5] NEUMANN M, VU N G T. Cross-lingual and multilingual speech emotion recognition on English and French[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, AB, Canada: IEEE, 2018: 5769-5773. DOI: 10.1109/ICASSP.2018.8462162. [6] LEE S W. The generalization effect for multilingual speech emotion recognition across heterogeneous languages[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019: 5881-5885. DOI: 10.1109/ICASSP.2019.8683046. [7] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Los Angeles, USA: Curran Associates Inc., 2017: 6000-6010. [8] GANIN Y, USTINOVA E, AJAKAN H, et al. Domain-adversarial training of neural networks[J]. The Journal of Machine Learning Research, 2016, 17(1): 2096-2030. [9] BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42(4): 335-359. DOI: 10.1007/s10579-008-9076-6. [10] BUSSO C, PARTHASARATHY S, BURMANIA A, et al. MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception[J]. IEEE Transactions on Affective Computing, 2017, 8(1): 67-80. DOI: 10.1109/TAFFC.2016.2515617. [11] CHANG J, SCHERER S. Learning representations of emotional speech with deep convolutional generative adversarial networks[C]//Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New Orleans, LA, USA: IEEE, 2017: 2746-2750. DOI: 10.1109/ICASSP.2017.7952656. [12] SCHULLER B, ZHANG Z X, WENINGER F, et al. Selecting training data for cross-corpus speech emotion recognition: prototypicality vs. generalization[C]//Proceedings of the 2011 Afeka-AVIOS Speech Processing Conference. Tel Aviv, Israel: TUM, 2011. -