• 北大核心期刊(《中文核心期刊要目总览》2017版)
  • 中国科技核心期刊(中国科技论文统计源期刊)
  • JST 日本科学技术振兴机构数据库(日)收录期刊

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于对抗训练的跨语料库语音情感识别方法

薛艳飞 张建明

薛艳飞, 张建明. 基于对抗训练的跨语料库语音情感识别方法[J]. 微电子学与计算机, 2021, 38(3): 77-83.
引用本文: 薛艳飞, 张建明. 基于对抗训练的跨语料库语音情感识别方法[J]. 微电子学与计算机, 2021, 38(3): 77-83.
XUE Yan-fei, ZHANG Jian-ming. Cross-corpus speech emotion recognition based on adversarial training[J]. Microelectronics & Computer, 2021, 38(3): 77-83.
Citation: XUE Yan-fei, ZHANG Jian-ming. Cross-corpus speech emotion recognition based on adversarial training[J]. Microelectronics & Computer, 2021, 38(3): 77-83.

基于对抗训练的跨语料库语音情感识别方法

基金项目: 

国家自然科学基金面上资助项目 61672267

详细信息
    作者简介:

    薛艳飞    男,(1993-),硕士研究生.研究方向为语音识别, E-mail: 2221708068@stmail.ujs.edu.cn

    张建明    男,(1964-),博士,教授.研究方向为图象处理、模式识别

  • 中图分类号: TP183

Cross-corpus speech emotion recognition based on adversarial training

  • 摘要: 在跨语料库语音情感识别中,训练和测试数据分布的差异变得非常明显,导致验证和测试性能差别很大.针对该问题,提出一种基于对抗训练的跨语料库语音情感识别方法.该方法通过语料库之间的对抗训练能有效地缩小不同语料库之间的差异,提升模型对域不变情感特征的提取能力.同时,通过引入多头自注意力机制,对语音序列中不同位置元素之间的依赖关系进行序列建模,增强序列中情感显著特征的提取能力.在以IEMOCAP为源域、MSP-IMPRO为目标域和在以MSP-IMPRO为源域、IEMOCAP为目标域上的实验表明,所提出方法的相对UAR性能相比于基准方法分别提升了0.91%~12.22%和2.27%~6.90%.因此,在目标域标注缺失的情况下,所提出的跨语料库语音情感识别方法具有更好的域不变情感显著特征的提取能力.
  • 图  1  CRMHSAN_AT网络结构

    图  2  多头自注意力机制的并行结构

    图  3  源域和目标域特征分布可视化

    表  1  卷积层体系结构

    类型 输出维度 核尺寸 步长 填充
    卷积 40×751×128 7×7 1×1 SAME
    批归一化 40×751×128 -- -- --
    非线性激活 40×751×128 -- -- --
    最大池化 20×376×128 2×2 2×2 SAME
    卷积 1×370×128 20×7 1×1 VALID
    批归一化 1×370×128 -- -- --
    非线性激活 1×370×128 -- -- --
    最大池化 1×74×128 1×5 1×5 SAME
    下载: 导出CSV

    表  2  语料库各类别样本数

    语料库
    {1, 2}

    {3}

    {4, 5}
    总计
    IEMOCAP 3 181 1 641 1 994 6 816
    MSP-IMPRO 2 160 2 961 2 731 7 852
    下载: 导出CSV

    表  3  跨语料库情感识别UAR(%)性能对比

    方法 源域 MSP-IMPROV IEMOCAP
    目标域 IEMOCAP MSP-IMPROV
    IS10+DNN[6] 42.90 42.65
    ACNN[5] 42.62 42.91
    MADDoG[4] 47.40 44.40
    DANN[3] 46.08 43.73
    E_EDFLM[2] 46.36 44.58
    CRMHSAN 45.51 43.32
    CRMHSAN_AT(Ours) 47.83 45.59
    下载: 导出CSV
  • [1] LIU N, ZONG Y, ZHANG B F, et al. Unsupervised cross-corpus speech emotion recognition using domain-adaptive subspace learning[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, AB, Canada: IEEE, 2018: 5144-5148. DOI: 10.1109/ICASSP.2018.8461848.
    [2] MAO Q R, XU G P, XUE W T, et al. Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition[J]. Speech Communication, 2017(93): 1-10. DOI:  10.1016/j.specom.2017.06.006.
    [3] ABDELWAHAB M, BUSSO C. Domain adversarial for acoustic emotion recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(12): 2423-2435. DOI:  10.1109/TASLP.2018.2867099.
    [4] GIDEON J, MCINNIS M, PROVOST E M. Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (ADDoG)[J]. IEEE Transactions on Affective Computing, 2019. DOI:  10.1109/TAFFC.2019.2916092.
    [5] NEUMANN M, VU N G T. Cross-lingual and multilingual speech emotion recognition on English and French[C]//Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, AB, Canada: IEEE, 2018: 5769-5773. DOI: 10.1109/ICASSP.2018.8462162.
    [6] LEE S W. The generalization effect for multilingual speech emotion recognition across heterogeneous languages[C]//Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK: IEEE, 2019: 5881-5885. DOI: 10.1109/ICASSP.2019.8683046.
    [7] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Los Angeles, USA: Curran Associates Inc., 2017: 6000-6010.
    [8] GANIN Y, USTINOVA E, AJAKAN H, et al. Domain-adversarial training of neural networks[J]. The Journal of Machine Learning Research, 2016, 17(1): 2096-2030.
    [9] BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42(4): 335-359. DOI:  10.1007/s10579-008-9076-6.
    [10] BUSSO C, PARTHASARATHY S, BURMANIA A, et al. MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception[J]. IEEE Transactions on Affective Computing, 2017, 8(1): 67-80. DOI:  10.1109/TAFFC.2016.2515617.
    [11] CHANG J, SCHERER S. Learning representations of emotional speech with deep convolutional generative adversarial networks[C]//Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). New Orleans, LA, USA: IEEE, 2017: 2746-2750. DOI: 10.1109/ICASSP.2017.7952656.
    [12] SCHULLER B, ZHANG Z X, WENINGER F, et al. Selecting training data for cross-corpus speech emotion recognition: prototypicality vs. generalization[C]//Proceedings of the 2011 Afeka-AVIOS Speech Processing Conference. Tel Aviv, Israel: TUM, 2011.
  • 加载中
图(3) / 表(3)
计量
  • 文章访问数:  302
  • HTML全文浏览量:  47
  • PDF下载量:  34
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-06-08
  • 修回日期:  2020-07-21
  • 刊出日期:  2021-03-05

目录

    /

    返回文章
    返回