段云,邵玉斌,龙华,等.基于非线性语谱图联合判决的语种识别[J]. 微电子学与计算机,2024,41(5):99-108. doi: 10.19304/J.ISSN1000-7180.2023.0298
引用本文: 段云,邵玉斌,龙华,等.基于非线性语谱图联合判决的语种识别[J]. 微电子学与计算机,2024,41(5):99-108. doi: 10.19304/J.ISSN1000-7180.2023.0298
DUAN Y,SHAO Y B,LONG H,et al. Language identification based on joint decision of nonlinear spectrograms[J]. Microelectronics & Computer,2024,41(5):99-108. doi: 10.19304/J.ISSN1000-7180.2023.0298
Citation: DUAN Y,SHAO Y B,LONG H,et al. Language identification based on joint decision of nonlinear spectrograms[J]. Microelectronics & Computer,2024,41(5):99-108. doi: 10.19304/J.ISSN1000-7180.2023.0298

基于非线性语谱图联合判决的语种识别

Language identification based on joint decision of nonlinear spectrograms

  • 摘要: 针对灰度对数语谱图对基频拉伸幅度过大,短时长语音识别率提升受限的问题,提出一种非线性语谱图联合判决的语种识别方法。首先,对语音进行能量归一化,提取对数功率谱,将频率刻度按照人耳听觉感知进行非线性映射得到非线性语谱图。然后,将非线性语谱图按词关联特性进行等间隔拆分,在ResNet网络后端加入联合判决层;输出语音所属语种类型。实验结果表明,所提方法有效改善灰度对数语谱图的缺点,识别性能均高于语谱图及改进特征。联合判决对切分时长为1.0 s的样本语音取得的识别效果最佳,在广播音频数据集中,识别率达到94.25%;在VoxForge公共语料集中,识别率达到98.94%。

     

    Abstract: To address the problem that the gray-scale logarithmic speech spectrogram is too stretched to the fundamental frequency, which limits the improvement of short-length speech identification rate, a language identification method with joint judgment of nonlinear speech spectrogram is proposed. Firstly, the logarithmic power spectrum is extracted by energy normalization, and the nonlinear speech spectrogram is obtained by nonlinear mapping of frequency scales according to human ear perception. Then, the nonlinear speech spectrogram is split into equal intervals according to word association characteristics, and the joint judgment layer is added at the back end of the ResNet network. Finally, the language type of the speech is output. The experimental results show that the proposed method can effectively improve the shortcomings of the gray-scale logarithmic speech spectrogram, and the recognition performance is higher than that of the speech spectrogram and the improved features. The best recognition results are obtained for the sample speech with a cut time of 1.0 s, and the recognition rate reaches 94.25% in the broadcast audio data set and 98.94% in the VoxForge public corpus.

     

/

返回文章
返回