Abstract:
Recently HMM-RNN hybrid system has been proved to be successful in speech recognition,But using HMM to dealing these tasks would need inputs and outputs to be pre-aligned, so the training process works effectively, on the other hand, when dividing signal into frames, each frames nearby will have a same part overlapped,since the calculation of RNN is context-dependent, the overlapped part increases the training time. This paper combines CTC with RNN instead of HMM, and remove the overlapped part during framing modeling TIMIT dataset on phone recognition tasks. The experiments show that CTC-BLSTM performs better than HMM-BLSTM on phone recognition, and removing the overlapped part of frames can make system more efficient and ensure the accuracy at a certain degree.