Speech Synthesis: Text Processing, Data and Evaluation

Tue-1-7-7 Distant Supervision for Polyphone Disambiguation in Mandarin Chinese

Jiawen Zhang(University of Chinese Academy of Sciences), Yuanyuan Zhao(Kwai), Jiaqi Zhu(Institute of Software, Chinese Academy of Science) and Jinba Xiao(Kwai)
Abstract: Grapheme-to-phoneme (G2P) conversion plays an important role in building a Mandarin Chinese text-to-speech (TTS) system, where the polyphone disambiguation is an indispensable task. However, most of previous polyphone disambiguation models are trained on manually annotated datasets, which are suffering from the data scarcity, narrow coverage, and unbalanced data distribution. In this paper, we propose a framework that can predict the pronunciations of Chinese characters, and the core model is trained in a distantly supervised way. Specifically, we utilize the alignment procedure used for the acoustic model to produce abundant character-phoneme sequence pairs,which are employed to train a Seq2Seq model with attention mechanism. We also make use of the language model of phoneme sequences to alleviate the impact of noises in the auto-generated dataset. Experimental results demonstrate that even without additional syntactic features and pre-trained embeddings, our approach achieves competitive prediction results, and especially improves the predictive accuracy for unbalanced polyphonic characters. In addition, compared with the manually annotated training datasets, the auto-generated one is more diversified and makes the results more consistent with the pronunciation habits of most people.
Student Information

Student Events

Travel Grants