End-to-End Speech Synthesis for Bangla with Text Normalization
Published in 2018 5th International Conference on Computational Science/Intelligence and Applied Informatics (CSII), 2018
Recommended citation: Pial, Tanzir Islam, Shahreen Salim Aunti, Shabbir Ahmed, and Hasnain Heickal. "End-to-End Speech Synthesis for Bangla with Text Normalization." In 2018 5th International Conference on Computational Science/Intelligence and Applied Informatics (CSII), pp. 66-71. IEEE, 2018. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8457103
ABSTRACT
Text to speech synthesis is a well-researched area, yet no system has been developed which can claim to be as convincing as a human voice. An end-to-end system in the context of speech synthesis denotes a system capable of synthesizing speech from text using training data as minimal as transcribed audio data without any language-specific knowledge and phoneme dictionaries. But an end-to-end system should also have the capability to integrate any language-specific rules to improve its performance. In this paper, we propose an end-to-end speech synthesis system for Bangla (also known as Bengali) which uses a minimal front end and a neural network as its statistical parametric model. We also propose a Text Normalization Procedure(TNP) for Bangla and incorporate it to the end-to-end system. We have conducted extensive experiments using different models. From the feedback from the participants of the experiment, we have found out that they felt more positively towards the system if TNP is incorporated. A Wilcoxon signed-rank test was conducted to validate the results of the experiment and the probability of the results being like this because of experimental errors rather than TNP was calculated to be less than 5%.