Open Access Journal

ISSN : 2394 - 6849 (Online)

International Journal of Engineering Research in Electronics and Communication Engineering(IJERECE)

Monthly Journal for Electronics and Communication Engineering

Open Access Journal

International Journal of Engineering Research in Electronics and Communication Engineering(IJERECE)

Monthly Journal for Electronics and Communication Engineering

ISSN : 2394-6849 (Online)

Learning a Phoneme Manifold Using Multitask Learning for DNN based Synthesis of Children's Stories

Author : Naina Teertha 1 Sai Sirisha Rallabandi 2 Sai Krishna Rallabandi 3 Dr. Kumaraswamy 4 Suryakanth V Gangashetty 5

Date of Publication :7th May 2016

Abstract: Deep Neural Networks(DNNs) use a cascade of hidden representations to enable the learning of complex mappings from input tom output features and are shown to produce more natural synthetic speech than conventional HMM-based statistical parametric systems. However even though it offers greater flexibility and controllability than unit selection, the naturalness of speech generated by DNN SPSS is still below that of human speech, and cannot compete with good unit selection systems. DNNs are very powerful models and it might be the case that we haven’t yet found the best possible way to use them. In this paper, we investigate the learning of a phoneme manifold as a secondary task in a Multitask Learning setting for acoustic modeling and show that the hidden representation used within a DNN can be improved using such a method. The rationale behind the techniques is independent of the architecture and can also be extended to the recurrent/recursive variants of the neural networks.

Reference :

  1. [1] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039–1064, 2009.

    [2] S. King, “Measuring a decade of progress in text-to speech,” Loquens, vol. 1, 1 2014.

    [3] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, vol. 1. IEEE, 1996, pp. 373– 376.

    [4] H. Ze, A. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 7962– 7966.

    [5] S. Kang, X. Qian, and H.-Y. Meng, “Multi-distribution deep belief network for speech synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8012–8016.

    [6] Z. Ling, L. Deng, and D. Yu, “Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2129–2139, October 2013.

    [7] H. Lu, S. King, and O. Watts, “Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis,” in 8th ISCA Workshop on Speech Synthesis, Barcelona, Spain, Aug. 2013, pp. 281–285.

    [8] Y. Qian, Y. Fan, W. Hu, and F. K. Soong, “On the training aspects of deep neural network (dnn) for parametric TTS synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 3829–3833.

    [9] H. Zen and A. Senior, “Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis,” in Acoustics, Speech and Signal Processing 5 (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 3844–3848.

    [10] B. Uria, I. Murray, S. Renals, C. Valentini, and J. Bridle, “Modelling acoustic feature dependencies with artificial neural networks: Trajectory-rnade,” Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2015, 2015.

    [11] O. Watts, “From hmms to dnns: Where do the improvements come from?”

    [12] K. Prahallad, A. Vadapalli, N. Elluru, G. Mantena, B. Pulugundla, P. Bhaskararao, H. Murthy, S. King, V. Karaiskos, and A. Black, “The blizzard challenge 2013– indian language task,” in Blizzard Challenge Workshop 2013, 2013.

    [13] R. Caruana, Learning to Learn. Boston, MA: Springer US, 1998, ch. Multitask Learning, pp. 95–133.

    [14] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King, “Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4460–4464.

    [15] M. L. Seltzer and J. Droppo, “Multi-task learning in deep neural networks for improved phoneme recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 6965–6969.

    [16] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King, “Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, April 2015, pp. 4460– 4464.

    [17] R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 160–167.

    [18] V. S. Tomar and R. C. Rose, “Efficient manifold learning for speech recognition using locality sensitive hashing,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 6995–6999

    [19] C. Vaz and S. Narayanan, “Learning a speech manifold for signal subspace speech denoising,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.

    [20] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584. IEEE Signal Processing Society, 2011.

    [21] X. He, D. Cai, S. Yan, and H.-J. Zhang, “Neighborhood preserving embedding,” in Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, vol. 2. IEEE, 2005, pp. 1208–1213.

    [22] M. Brand and M. Brand, “Charting a manifold,” in Advances in Neural Information Processing Systems 15. MIT Press, 2003, pp. 961–968.

     [23] S. K. Rallabandi, S. S. Rallabandi, P. Bandi, and S. V. Gangashetty, “Learning continuous representation of text for phone duration modeling in statistical parametric speech synthesis,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Dec 2015, pp. 111–115.

    [24] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems.” in Interspeech, 2011, pp. 249–252.

    [25] A. Larcher, J. Bonastre, B. G. B. Fauve, K. Lee, C. Levy, ´ H. Li, J. S. D. Mason, and J. Parfait, “ALIZE 3.0 – open source toolkit for state-of-the-art speaker recognition,” in INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, August 25-29, 2013, 2013, pp. 2768–2772.

    [26] D. Krull, “Second formant locus patterns and consonant vowel coarticulation in spontaneous speech,” Phonetic Experimental Research at the Institute of Linguistics, University of Stockholm, vol. 10, pp. 87–108, 1989.

    [27] B. Yegnanarayana and S. P. Kishore, “Aann: an alternative to GMM for pattern recognition,” Neural Networks, vol. 15, no. 3, pp. 459–469, 2002. [28] Kawahara, H,: Staright, exploitation of the other aspect of vocoder: Perceptually isomorphic decomposition of speech sounds. Acoustical science and technology 27(6),349-353(2006)


Recent Article