Open Access Journal

ISSN : 2394 - 6849 (Online)

International Journal of Engineering Research in Electronics and Communication Engineering(IJERECE)

Monthly Journal for Electronics and Communication Engineering

Open Access Journal

International Journal of Engineering Research in Electronics and Communication Engineering(IJERECE)

Monthly Journal for Electronics and Communication Engineering

ISSN : 2394-6849 (Online)

An Investigation of End-To-End Speaker Recognition Using Deep Neural Networks

Author : Lakhsmi HR 1 Sivanand Achanta 2 Suryakanth V Gangashetty 3 R Kumaraswamy 4

Date of Publication :7th May 2016

Abstract: State-of-the-art automatic speaker recognition (SR) has been dominated by Gaussian mixture model-universal background model (GMM-UBM) based i-vector feature extraction methods. Although these systems are robust, extraction of ivectors is very time consuming and a separate classifier needs to be trained for decision making in the end. Inorder to alleviate the above disadvantages, in this paper we propose to use deep neural networks for end-to-end speaker recognition. We perform several experiments to determine the best suited architecture, the hyper-parameter tuning algorithm and the initialization scheme for SR task. The proposed method combines feature extraction and classification step, and is of very low foot print. Through objective metric (equal error rate) we show that the proposed method outperforms the GMM-UBM conventional system

Reference :

  1. [1] J. Hansen and T. Hasan, “Speaker recognition by machines and humans: A tutorial review,” Signal Processing Magazine, IEEE, vol. 32, no. 6, pp. 74–99, Nov 2015.

    [2] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn, “Speaker verification using adapted gaussian mixture models,” Digital signal processing, vol. 10, no. 1, pp. 19–41, 2000.

    [3] William M Campbell, Douglas E Sturim, and Douglas A Reynolds, “Support vector machines using gmm super vectors for speaker verification,” Signal Processing Letters, IEEE, vol. 13, no. 5, pp. 308–311, 2006.

    [4] Najim Dehak, Patrick Kenny, R´eda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 4, pp. 788– 798, 2011.

    [5] Patrick Kenny, Gilles Boulianne, Pierre Ouellet, and Pierre Dumouchel, “Speaker and session variability in gmm-based speaker verification,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, no. 4, pp. 1448–1460, 2007.

    [6] Patrick Kenny, Gilles Boulianne, Pierre Ouellet, and Pierre Dumouchel, “Joint factor analysis versus eigenchannels in speaker recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, no. 4, pp. 1435–1447, 2007.

    [7] Shou-Chun Yin, Richard Rose, and Patrick Kenny, “A joint factor analysis approach to progressive model adaptation in text-independent speaker verification,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, no. 7, pp. 1999–2010, 2007.

    [8] Fred Richardson, Douglas Reynolds, and Najim Dehak, “A unified deep neural network for speaker and language recognition,” in Proc. of Interspeech, 2015, pp. 1146–1150.

    [9] Najim Dehak, Pedro A Torres-Carrasquillo, Douglas A Reynolds, and Reda Dehak, “Language recognition via ivectors and dimensionality reduction.,” in INTERSPEECH. Citeseer, 2011, pp. 857–860.

    [10] Patrick Kenny, “A small footprint i-vector extractor,” in Odyssey 2012-The Speaker and Language Recognition Workshop, 2012.

    [11] Yoshua Bengio, “Learning deep architectures for ai,” Foundations and trendsR in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.

    [12] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82– 97, 2012.

    [13] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition,” Signal Processing Magazine, 2012.

    [14] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Jorge Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 4052–4056.

    [15] Yun Lei, Nicolas Scheffer, Luciana Ferrer, and Moray McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 1695–1699.

    [16] Yan Song, Ruilian Cui, Xinhai Hong, Ian Mcloughlin, Jiong Shi, and Lirong Dai, “Improved language identification using deep bottleneck network,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4200– 4204.

    [17] Pavel Matejka, Le Zhang, Tim Ng, HS Mallidi, Ondrej Glembek, Jeff Ma, and Bing Zhang, “Neural network bottleneck features for language identification,” Proc. IEEE Odyssey, pp. 299–304, 2014.

    [18] B Yegnanarayana and S Prahallad Kishore, “Aann: an alternative to gmm for pattern recognition,” Neural Networks, vol. 15, no. 3, pp. 459–469, 2002.

    [19] SP Kishore, B Yegnanarayana, and Suryakanth V Gangashetty, “Online text-independent speaker verification system using autoassociative neural network models,” in Neural Networks, 2001. Proceedings. IJCNN’01. International Joint Conference on. IEEE, 2001, vol. 2, pp. 1548–1553.

    [20] Sriram Ganapathy, Kyu Han, Samuel Thomas, Mohamed Omar, Maarten Van Segbroeck, and Shrikanth S Narayanan, “Robust language identification using convolutional neural network features,” in INTERSPEECH.

    [21] Yun Lei, Luciana Ferrer, Aaron Lawson, Mitchell McLaren, and Nicolas Scheffer, “Application of convolutional neural networks to language identification in noisy conditions,” Proc. Odyssey-14, Joensuu, Finland, 2014.

    [22] Ignacio Lopez-Moreno, Jorge Gonzalez-Dominguez, Oldrich Plchot, David Martinez, Joaquin GonzalezRodriguez, and Pablo Moreno, “Automatic language identification using deep neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 5337–5341.

    [23] Xavier Glorot and Yoshua Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in International conference on artificial intelligence and statistics, 2010, pp. 249–256

    [24] Geoffrey E Hinton and Ruslan R Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006. [25] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton, “On the importance of initialization and momentum in deep learning,” in Proc. of ICML, 2013, pp. 1139–1147.

    [26] David Sussillo, “Random walks: Training very deep nonlinear feed-forward networks with smart initialization,” arXiv preprint arXiv:1412.6558, 2014.

    [27] Andrew M Saxe, James L McClelland, and Surya Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” arXiv preprint arXiv:1312.6120, 2013.

    [28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” arXiv preprint arXiv:1502.01852, 2015.

    [29] Vinod Nair and Geoffrey E Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML- 10), 2010, pp. 807–814.

    [30] Matthew D Zeiler et al., “On rectified linear units for speech processing,” in Proc. of ICASSP, 2013, pp. 3517– 3521.

    [31] Matthew D Zeiler, “ADADELTA: an adaptive learning rate method,” arXiv preprint arXiv: 1212.5701, 2012.

    [32] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” Proc. of the 3rd International Conference on Learning Representations (ICLR), 2014.

    [33] Seyed Omid Sadjadi, Malcolm Slaney, and Larry Heck, “Msr identity toolbox v1.0: A matlab toolbox for speakerrecognition research,” Speech and Language Processing Technical Committee Newsletter, November 2013.

    [34] Sivanand Achanta, Tejas Godambe, and Suryakanth V Gangashetty, “An investigation of recurrent neural network architectures for statistical parametric speech synthesis,” in Proc. Of Interspaced, 2015, pp. 2524–2528.

     


Recent Article