(44-4) 14 * << * >> * Russian * English * Content * All Issues
  
Automatic text-independent speaker verification using convolutional deep belief network
  I.A. Rakhmanenko 1, A.A. Shelupanov 1, E.Y. Kostyuchenko 1
 1 Tomsk State University of Control Systems and Radioelectronics,
 
  prospect Lenina 40, 634050, Tomsk, Russia
 PDF, 1382 kB
  PDF, 1382 kB
DOI: 10.18287/2412-6179-CO-621
Pages: 596-605.
Full text of article: Russian language.
 
Abstract:
This paper is devoted to  the use of the convolutional deep belief network as a speech feature extractor  for automatic text-independent speaker verification. The paper describes the  scope and problems of automatic speaker verification systems. Types of modern  speaker verification systems and types of speech features used in speaker  verification systems are considered. The structure and learning algorithm of  convolutional deep belief networks is described. The use of speech features  extracted from three layers of a trained convolution deep belief network is  proposed. Experimental studies of the proposed features were performed on two  speech corpora: own speech corpus including audio recordings of 50 speakers and  TIMIT speech corpus including audio recordings of 630 speakers. The accuracy of  the proposed features was assessed using different types of classifiers. Direct  use of these features did not increase the accuracy compared to the use of  traditional spectral speech features, such as mel-frequency cepstral  coefficients. However, the use of these features in the classifiers ensemble  made it possible to achieve a reduction of the equal error rate to 0.21% on  50-speaker speech corpus and to 0.23% on the TIMIT speech corpus.
Keywords:
speaker recognition,  speaker verification, Gaussian mixture models, GMM-UBM system, speech features,  speech processing, deep learning, neural networks, pattern recognition.
Citation:
  Rakhmanenko IA,  Shelupanov AA, Kostyuchenko EYu. Automatic text-independent speaker  verification using convolutional deep belief network. Computer Optics 2020;  44(4): 596-605. DOI:  10.18287/2412-6179-CO-621.
Acknowledgements:
  The work was funded within the basic part of the government project of  the Russian Federation Education and Science Ministry, project 8.9628.2017/8.9.
References:
- Campbell JP. Speaker recognition: a tutorial. Proc  IEEE Inst Electr Electron Eng 1997; 85(9): 1437-1462.
 
- Soldatova OP, Garshin AA. Convolutional neural  network applied to handwritten digits recognition. Computer Optics 2010; 34(2):  252-259.
 
- Lee  H, Grosse R, Ranganath R, Ng AY. Convolutional deep belief networks for  scalable unsupervised learning of hierarchical representations. Proc 26th  Annual International Conference on Machine Learning 2009: 609-616.
 
- Lee  H, Pham P, Largman Y, Ng AY. Unsupervised feature learning for audio  classification using convolutional deep belief networks. Adv Neural Inform  Process Syst 2009: 1096-1104.
 
- Ren  Y, Wu Y. Convolutional deep belief networks for feature extraction of EEG  signal. IJCNN 2014: 2850-2853.
 
- Sahidullah  M, Saha G. A novel windowing technique for efficient computation of MFCC for  speaker recognition. IEEE Signal Process Lett 2013; 20(2): 149-152.
 
- Motlicek  P, Dey S, Madikeri S, Burget L. Employment of subspace gaussian mixture models  in speaker recognition. ICASSP 2015: 4445-4449.
 
- Greenberg  CS, Bansé D, Doddington GR, Garcia-Romero D, Godfrey JJ, Kinnunen T, Martin AF,  McCree A, Przybocki M, Reynolds DA. The NIST 2014 speaker recognition i-vector  machine learning challenge. Odyssey: The Speaker and Language Recognition  Workshop 2014: 224-230.
 
- Lei  Y, Scheffer N, Ferrer L, McLaren M. A novel scheme for speaker recognition  using a phonetically-aware deep neural network. ICASSP 2014: 1695-1699.
 
- Stafylakis  T, Kenny P, Gupta V, Alam J, Kockmann M. Compensation for phonetic nuisance  variability in speaker recognition using DNNs. Odyssey: The Speaker and  Language Recognition Workshop 2016: 340-345.
 
- Kenny  P, Gupta V, Stafylakis T, Ouellet P, Alam J. Deep neural networks for  extracting baum-welch statistics for speaker recognition. Proc Odyssey 2014:  293-298.
 
- Xu L,  Lee KA, Li H, Yang Z. Rapid Computation of I-vector. Odyssey: The Speaker and  Language Recognition Workshop 2016: 47-52.
 
- McLaren  M, Ferrer L, Lawson A. Exploring the role of phonetic bottleneck features for  speaker and language recognition. ICASSP 2016: 5575-5579.
 
- Richardson  F, Reynolds D, Dehak N. Deep neural network approaches to speaker and language  recognition. IEEE Signal Process Lett 2015; 22(10): 1671-1675.
 
- Reynolds  DA, Quatieri TF, Dunn RB. Speaker verification using adapted Gaussian mixture  models. Digit Signal Process 2000; 10(1): 19-41.
 
- Sizov  A, Khoury E, Kinnunen T, Wu Z, Marcel S. Joint speaker verification and  antispoofing in the I-vector space. IEEE Trans Inf Forensics Secur 2015; 10(4):  821-832.
 
- Variani  E, Lei X, McDermott E, Moreno IL, Gonzalez-Dominguez J. Deep neural networks  for small footprint text-dependent speaker verification. ICASSP 2014:  4052-4056.
 
- Jung  JW, Heo HS, Yang IH, Shim HJ, Yu HJ. A complete end-to-end speaker verification  system using deep neural networks: From raw signals to verification result.  ICASSP 2018: 5349-5353.
 
- Rohdin  J, Silnova A, Diez M, Plchot O, Matějka P, Burget L. End-to-end DNN based  speaker recognition inspired by i-vector and PLDA. ICASSP 2018: 4874-4878.
 
- Rakhmanenko  IA, Meshcheryakov RV.  Identification features analysis in speech data using GMM-UBM speaker  verification system [In Russian]. SPIIRAS Proc 2017; 52(3): 22-50.
 
- Davis  S, Mermelstein P. Comparison of parametric representations for monosyllabic  word recognition in continuously spoken sentences. IEEE Trans Audio Speech Lang  Process 1980; 28(4): 357-366.
 
- Jurafsky  D, Martin JH. Speech and language processing. 2nd ed. New Jersey:  Pearson Education; 2009.
 
- Eyben  F, Weninger F, Gross F, Schuller B. Recent developments in opensmile, the  munich open-source multimedia feature extractor. Proc 21st ACM Int  Conf Multimedia 2013: 835-838.
 
- Hinton  GE, Osindero S, The YW. A fast learning algorithm for deep belief nets. Neural  Comput 2006; 18(7): 1527-1554.
 
- Hinton  GE. Training products of experts by minimizing contrastive divergence. Neural  Comput 2002; 14(8): 1771-1800.
 
- Sadjadi  SO, Slaney M, Heck L. MSR identity toolbox v1.0: A MATLAB toolbox for  speaker-recognition research. Speech and Language Processing Technical  Committee Newsletter 2013; 1(4): 1-32.
 
- Zue  V, Seneff S, Glass J. Speech database development at MIT: TIMIT and beyond.  Speech Commun 1990; 9(4): 351-356.
 
- Yoshimura  T, Koike N, Hashimoto K, Oura K, Nankaku Y, Tokuda K. Discriminative feature  extraction based on sequential variational autoencoder for speaker recognition.  APSIPA ASC 2018: 1742-1746. 
 
- Zeng  CY, Ma CF, Wang ZF, Ye JX. Stacked Autoencoder Networks Based Speaker  Recognition. ICMLC 2018; 1: 294-299.
 
- Chorowski  JK, Bahdanau D, Serdyuk D, Cho K, Bengio Y. Attention-based models for speech.  Advances in neural information processing systems 2015: 577-585. 
- Meriem F, Farid H, Messaoud B, Abderrahmene A. Robust speaker  verification using a new front end based on multitaper and gammatone filters.  2014 Tenth International Conference on Signal-Image Technology and  Internet-Based Systems 2014: 99-103.
 
  
  © 2009, IPSI RAS
  151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: ko@smr.ru ; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20