(46-6) 14 * << * >> * Russian * English * Content * All Issues
  
Method for visual analysis of driver's face for automatic lip-reading in the wild
  A.A. Axyonov 1, D.A. Ryumin 1, A.M. Kashevnik 1, D.V. Ivanko 1, A.A. Karpov 1
1 St. Petersburg Federal Research Center of the RAS (SPC RAS),
  199178, St. Petersburg, Russia, 14th Line V.O. 39
 PDF,13 MB
  PDF,13 MB
DOI: 10.18287/2412-6179-CO-1092
Pages: 955-962.
Full text of article: Russian language.
 
Abstract:
The paper proposes a  method of visual analysis for automatic speech recognition of the vehicle  driver. Speech recognition in acoustically noisy conditions is one of big  challenges of artificial intelligence. The problem of effective automatic  lip-reading in vehicle environment has not yet been resolved due to the  presence of various kinds of interference (frequent turns of driver's head,  vibration, varying lighting conditions, etc.). In addition, the problem is  aggravated by the lack of available databases on this topic. A MediaPipe Face  Mesh is used to find and extract the region-of-interest (ROI). We have  developed End-to-End neural network architecture for the analysis of visual  speech. Visual features are extracted from a single image using a convolutional  neural network (CNN) in conjunction with a fully connected layer. The extracted  features are input to a Long Short-Term Memory (LSTM) neural network. Due to a  small amount of training data we proposed that a Transfer Learning method  should be applied.  Experiments on visual analysis and  speech recognition present great opportunities for solving the problem of  automatic lip-reading. The experiments were performed on an in-house  multi-speaker audio-visual dataset RUSAVIC. The maximum recognition accuracy of  62 commands is 64.09 %. The results can be used in various  automatic speech recognition systems, especially in acoustically noisy  conditions (high speed, open windows or a sunroof in a vehicle, backgoround  music, poor noise insulation, etc.) on the road.
Keywords:
vehicle, driver, visual  speech recognition, automated lip-reading, machine learning, End-to-End, CNN,  LSTM.
Citation:
  Axyonov AA, Ryumin DA, Kashevnik AM, Ivanko DV, Karpov AA. Method for  visual analysis of driver's face for automatic lip-reading in the wild. Computer Optics 2022; 46(6): 955-962. DOI:  10.18287/2412-6179-CO-1092.
Acknowledgements:
  This work was partly funded by the Russian Foundation for Basic Research under grant No. 19-29-09081 and the state research project No. 0073-2019-0005.
References:
  - Road traffic injuries. Source: <https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries>.
- Indicators of road safety.  Source: <http://stat.gibdd.ru>. 
 
- Ivanko D, Ryumin D. A novel  task-oriented approach toward automated lip-reading system implementation. Int  Arch Photogramm Remote Sens Spatial Inf Sci 2021; XLIV-2/W1-2021: 85-89. DOI:  10.5194/isprs-archives-XLIV-2-W1-2021-85-2021.
 
- McGurk H, MacDonald J. Hearing  lips and seeing voices. Nature 1976; 264: 746-748.
 
- Chung JS, Zisserman A. Lip  reading in the wild. Asian Conf on Computer Vision (ACCV) 2016: 87-103. DOI:  10.1007/978-3-319-54184-6_6.
 
- Yang S, Zhang Y, Feng D, Yang  M, Wang C, Xiao J, Chen X. LRW-1000: A naturally-distributed large-scale benchmark  for lip reading in the wild. Int Conf on Automatic Face and Gesture     Recognition  (FG) 2019: 1-8. DOI: 10.1109/FG.2019.8756582.
 
- Chen X, Du J, Zhang H.  Lipreading with DenseNet and resBi-LSTM.   Signal Image Video Process 2020; 14: 981-989. DOI:  10.1007/s11760-019-01630-1.
 
- Feng D, Yang S, Shan S. An  efficient software for building LIP reading models without pains. Int Conf on Multimedia  and Expo Workshops (ICMEW) 2021: 1-2. DOI: 10.1109/ICMEW53276.2021.9456014.
 
- Martinez B, Ma P, Petridis S,  Pantic M. Lipreading using temporal convolutional networks. Int Conf on  Acoustics, Speech and Signal Processing (ICASSP) 2020: 6319-6323. DOI:  10.1109/ICASSP40776.2020.9053841.
 
- Zhang Y, Yang S, Xiao J, Shan  S, Chen X. Can we read speech beyond the lips? Rethinking RoI selection for  deep visual speech recognition. Int Conf on Automatic Face and Gesture  Recognition (FG) 2020: 356-363. DOI: 10.1109/FG47880.2020.00134.
 
- Ma P, Martinez B, Petridis S,  Pantic M. Towards practical lipreading with distilled and efficient models. Int  Conf on Acoustics, Speech and Signal Processing (ICASSP) 2021: 7608-7612. DOI:  10.1109/ICASSP39728.2021.9415063.
 
- Sui C, Bennamoun M, Togneri R. Listening with your eyes: Towards a  practical visual speech recognition system using deep Boltzmann machines. Proc  Int Conf on Computer Vision (ICCV) 2015: 154-162.
 
- Stafylakis T, Tzimiropoulos G. Combining residual networks with  LSTMs for lipreading. Interspeech 2017: 3652-3656.
 
- Hlaváč M, Gruber I, Železný M, Karpov A. Lipreading with LipsID. Int  Conf on Speech and Computer (SPECOM) 2020: 176-183. DOI:  10.1007/978-3-030-60276-5_18.
 
- Viola P, Jones M. Rapid object detection using a boosted cascade of  simple features. Proc Computer Society Conf on Computer Vision and Pattern Recognition  (CVPR) 2001; 1: 511-518. DOI: 10.1109/CVPR.2001.990517.
 
- Cootes TF, Edwards GJ, Taylor CJ. Active appearance models. IEEE  Trans Pattern Anal Mach Intell 2001; 23(6): 681-685. DOI: 10.1109/34.927467.
 
- Xu B, Wang J, Lu C, Guo Y.  Watch to listen clearly: Visual speech enhancement driven multi-modality speech  recognition. Proc IEEE/CVF Winter Conf on Applications of Computer Vision 2020:  1637-1646.
 
- Ryumina E, Ryumin D, Ivanko D, Karpov A. A novel method for  protective face mask detection using convolutional neural networks and image  histograms. Int Arch Photogramm Remote Sens Spatial Inf Sci 2021;  XLIV-2/W1-2021: 177-182. DOI: 10.5194/isprs-archives-XLIV-2-W1-2021-177-2021.
 
- Ryumina  E, Karpov A. Facial expression recognition using distance importance scores  between facial landmarks. Graphicon, CEUR Workshop Proceedings 2020; 2744:  1-10.
 
- Ivanko D, Ryumin D, Axyonov A, Kashevnik A. Speaker-dependent visual  command recognition in vehicle cabin: Methodology and evaluation. In Book:  Karpov A, Potapova R, eds. Speech and Computer (SPECOM). Lecture Notes in  Computer Science 2021; 12997: 291-302. DOI: 10.1007/978-3-030-87802-3_27.
 
- Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T. Lipreading using  convolutional neural network. Proc Annual Conf of the Int Speech Communication  Association (INTERSPEECH) 2014: 1149-1153.
 
- Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput  1997; 9(8): 1735-1780.
 
- Petridis S, Stafylakis T, Ma P, Cai F, Tzimiropoulos G, Pantic М.  End-to-end audiovisual speech recognition. 2018 IEEE Int Conf on Acoustics,  Speech and Signal Processing (ICASSP) 2018; 6548-6552.
 
- Kashevnik A, Lashkov I, Axyonov A, Ivanko D, Ryumin D, Kolchin A,  Karpov A. Multimodal corpus design for audio-visual speech recognition in  vehicle cabin. IEEE Access 2021; 9: 34986-35003. DOI:  10.1109/ACCESS.2021.3062752.
 
- Lashkov I, Axyonov A, Ivanko D, Ryumin D, Karpov A, Kashevnik A.  Multimodal Russian Driver Multimodal database of Russian speech of drivers in  the cab of vehicles (RUSAVIC – RUSsian Audio-Visual speech in Cars) [In  Russian]. Database   State Registration  Certificate N2020622063 of October 27, 2020. 
 
- Ivanko D, Axyonov A, Ryumin D, Kashevnik A, Karpov A. RUSAVIC  Corpus: Russian audio-visual speech in cars. Proc Thirteenth Language Resources  and Evaluation Conference (LREC'22) 2022:  1555-1559.
 
- Kashevnik A, Lashkov I, Gurtov A. Methodology and mobile application  for driver behavior analysis and accident prevention. IEEE Trans Intell Transp  Syst 2019; 21(6): 2427-2436.
 
- Kashevnik A, Lashkov I, Ponomarev A, Teslya N, Gurtov A. Cloud-based  driver monitoring system using a smartphone. Sensors 2020; 20(12): 6701-6715.
 
- The multi-speaker audiovisual corpus RUSAVIC. Source: <https://mobiledrivesafely.com/corpus-rusavic>.
 
- Fung I, Mak B. End-to-End Low-resource lip-reading with Maxout CNN  and LSTM. Int Conf on Acoustics, Speech and Signal Processing (ICASSP) 2018:  2511-2515. DOI: 10.1109/ICASSP.2018.8462280.
 
- Xu K, Li D, Cassimatis N, Wang X. LCANet: End-to-end lipreading with  cascaded attention-CTC. Int Conf on Automatic Face and Gesture Recognition (FG)  2018: 548-555. DOI: 10.1109/FG.2018.00088.
 
- Ma P, Petridis S, Pantic M. End-to-end audio-visual speech  recognition with conformers. Int Conf on Acoustics, Speech and Signal  Processing (ICASSP) 2021: 7613-7617. DOI: 10.1109/ICASSP39728.2021.9414567.
 
- Lugaresi C, Tang J, Nash H, McClanahan C, Uboweja E, Hays M, Zhang  F, Chang CL, Yong M, Lee J, Chang WT, Hua W, Georg M, Grundmann M. Mediapipe: A  framework for building perception pipelines. arXiv Preprint. 2019. Source: <https://arxiv.org/abs/1906.08172>.
 
- Shin H, Roth H, Gao M, Lu L, Xu Z, Nogues I, Summers RM. Deep  convolutional neural networks for computer-aided detection: CNN architectures, dataset  characteristics and transfer learning. IEEE Trans Med Imaging 2016; 35(5):  1285-31298. DOI: 10.1109/TMI.2016.2528162.
 
- Torchvision. Transforms. Source:   <https://pytorch.org/vision/stable/transforms.html?highlight=randomequalize#torchvision.transforms.RandomEqualize>.
 
- Label smoothing. Source: <https://paperswithcode.com/method/label-smoothing>.
 
- 3D ResNet. Source: <https://pytorch.org/hub/facebookresearch_pytorchvideo_resnet/>.
 
- Zhong Z, Lin ZQ, Bidart R, Hu X, Ben Daya I, Li Z, Zheng W, Li J,  Wong A. Squeeze-and-attention networks for semantic segmentation. Proc IEEE/CVF  Conf on Computer Vision and Pattern Recognition 2020; 13065-13074.
  
- Cosine annealing warm restarts. Source: <https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingWarmRestarts.html>.
    
    
  
  
  © 2009, IPSI RAS
  151, Molodogvardeiskaya str., Samara, 443001, Russia; E-mail: journal@computeroptics.ru ; Tel: +7 (846) 242-41-24 (Executive secretary), +7 (846) 332-56-22 (Issuing editor), Fax: +7 (846) 332-56-20