An audio-visual corpus for multimodal automatic speech recognition - Publication - MOST Wiedzy


An audio-visual corpus for multimodal automatic speech recognition


review of available audio-visual speech corpora and a description of a new multimodal corpus of English speech recordings is provided. The new corpus containing 31 hours of recordings was created specifically to assist audio-visual speech recognition systems (AVSR) development. The database related to the corpus includes high-resolution, high-framerate stereoscopic video streams from RGB cameras, depth imaging stream utilizing Time-of-Flight camera accompanied by audio recorded using both: a microphone array and a microphone built in a mobile computer. For the purpose of applications related to AVSR systems training, every utterance was manually labeled, resulting in label files added to the corpus repository. Owing to the inclusion of recordings made in noisy conditions the elaborated corpus can also be used for testing robustness of speech recognition systems in the presence of acoustic background noise. The process of building the corpus, including the recording, labeling and post-processing phases is described in the paper. Results achieved with the developed audio-visual automatic speech recognition (ASR) engine trained and tested with the material contained in the corpus are presented and discussed together with comparative test results employing a state-of-the-art/commercial ASR engine. In order to demonstrate the practical use of the corpus it is made available for the public use.


  • 1 9


  • 1 8

    Web of Science

  • 2 6



artykuł w czasopiśmie wyróżnionym w JCR
Published in:
ISSN: 0925-9902
Publication year:
Bibliographic description:
Czyżewski A., Kostek B., Bratoszewski P., Kotus J., Szykulski M.: An audio-visual corpus for multimodal automatic speech recognition// JOURNAL OF INTELLIGENT INFORMATION SYSTEMS. -Vol. 49, nr. 2 (2017), s.167-192
Digital Object Identifier (open in new tab) 10.1007/s10844-016-0438-z
Bibliography: test
  1. AGH University of Science and Technology (2014). Audiovisual Polish speech corpus. http://www.dsp.agh., accessed: 2016-11-29. open in new tab
  2. Almajai, I., Cox, S., Harvey, R., & Lan, Y. (2016). Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2722-2726). doi:10.1109/ICASSP.2016.7472172. open in new tab
  3. Bailly-Bailliére, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J., Matas, J., Messer, K., Popovici, V., Porée, F., Ruiz, B., & Thiran, J.P. (2003). The BANCA Database and Evaluation Protocol. doi:10.1007/3-540-44887-X 74. open in new tab
  4. Bear, H.L., & Harvey, R. (2016). Decoding visemes: Improving machine lip-reading. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2009-2013). doi:10.1109/ICASSP.2016.7472029. open in new tab
  5. Benezeth, Y., & Bachman, G. (2011). BL-Database: A French audiovisual database for speech driven lip animation systems.
  6. Bernstein, L. (1991). Lipreading Corpus V-VI: Disc 3 and Corpus VI-VIII: Disc 4.
  7. Biswas, A., Sahu, P., & Chandra, M. (2015). Multiple camera in car audio-visual speech recog- nition using phonetic and visemic information. Comput Electr Eng, 47, 35-50. doi:10.1016/j. compeleceng.2015.08.009, open in new tab
  8. Bolia, R.S., Nelson, W.T., Ma, E., & Simpson, B.D. (2000). A speech corpus for multitalker communications research. J Acoust Soc Amer, 107(2), 1065-1066. doi:10.1121/1.428288. open in new tab
  9. Bratoszewski, P., Lopatka, K., & Czyzewski, A. (2014). Examining Influence Of Video Framerate And Audio / Video Synchronization On Audio-Visual Speech Recognition Accuracy. In 15th International Symposium on New Trends in Audio and Video (pp. 25-27): Wroclaw, Poland. open in new tab
  10. Bratoszewski, P., Szykulski, M., & Czyzewski, A. (2015). Examining influence of distance to microphone on accuracy of speech recognition. In Audio Engineering Society Convention 138, e-lib/browse.cfm?elib=17629.
  11. Chibelushi, C.C., Gandon, S., Mason, J.S.D., Deravi, F., & Johnston, R.D. (1996). Design issues for a digital audio-visual integrated database. doi:10.1049/ic:19961151. open in new tab
  12. Chibelushi, C.C., Deravi, F., & Mason, J.S.D. (2002). A review of speech-based bimodal recognition. doi:10.1109/6046.985551. open in new tab
  13. Chitu, A.G., & Rothkrantz, L.J.M. (2007). Building a data corpus for audio-visual speech recognition. Euromedia '2007, 1(Movellan 1995), 88-92. URL <Go to ISI>://WOS:000255591600012. open in new tab
  14. Chung, J.S., Senior, A., Vinyals, O., & Zisserman, A. (2016). Lip reading sentences in the wild. In arXiv:1611.05358. open in new tab
  15. Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). An audio-visual corpus for speech perception and automatic speech recognition. J Acoust Soc Amer, 120(5 Pt 1), 2421-2424. doi:10.1121/1.2229005. open in new tab
  16. Czyzewski, A., Kaczmarek, A., & Kostek, B. (2003). Intelligent processing of stuttered speech. J Intell Inf Syst, 21(2), 143-171. doi:10.1023/A.1024710532716. open in new tab
  17. Czyzewski, A., Kostek, B., Ciszewski, T., & Majewicz, D. (2013). Language material for english audiovisual speech recognition system development. Proc Meet Acoust, 20(1), 060002. doi:10.1121/1.4864363. open in new tab
  18. Dalka, P., Bratoszewski, P., & Czyzewski, A. (2014). Visual lip contour detection for the purpose of speech recognition. doi:10.1109/ICSES.2014.6948716. open in new tab
  19. Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. doi:10.1109/TASSP.1980.1163420. open in new tab
  20. Durand, J., Gut, U., & Kristoffersen, G. (2014). The oxford handbook of corpus phonology. doi:10.1093/oxfordhb/9780199571932. open in new tab
  21. Fox, N.A., O'Mullane, B.A., & Reilly, R.B. (2005). VALID: A new practical audio-visual database, and comparative results. Audio-and Video-Based Biometric Person Authentication pp 777-786, doi:10.1007/11527923 81. open in new tab
  22. Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallet, D., Dahlgren, N., & Zue, V. (1993). TIMIT Acoustic- Phonetic Continuous Speech Corpus LDC93S1. Web Download. open in new tab
  23. Igras, M., Ziółko, B., & Jadczy, T. (2012). Audiovisual database of polish speech recordings. Stud Inf, 33(2B), 163-172. doi:10.5072/si2012 v33.n2B.182. open in new tab
  24. Jadczyk, T., & Ziółko, M. (2015). Audio-visual speech processing system for polish with dynamic bayesian network models. In Proceedings of the World Congress on Electrical Engineering and Computer Systems and Science, Proceedings/files/papers/MVML343.pdf. open in new tab
  25. Kashiwagi, Y., Suzuki, M., Minematsu, N., & Hirose, K. (2012). Audio-visual feature integra- tion based on piecewise linear transformation for noise robust automatic speech recognition. doi:10.1109/SLT.2012.6424213. open in new tab
  26. Kunka, B., Kupryjanow, A., Dalka, P., Bratoszewski, P., Szczodrak, M., Spaleniak, P., Szykulski, M., & Czyzewski, A. (2013). Multimodal English corpus for automatic speech recognition. In Signal Pro- cessing -Algorithms, Architectures, Arrangements, and Applications Conference Proceedings, SPA, pp. 106-111,
  27. Kuwabara, H. (1996). Acoustic properties of phonemes in continuous speech for different speaking rate. doi:10.1109/ICSLP.1996.607301. open in new tab
  28. Lane, H., & Tranel, B. (1971). The lombard sign and the role of hearing in speech. J Speech Lang, Hear Res, 14, 677-709. doi:10.1044/jshr1404.677. open in new tab
  29. Lane, H., & Tranel, B. (1993). The lombard reflex and its role on human listeners and automatic speech recognizers. J Acoust Soc Amer, 93, 510-524. doi:10.1044/jshr.1404.677. open in new tab
  30. Lee, B., Hasegawa-johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., & Huang, T. (2004). AVICAR : Audio-Visual Speech Corpus in a Car Environment. 8th International Conference on Spoken Language Processing pp 8-11.
  31. Lopatka, K., Kotus, J., Bratoszewski, P., Spaleniak, P., Szykulski, M., & Czyzewski, A. (2015). Enhanced voice user interface employing spatial filtration of signals from acoustic vector sensor. Proceed- ings -2015 8th International Conference on Human System Interaction, HSI 2015 pp 82-87, doi:10.1109/HSI.2015.7170647. open in new tab
  32. McCool, C., Marcel, S., Hadid, A., Pietikainen, M., Matejka, P., Cernock, J., Poh, N., Kittler, J., Larcher, A., Levy, C., Matrouf, D., Bonastre, J.F., Tresadern, P., & Cootes, T. (2012). Bi-modal person recognition on a mobile phone: Using mobile phone data. Proceedings of the 2012 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2012 pp 635-640, doi:10.1109/ICMEW.2012.116. open in new tab
  33. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746-748. doi:10.1038/264746a0. open in new tab
  34. Messer, K., Matas, J., Kittler, J., & Jonsson, K. (1999). XM2VTSDB: The extended M2VTS database. In Second International Conference on Audio and Video-based Biometric Person Authentication (pp. 72-77).
  35. Movellan, J.R. (1995). Visual speech recognition with stochastic networks, In Tesauro, G., Touretzky, D.S., & Leen, T.K. (Eds.) Advances in Neural Information Processing Systems 7 (pp. 851-858): MIT Press.
  36. Mroueh, Y., Marcheret, E., & Goel, V. (2015). Deep multimodal learning for audio-visual speech recogni- tion. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2130-2134). doi:10.1109/ICASSP.2015.7178347. open in new tab
  37. Nguyen, Q.D., & Milgram, M. (2009). Semi Adaptive Appearance Models for lip tracking. doi:10.1109/ICIP.2009.5414105. open in new tab
  38. Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., & Ogata, T. (2015). Audio-visual speech recognition using deep learning. Appl Intell, 42(4), 722-737. doi:10.1007/s10489-014-0629-7. open in new tab
  39. Park, Y., Patwardhan, S., Visweswariah, K., & Gates, S.C. (2008). An Empirical Analysis of Word Error Rate and Keyword Error Rate.
  40. Patterson, E.K., Gurbuz, S., Tufekci, Z., & Gowdy, J.N. (2002). CUAVE: A new audio-visual database for multimodal human-computer interface research. doi:10.1109/ICASSP.2002.5745028. open in new tab
  41. Petajan, E.D., Bischoff, B., & Bodoff, D. (1988). An improved automatic lipreading system to enhance speech recognition. Human Factors in Computing Systems Conference pp 19-25, doi:10.1145/57167.57170. open in new tab
  42. Petrovska-Delacrétaz, D., Lelandais, S., Colineau, J., Chen, L., Dorizzi, B., Ardabilian, M., Krichen, E., Mellakh, M.A., Chaari, A., Guerfi, S., D'Hose, J., & Amor, B.B. (2008). The IV2 multi- modal biometric database (including Iris, 2D, 3D, stereoscopic, and talking face data), and the IV2-2007 evaluation campaign. BTAS 2008 -IEEE 2nd Int Conf Biom: Theory, Appl Syst, 00, 3-9. doi:10.1109/BTAS.2008.4699323. open in new tab
  43. Pigeon, S., & Vandendorpe, L. (1997). The M2VTS multimodal face database (Release 1.00). Audio- Video-based Biom Person Authentication, 1206, 403-409. doi:10.1007/BFb0015972, 10.1007/BFb0016 021. open in new tab
  44. Potamianos, G., Neti, C., & Deligne, S. (2003). Joint audio-visual speech processing for recognition and enhancement.
  45. Sanderson, C., & Lovell, B.C. (2009). Advances in Biometrics: Third International Conference, ICB 2009, Alghero, Italy, 2009. Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, chap Multi-Regi, pp 199-208. doi:10.1007/978-3-642-01793-3 21. open in new tab
  46. Stewart, D., Seymour, R., Pass, A., & Ming, J. (2014). Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Trans Cybern, 44(2), 175-184. doi:10.1109/TCYB.2013.2250954. open in new tab
  47. Teferi, D., & Bigun, J. (2008). Evaluation protocol for the DXM2VTS database and performance comparison of face detection and face tracking on video. doi:10.1109/ICPR.2008.4761875. open in new tab
  48. Trentin, E., & Matassoni, M. (2003). Noise-tolerant speech recognition: The SNN-TA approach. Inf Sci, 156(1-2), 55-69. doi:10.1016/S0020-0255(03)00164-6. open in new tab
  49. Trojanová, J., Hrúz, M., Campr, P., & Zelezny, M. (2008). Design and Recording of Czech Audio-Visual Database with Impaired Conditions for Continuous Speech Recognition. Proceedings of the Sixth International Language Resources and Evaluation (LREC'08) pp 1239-1243, proceedings/lrec2008/.
  50. Vlaj, D., & Kacic, Z. (2011). Computer Science and Engineering doi:10.5772/17520, http://www.intechopen. com/books/speech-technologies/the-influence-of-lombard-effect-on-speech-recognition. open in new tab
  51. Vorwerk A., Wang X., Kolossa D., Zeiler S., & Orglmeister R. (2010). WAPUSK20 -A database for robust audiovisual speech recognition, In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.) LREC, European Language Resources Association, http://
  52. Wong, Y.W., Ch'Ng, S.I., Seng, K.P., Ang, L.M., Chin, S.W., Chew, W.J., & Lim, K.H. (2011). A new multi- purpose audio-visual UNMC-VIER database with multiple variabilities. Pattern Recogn Lett, 32(13), 1503-1510. doi:10.1016/j.patrec.2011.06.011. open in new tab
  53. Young, S.J., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., & Woodland, P. (2006). The HTK Book Version 3.4: Cambridge University Press.
  54. Zelasko, P., Ziółko, B., Jadczyk, T., & Skurzok, D. (2016). Agh corpus of polish speech. Lang Resour Eval, 50(3), 585-601. doi:10.1007/s10579-015-9302-y. open in new tab
Verified by:
Gdańsk University of Technology

seen 60 times

Recommended for you

Meta Tags