Abstract
review of available audio-visual speech corpora and a description of a new multimodal corpus of English speech recordings is provided. The new corpus containing 31 hours of recordings was created specifically to assist audio-visual speech recognition systems (AVSR) development. The database related to the corpus includes high-resolution, high-framerate stereoscopic video streams from RGB cameras, depth imaging stream utilizing Time-of-Flight camera accompanied by audio recorded using both: a microphone array and a microphone built in a mobile computer. For the purpose of applications related to AVSR systems training, every utterance was manually labeled, resulting in label files added to the corpus repository. Owing to the inclusion of recordings made in noisy conditions the elaborated corpus can also be used for testing robustness of speech recognition systems in the presence of acoustic background noise. The process of building the corpus, including the recording, labeling and post-processing phases is described in the paper. Results achieved with the developed audio-visual automatic speech recognition (ASR) engine trained and tested with the material contained in the corpus are presented and discussed together with comparative test results employing a state-of-the-art/commercial ASR engine. In order to demonstrate the practical use of the corpus it is made available for the public use.
Citations
-
5 4
CrossRef
-
0
Web of Science
-
6 7
Scopus
Authors (5)
Cite as
Full text
- Publication version
- Accepted or Published Version
- License
- open in new tab
Keywords
Details
- Category:
- Articles
- Type:
- artykuł w czasopiśmie wyróżnionym w JCR
- Published in:
-
JOURNAL OF INTELLIGENT INFORMATION SYSTEMS
no. 49,
pages 167 - 192,
ISSN: 0925-9902 - Language:
- English
- Publication year:
- 2017
- Bibliographic description:
- Czyżewski A., Kostek B., Bratoszewski P., Kotus J., Szykulski M.: An audio-visual corpus for multimodal automatic speech recognition// JOURNAL OF INTELLIGENT INFORMATION SYSTEMS. -Vol. 49, nr. 2 (2017), s.167-192
- DOI:
- Digital Object Identifier (open in new tab) 10.1007/s10844-016-0438-z
- Bibliography: test
-
- AGH University of Science and Technology (2014). Audiovisual Polish speech corpus. http://www.dsp.agh. edu.pl/en:resources:korpusav, accessed: 2016-11-29. open in new tab
- Almajai, I., Cox, S., Harvey, R., & Lan, Y. (2016). Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2722-2726). doi:10.1109/ICASSP.2016.7472172. open in new tab
- Bailly-Bailliére, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J., Matas, J., Messer, K., Popovici, V., Porée, F., Ruiz, B., & Thiran, J.P. (2003). The BANCA Database and Evaluation Protocol. doi:10.1007/3-540-44887-X 74. open in new tab
- Bear, H.L., & Harvey, R. (2016). Decoding visemes: Improving machine lip-reading. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2009-2013). doi:10.1109/ICASSP.2016.7472029. open in new tab
- Benezeth, Y., & Bachman, G. (2011). BL-Database: A French audiovisual database for speech driven lip animation systems. http://hal.inria.fr/inria-00614761/.
- Bernstein, L. (1991). Lipreading Corpus V-VI: Disc 3 and Corpus VI-VIII: Disc 4.
- Biswas, A., Sahu, P., & Chandra, M. (2015). Multiple camera in car audio-visual speech recog- nition using phonetic and visemic information. Comput Electr Eng, 47, 35-50. doi:10.1016/j. compeleceng.2015.08.009, http://linkinghub.elsevier.com/retrieve/pii/S0045790615002864. open in new tab
- Bolia, R.S., Nelson, W.T., Ma, E., & Simpson, B.D. (2000). A speech corpus for multitalker communications research. J Acoust Soc Amer, 107(2), 1065-1066. doi:10.1121/1.428288. open in new tab
- Bratoszewski, P., Lopatka, K., & Czyzewski, A. (2014). Examining Influence Of Video Framerate And Audio / Video Synchronization On Audio-Visual Speech Recognition Accuracy. In 15th International Symposium on New Trends in Audio and Video (pp. 25-27): Wroclaw, Poland. open in new tab
- Bratoszewski, P., Szykulski, M., & Czyzewski, A. (2015). Examining influence of distance to microphone on accuracy of speech recognition. In Audio Engineering Society Convention 138, http://www.aes.org/ e-lib/browse.cfm?elib=17629.
- Chibelushi, C.C., Gandon, S., Mason, J.S.D., Deravi, F., & Johnston, R.D. (1996). Design issues for a digital audio-visual integrated database. doi:10.1049/ic:19961151. open in new tab
- Chibelushi, C.C., Deravi, F., & Mason, J.S.D. (2002). A review of speech-based bimodal recognition. doi:10.1109/6046.985551. open in new tab
- Chitu, A.G., & Rothkrantz, L.J.M. (2007). Building a data corpus for audio-visual speech recognition. Euromedia '2007, 1(Movellan 1995), 88-92. URL <Go to ISI>://WOS:000255591600012. open in new tab
- Chung, J.S., Senior, A., Vinyals, O., & Zisserman, A. (2016). Lip reading sentences in the wild. In arXiv:1611.05358. open in new tab
- Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). An audio-visual corpus for speech perception and automatic speech recognition. J Acoust Soc Amer, 120(5 Pt 1), 2421-2424. doi:10.1121/1.2229005. open in new tab
- Czyzewski, A., Kaczmarek, A., & Kostek, B. (2003). Intelligent processing of stuttered speech. J Intell Inf Syst, 21(2), 143-171. doi:10.1023/A.1024710532716. open in new tab
- Czyzewski, A., Kostek, B., Ciszewski, T., & Majewicz, D. (2013). Language material for english audiovisual speech recognition system development. Proc Meet Acoust, 20(1), 060002. doi:10.1121/1.4864363. open in new tab
- Dalka, P., Bratoszewski, P., & Czyzewski, A. (2014). Visual lip contour detection for the purpose of speech recognition. doi:10.1109/ICSES.2014.6948716. open in new tab
- Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. doi:10.1109/TASSP.1980.1163420. open in new tab
- Durand, J., Gut, U., & Kristoffersen, G. (2014). The oxford handbook of corpus phonology. doi:10.1093/oxfordhb/9780199571932. open in new tab
- Fox, N.A., O'Mullane, B.A., & Reilly, R.B. (2005). VALID: A new practical audio-visual database, and comparative results. Audio-and Video-Based Biometric Person Authentication pp 777-786, doi:10.1007/11527923 81. open in new tab
- Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallet, D., Dahlgren, N., & Zue, V. (1993). TIMIT Acoustic- Phonetic Continuous Speech Corpus LDC93S1. Web Download. open in new tab
- Igras, M., Ziółko, B., & Jadczy, T. (2012). Audiovisual database of polish speech recordings. Stud Inf, 33(2B), 163-172. doi:10.5072/si2012 v33.n2B.182. open in new tab
- Jadczyk, T., & Ziółko, M. (2015). Audio-visual speech processing system for polish with dynamic bayesian network models. In Proceedings of the World Congress on Electrical Engineering and Computer Systems and Science, http://avestia.com/EECSS2015 Proceedings/files/papers/MVML343.pdf. open in new tab
- Kashiwagi, Y., Suzuki, M., Minematsu, N., & Hirose, K. (2012). Audio-visual feature integra- tion based on piecewise linear transformation for noise robust automatic speech recognition. doi:10.1109/SLT.2012.6424213. open in new tab
- Kunka, B., Kupryjanow, A., Dalka, P., Bratoszewski, P., Szczodrak, M., Spaleniak, P., Szykulski, M., & Czyzewski, A. (2013). Multimodal English corpus for automatic speech recognition. In Signal Pro- cessing -Algorithms, Architectures, Arrangements, and Applications Conference Proceedings, SPA, pp. 106-111, http://www.scopus.com/inward/record.url?eid=2-s2.0-84897901272&partnerID=tZOtx3y1.
- Kuwabara, H. (1996). Acoustic properties of phonemes in continuous speech for different speaking rate. doi:10.1109/ICSLP.1996.607301. open in new tab
- Lane, H., & Tranel, B. (1971). The lombard sign and the role of hearing in speech. J Speech Lang, Hear Res, 14, 677-709. doi:10.1044/jshr1404.677. open in new tab
- Lane, H., & Tranel, B. (1993). The lombard reflex and its role on human listeners and automatic speech recognizers. J Acoust Soc Amer, 93, 510-524. doi:10.1044/jshr.1404.677. open in new tab
- Lee, B., Hasegawa-johnson, M., Goudeseune, C., Kamdar, S., Borys, S., Liu, M., & Huang, T. (2004). AVICAR : Audio-Visual Speech Corpus in a Car Environment. 8th International Conference on Spoken Language Processing pp 8-11.
- Lopatka, K., Kotus, J., Bratoszewski, P., Spaleniak, P., Szykulski, M., & Czyzewski, A. (2015). Enhanced voice user interface employing spatial filtration of signals from acoustic vector sensor. Proceed- ings -2015 8th International Conference on Human System Interaction, HSI 2015 pp 82-87, doi:10.1109/HSI.2015.7170647. open in new tab
- McCool, C., Marcel, S., Hadid, A., Pietikainen, M., Matejka, P., Cernock, J., Poh, N., Kittler, J., Larcher, A., Levy, C., Matrouf, D., Bonastre, J.F., Tresadern, P., & Cootes, T. (2012). Bi-modal person recognition on a mobile phone: Using mobile phone data. Proceedings of the 2012 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2012 pp 635-640, doi:10.1109/ICMEW.2012.116. open in new tab
- McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746-748. doi:10.1038/264746a0. open in new tab
- Messer, K., Matas, J., Kittler, J., & Jonsson, K. (1999). XM2VTSDB: The extended M2VTS database. In Second International Conference on Audio and Video-based Biometric Person Authentication (pp. 72-77).
- Movellan, J.R. (1995). Visual speech recognition with stochastic networks, In Tesauro, G., Touretzky, D.S., & Leen, T.K. (Eds.) Advances in Neural Information Processing Systems 7 (pp. 851-858): MIT Press. http://papers.nips.cc/paper/993-visual-speech-recognition-with-stochastic-networks.pdf.
- Mroueh, Y., Marcheret, E., & Goel, V. (2015). Deep multimodal learning for audio-visual speech recogni- tion. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2130-2134). doi:10.1109/ICASSP.2015.7178347. open in new tab
- Nguyen, Q.D., & Milgram, M. (2009). Semi Adaptive Appearance Models for lip tracking. doi:10.1109/ICIP.2009.5414105. open in new tab
- Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., & Ogata, T. (2015). Audio-visual speech recognition using deep learning. Appl Intell, 42(4), 722-737. doi:10.1007/s10489-014-0629-7. open in new tab
- Park, Y., Patwardhan, S., Visweswariah, K., & Gates, S.C. (2008). An Empirical Analysis of Word Error Rate and Keyword Error Rate.
- Patterson, E.K., Gurbuz, S., Tufekci, Z., & Gowdy, J.N. (2002). CUAVE: A new audio-visual database for multimodal human-computer interface research. doi:10.1109/ICASSP.2002.5745028. open in new tab
- Petajan, E.D., Bischoff, B., & Bodoff, D. (1988). An improved automatic lipreading system to enhance speech recognition. Human Factors in Computing Systems Conference pp 19-25, doi:10.1145/57167.57170. open in new tab
- Petrovska-Delacrétaz, D., Lelandais, S., Colineau, J., Chen, L., Dorizzi, B., Ardabilian, M., Krichen, E., Mellakh, M.A., Chaari, A., Guerfi, S., D'Hose, J., & Amor, B.B. (2008). The IV2 multi- modal biometric database (including Iris, 2D, 3D, stereoscopic, and talking face data), and the IV2-2007 evaluation campaign. BTAS 2008 -IEEE 2nd Int Conf Biom: Theory, Appl Syst, 00, 3-9. doi:10.1109/BTAS.2008.4699323. open in new tab
- Pigeon, S., & Vandendorpe, L. (1997). The M2VTS multimodal face database (Release 1.00). Audio- Video-based Biom Person Authentication, 1206, 403-409. doi:10.1007/BFb0015972, 10.1007/BFb0016 021. open in new tab
- Potamianos, G., Neti, C., & Deligne, S. (2003). Joint audio-visual speech processing for recognition and enhancement.
- Sanderson, C., & Lovell, B.C. (2009). Advances in Biometrics: Third International Conference, ICB 2009, Alghero, Italy, 2009. Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, chap Multi-Regi, pp 199-208. doi:10.1007/978-3-642-01793-3 21. open in new tab
- Stewart, D., Seymour, R., Pass, A., & Ming, J. (2014). Robust audio-visual speech recognition under noisy audio-video conditions. IEEE Trans Cybern, 44(2), 175-184. doi:10.1109/TCYB.2013.2250954. open in new tab
- Teferi, D., & Bigun, J. (2008). Evaluation protocol for the DXM2VTS database and performance comparison of face detection and face tracking on video. doi:10.1109/ICPR.2008.4761875. open in new tab
- Trentin, E., & Matassoni, M. (2003). Noise-tolerant speech recognition: The SNN-TA approach. Inf Sci, 156(1-2), 55-69. doi:10.1016/S0020-0255(03)00164-6. open in new tab
- Trojanová, J., Hrúz, M., Campr, P., & Zelezny, M. (2008). Design and Recording of Czech Audio-Visual Database with Impaired Conditions for Continuous Speech Recognition. Proceedings of the Sixth International Language Resources and Evaluation (LREC'08) pp 1239-1243, http://www.lrec-conf.org/ proceedings/lrec2008/.
- Vlaj, D., & Kacic, Z. (2011). Computer Science and Engineering doi:10.5772/17520, http://www.intechopen. com/books/speech-technologies/the-influence-of-lombard-effect-on-speech-recognition. open in new tab
- Vorwerk A., Wang X., Kolossa D., Zeiler S., & Orglmeister R. (2010). WAPUSK20 -A database for robust audiovisual speech recognition, In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.) LREC, European Language Resources Association, http:// dblp.uni-trier.de/db/conf/lrec/lrec2010.html#VorwerkWKZO10.
- Wong, Y.W., Ch'Ng, S.I., Seng, K.P., Ang, L.M., Chin, S.W., Chew, W.J., & Lim, K.H. (2011). A new multi- purpose audio-visual UNMC-VIER database with multiple variabilities. Pattern Recogn Lett, 32(13), 1503-1510. doi:10.1016/j.patrec.2011.06.011. open in new tab
- Young, S.J., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., & Woodland, P. (2006). The HTK Book Version 3.4: Cambridge University Press.
- Zelasko, P., Ziółko, B., Jadczyk, T., & Skurzok, D. (2016). Agh corpus of polish speech. Lang Resour Eval, 50(3), 585-601. doi:10.1007/s10579-015-9302-y. open in new tab
- Verified by:
- Gdańsk University of Technology
Referenced datasets
- dataset MODALITY corpus - SPEAKER 35 - COMMANDS C1
- dataset MODALITY corpus - SPEAKER 21 - SEQUENCE S6
- dataset MODALITY corpus - SPEAKER 21 - COMMANDS C5
- dataset MODALITY corpus - SPEAKER 21 - SEQUENCE S4
- dataset MODALITY corpus - SPEAKER 10 - SEQUENCE S1
- dataset MODALITY corpus - SPEAKER 01 - SEQUENCE S2
- dataset MODALITY corpus - SPEAKER 39 - COMMANDS C1
- dataset MODALITY corpus - SPEAKER 01 - SEQUENCE S3
- dataset MODALITY corpus - SPEAKER 01 - COMMANDS C3
- dataset MODALITY corpus - SPEAKER 21 - SEQUENCE S2
seen 327 times
Recommended for you
Multimodal English corpus for automatic speech recognition
- B. Kunka,
- A. Kupryjanow,
- P. Dalka
- + 5 authors