Search results for: SPEECH RECOGNITION

Language Models in Speech Recognition

Publication

J. Daciuk

- Year 2022

This chapter describes language models used in speech recognition, It starts by indicating the role and the place of language models in speech recognition. Mesures used to compare language models follow. An overview of n-gram, syntactic, semantic, and neural models is given. It is accompanied by a list of popular software.

Full text to download in external service

Multimodal English corpus for automatic speech recognition

Publication

- Year 2013

A multimodal corpus developed for research of speech recognition based on audio-visual data is presented. Besides usual video and sound excerpts, the prepared database contains also thermovision images and depth maps. All streams were recorded simultaneously, therefore the corpus enables to examine the importance of the information provided by different modalities. Based on the recordings, it is also possible to develop a speech...

Speech recognition system for hearing impaired people.

Publication

- Year 2005

Praca przedstawia wyniki badań z zakresu rozpoznawania mowy. Tworzony system wykorzystujący dane wizualne i akustyczne będzie ułatwiał trening poprawnego mówienia dla osób po operacji transplantacji ślimaka i innych osób wykazujących poważne uszkodzenia słuchu. Active Shape models zostały wykorzystane do wyznaczania parametrów wizualnych na podstawie analizy kształtu i ruchu ust w nagraniach wideo. Parametry akustyczne bazują na...

Examining Influence of Distance to Microphone on Accuracy of Speech Recognition

Publication

- Year 2015

The problem of controlling a machine by the distant-talking speaker without a necessity of handheld or body-worn equipment usage is considered. A laboratory setup is introduced for examination of performance of the developed automatic speech recognition system fed by direct and by distant speech acquired by microphones placed at three different distances from the speaker (0.5 m to 1.5 m). For feature extraction from the voice signal...

Full text to download in external service

An audio-visual corpus for multimodal automatic speech recognition

Publication

- JOURNAL OF INTELLIGENT INFORMATION SYSTEMS - Year 2017

review of available audio-visual speech corpora and a description of a new multimodal corpus of English speech recordings is provided. The new corpus containing 31 hours of recordings was created specifically to assist audio-visual speech recognition systems (AVSR) development. The database related to the corpus includes high-resolution, high-framerate stereoscopic video streams from RGB cameras, depth imaging stream utilizing Time-of-Flight...

Full text available to download

Visual Lip Contour Detection for the Purpose of Speech Recognition

Publication

- Year 2014

A method for visual detection of lip contours in frontal recordings of speakers is described and evaluated. The purpose of the method is to facilitate speech recognition with visual features extracted from a mouth region. Different Active Appearance Models are employed for finding lips in video frames and for lip shape and texture statistical description. Search initialization procedure is proposed and error measure values are...

Automatic Image and Speech Recognition Based on Neural Network

Publication

D. Król
B. Szlachetko

- Journal of Information Technology Research - Year 2010

Full text to download in external service

Audiovisual speech recognition for training hearing impaired patients

Publication

- Year 2006

Praca przedstawia system rozpoznawania izolowanych głosek mowy wykorzystujący dane wizualne i akustyczne. Modele Active Shape Models zostały wykorzystane do wyznaczania parametrów wizualnych na podstawie analizy kształtu i ruchu ust w nagraniach wideo. Parametry akustyczne bazują na współczynnikach melcepstralnych. Sieć neuronowa została użyta do rozpoznawania wymawianych głosek na podstawie wektora cech zawierającego oba typy...

Optimizing Medical Personnel Speech Recognition Models Using Speech Synthesis and Reinforcement Learning

Publication

A. Czyżewski

- Journal of the Acoustical Society of America - Year 2023

Text-to-Speech synthesis (TTS) can be used to generate training data for building Automatic Speech Recognition models (ASR). Access to medical speech data is because it is sensitive data that is difficult to obtain for privacy reasons; TTS can help expand the data set. Speech can be synthesized by mimicking different accents, dialects, and speaking styles that may occur in a medical language. Reinforcement Learning (RL), in the...

Full text available to download

Auditory-model based robust feature selection for speech recognition

Publication

C. Koniaris
M. Kuropatwinski
W. Kleijn
M. Kuropatwiński

- Journal of the Acoustical Society of America - Year 2010

Full text to download in external service

Comparison of Language Models Trained on Written Texts and Speech Transcripts in the Context of Automatic Speech Recognition

Publication

S. Dziadzio
A. Nabożny
A. Smywiński-Pohl
B. Ziółko

- Year 2015

Full text to download in external service

Comparison of Acoustic and Visual Voice Activity Detection for Noisy Speech Recognition

Publication

- Year 2016

The problem of accurate differentiating between the speaker utterance and the noise parts in a speech signal is considered. The influence of utilizing a voice activity detection in speech signals on the accuracy of the automatic speech recognition (ASR) system is presented. The examined methods of voice activity detection are based on acoustic and visual modalities. The problem of detecting the voice activity in clean and noisy...

Analysis of 2D Feature Spaces for Deep Learning-based Speech Recognition

Publication

G. Korvel
P. Treigys
G. Tamulevicus
J. Bernataviciene
B. Kostek

- JOURNAL OF THE AUDIO ENGINEERING SOCIETY - Year 2018

convolutional neural network (CNN) which is a class of deep, feed-forward artificial neural network. We decided to analyze audio signal feature maps, namely spectrograms, linear and Mel-scale cepstrograms, and chromagrams. The choice was made upon the fact that CNN performs well in 2D data-oriented processing contexts. Feature maps were employed in the Lithuanian word recognition task. The spectral analysis led to the highest word...

A survey of automatic speech recognition deep models performance for Polish medical terms

Publication

- Year 2023

Among the numerous applications of speech-to-text technology is the support of documentation created by medical personnel. There are many available speech recognition systems for doctors. Their effectiveness in languages such as Polish should be verified. In connection with our project in this field, we decided to check how well the popular speech recognition systems work, employing models trained for the general Polish language....

Full text to download in external service

Combining visual and acoustic modalities to ease speech recognition by hearing impaired people

Publication

- Year 2005

Artykuł prezentuje system, którego celem działania jest ułatwienie procesu treningu poprawnej wymowy dla osób z poważnymi wadami słuchu. W analizie mowy wykorzystane zostały parametry akutyczne i wizualne. Do wyznaczenia parametrów wizualnych na podstawie kształtu i ruchu ust zostały wykorzystane modele Active Shape Models. Parametry akustyczne bazują na współczynnikach melcepstralnych. Do klasyfikacji wypowiadanych głosek została...

Hybrid of Neural Networks and Hidden Markov Models as a modern approach to speech recognition systems

Publication

- Pomiary Automatyka Robotyka - Year 2013

The aim of this paper is to present a hybrid algorithm that combines the advantages ofartificial neural networks and hidden Markov models in speech recognition for control purpos-es. The scope of the paper includes review of currently used solutions, description and analysis of implementation of selected artificial neural network (NN) structures and hidden Markov mod-els (HMM). The main part of the paper consists of a description...

Full text available to download

EXAMINING INFLUENCE OF VIDEO FRAMERATE AND AUDIO/VIDEO SYNCHRONIZATION ON AUDIO-VISUAL SPEECH RECOGNITION ACCURACY

Publication

- Year 2014

The problem of video framerate and audio/video synchronization in audio-visual speech recognition is considered. The visual features are added to the acoustic parameters in order to improve the accuracy of speech recognition in noisy conditions. The Mel-Frequency Cepstral Coefficients are used on the acoustic side whereas Active Appearance Model features are extracted from the image. The feature fusion approach is employed. The...

EXAMINING INFLUENCE OF VIDEO FRAMERATE AND AUDIO/VIDEO SYNCHRONIZATION ON AUDIO-VISUAL SPEECH RECOGNITION ACCURACY

Publication

- Year 2014

The problem of video framerate and audio/video synchronization in audio-visual speech recogni-tion is considered. The visual features are added to the acoustic parameters in order to improve the accuracy of speech recognition in noisy conditions. The Mel-Frequency Cepstral Coefficients are used on the acoustic side whereas Active Appearance Model features are extracted from the image. The feature fusion approach is employed. The...

Intra-subject class-incremental deep learning approach for EEG-based imagined speech recognition

Publication

J. S. Garcia Salinas
A. A. Torres-García
C. A. Reyes-Garćia
L. Villaseñor-Pineda

- Biomedical Signal Processing and Control - Year 2023

Brain–computer interfaces (BCIs) aim to decode brain signals and transform them into commands for device operation. The present study aimed to decode the brain activity during imagined speech. The BCI must identify imagined words within a given vocabulary and thus perform the requested action. A possible scenario when using this approach is the gradual addition of new words to the vocabulary using incremental learning methods....

Full text to download in external service

Language material for English audiovisual speech recognition system developmen . Materiał językowy do wykorzystania w systemie audiowizualnego rozpoznawania mowy angielskiej

Publication

A. Czyżewski
B. Kostek
T. Ciszewski
D. Majewicz

- Year 2013

The bi-modal speech recognition system requires a 2-sample language input for training and for testing algorithms which precisely depicts natural English speech. For the purposes of the audio-visual recordings, a training data base of 264 sentences (1730 words without repetitions; 5685 sounds) has been created. The language sample reflects vowel and consonant frequencies in natural speech. The recording material reflects both the...

Recognition of Emotions in Speech Using Convolutional Neural Networks on Different Datasets

Publication

- Electronics - Year 2022

Artificial Neural Network (ANN) models, specifically Convolutional Neural Networks (CNN), were applied to extract emotions based on spectrograms and mel-spectrograms. This study uses spectrograms and mel-spectrograms to investigate which feature extraction method better represents emotions and how big the differences in efficiency are in this context. The conducted studies demonstrated that mel-spectrograms are a better-suited...

Full text available to download

A Study of Cross-Linguistic Speech Emotion Recognition Based on 2D Feature Spaces

Publication

G. Tamulevicius
G. Korvel
A. B. Yayak
P. Treigys
J. Bernataviciene
B. Kostek

- Electronics - Year 2020

In this research, a study of cross-linguistic speech emotion recognition is performed. For this purpose, emotional data of different languages (English, Lithuanian, German, Spanish, Serbian, and Polish) are collected, resulting in a cross-linguistic speech emotion dataset with the size of more than 10.000 emotional utterances. Despite the bi-modal character of the databases gathered, our focus is on the acoustic representation...

Full text available to download

Material for Automatic Phonetic Transcription of Speech Recorded in Various Conditions

Publication

- Year 2016

Automatic speech recognition (ASR) is under constant development, especially in cases when speech is casually produced or it is acquired in various environment conditions, or in the presence of background noise. Phonetic transcription is an important step in the process of full speech recognition and is discussed in the presented work as the main focus in this process. ASR is widely implemented in mobile devices technology, but...

Full text to download in external service

Investigating Feature Spaces for Isolated Word Recognition

Publication

P. Treigys
G. Korvel
G. Tamulevicius
J. Bernataviciene
B. Kostek

- Year 2020

The study addresses the issues related to the appropriateness of a two-dimensional representation of speech signal for speech recognition tasks based on deep learning techniques. The approach combines Convolutional Neural Networks (CNNs) and time-frequency signal representation converted to the investigated feature spaces. In particular, waveforms and fractal dimension features of the signal were chosen for the time domain, and...

Full text to download in external service

Investigating Feature Spaces for Isolated Word Recognition

Publication

G. Korvel
G. Tamulevicus
P. Treigys
J. Bernataviciene
B. Kostek

- Year 2018

Much attention is given by researchers to the speech processing task in automatic speech recognition (ASR) over the past decades. The study addresses the issue related to the investigation of the appropriateness of a two-dimensional representation of speech feature spaces for speech recognition tasks based on deep learning techniques. The approach combines Convolutional Neural Networks (CNNs) and timefrequency signal representation...

A comparative study of English viseme recognition methods and algorithm

Publication

- MULTIMEDIA TOOLS AND APPLICATIONS - Year 2018

An elementary visual unit – the viseme is concerned in the paper in the context of preparing the feature vector as a main visual input component of Audio-Visual Speech Recognition systems. The aim of the presented research is a review of various approaches to the problem, the implementation of algorithms proposed in the literature and a comparative research on their effectiveness. In the course of the study an optimal feature vector...

Full text available to download

Methodology and technology for the polymodal allophonic speech transcription

Publication

- Journal of the Acoustical Society of America - Year 2016

A method for automatic audiovisual transcription of speech employing: acoustic and visual speech representations is developed. It adopts a combining of audio and visual modalities, which provide a synergy effect in terms of speech recognition accuracy. To establish a robust solution, basic research concerning the relation between the allophonic variation of speech, i.e. the changes in the articulatory setting of speech organs for...

Full text to download in external service

Methodology and technology for the polymodal allophonic speech transcription

Publication

- Journal of the Acoustical Society of America - Year 2016

A method for automatic audiovisual transcription of speech employing: acoustic, electromagnetical articulography and visual speech representations is developed. It adopts a combining of audio and visual modalities, which provide a synergy effect in terms of speech recognition accuracy. To establish a robust solution, basic research concerning the relation between the allophonic variation of speech, i.e., the changes in the articulatory...

Full text to download in external service

Introduction to the special issue on machine learning in acoustics

Publication

Z. Michalopoulou
P. Gerstoft
B. Kostek
M. A. Roch

- Journal of the Acoustical Society of America - Year 2021

When we started our Call for Papers for a Special Issue on “Machine Learning in Acoustics” in the Journal of the Acoustical Society of America, our ambition was to invite papers in which machine learning was applied to all acoustics areas. They were listed, but not limited to, as follows: • Music and synthesis analysis • Music sentiment analysis • Music perception • Intelligent music recognition • Musical source separation • Singing...

Full text available to download

The Impact of Foreign Accents on the Performance of Whisper Family Models Using Medical Speech in Polish

Publication

S. Zaporowski

- Year 2024

The article presents preliminary experiments investigating the impact of accent on the performance of the Whisper automatic speech recognition (ASR) system, specifically for the Polish language and medical data. The literature review revealed a scarcity of studies on the influence of accents on speech recognition systems in Polish, especially concerning medical terminology. The experiments involved voice cloning of selected individuals...

Full text available to download

Examining Feature Vector for Phoneme Recognition

Publication

G. Korvel
B. Kostek

- Year 2018

The aim of this paper is to analyze usability of descriptors coming from music information retrieval to the phoneme analysis. The case study presented consists in several steps. First, a short overview of parameters utilized in speech analysis is given. Then, a set of time and frequency domain-based parameters is selected and discussed in the context of stop consonant acoustical characteristics. A toolbox created for this purpose...

An Attempt to Create Speech Synthesis Model That Retains Lombard Effect Characteristics

Publication

G. Korvel
O. Kurasova
B. Kostek

- Year 2019

The speech with the Lombard effect has been extensively studied in the context of speech recognition or speech enhancement. However, few studies have investigated the Lombard effect in the context of speech synthesis. The aim of this paper is to create a mathematical model that allows for retaining the Lombard effect. These models could be used as a basis of a formant speech synthesizer. The proposed models are based on dividing...

Full text available to download

A comparative study of English viseme recognition methods and algorithms

Publication

- MULTIMEDIA TOOLS AND APPLICATIONS - Year 2018

An elementary visual unit – the viseme is concerned in the paper in the context of preparing the feature vector as a main visual input component of Audio-Visual Speech Recognition systems. The aim of the presented research is a review of various approaches to the problem, the implementation of algorithms proposed in the literature and a comparative research on their effectiveness. In the course of the study an optimal feature vector construction...

Full text available to download

Speech Analytics Based on Machine Learning

Publication

- Year 2019

In this chapter, the process of speech data preparation for machine learning is discussed in detail. Examples of speech analytics methods applied to phonemes and allophones are shown. Further, an approach to automatic phoneme recognition involving optimized parametrization and a classifier belonging to machine learning algorithms is discussed. Feature vectors are built on the basis of descriptors coming from the music information...

Full text to download in external service

Enhanced voice user interface employing spatial filtration of signals from acoustic vector sensor

Publication

- Year 2015

Spatial filtration of sound is introduced to enhance speech recognition accuracy in noisy conditions. An acoustic vector sensor (AVS) is employed. The signals from the AVS probe are processed in order to attenuate the surrounding noise. As a result the signal to noise ratio is increased. An experiment is featured in which speech signals are disturbed by babble noise. The signals before and after spatial filtration are processed...

Full text to download in external service

SYNTHESIZING MEDICAL TERMS – QUALITY AND NATURALNESS OF THE DEEP TEXT-TO-SPEECH ALGORITHM

Publication

- Journal of the Acoustical Society of America - Year 2023

The main purpose of this study is to develop a deep text-to-speech (TTS) algorithm designated for an embedded system device. First, a critical literature review of state-of-the-art speech synthesis deep models is provided. The algorithm implementation covers both hardware and algorithmic solutions. The algorithm is designed for use with the Raspberry Pi 4 board. 80 synthesized sentences were prepared based on medical and everyday...

Full text available to download

Noise profiling for speech enhancement employing machine learning models

Publication

K. Kąkol
G. Korvel
B. Kostek

- Journal of the Acoustical Society of America - Year 2022

This paper aims to propose a noise profiling method that can be performed in near real-time based on machine learning (ML). To address challenges related to noise profiling effectively, we start with a critical review of the literature background. Then, we outline the experiment performed consisting of two parts. The first part concerns the noise recognition model built upon several baseline classifiers and noise signal features...

Full text available to download

Decoding imagined speech for EEG-based BCI

Publication

C. A. Reyes-García
A. A. Torres-García
T. Hernández-del-Toro
J. S. Garcia Salinas
L. Villaseñor-Pineda

- Year 2024

Brain–computer interfaces (BCIs) are systems that transform the brain's electrical activity into commands to control a device. To create a BCI, it is necessary to establish the relationship between a certain stimulus, internal or external, and the brain activity it provokes. A common approach in BCIs is motor imagery, which involves imagining limb movement. Unfortunately, this approach allows few commands. As an alternative, this...

Full text to download in external service

Bimodal classification of English allophones employing acoustic speech signal and facial motion capture

Publication

- Journal of the Acoustical Society of America - Year 2018

A method for automatic transcription of English speech into International Phonetic Alphabet (IPA) system is developed and studied. The principal objective of the study is to evaluate to what extent the visual data related to lip reading can enhance recognition accuracy of the transcription of English consonantal and vocalic allophones. To this end, motion capture markers were placed on the faces of seven speakers to obtain lip...

Full text to download in external service

KORPUS MOWY ANGIELSKIEJ DO CELÓW MULTIMODALNEGO AUTOMATYCZNEGO ROZPOZNAWANIA MOWY

Publication

- Przegląd Telekomunikacyjny + Wiadomości Telekomunikacyjne - Year 2016

W referacie zaprezentowano audiowizualny korpus mowy zawierający 31 godzin nagrań mowy w języku angielskim. Korpus dedykowany jest do celów automatycznego audiowizualnego rozpoznawania mowy. Korpus zawiera nagrania wideo pochodzące z szybkoklatkowej kamery stereowizyjnej oraz dźwięk zarejestrowany przez matrycę mikrofonową i mikrofon komputera przenośnego. Dzięki uwzględnieniu nagrań zarejestrowanych w warunkach szumowych korpus...

Voice command recognition using hybrid genetic algorithm

Publication

- TASK Quarterly - Year 2010

Abstract: Speech recognition is a process of converting the acoustic signal into a set of words, whereas voice command recognition consists in the correct identification of voice commands, usually single words. Voice command recognition systems are widely used in the military, control systems, electronic devices, such as cellular phones, or by people with disabilities (e.g., for controlling a wheelchair or operating a computer...

Full text available to download

Automatic Emotion Recognition in Children with Autism: A Systematic Literature Review

Publication

A. Landowska
A. Karpus
T. Zawadzka
B. Robins
D. Erol Barkana
H. Kose
T. Zorcec
N. Cummins

- SENSORS - Year 2022

The automatic emotion recognition domain brings new methods and technologies that might be used to enhance therapy of children with autism. The paper aims at the exploration of methods and tools used to recognize emotions in children. It presents a literature review study that was performed using a systematic approach and PRISMA methodology for reporting quantitative and qualitative results. Diverse observation channels and modalities...

Full text available to download

WYKORZYSTANIE SIECI NEURONOWYCH DO SYNTEZY MOWY WYRAŻAJĄCEJ EMOCJE

Publication

- Year 2018

W niniejszym artykule przedstawiono analizę rozwiązań do rozpoznawania emocji opartych na mowie i możliwości ich wykorzystania w syntezie mowy z emocjami, wykorzystując do tego celu sieci neuronowe. Przedstawiono aktualne rozwiązania dotyczące rozpoznawania emocji w mowie i metod syntezy mowy za pomocą sieci neuronowych. Obecnie obserwuje się znaczny wzrost zainteresowania i wykorzystania uczenia głębokiego w aplikacjach związanych...

Examining Feature Vector for Phoneme Recognition / Analiza parametrów w kontekście automatycznej klasyfikacji fonemów

Publication

- Year 2017

The aim of this paper is to analyze usability of descriptors coming from music information retrieval to the phoneme analysis. The case study presented consists in several steps. First, a short overview of parameters utilized in speech analysis is given. Then, a set of time and frequency domain-based parameters is selected and discussed in the context of stop consonant acoustical characteristics. A toolbox created for this purpose...

Vocalic Segments Classification Assisted by Mouth Motion Capture

Publication

- Year 2018

Visual features convey important information for automatic speech recognition (ASR), especially in noisy environment. The purpose of this study is to evaluate to what extent visual data (i.e. lip reading) can enhance recognition accuracy in the multi-modal approach. For that purpose motion capture markers were placed on speakers' faces to obtain lips tracking data during speaking. Different parameterizations strategies were tested...

Full text to download in external service

PHONEME DISTORTION IN PUBLIC ADDRESS SYSTEMS

Publication

- Year 2015

The quality of voice messages in speech reinforcement and public address systems is often poor. The sound engineering projects of such systems take care of sound intensity and possible reverberation phenomena in public space without, however, considering the influence of acoustic interference related to the number and distribution of loudspeakers. This paper presents the results of measurements and numerical simulations of the...

Marking the Allophones Boundaries Based on the DTW Algorithm

Publication

J. Rafałko

- Year 2018

The paper presents an approach to marking the boundaries of allophones in the speech signal based on the Dynamic Time Warping (DTW) algorithm. Setting and marking of allophones boundaries in continuous speech is a difficult issue due to the mutual influence of adjacent phonemes on each other. It is this neighborhood on the one hand that creates variants of phonemes that is allophones, and on the other hand it affects that the border...

The Innovative Faculty for Innovative Technologies

Publication

- Year 2013

A leaflet describing Faculty of Electronics, Telecommunications and Informatics, Gdańsk University of Technology. Multimedia Systems Department described laboratories and prototypes of: Auditory-visual attention stimulator, Automatic video event detection, Object re-identification application for multi-camera surveillance systems, Object Tracking and Automatic Master-Slave PTZ Camera Positioning System, Passive Acoustic Radar,...

Full text to download in external service

Mispronunciation Detection in Non-Native (L2) English with Uncertainty Modeling

Publication

D. Korzekwa
J. Lorenzo-trueba
S. Zaporowski
S. Calamaro
T. Drugman
B. Kostek

- Year 2021

A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker. This approach makes two simplifying assumptions: a) phonemes can be recognized from speech with high accuracy, b) there is a single correct way for a sentence to be pronounced. These assumptions do not always hold, which can result...

Full text to download in external service

Minimizing Distribution and Data Loading Overheads in Parallel Training of DNN Acoustic Models with Frequent Parameter Averaging

Publication

- Year 2017

In the paper we investigate the performance of parallel deep neural network training with parameter averaging for acoustic modeling in Kaldi, a popular automatic speech recognition toolkit. We describe experiments based on training a recurrent neural network with 4 layers of 800 LSTM hidden states on a 100-hour corpora of annotated Polish speech data. We propose a MPI-based modiﬁcation of the training program which minimizes the...

Full text to download in external service

Search

Filters

Catalog

Category

Year

Options

Search results for: SPEECH RECOGNITION