Search results for: AUDIO-VISUAL SPEECH RECOGNITION - Bridge of Knowledge

Search

Search results for: AUDIO-VISUAL SPEECH RECOGNITION

Search results for: AUDIO-VISUAL SPEECH RECOGNITION

  • Pursuing Listeners’ Perceptual Response in Audio-Visual Interactions - Headphones vs Loudspeakers: A Case Study

    Publication

    This study investigates listeners’ perceptual responses in audio-visual interactions concerning binaural spatial audio. Audio stimuli are coupled with or without visual cues to the listeners. The subjective test participants are tasked to indicate the direction of the incoming sound while listening to the audio stimulus via loudspeakers or headphones with the head-related transfer function (HRTF) plugin. First, the methodology...

    Full text available to download

  • Gaze-tracking based audio-visual correlation analysis employing quality of experience methodology

    This paper investigates a new approach to audio-visual correlation assessment based on the gaze-tracking system developed at the Multimedia Systems Department (MSD) of Gdansk University of Technology (GUT). The gaze-tracking methodology, having roots in Human-Computer Interaction borrows the relevance feedback through gaze-tracking and applies it to the new area of interests, which is Quality of Experience. Results of subjective...

    Full text to download in external service

  • Audio-visual surveillance system for application in bank operating room

    An audio-visual surveillance system able to detect, classify and to localize acoustic events in a bank operating room is presented. Algorithms for detection and classification of abnormal acoustic events, such as screams or gunshots are introduced. Two types of detectors are employed to detect impulsive sounds and vocal activity. A Support Vector Machine (SVM) classifier is used to discern between the different classes of acoustic...

  • A survey of automatic speech recognition deep models performance for Polish medical terms

    Among the numerous applications of speech-to-text technology is the support of documentation created by medical personnel. There are many available speech recognition systems for doctors. Their effectiveness in languages such as Polish should be verified. In connection with our project in this field, we decided to check how well the popular speech recognition systems work, employing models trained for the general Polish language....

    Full text to download in external service

  • Hybrid of Neural Networks and Hidden Markov Models as a modern approach to speech recognition systems

    The aim of this paper is to present a hybrid algorithm that combines the advantages ofartificial neural networks and hidden Markov models in speech recognition for control purpos-es. The scope of the paper includes review of currently used solutions, description and analysis of implementation of selected artificial neural network (NN) structures and hidden Markov mod-els (HMM). The main part of the paper consists of a description...

    Full text available to download

  • Automatic audio-visual threat detection

    Publication

    - Year 2010

    The concept, practical realization and application of a system for detection and classification of hazardous situations based on multimodal sound and vision analysis are presented. The device consists of new kind multichannel miniature sound intensity sensors, digital Pan Tilt Zoom and fixed cameras and a bundle of signal processing algorithms. The simultaneous analysis of multimodal signals can significantly improve the accuracy...

  • Emotions in polish speech recordings

    Open Research Data
    open access

    The data set presents emotions recorded in sound files that are expressions of Polish speech. Statements were made by people aged 21-23, young voices of 5 men. Each person said the following words / nie – no, oddaj - give back, podaj – pass, stop - stop, tak - yes, trzymaj -hold / five times representing a specific emotion - one of three - anger (a),...

  • Analysis of 2D Feature Spaces for Deep Learning-based Speech Recognition

    Publication

    - JOURNAL OF THE AUDIO ENGINEERING SOCIETY - Year 2018

    convolutional neural network (CNN) which is a class of deep, feed-forward artificial neural network. We decided to analyze audio signal feature maps, namely spectrograms, linear and Mel-scale cepstrograms, and chromagrams. The choice was made upon the fact that CNN performs well in 2D data-oriented processing contexts. Feature maps were employed in the Lithuanian word recognition task. The spectral analysis led to the highest word...

  • Marek Blok dr hab. inż.

    People

    Marek Blok in 1994 graduated from the Faculty of Electronics at Gdansk University of Technology receiving his MSc in telecommunications. In 2003 received Ph.D.  and in 2017 D.Sc. in telecommunications from the Faculty of Electronics, Telecommunications and Informatics of Gdańsk University of Technology. His research interests are focused on application of digital signal processing in telecommunications. He provides lectures, laboratory...

  • Michał Lech dr inż.

    People

    Michał Lech was born in Gdynia in 1983. In 2007 he graduated from the faculty of Electronics, Telecommunications and Informatics of Gdansk University of Technology. In June 2013, he received his Ph.D. degree. The subject of the dissertation was: “A Method and Algorithms for Controlling the Sound Mixing Processes by Hand Gestures Recognized Using Computer Vision”. The main focus of the thesis was the bias of audio perception caused...

  • Artur Gańcza mgr inż.

    I received the M.Sc. degree from the Gdańsk University of Technology (GUT), Gdańsk, Poland, in 2019. I am currently a Ph.D. student at GUT, with the Department of Automatic Control, Faculty of Electronics, Telecommunications and Informatics. My professional interests include speech recognition, system identification, adaptive signal processing and linear algebra.

  • Vowel recognition based on acoustic and visual features

    W artykule zaprezentowano metodę, która może ułatwić naukę mowy dla osób z wadami słuchu. Opracowany system rozpoznawania samogłosek wykorzystuje łączną analizę parametrów akustycznych i wizualnych sygnału mowy. Parametry akustyczne bazują na współczynnikach mel-cepstralnych. Do wyznaczenia parametrów wizualnych z kształtu i ruchu ust zastosowano Active Shape Models. Jako klasyfikator użyto sztuczną sieć neuronową. Działanie systemu...

    Full text available to download

  • ALOFON corpus

    The ALOFON corpus is one of the multimodal database of word recordings in English, available at http://www.modality-corpus.org/.  The ALOFON corpus is oriented towards the recording of the speech equivalence variants. For this purpose, a total of 7 people who are or speak English with native speaker fluency and a variety of Standard Southern British...

  • Speech recognition system for hearing impaired people.

    Publication

    - Year 2005

    Praca przedstawia wyniki badań z zakresu rozpoznawania mowy. Tworzony system wykorzystujący dane wizualne i akustyczne będzie ułatwiał trening poprawnego mówienia dla osób po operacji transplantacji ślimaka i innych osób wykazujących poważne uszkodzenia słuchu. Active Shape models zostały wykorzystane do wyznaczania parametrów wizualnych na podstawie analizy kształtu i ruchu ust w nagraniach wideo. Parametry akustyczne bazują na...

  • Piotr Szczuko dr hab. inż.

    Piotr Szczuko received his M.Sc. degree in 2002. His thesis was dedicated to examination of correlation phenomena between perception of sound and vision for surround sound and digital image. He finished Ph.D. studies in 2007 and one year later completed a dissertation "Application of Fuzzy Rules in Computer Character Animation" that received award of Prime Minister of Poland. His interests include: processing of audio and video, computer...

  • IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING

    Journals

    ISSN: 1063-6676

  • Audiovisual speech recognition for training hearing impaired patients

    Publication

    Praca przedstawia system rozpoznawania izolowanych głosek mowy wykorzystujący dane wizualne i akustyczne. Modele Active Shape Models zostały wykorzystane do wyznaczania parametrów wizualnych na podstawie analizy kształtu i ruchu ust w nagraniach wideo. Parametry akustyczne bazują na współczynnikach melcepstralnych. Sieć neuronowa została użyta do rozpoznawania wymawianych głosek na podstawie wektora cech zawierającego oba typy...

  • IEEE Transactions on Audio Speech and Language Processing

    Journals

    ISSN: 1558-7916

  • Combined Single Neuron Unit Activity and Local Field Potential Oscillations in a Human Visual Recognition Memory Task

    Publication
    • M. T. Kucewicz
    • B. M. Berry
    • M. R. Bower
    • J. Cymbalnik
    • V. Svehlik
    • S. M. Stead
    • G. A. Worrell

    - IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING - Year 2016

    GOAL: Activities of neuronal networks range from action potential firing of individual neurons, coordinated oscillations of local neuronal assemblies, and distributed neural populations. Here, we describe recordings using hybrid electrodes, containing both micro- and clinical macroelectrodes, to simultaneously sample both large-scale network oscillations and single neuron spiking activity in the medial temporal lobe structures...

    Full text to download in external service

  • Comparison of Language Models Trained on Written Texts and Speech Transcripts in the Context of Automatic Speech Recognition

    Publication
    • S. Dziadzio
    • A. Nabożny
    • A. Smywiński-Pohl
    • B. Ziółko

    - Year 2015

    Full text to download in external service

  • Auditory-model based robust feature selection for speech recognition

    Publication

    - Journal of the Acoustical Society of America - Year 2010

    Full text to download in external service

  • Human-Computer Interface Based on Visual Lip Movement and Gesture Recognition

    The multimodal human-computer interface (HCI) called LipMouse is presented, allowing a user to work on a computer using movements and gestures made with his/her mouth only. Algorithms for lip movement tracking and lip gesture recognition are presented in details. User face images are captured with a standard webcam. Face detection is based on a cascade of boosted classifiers using Haar-like features. A mouth region is located in...

    Full text to download in external service

  • Bożena Kostek prof. dr hab. inż.

  • IEEE-ACM Transactions on Audio Speech and Language Processing

    Journals

    ISSN: 2329-9290

  • Intra-subject class-incremental deep learning approach for EEG-based imagined speech recognition

    Publication

    - Biomedical Signal Processing and Control - Year 2023

    Brain–computer interfaces (BCIs) aim to decode brain signals and transform them into commands for device operation. The present study aimed to decode the brain activity during imagined speech. The BCI must identify imagined words within a given vocabulary and thus perform the requested action. A possible scenario when using this approach is the gradual addition of new words to the vocabulary using incremental learning methods....

    Full text to download in external service

  • Adaptive system for recognition of sounds indicating threats to security of people and property employing parallel processing of audio data streams

    Publication

    - Year 2015

    A system for recognition of threatening acoustic events employing parallel processing on a supercomputing cluster is featured. The methods for detection, parameterization and classication of acoustic events are introduced. The recognition engine is based onthreshold-based detection with adaptive threshold and Support Vector Machine classifcation. Spectral, temporal and mel-frequency descriptors are used as signal features. The...

  • EURASIP Journal on Audio Speech and Music Processing

    Journals

    ISSN: 1687-4714 , eISSN: 1687-4722

  • Piotr Odya dr inż.

      Piotr Odya was born in Gdansk in 1974. He received his M.Sc. in 1999 from the Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, Poland. His thesis was related to the problem of sound quality improvement in the contemporary broadcasting studio. He is interested in video editing and multichannel sound systems. The goal of Mr. Odya Ph.D. thesis concerned methods and algorithms for correcting...

  • IEEE Automatic Speech Recognition and Understanding Workshop

    Conferences

  • Jan Daciuk dr hab. inż.

    Jan Daciuk received his M.Sc. from the Faculty of Electronics of Gdansk University of Technology in 1986, and his Ph.D. from the Faculty of Electronics, Telecommunications and Informatics of Gdańsk University of Technology in 1999. He has been working at the Faculty from 1988. His research interests include finite state methods in natural language processing and computational linguistics including speech processing. Dr. Daciuk...

  • ISCA Tutorial and Research Workshop Automatic Speech Recognition

    Conferences

  • Introduction to the special issue on machine learning in acoustics

    Publication
    • Z. Michalopoulou
    • P. Gerstoft
    • B. Kostek
    • M. A. Roch

    - Journal of the Acoustical Society of America - Year 2021

    When we started our Call for Papers for a Special Issue on “Machine Learning in Acoustics” in the Journal of the Acoustical Society of America, our ambition was to invite papers in which machine learning was applied to all acoustics areas. They were listed, but not limited to, as follows: • Music and synthesis analysis • Music sentiment analysis • Music perception • Intelligent music recognition • Musical source separation • Singing...

    Full text available to download

  • Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention

    Publication

    - Year 2021

    This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as the syllable nucleus. We propose an attention-based deep learning model that automatically de...

    Full text available to download

  • Enhanced voice user interface employing spatial filtration of signals from acoustic vector sensor

    Spatial filtration of sound is introduced to enhance speech recognition accuracy in noisy conditions. An acoustic vector sensor (AVS) is employed. The signals from the AVS probe are processed in order to attenuate the surrounding noise. As a result the signal to noise ratio is increased. An experiment is featured in which speech signals are disturbed by babble noise. The signals before and after spatial filtration are processed...

    Full text to download in external service

  • Investigating Feature Spaces for Isolated Word Recognition

    Publication

    - Year 2018

    Much attention is given by researchers to the speech processing task in automatic speech recognition (ASR) over the past decades. The study addresses the issue related to the investigation of the appropriateness of a two-dimensional representation of speech feature spaces for speech recognition tasks based on deep learning techniques. The approach combines Convolutional Neural Networks (CNNs) and timefrequency signal representation...

  • WYKORZYSTANIE SIECI NEURONOWYCH DO SYNTEZY MOWY WYRAŻAJĄCEJ EMOCJE

    Publication

    - Year 2018

    W niniejszym artykule przedstawiono analizę rozwiązań do rozpoznawania emocji opartych na mowie i możliwości ich wykorzystania w syntezie mowy z emocjami, wykorzystując do tego celu sieci neuronowe. Przedstawiono aktualne rozwiązania dotyczące rozpoznawania emocji w mowie i metod syntezy mowy za pomocą sieci neuronowych. Obecnie obserwuje się znaczny wzrost zainteresowania i wykorzystania uczenia głębokiego w aplikacjach związanych...

  • Objectivization of audio-video correlation assessment experiments

    Publication

    - Year 2010

    The purpose of this paper is to present a new method of conducting an audio-visual correlation analysis employing a head-motion-free gaze tracking system. First, a review of related works in the domain of sound and vision correlation is presented. Then assumptions concerning audio-visual scene creation are shortly described. The objectivization process of carrying out correlation tests employing gaze-tracking system is outlined....

    Full text to download in external service

  • Intelligent multimedia solutions supporting special education needs.

    The role of computers in school education is briefly discussed. Multimodal interfaces development history is shortly reviewed. Examples of applications of multimodal interfaces for learners with special educational needs are presented, including interactive electronic whiteboard based on video image analysis, application for controlling computers with facial expression and speech stretching audio interface representing audio modality....

  • Intelligent video and audio applications for learning enhancement

    The role of computers in school education is briefly discussed. Multimodal interfaces development history is shortly reviewed. Examples of applications of multimodal interfaces for learners with special educational needs are presented, including interactive electronic whiteboard based on video image analysis, application for controlling computers with facial expression and speech stretching audio interface representing audio modality....

    Full text to download in external service

  • Józef Kotus dr hab. inż.

  • Investigating Feature Spaces for Isolated Word Recognition

    Publication
    • P. Treigys
    • G. Korvel
    • G. Tamulevicius
    • J. Bernataviciene
    • B. Kostek

    - Year 2020

    The study addresses the issues related to the appropriateness of a two-dimensional representation of speech signal for speech recognition tasks based on deep learning techniques. The approach combines Convolutional Neural Networks (CNNs) and time-frequency signal representation converted to the investigated feature spaces. In particular, waveforms and fractal dimension features of the signal were chosen for the time domain, and...

    Full text to download in external service

  • Testing A Novel Gesture-Based Mixing Interface

    With a digital audio workstation, in contrast to the traditional mouse-keyboard computer interface, hand gestures can be used to mix audio with eyes closed. Mixing with a visual representation of audio parameters during experiments led to broadening the panorama and a more intensive use of shelving equalizers. Listening tests proved that the use of hand gestures produces mixes that are aesthetically as good as those obtained using...

    Full text available to download

  • Analysis of Lombard speech using parameterization and the objective quality indicators in noise conditions

    Publication

    - Year 2018

    The aim of the work is to analyze Lombard speech effect in recordings and then modify the speech signal in order to obtain an increase in the improvement of objective speech quality indicators after mixing the useful signal with noise or with an interfering signal. The modifications made to the signal are based on the characteristics of the Lombard speech, and in particular on the effect of increasing the fundamental frequency...

  • Biometria i przetwarzanie mowy 2023

    e-Learning Courses
    • J. Daciuk

    {mlang pl} Celem kursu jest zapoznanie studentów z: metodami ustalania i potwierdzania tożsamości ludzi na podstawie mierzalnych cech organizmu cechami mowy ludzkiej, w szczególności polskiej metodami rozpoznawania mowy metodami syntezy mowy {mlang} {mlang en} The aim of the course is to familiarize the students with: methods of identification and verification of identity of people based on measurable features of their...

  • Biometria i przetwarzanie mowy 2024

    e-Learning Courses
    • J. Daciuk

    {mlang pl} Celem kursu jest zapoznanie studentów z: metodami ustalania i potwierdzania tożsamości ludzi na podstawie mierzalnych cech organizmu cechami mowy ludzkiej, w szczególności polskiej metodami rozpoznawania mowy metodami syntezy mowy {mlang} {mlang en} The aim of the course is to familiarize the students with: methods of identification and verification of identity of people based on measurable features of their...

  • An Attempt to Create Speech Synthesis Model That Retains Lombard Effect Characteristics

    Publication

    - Year 2019

    The speech with the Lombard effect has been extensively studied in the context of speech recognition or speech enhancement. However, few studies have investigated the Lombard effect in the context of speech synthesis. The aim of this paper is to create a mathematical model that allows for retaining the Lombard effect. These models could be used as a basis of a formant speech synthesizer. The proposed models are based on dividing...

    Full text available to download

  • Auditory-visual attention stimulator

    New approach to lateralization irregularities formation was proposed. The emphasis is put on the relationship between visual and auditory attention stimulation. In this approach hearing is stimulated using time scale modified speech and sight is stimulated by rendering the text of the currently heard speech. Moreover, displayed text is modified using several techniques i.e. zooming, highlighting etc. In the experimental part of...

    Full text to download in external service

  • Analiza stanu nawierzchni i klas pojazdów na podstawie parametrów ekstrahowanych z sygnału fonicznego

    Celem badań jest poszukiwanie parametrów wektora cech ekstrahowanego z sygnału fonicznego w kontekście automatycznego rozpoznawania stanu nawierzchni jezdni oraz typu pojazdów. W pierwszej kolejności przedstawiono wpływ warunków pogodowych na charakterystykę widmową sygnału fonicznego rejestrowanego przy przejeżdżających pojazdach. Następnie, dokonano parametryzacji sygnału fonicznego oraz przeprowadzano analizę korelacyjną w celu...

    Full text available to download

  • SYNTHESIZING MEDICAL TERMS – QUALITY AND NATURALNESS OF THE DEEP TEXT-TO-SPEECH ALGORITHM

    The main purpose of this study is to develop a deep text-to-speech (TTS) algorithm designated for an embedded system device. First, a critical literature review of state-of-the-art speech synthesis deep models is provided. The algorithm implementation covers both hardware and algorithmic solutions. The algorithm is designed for use with the Raspberry Pi 4 board. 80 synthesized sentences were prepared based on medical and everyday...

    Full text available to download

  • Ranking Speech Features for Their Usage in Singing Emotion Classification

    Publication

    - Year 2020

    This paper aims to retrieve speech descriptors that may be useful for the classification of emotions in singing. For this purpose, Mel Frequency Cepstral Coefficients (MFCC) and selected Low-Level MPEG 7 descriptors were calculated based on the RAVDESS dataset. The database contains recordings of emotional speech and singing of professional actors presenting six different emotions. Employing the algorithm of Feature Selection based...

    Full text available to download