Annual Conference 2022 – Keynotes
Venue: Lecture Theatre 1, Diamond Building
Note: all keynote talks will be delivered remotely to an in-person audience. There will be no facility to join online.
Professor Mark Hasegawa-Johnson
University of Illinois, USA
Talk scheduled for: 3:20pm - 4:20pm on Wednesday 8 June
Title: Child-rearing families, language families, and the support for people with disabilities: Thoughts about what it means to be human
Abstract: Intelligent agents communicate data using symbol sequences. Animals communicate social signals acoustically. Only humans do both. In this talk, I'll talk a little about the relationship between speech processing and our common humanity, using three recent projects that I hope you'll find illustrative. First, in order to study the ways in which infants develop emotional bonds with their caregivers, we've developed a wearable device called LittleBeats that records an infant's acoustic environment, movement, and heartbeat. Automatic annotation of preverbal vocalizations (laugh, cry, babble, fuss) permit us to analyze the social interaction patterns of dozens of participant families, each spanning several hours of in-home recording. Visualizations reveal striking inter-familial differences that probably have yet-unknown correlates in the emotional development of the infant. Second, in order to study the relationships among languages, we have attempted to test automatic phone recognizers in languages other than the training languages. If the test language is drawn from a language family that is represented in the training dataset, then standard training methods give the best performance. If not, then it's better to use some type of regularized training that penalizes an estimate of the inter-language performance gap: we find that invariant risk minimization or regret minimization give the best performance if, respectively, the phone token boundary times are known versus unknown. Third, among several recent projects focused on improved speech recognition for people with disabilities, one of the most recent is focused on the automatic per-frame detection of stutter events. Most corpora of stuttering are labeled at the utterance level, not at the frame level; automatically determining the location of the stutter turns out to be somewhat challenging. Multiple instance learning (max pooling) solves the problem, at the expense of reducing the effective size of the training dataset. Perceptual tests (of the amount of stutter removed, and of the amount of content retained) suggest that the data reduction problem can be solved, at least for some types of stutter events, by pre-training the recognizer using artificially synthesized stutter events.
Mark Hasegawa-Johnson has been on the faculty at the University of Illinois since 1999, where he is currently a Professor of Electrical and Computer Engineering. He received his Ph.D. in 1996 at MIT, with a thesis titled 'Formant and Burst Spectral Measures with Quantitative Error Models for Speech Sound Classification', after which he was a postdoc at UCLA from 1996-1999.
Professor Hasegawa-Johnson is a Fellow of the Acoustical Society of America, and a Senior Member of IEEE and ACM. He is currently Treasurer of ISCA, and Senior Area Editor of the IEEE Transactions on Audio, Speech and Language.
He has published 280 peer-reviewed journal articles and conference papers in the general area of automatic speech analysis, including machine learning models of articulatory and acoustic phonetics, prosody, dysarthria, non-speech acoustic events, audio source separation, and under-resourced languages.
Professor Emine Yilmaz
University College London, UK
Talk scheduled for: 10am - 10:30am on Tuesday 7 June
Title: Research Challenges in Devising the Next Generation Information Retrieval and Access Systems
Abstract: With the introduction of new types of devices in our everyday lives (e.g. smart phones, smart watches, smart glasses, etc.), the interfaces over which information retrieval (IR) systems are used are becoming increasingly smaller, which limits the interactions users may have. Searching over devices with such small interfaces is not easy as it requires more effort to type and interact with such systems. Hence, building IR systems that can reduce the interactions needed with the device, while providing correct and unbiased information is highly critical.
Devising such systems have several challenges that must be tackled. In the first part of this talk, I will focus on the problems that need to be solved in designing next generation IR systems that can reduce the user effort needed, as well as the progress that we have made in these areas. In the second part of the talk, I will emphasize the importance of detecting online misinformation and bias and describe some of the work we have done to tackle these issues.
Emine Yilmaz is a Professor and Turing Fellow at University College London, Department of Computer Science. She also works as an Amazon Scholar for Amazon.
Her research interests lie in the areas of information retrieval, data mining, and applications of machine learning, probability and statistics. She is a recipient of the Early Career Fellowship from the Engineering and Physical Sciences Research Council (EPSRC). To date, she has received approximately £1.5 million of external funding from funding agencies including European Union, EPSRC, Google, Elsevier and Bloomberg. She has served in various senior roles, including co-editor-in-chief for the Information Retrieval Journal, a member of the editorial board for the AI Journal and an elected member of the executive committee for ACM SIGIR.
Prof Yilmaz was the recipient of the Karen Sparck Jones 2015 Award for her contributions to information retrieval research. She is also one of the recipients of the Google Faculty Research Award in 2015 and the Bloomberg Data Science Research Award in 2018.
Dr Emmanuel Vincent
Senior Research Scientist & Head of Science, Inria Nancy - Grand Est
Talk scheduled for: 12pm - 1pm on Wednesday 8 June
Title: Speech anonymization
Abstract: Large-scale collection, storage, and processing of speech data poses severe privacy threats. Indeed, speech encapsulates a wealth of personal data (e.g., age and gender, ethnic origin, personality traits, health and socio-economic status, etc.) which can be linked to the speaker's identity via metadata or via automatic speaker recognition. Speech data may also be used for voice spoofing using voice cloning software. With firm backing by privacy legislations such as the European general data protection regulation (GDPR), several initiatives are emerging to develop privacy preservation solutions for speech technology. This talk focuses on voice anonymization, that is the task of concealing the speaker's voice identity without degrading the utility of the data for downstream tasks. I will i) explain how to assess privacy and utility, ii) describe the two baselines of the VoicePrivacy 2020 and 2022 Challenges and complementary methods based on adversarial learning, differential privacy, or slicing, and iii) conclude by stating open questions for future research.
Emmanuel Vincent is a Senior Research Scientist with Inria. His research interests include Statistical machine learning for speech and audio; Privacy preservation; Learning from little/no labeled data; Source localization, speech enhancement, source separation; Robust speech and speaker recognition
Dr Bhuvana Ramabhadran
Talk scheduled for: 12pm - 12:30pm on Tuesday 7 June
Title: Large Scale Self/Semi-Supervised Learning for Speech Recognition
Abstract: Supervised learning has been used extensively in speech and language processing over the years. However, as the demands for annotated data increase, the process becomes expensive, time-consuming and is prone to inconsistencies and bias in the annotations. To take advantage of the vast quantities of unlabeled data, semi-supervised and unsupervised learning has been used extensively in the literature. Self-supervised learning, first introduced in the field of computer vision, is used to refer to frameworks that learn labels or targets from the unlabeled input signal. In other words, self-supervised learning makes use of proxy supervised learning tasks, such as contrastive learning to identify specific parts of the signal that carry information, thereby helping the neural network model to learn robust representations. Recently, self-supervised (pre-training) approaches for speech and audio processing that utilize both unspoken text and untranscribed audio are beginning to gain popularity. These approaches cover a broad spectrum of training strategies such as, unsupervised data generation, masking and reconstruction to learn invariant and task-agnostic representations, and at times even guiding this pre-training process with limited supervision. With the advent of models pre-trained on diverse data at scale that can be adapted to a wide range of downstream applications, a new training paradigm is beginning to emerge with an increased focus on utilizing these pre-trained models effectively. This talk provides an overview of the successful self-supervised approaches in speech recognition, lessons learnt and promising future directions.
Bhuvana Ramabhadran (IEEE Fellow, 2017, ISCA Fellow 2017) currently leads a team of researchers at Google, focusing on semi-supervised learning for speech recognition and multilingual speech recognition. Previously, she was a Distinguished Research Staff Member and Manager in IBM Research AI, at the IBM T. J. Watson Research Center, Yorktown Heights, NY, USA, where she led a team of researchers in the Speech Technologies Group and coordinated activities across IBM’s world wide laboratories in the areas of speech recognition, synthesis, and spoken term detection. She has served as the Area Chair for ICASSP (2011–2018), on the editorial board of the IEEE Transactions on Audio, Speech, and Language Processing (2011–2015), and on the IEEE SPS conference board (2017-2018). She also serves on the International Speech Communication Association (ISCA) board. She has published over 150 papers and been granted over 40 U.S. patents. Her research interests include speech recognition and synthesis algorithms, statistical modeling, signal processing, and machine learning. Some of her recent work has focused on the use of speech synthesis to improve core speech recognition performance and self-supervised learning.
Dr Duc Le
Talk scheduled for: 3pm - 3:30pm on Tuesday 7 June
Title: Automatic Speech Recognition Research at Meta
Abstract: Speech and audio technology plays a key role in many products at Meta, such as video captioning, Instagram stories creation, Portal devices for smart home conferencing, Oculus VR headsets, and Ray-Ban Stories smart glasses. These products require many technologies to come together and drive various research directions we are pursuing. In this talk, I will give an overview of the automatic speech recognition (ASR) research we have done at Meta, including the challenges and constraints that we faced and how we tackled them. I will also outline some of our long-term research visions and discuss potential avenues of collaboration between Meta and the research community.
Duc Le is a Research Scientist Manager at Meta AI Speech, where he is supporting a team of researchers and engineers to push the state-of-the-art in automatic speech recognition (ASR) and its applications. His research interests include end-to-end architectures, rare word and entity recognition, contextual biasing, language model fusion, multilingual ASR, and spoken language understanding. Dr. Duc Le received his Ph.D. from the University of Michigan, where he worked on automatic speech-language assessment and speech emotion recognition.