Venue: Lecture Theatre 1, Diamond Building
Note: all keynote talks will be delivered remotely to an in-person audience. There will be no facility to join online.
Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg, Sweden
Talk scheduled for: 10:15am - 11:15am
Title: Beyond pixels and words
Abstract: Words are not used in isolation. When we communicate we relate them to our background knowledge, the intent of interaction – why we are saying something, who is our partner, what has been said before – our common ground, our senses and perception of the physical world and situations around us. Speech is also not the only way to convey information with: we interact in writing, symbols, using different media and communication forms, with eye-gaze, gestures and other properties of our bodies. Language models in language technology extract meaning primarily from text and sometimes a few other modalities such as images and acoustic signal. This poses two questions: (i) to what extent can these modalities be a proxy for representing our semantic knowledge for different natural language processing tasks and applications; and (ii) how can we port the semantic knowledge captured by these modalities to other modalities – how can we bring large language models to the real world and take them for a walk? In this talk I will describe our research towards answering these questions and outline our challenges awaiting ahead.
Speaker Bio
Simon Dobnik is a Professor of Computational Linguistics at the Department of Philosophy, Linguistics and Theory of Science (FLoV) at University of Gothenburg, Sweden. He is a member of the Centre for Linguistic Theory and Studies in Probability (CLASP) where he leads the Cognitive Systems research group. He has worked on (i) data models and machine learning of meaning representations for language, action and perception, (ii) semantic models for language, action and perception (computational semantics), (iii) representation learning in language, inference and interpretability, (iv) interpretation and generation of spatial descriptions and reference, (v) interactive learning with small data, (vi) data bias and privacy, and (vii) multimodal dialogue, robotics and related topics.
Language Technologies Institute, School of Computer Science, Carnegie Mellon University, USA
Talk scheduled for: 2:00pm - 3:00pm
Title: Improving Generalization in Speech Emotion Recognition
Abstract: Emotions play a crucial role in human interactions, shaping our decision-making, self-expression, and how others respond to us. Emotions are important to regulate our conversation with others and understand the message. Therefore, designing affect-aware spoken language systems has essential benefits, with broad applications in detecting potential threats, assessing customer service quality, and monitoring cognitive states in intelligent tutoring systems. The advances in this area are also relevant in the healthcare domain to facilitate the diagnosis and prognosis of many mental health conditions, including schizophrenia, depression, bipolar disorder, autism spectrum disorder, and post-traumatic stress disorder. A key barrier to deploying affect-aware technology in real-world applications is the lack of generalization of speech emotion recognition models to recognize expressive behavior during natural, unrestricted human interaction. The performance of SER models trained in laboratory conditions often drops when evaluated in natural, unconstrained settings. What are the factors that prevent the generalization of speech emotion recognition systems? In this seminar, we will discuss key factors that impact the generalization of speech recognition systems across different domains. We will explore how variations in speakers, languages, and interaction styles can significantly affect system performance. In addition to identifying these challenges, we will discuss principled algorithmic approaches designed to improve robustness and adaptability, enabling speech recognition systems to perform reliably in diverse and dynamic real-world settings.
Speaker Bio: Carlos Busso is a Professor at the Language Technologies Institute, Carnegie Mellon University, where he is also the director of the Multimodal Speech Processing (MSP) Laboratory. His research interest is in human-centered multimodal machine intelligence and applications, focusing on the broad areas of speech processing, affective computing, multimodal behavior generative models, and foundational models for multimodal processing. He was selected by the School of Engineering of Chile as the best electrical engineer who graduated in 2003 from Chilean universities. He is a recipient of an NSF CAREER Award. In 2014, he received the ICMI Ten-Year Technical Impact Award. His students received the third prize IEEE ITSS Best Dissertation Award (N. Li) in 2015, and the AAAC Student Dissertation Award (W.-C. Lin) in 2024. He also received the Hewlett Packard Best Paper Award at the IEEE ICME 2011 (with J. Jain), and the Best Paper Award at the AAAC ACII 2017 (with Yannakakis and Cowie). He received the Best of IEEE Transactions on Affective Computing Paper Collection in 2021 (with R. Lotfian) and the Best Paper Award from IEEE Transactions on Affective Computing in 2022 (with Yannakakis and Cowie). In 2023, he received the Distinguished Alumni Award in the Mid-Career/Academia category by the Signal and Image Processing Institute (SIPI) at the University of Southern California. He received the 2023 ACM ICMI Community Service Award. He is currently a Senior Area Editor of IEEE/ACM Speech and Language Processing. He is a member of AAAC and a senior member of ACM. He is an IEEE Fellow and an ISCA Fellow.