Annual SLT CDT conference is on Monday 23 June 2025
Venue: Lecture Theatre 1, Diamond Building
Note: all research talks will be delivered in-person.
Authors: Madhurananda Pahar, Fuxiang Tao, Bahman Mirheidari, Nathan Pevy, Rebecca Bright, Swapnil Gadgil, Lise Sproson, Dorota Braun, Caitlin Illingworth, Daniel Blackburn, Heidi Christensen (University of Sheffield)
Abstract: The early signs of cognitive decline are often noticeable in conversational speech, and identifying those signs is crucial in dealing with later and more serious stages of neurodegenerative diseases. Clinical detection is costly and time-consuming and although there has been recent progress in the automatic detection of speech-based cues, those systems are trained on relatively small databases, lacking detailed metadata and demographic information. This paper presents CognoSpeak and its associated data collection efforts. CognoSpeak asks memory-probing long and short-term questions and administers standard cognitive tasks such as verbal and semantic fluency and picture description using a virtual agent on a mobile or web platform. In addition, it collects multimodal data such as audio and video along with a rich set of metadata from primary and secondary care, memory clinics and remote settings like people's homes. Here, we present results from 126 subjects whose audio was manually transcribed. Several classic classifiers, as well as large language model-based classifiers, have been investigated and evaluated across the different types of prompts. We demonstrate a high level of performance; in particular, we achieved an F1-score of 0.873 using a DistilBERT model to discriminate people with cognitive impairment (dementia and people with mild cognitive impairment (MCI)) from healthy volunteers using the memory responses, fluency tasks and cookie theft picture description. CognoSpeak is an automatic, remote, low-cost, repeatable, non-invasive and less stressful alternative to existing clinical cognitive assessments.
Authors: Thomas Pickard (University of Sheffield), Aline Villavicencio (University of Exeter), Maggie Mi (University of Sheffield), Wei He (University of Exeter), Dylan Phelps (University of Sheffield), Marco Idiart (Federal University of Rio Grande do Sul)
Abstract: AdMIRe: Advancing Multimodal Idiomaticity Representation is a shared task at SemEval-2025, combining text and images to evaluate language models' processing of idiomatic language. This talk will present the task itself and the results obtained from both participating systems and human annotators. We will also discuss the practicalities of organising and running a challenge like this and our 'lessons learned' for anyone who might consider organising one in the future.
Authors: Xiaozhou Tan (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: I will give an introduction to the application of diffusion models in Speech synthesis, and what can be done to explore the diffusion like models(models that iteratively refine the result) in Speech synthesis.
Authors: Jinzuomu Zhong (University of Edinburgh), Korin Richmond (University of Edinburgh), Suyuan Liu (University of British Columbia), Dan Wells (University of Edinburgh), Zhiba Su (Independent Researcher), Siqi Sun (University of Edinburgh)
Abstract: While recent Zero-Shot Text-to-Speech (ZS-TTS) models achieve high naturalness and speaker similarity, they fall short in accent fidelity and control - generating hallucinated accents that diverge from the input speech prompt. To address this, we introduce zero-shot accent generation, a new task aimed at synthesising speech in any target content, speaker, and accent. We present AccentBox, the first system capable of this task via a two-stage pipeline. In the first stage, we propose GenAID, a novel Accent Identification model that learns speaker-agnostic accent embeddings, achieving 0.16 F1-score improvement on unseen speakers. In the second stage, a ZS-TTS model is conditioned on these embeddings, achieving 57.4-70.0% listener preference for accent fidelity, compared to strong baselines. We also advance evaluation methodologies for accent generation. Subjectively, we improve listener guidance with transcriptions and accent difference highlighting, with rigorous listener screening. Objectively, we propose pronunciation-sensitive metrics using vowel formant and phonetic posteriorgram distances. providing more reliable evaluation for underrepresented accents. Looking forward, we aim to expand AccentBox's capabilities to more accents via pseudo-labelling of in-the-wild data, and improve accent fidelity via formant-guided generation - moving toward fairer and more inclusive speech synthesis for all accents.
Authors: Hend ElGhazaly (University of Sheffield), Bahman Mirheidari (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield), Heidi Christensen (University of Sheffield)
Abstract: Ensuring fairness in Automatic Speech Recognition (ASR) models requires not only reducing biases but also making sure that fairness improvements generalize beyond the training domain. This challenge is particularly relevant for pre-trained models, which have already been trained on large-scale data and may overfit quickly during fine-tuning. In this work, we investigate contrastive learning as a fairness intervention, introducing a contrastive loss term alongside the standard cross-entropy loss to promote gender-invariant speech representations. Our results show that fairness-aware fine-tuning is highly dependent on training data diversity, with contrastive learning proving effective only when applied to diverse and representative datasets. Simply increasing training data without explicitly enforcing fairness does not ensure bias mitigation. Our findings highlight the need for fairness-aware dataset selection and evaluation beyond in-domain settings to build robust and equitable ASR systems.
Authors: Shaun Cassini (University of Sheffield), Thomas Hain (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Emphasis plays a key role in spoken communication, conveying intent, emotion, and information structure. It is also a useful attribute for a range of speech technology tasks, such as intent prediction, emotion recognition, and punctuation recovery. Self-supervised speech models (S3Ms) learn general-purpose representations of speech, enabling broad transfer to downstream tasks. However, it remains unclear to what extent S3Ms encode emphasis. Existing studies typically detect only acoustic correlates of emphasis, or fine-tune a single model on an emphasis classification task. In this work, we address three open questions: 1) How is emphasis represented across speech foundation models? 2) How can its presence be quantified? 3) Is emphasis information removed, preserved, or enhanced through downstream fine-tuning? We propose a novel, non-parameteric, unitless distance measure for quantifying emphasis encoding, and apply it to a diverse set of S3Ms. Our findings show that emphasis is clearly reflected in model representations, and becomes more accessible after fine-tuning on downstream tasks such as automatic speech recognition.