Annual SLT CDT conference is on Monday 23 June 2025
Venue: Basement exhibition space, Diamond Building
Note: all poster presentations will be delivered in-person.
Authors: Ariadna Sanchez (Centre for Speech Technology Research, University of Edinburgh), Simon King (Centre for Speech Technology Research, University of Edinburgh)
Abstract: Redacted
Authors: Aaron Fletcher (University of Sheffield), Mark Stevenson (University of Sheffield)
Abstract: Conducting efficient systematic reviews is crucial given the growing volume of scientific literature. Technology Assisted Review offers tools to streamline this process, but traditional stopping approaches focus on achieving specific target recall levels, which may not always satisfy specific information needs. This paper introduces utility-based stopping algorithms, including Harmonic Shrinkage and Confidence Boundary, that prioritise the value of retrieved information in addressing a defined information need. These algorithms analyse statistical properties of point estimates, such as sensitivity and false positive rate, to determine when sufficient evidence has been gathered. Experiments using CLEF 2017-2019 datasets demonstrate that these moment-based approaches can significantly reduce the percentage of reviewed relevant documents (PRD) while maintaining high decision agreement. Notably, these methods achieve substantially reduced PRD while maintaining decision agreement when compared to traditional target-recall approaches. This research advances TAR methodologies by shifting from recall-oriented stopping rules to utility-driven approaches, aligning stopping decisions with users' information needs to enhance the efficiency and effectiveness of document review.
Authors: Jacqueline Rowe (University of Edinburgh), Alexandra Birch (University of Edinburgh), Shannon Vallor (University of Edinburgh), Liane Guillou (University of Edinburgh / Aveni), Mateusz Klimaszewski (University of Edinburgh / University of Warsaw)
Abstract: Large language models (LLMs) encode and sometimes exacerbate gender biases present in their training texts, resulting in stereotypical reasoning, biased text generation and translation, and discriminatory outcomes when LLMs are applied to a range of contexts and tasks. While much work has been done on measuring and benchmarking gender biases in LLMs' English-language outputs, there have been only a few investigations into LLM gender bias in non-English languages, and these have been limited to select languages and tasks. With most widely-used language models now supporting dozens, if not hundreds, of languages, it is important to understand how gender biases might persist or change across different linguistic contexts to ensure that appropriate mitigations for these biases can be implemented when such models are used in practice. In this work, we extend an existing dataset for measuring multilingual gender bias across 29 languages, and use this new benchmark to compare gender biases across languages and across models, exploring whether model size, presence or absence of grammatical gender, and training data composition impacts how gender-biased model outputs are across different languages. This poster will present our methods and key findings.
Authors: Eyüp Sezer (Fenerbahçe Üniversitesi), Dilan Selen Akıl (Bahcelievler Ozel Egitim Merkezi), Kemal Eren Cengiz (Alrenas Technology AS.)
Abstract: The super nasal oral ratiometry system (SNORS) is a commercially available system which measures both nasal and oral airflow during speech, allowing the rapid movement of the velum to be measured. SNORS can be used for objective assessment, where the subject is required to speak a couple of chosen words selected to demonstrate velopharyngeal function. SNORS also provides biofeedback, using a simple real-time display of nasal and oral airflow during procedure. Velopharyngeal insufficiency/incompetency (VPI) is the inability to make adequate velopharyngeal closure, and. it results in abnormal speech characteristics, such as omissions, substitutions or weak articulation of consonants, and hypernasality. The person B. A., a 19 year old male, had mild to moderate hypernasal speech and nasalance without known medical history. Initial oral motor examination revealed sufficient velar movement though somehow lacking in complete velopharyngeal sealing. SNORS device was successfully used as both an assessment and a therapeutic tool in the management of this patient. Moreover, the effectiveness of conventional speech and language therapy vs. SNORS biofeedback therapy was compared. Initially, while there was some movement of the velum, the patient could not achieve velopharyngeal closure. Conventional therapy aimed to strengthen and improve the function of the velum and following this there was some minimal improvement: the patient could now achieve, but not maintain, closure as needed in discourse. Then, SNORS biofeedback therapy was given as an alternative option. This raised the patient's awareness of his velopharyngeal function, thus helping him to maintain closure, thereby reducing hypernasality. SNORS therapy proved significantly more effective than conventional speech and language therapy, in this case. Key words: hypernasality, nasalance, nasal emission velopharygeal sealing,
Authors: Anthony Hughes (University of Sheffield), Ning Ma (University of Sheffield), Nikolaos Aletras (University of Sheffield)
Abstract: In medical and legal settings the appropriate management of data is essential. Protective laws, like HIPAA and GDPR, enforce that individuals data is not leaked into the public domain. Although essential for confidentiality, this inhibits data sharing, consequently limiting access to potentially critical intelligence. Given that language models (LMs) have shown outstanding performance in text summarization, understanding to what extent LMs can provide privacy-preserving summaries given a non-private source document would be of great value to intelligence communities. In this paper, we perform a comprehensive study across 3 closed- and 2 open-weight LMs of different sizes and families. We experiment with prompting and fine-tuning strategies for privacy-preservation across a range of summarization datasets. Our extensive quantitative and qualitative analysis including human evaluation shows that LMs often cannot prevent PII leakage on their summaries and that current widely-used metrics cannot capture context dependent privacy risks.
Authors: Amit Meghanani (University of Sheffield), Thomas Hain (University of Sheffield)
Abstract: Self-supervised learning (SSL)-based speech models are extensively used for full-stack speech processing. However, it has been observed that improving SSL-based speech representations using unlabeled speech for content-related tasks is challenging and computationally expensive. Recent attempts have been made to address this issue with cost-effective self-supervised fine-tuning (SSFT) approaches. Continuing in this direction, a cost-effective SSFT method named "LASER: Learning by Aligning Self-supervised Representations" is presented. LASER is based on the soft-DTW alignment loss with temporal regularisation term. Experiments are conducted with HuBERT and WavLM models and evaluated on the SUPERB benchmark for two content-related tasks: automatic speech recognition (ASR) and phoneme recognition (PR). A relative improvement of 3.7% and 8.2% for HuBERT, and 4.1% and 11.7% for WavLM are observed, for the ASR and PR tasks respectively, with only < 3 hours of fine-tuning on a single GPU.
Authors: Paul Gering (University of Sheffield), Heidi Christensen (University of Sheffield), Roger K Moore (University of Sheffield)
Abstract: Recent research has used machine learning approaches to implement an automatic, quantitative measure of interaction quality. Schmitt and colleagues used the LEGO spoken dialogue corpus, a publicly available database of telephone-based human-agent interactions, to predict turn-level interaction quality using SVM and LSTM classifiers. These classifiers were trained on automatically derived features relating to the performance of the dialogue system's ASR and DM subcomponents. This project aims to extend the work of Schmitt and colleagues by training machine learning models on linguistic and paralinguistic speech features, such as pauses, intonations and turn-taking, from the LEGO corpus to predict interaction quality. These speech features have been effectively used for similar machine-learning problems, such as engagement estimation and emotion detection. Additionally, it is assumed that individuals use these speech features to evaluate everyday interactions. Therefore, the performance of the trained models is expected to surpass those implemented by Schmitt and colleagues. This is the first of three planned experiments focusing on implementing an automatic, quantitative, local measure of interaction quality.
Authors: Valeria Pastorino (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)
Abstract: Framing in media critically shapes public perception by selectively emphasising some details while downplaying others. With the rise of large language models in automated news and content creation, there is growing concern that these systems may introduce or even amplify framing biases compared to human authors. In this paper, we explore how framing manifests in both out-of-the-box and fine-tuned LLM-generated news content. Our analysis reveals that, particularly in politically and socially sensitive contexts, LLMs tend to exhibit more pronounced framing than their human counterparts. In addition, we observe significant variation in framing tendencies across different model architectures, with some models displaying notably higher biases. These findings point to the need for effective post-training mitigation strategies and tighter evaluation frameworks to ensure that automated news content upholds the standards of balanced reporting.
Authors: Wing-Zin Leung (University of Sheffield), Mattias Cross (University of Sheffield), Anton Ragni (University of Sheffield), Stefan Goetze (South Westphalia University of Applied Sciences, Iserlohn, Germany)
Abstract: Automatic speech recognition (ASR) research has achieved impressive performance in recent years and has significant potential for enabling access for people with dysarthria (PwD) in augmentative and alternative communication (AAC) and home environment systems. However, progress in dysarthric ASR (DASR) has been limited by high variability in dysarthric speech and limited public availability of dysarthric training data. This paper demonstrates that data augmentation using text-to-dysarthic-speech (TTDS) synthesis for finetuning large ASR models is effective for DASR. Specifically, diffusion-based text-to-speech (TTS) models can produce speech samples similar to dysarthric speech that can be used as additional training data for fine-tuning ASR foundation models, in this case Whisper. Results show improved synthesis metrics and ASR performance for the proposed multi-speaker diffusion-based TTDS data augmentation for ASR fine-tuning compared to current DASR baselines.
Authors: Jason Clarke (University of Sheffield), Yoshihiko Gotoh (University of Sheffield), Stefan Goetze (South Westphalia University of Applied Sciences, Iserlohn, Germany)
Abstract: Audiovisual active speaker detection (ASD) addresses the task of determining the speech activity of a candidate speaker given acous- tic and visual data. Typically, systems model the temporal correspondence of audiovisual cues, such as the synchronisation between speech and lip movement. Recent work has explored extending this paradigm by additionally leveraging speaker embeddings extracted from candidate speaker reference speech. This paper proposes the speaker comparison auxiliary network (SCAN) which uses speaker-specific information from both reference speech and the candidate audio signal to disambiguate challenging scenes when the visual signal is unresolvable. Furthermore, an improved method for enrolling face-speaker libraries is developed, which implements a self-supervised approach to video-based face recognition. Fitting with the recent proliferation of wearable devices, this work focuses on improving speaker-embedding-informed ASD in the context of egocentric recordings, which can be characterised by acoustic noise and highly dynamic scenes. SCAN is implemented with two well- established baselines, namely TalkNet and Light-ASD; yielding a relative improvement in mAP of 14.5% and 10.3% on the Ego4D benchmark, respectively.
Authors: Mattias Cross (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Generative speech enhancement aims to improve the quality of speech recordings in noisy environments by learning a probability path between the noisy- and clean-speech distributions. Flow models learn an abstract velocity field that transports noisy samples to the clean distribution, effectively extracting the speech signal. Computing the velocity field requires differential equation solvers which limits sample quality to a given number of inference steps, resulting in slow inference. Particularly we investigate the existing Schrodinger bridge and provide our own formulation of independent conditional flow matching for speech enhancement. The best form of flow model for speech enhancement is unknown, and there aren't many one-step methods available. Our work shows that flow models can generate high-quality samples in one step, particularly when the variance is time-invariant. We show that independent conditional flow matching models for speech enhancement produce good samples. Given that the clean data is always part of the training objective, it is straightforward to adjust inference to accommodate one-step. Our findings provide a faster and more effective speech enhancement method than the baselines. Conditional flow matching is particularly useful thanks to its linear gradient and static variance, which are constant throughout the probability path, unlike Schrodinger bridges and diffusion models. Being able to inference flow models in one step has potential to enable high quality speech enhancement in real time scenarios such as telephone calls.
Authors: Yaseen Mohammed Osman (University of Southampton), Stuart E. Middleton (University of Southampton), Geoff V. Merrett (University of Southampton)
Abstract: In this study, we conducted a series of comparative studies to quantify the impact of rapid fine-tuning of large language models (LLMs). We have done so using SQuAD 2.0 as our benchmark. Based on our experiments, we observe that rapid fine-tuning is indeed sufficient and can be better than few-shot prompting. In fact, we find that fine-tuning an 8B Llama3.1 model for only 5-minutes using just 0.01% of SQuAD 2.0 training dataset can result in a 12.31% increase in performance (i.e., exact match metric, from 77% to 83.69%) as well as making the 8B model outperform the 70B variant. This can save 89% of the hardware costs by deploying such a model. Our approach can also achieve this while reducing both the inference time and energy consumption by around 12% of the same sized model. This study provides evidence that rapid retraining of LLMs can work. As future work, we plan to investigate Active Learning algorithms to explore its impact on in-context few-shot learning.
Authors: Boxuan Shan (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: In expressive speech synthesis, various factors can influence the synthesised speech, such as prosody, speaker identity, emotion, etc. However, little work has been done on the influence of sentiment on speech. Therefore, in this work, the effect of sentiment on speech is explored, and as a starting point, we focus on its effect on prosody. Firstly, we examined the differences in the distribution of prosody for different sentiment groups in real speech data. In this process, a sentiment analysis technique was adopted to perform sentiment grouping on the transcription of the speech dataset. Furthermore, some mainstream TTS models are examined to investigate whether their prediction performance is sensitive to sentiment grouping. Our experimental results show that sentiment has a significant impact on prosody, while current TTS models are sensitive to sentiment, which means that their prediction performance on prosody varies across different sentiment groups. However the reason for this difference is unclear. These results suggest that the effect of sentiment on speech synthesis is a worthwhile direction for further research.
Authors: Donovan Wright (University of Sheffield), Robert Gaizauskas (University of Sheffield), Mark Stephenson (University of Sheffield)
Abstract: Journalistic writing conforms to an inverted pyramid structure in which the most important elements of the news content appear at the top of a news column, with supporting content following. This inverted pyramid structure of journalistic writing results in layout bias. Indeed, some researchers have exploited the layout bias in news content. Because of the bias within the datasets, abstractive summarization models developed with datasets comprised of scraped news content, are unlikely to generalize to non-news content-based datasets. This research focusses on noise or differences in similarity between abstractive summarization models, which have been developed using unconstrained online news content, and seeks to make quantitative comparisons with datasets originating from non-news based content.
Authors: Jason Chan (The University of Sheffield), Robert Gaizauskas (The University of Sheffield), Zhixue Zhao (The University of Sheffield)
Abstract: Formal logic enables computers to reason in natural language by representing sentences in symbolic forms and applying rules to derive conclusions. However, in what our study characterises as "rulebreaker" scenarios, this method can lead to conclusions that are typically not inferred or accepted by humans given their common sense and factual knowledge. Inspired by works in cognitive science, we create RULEBREAKERS, the first dataset for rigorously evaluating the ability of large language models (LLMs) to recognise and respond to rulebreakers (versus non-rulebreakers) in a human-like manner. Evaluating seven LLMs, we find that most models, including GPT-4o, achieve mediocre accuracy on RULEBREAKERS and exhibit some tendency to over-rigidly apply logical rules unlike what is expected from typical human reasoners. Further analysis suggests that this apparent failure is potentially associated with the models' poor utilisation of their world knowledge and their attention distribution patterns. Whilst revealing a limitation of current LLMs, our study also provides a timely counterbalance to a growing body of recent works that propose methods relying on formal logic to improve LLMs' general reasoning capabilities, highlighting their risk of further increasing divergence between LLMs and human-like reasoning.
Authors: Yao Xiao (University of Sheffield), Heidi Christensen (University of Sheffield), Stefan Goetze (University of Sheffield)
Abstract: Alzheimer's dementia (AD) is a neurodegenerative disorder with cognitive decline that commonly impacts older people. AD can cause language-related deficits, such as difficulty in word-searching. This poses challenges for individuals in efficiently communicating with others, and consequently affect their quality of life. The linguistic and acoustic differences between impaired and unimpaired speech or language have enabled clinicians to perform screening for AD. Such cues have also inspired the automated AD detection, serving as a non-intrusive, scalable, and cost-effective way to facilitate early detection, monitoring, and management of such conditions. This project investigates the use of speech and language technologies for the automated detection of AD. By analysing spontaneous speech from cognitive assessments, the project explores novel methods to achieve accurate AD screening and progression monitoring.
Authors: Fritz Peters (University of Sheffield), W Richard Bevan-Jones (University of Sheffield), Grace Threlfall (University of Sheffield), Jenny M Harris (University of Exeter), Julie S Snowden (University of Manchester), Matthew Jones (University of Manchester), Jennifer C Thompson (University of Manchester), Daniel J Blackburn (University of Sheffield), Heidi Christensen (University of Sheffield)
Abstract: Primary progressive aphasia (PPA) encompasses a group of neurodegenerative disorders primarily affecting language abilities. Diagnosing PPA typically requires experienced clinicians, who are often only available in specialized hospital settings. Early in the disease course, individuals with PPA frequently exhibit noticeable changes in speech and language. In this study, we extracted acoustic, linguistic, and task-specific features from audio recordings to assess their effectiveness for PPA classification. Using a subset of task-specific features, we identified PPA with 97% accuracy. For subtyping, models trained on the complete feature set achieved 74% accuracy in distinguishing between the traditional three PPA variants. Our findings underscore the added value of task-specific features, which enhance conventional approaches. Furthermore, their visualisation provides an intuitive representation of task performance, improving clinical interpretability and potential diagnostic applications.
Authors: Minghui Zhao (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Diffusion models have shown strong performance in speech synthesis, primarily through non-autoregressive, parallel generation. While effective in terms of quality and diversity, these models can be less effective at capturing fine-grained frame-to-frame dependencies. In this work, we propose an autoregressive diffusion model that generates speech frame by frame, without adhering to a fixed temporal order. The model preserves autoregressive conditioning across frames. Experimental results show improved performance in pitch accuracy, as measured by log F0, demonstrating the model's ability to produce more natural and expressive speech.
Authors: Jack Cox (University of Sheffield), Prof. Jon Barker (University of Sheffield), Dr. Maha Elbayad (Meta)
Abstract: Learning disentangled representations of speech data has the potential to provide benefits in robustness, domain generalisation, controllability and interpretability, and fairness, but a lack of clarity around disentanglement has made progress difficult. We bring work in the broader machine learning community on causal representation learning to the context of speech disentanglement, and ask whether approaching speech as the result of a causal generative process can result in better representations.
Authors: Miles Williams (University of Sheffield), Nikolaos Aletras (University of Sheffield)
Abstract: Quantization and pruning form the foundation of compression for neural networks, enabling efficient inference for large language models (LLMs). Recently, various quantization and pruning techniques have demonstrated remarkable performance in a post-training setting. They rely upon calibration data, a small set of unlabeled examples that are used to generate layer activations. However, no prior work has systematically investigated how the calibration data impacts the effectiveness of model compression methods. In this paper, we present the first extensive empirical study on the effect of calibration data upon LLM performance. We trial a variety of quantization and pruning methods, datasets, tasks, and models. Surprisingly, we find substantial variations in downstream task performance, contrasting existing work that suggests a greater level of robustness to the calibration data. Finally, we make a series of recommendations for the effective use of calibration data in LLM quantization and pruning.
Authors: Mary Hewitt (University of Sheffield), Hamish Cunningham (University of Sheffield)
Abstract: Smart home systems often rely on cloud-based assistants for voice control, but these raise privacy concerns and struggle with underspecified or indirect requests due to rigid command structures. This work explores how large language models (LLMs) can address these limitations, enabling more flexible interpretation of spoken commands, though with increased risk of error. We evaluate open-source LLMs as the reasoning component of a local smart home assistant, running on consumer-grade hardware without internet access. Using an annotated dataset of natural language commands, we benchmark multiple LLMs across homes of varying complexity, assessing their ability to generate both structured control outputs and conversational responses grounded in a home model. We identify common error types and explore prompting techniques that improve model behaviour on this task. Our results highlight the potential of local LLMs for private, adaptable smart home control, and we outline methods for their safe deployment in real-world environments.
Authors: Robert Sutherland (University of Sheffield), Jason Clarke (University of Sheffield), Stefan Goetze (University of Sheffield), Jon Barker (University of Sheffield)
Abstract: With machine learning techniques being developed and applied to the domain of speech enhancement for hearing aids, there comes a need for high-quality datasets which represent realistic scenarios that might be faced by hearing-impaired listeners in everyday life. In this work, we discuss the development of a dataset for Enhancing Conversations for the Hearing Impaired (ECHI), which is comprised of four-party conversations in a simulated cafe/restaurant environment. For this dataset, we recorded over 30 hours of conversational speech from over fifty 4-person conversations involving over 200 participants. For each session, we captured 55 channels of audio - including ego-centric recordings from the Meta Aria glasses and hearing aid microphones, synchronised with audio from eight separate 4-element fixed microphone arrays. In addition, we recorded six-degree-of-freedom head motion data for all four participants, and video from the Aria glasses worn by one of the participants. The data will have a number of possible use cases, which are relevant to the tasks of speech enhancement for hearing aids, distant-microphone speech processing, audio-visual speech processing, as well as investigations into head motion and behaviour in conversations in noisy environments. A subset of this data will be used as the dataset for the CHiME9-ECHI Challenge.
Authors: Olga Iakovenko (University of Sheffield), Thomas Hain (University of Sheffield)
Abstract: Code-switching (CS) is alternating between two or more languages within a single conversation. CS presents significant challenges for automatic speech recognition (ASR) systems due to abrupt language changes, mixed linguistic structures and varying phonetic and syntactic patterns. Another difficulty when dealing with CS is data scarcity when compared to monolingual data. One crucial aspect of enhancing ASR performance for CS is the accurate identification of the Matrix Language (MLID), which provides the syntactic and structural framework for CS utterances. This paper investigates the impact of MLID on the effectiveness and accuracy of ASR systems when processing CS speech. The MLID was predicted from CS audio simultaneously with the ASR and language diarisation task. This was compared to a similar setup trained in a multi-task learning fashion combining ASR, LD and utterance language identification (LID) task, where CS utterances were regarded as a separate language. The proposed CS ASR system has shown an absolute of 0.5% Mixed Error Rate (MER) decrease in comparison to the baseline and at least 0.2% absolute MER decrease in comparison to existing CS ASR multitask learning setups. The proposed approach has demonstrated that having predicted MLID as Mandarin leads to an increase of recognised function words indicating that MLID informs the ASR decoder of the grammatical properties of the utterance. The study highlights the potential of MLID-aware ASR systems in various applications, from multilingual virtual assistants to real-time translation services, indicating a broader applicability of the approach.
Authors: Constantinos Karouzos (University of Sheffield), Nikos Aletras (University of Sheffield)
Abstract: Recent progress in aligning large language models (LLMs) with human preferences through techniques such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) has led to significant improvements in generation quality. However, most studies train and evaluate models in a single, static domain, even though deployment settings often encounter substantial domain shifts. In this empirical study, we present an extensive empirical study of preference optimization under domain shift. We evaluate a suite of preference optimization methods, including Supervised Fine-Tuning (SFT), RLHF, DPO, KTO, and GRPO, on summarization (using a Reddit TL;DR dataset as the source and CNN/DM as the target). Furthermore, we explore several domain adaptation strategies, such as domain-adaptive pre-training (DAPT), interleaved training, pseudo-labeling, and domain-adversarial techniques, to assess their impact on transferring alignment knowledge to new domains. Our results highlight significant discrepancies in performance between source and target domains and reveal promising approaches for mitigating the alignment gap under domain shift.