Venue: Workroom 2, Diamond Building. Note: all poster presentations will be delivered in-person.
11:45am - 12:20pm: Even numbered posters to be presented
12:25pm - 1:00pm: Odd numbered posters to be presented
A number of submissions are also in review at international conferences. As a result, their titles and abstracts have been redacted to ensure they conform with the requirements for double blind review. The titles and abstracts will be published shortly before the SLT CDT Annual Conference.
Authors: Favour Yahdii Aghaebe (University of Sheffield), Tanefa Apekey (University of Sheffield), Elizabeth Williams (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)
Abstract: Opinion and multi-document summarisation often involve genuinely conflicting viewpoints, yet many existing approaches, particularly LLM-based systems, implicitly smooth disagreement and over-represent majority opinions. This limits the faithfulness of generated summaries in opinion-heavy settings. We introduce a disagreement-aware synthesis pipeline that separates belief-level aggregation from language generation. Documents are first represented as structured belief sets and aggregated using distance-based belief merging operators that explicitly model conflict. Large language models are then used only to realise the aggregated beliefs as natural language summaries. We evaluate the approach across multiple model families and scales, comparing it to methods that perform explicit aggregation during generation. Our results show that while sufficiently large models can match belief-level aggregation when aggregation is handled at generation time, this behaviour is not stable across architectures or capacities. In contrast, belief-level aggregation combined with simple prompting yields consistently strong disagreement-aware performance across models, while maintaining fluent and grounded summaries.
Authors: Christopher Bartley (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Redacted
Authors: Jason Chan (University of Sheffield), Zhixue Zhao (University of Sheffield), Robert Gaizauskas (University of Sheffield)
Abstract: Redacted
Authors: Joseph James (University of Sheffield), Chenghao Xiao (Durham University), Yucheng Li (University of Surrey), Nafise Sadat Moosavi (University of Sheffield), Chenghua Lin (University of Manchester)
Abstract: Redacted
Authors: Hezhao Zhang (School of Computer Science, University of Sheffield, United Kingdom), Huang-Cheng Chou (Signal Analysis and Interpretation Laboratory (SAIL), Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA 90089, USA), Shrikanth Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA 90089, USA), Thomas Hain (Department of Computer Science, University of Sheffield, United Kingdom)
Abstract: Redacted
Authors: Jo-Ku Cheng (University of Sheffield), Nikos Aletras (University of Sheffield), Marco Valentino (University of Sheffield)
Abstract: Redacted
Authors: Boxuan Shan (University of Sheffield), Adrián Barahona-Ríos (Sony Interactive Entertainment), Anton Ragni (University of Sheffield)
Abstract: Expressive speech can be influenced by various paralinguistic aspects, including sentiment, emotion, speaker identity, and style. Those aspects are widely adopted by controllable TTS systems as control signals, but their effectiveness has not been well understood. We therefore present an analysis of how these aspects affect speech, focusing on prosody as a fundamental component of expressive speech. Statistical tests show that emotion, style, and speaker identity produce clear prosodic differences, whereas sentiment yields significant but weaker effects, revealing a challenge for weak control aspects. A contrastive learning method has been introduced to encourage the model to better respond to paralinguistic controls. Finally, we present a distributional visualisation to give more insight into the effectiveness of contrastive learning. Our results highlight the difficulty of modelling and controlling weak paralinguistic aspects and provide insights for future controllable TTS research.
Authors: Ian Kennedy (University of Sheffield), Nafise S. Moosavi (University of Sheffield)
Abstract: Redacted
Authors: Wing-Zin Leung (University of Sheffield), Heidi Christensen (University of Sheffield), Stefan Goetze (University of Sheffield)
Abstract: Dysarthria is a type of motor speech disorder that reflects abnormalities in motor movements required for speech production. In clinical practice, identifying characteristic signs and symptoms of the neuropathophysiology underlying a dysarthria is vital for diagnosis and management. The gold standard for dysarthria assessment is auditory-perceptual evaluation by a speech and language therapist for differential diagnosis and management decisions. As the process is time-consuming for clinicians, there is growing interest in automatic dysarthria assessment (ADA). Recent approaches to ADA primarily focus on the classification of broad intelligibility or speech severity labels. However, this does not have much clinical utility and the assessment of communication-relevant parameters do not distinguish between dysarthria types and pathomechanisms. Studies on the classification of dysarthria function or clinical test protocol scores focusing on aspects of dysarthric speech production (such as the Frenchay dysarthria assessment (FDA)) are limited. Therefore, this paper focuses on the preliminary steps towards clinically interpretable ADA, including automatic FDA assessment. The phoneme posteriorgram (PPG) is a time-varying categorical distribution over acoustic speech units, and recent work demonstrates interpretable speech pronunciation distance for downstream tasks, e.g. pronunciation reconstruction. This work extends recent advances in posterior-based phoneme research and mispronunciation models to dysarthria assessment, exploring the extent to which dysarthric speech features in the FDA (identified by auditory-perceptual evaluation in clinical practice) are captured by PPG information. To achieve this, FDA aspects are systematically evaluated. The results show that interpretable PPG probability can capture dysarthric speech features that are related to motor system dysfunction.
Authors: Jasivan Sivakumar (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)
Abstract: This paper presents a rigorous mechanistic analysis of the failure modes in Large Language Model (LLM) numerical reasoning, focusing on the persistent "distraction" caused by prompt-induced biases. Despite attempting multiple mitigation strategies - including weighted rank predictions and bias-reduced decoding - we find that numerical anchors and verbal cues in the prompt exert a "gravitational pull" that is remarkably difficult to override during the decoding stage. We evaluate these dynamics by comparing masked versus unmasked inputs and analysing how prediction confidence often correlates more strongly with superficial prompt patterns than with mathematical accuracy. Our layer-wise diagnostic reveals why these interventions struggle: while MLP and Transformer probes confirm that correct mathematical information exists within the hidden states, this knowledge is frequently "buried" by a dominant rank bias that persists across the majority of the architecture. We detail our findings on rank analysis and early-exit performance, illustrating that the model's internal commitment to a biased answer often occurs early and stabilises regardless of contextual counter-evidence. By documenting these unsuccessful attempts to redirect the model's logic, we provide a detailed map of the structural barriers to fair numerical reasoning and offer critical insights into why standard decoding interventions remain insufficient.
Authors: Robert Flynn (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Automatic speech recognition (ASR) models are normally trained to operate over single utterances, with a short duration of less than 30 seconds. This choice has been made in part due to computational constraints, but also reflects a common, but often inaccurate, modeling assumption that treats utterances as independent and identically distributed samples. When longformat audio recordings are available, to work with such systems, these recordings must first be segmented into short utterances and processed independently. In this work, we show that due to recent algorithmic and hardware advances, this is no longer necessary, and current attention-based approaches can be used to train ASR systems that operate on sequences of over an hour in length. Therefore, to gain a better understanding of the relationship between the training/evaluation sequence length and performance, we train ASR models on large-scale data using 10 different sequence lengths from 10 seconds up to 1 h. Through modifying various architectural components, we find that the method of encoding positional information and the model's width/depth are important factors when working with long sequences. Finally, a series of evaluations using synthetic data are constructed to help analyse the model's use of context.
Authors: Paul Gering (University of Sheffield), Roger K Moore (University of Sheffield)
Abstract: Redacted
Authors: Constantinos Karouzos (University of Sheffield), Xingwei Tan (University of Sheffield), Nikolaos Aletras (University of Sheffield)
Abstract: Redacted
Authors: Danae Sanchez Villegas (University of Copenhagen), Samuel Lewis-Lim (University of Sheffield), Nikolaos Aletras (University of Sheffield), Desmond Elliott (University of Copenhagen)
Abstract: Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two model families. We track confidence over Chain-of-Thought (CoT), measure reasoning's corrective effect, and evaluate intermediate reasoning steps. We find that models are prone to answer inertia, where early predictions are reinforced rather than revised. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient. Although this influence can appear in the CoT, its detectability varies across models and depends on what is monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer CoTs can still appear visually grounded while following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. These findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.
Authors: Jack Cox (University of Sheffield), Jon Barker (University of Sheffield)
Abstract: Speech foundation models, pre-trained on large corpora of unlabelled speech data, produce general-purpose representations which are useful across tasks. However, these representations encode information about salient speech variables in a distributed manner, while downstream speech tasks rely on only some of this variability. In this work, we propose a post-training refinement approach using interventional contrastive learning. By leveraging an interventional dataset and multi-part contrastive loss, we learn a transformation from the entangled representation space of speech foundation models into separate content and speaker subspaces. We evaluate the learnt representations on speaker verification and keyword spotting tasks, showing improved out-of-domain speaker verification performance and evidence that speaker and content information are separated across the learned subspaces.
Authors: Yanyi Pu (University of Sheffield), Damian Gonzalez-Salzberg (University of Birmingham), Yuan Zheng (University of Sheffield), Nikos Aletras (University of Sheffield)
Abstract: Redacted
Authors: Minghui Zhao (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Redacted
Authors: Anthony Hughes (University of Sheffield), Alex Goldberg (Carnegie Mellon University), Prince Jha (MBZUAI) Nikos Aletras (University of Sheffield), Niloofar Mireshghallah (Carnegie Mellon University).
Abstract: Redacted
Authors: Fritz Peters (University of Sheffield), Madhurananda Pahar (University of Sheffield), Dorota Braun (University of Sheffield), Caitlin Illingworth (University of Sheffield), Daniel Blackburn (University of Sheffield), Heidi Christensen (University of Sheffield)
Abstract: Redacted
Authors: Michael Whealing (SLT CDT Affiliate), Thomas Hain (Speech and Hearing Research Group), Rob Gaizauskas (Natural Language Processing Research Group)
Abstract: Redacted
Authors: Valeria Pastorino (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)
Abstract: Redacted
Authors: Maggie Mi (University of Sheffield), Golzar Atefi (Berliner Hochschule für Technik), Atsuki Yamaguchi (University of Sheffield), Felix Gers (Berliner Hochschule für Technik), Aline Villavicencio (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)
Abstract: Redacted
Authors: Gerardo Roa-Dabike (University of Sheffield), Jon P. Barker (University of Sheffield), Michael A. Akeroyd (University of Nottingham), Scott Bannister (University of Leed | University of Manchester), Trevor J. Cox (University of Salford), Bruno Fazenda (University of Salford), Jennifer Firth (University of Nottingham), Simone Graetzer (University of Salford), Alinka Greasley (University of Leed), Rebecca R. Vos (University of Salford) and William M. Whitmer (University of Nottingham)
Abstract: Understanding the lyrics in music is key for music enjoyment. People with hearing loss can have difficulties clearly and effortlessly hearing lyrics, however. In speech technology, having metrics to automatically evaluate intelligibility has driven improvements in speech enhancement. We wanted to do the same for music with lyrics. To address this gap we presented the lyric intelligibility challenge. A new dataset, CLIP1, was introduced, comprising audio samples of popular western music paired with listener intelligibility scores. To model diverse listening profiles, samples were processed with no, mild and moderate simulated hearing loss. A total of 27 systems were submitted by 22 teams. After success of CLIP1, we are announcing the launch of CLIP2, the second lyric intelligibility challenge.
Authors: Robert Sutherland (University of Sheffield), Stefan Goetze (University of Sheffield), Jon Barker (University of Sheffield)
Abstract: Redacted
Authors: Mattias Cross (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Redacted
Authors: Yao Xiao (University of Sheffield), Fritz Peters (University of Sheffield), Madhurananda Pahar (University of Sheffield), Dorota A Braun (University of Sheffield), Caitlin H Illingworth (University of Sheffield), Stefan Goetze (University of Sheffield), Daniel Blackburn (University of Sheffield), Heidi Christensen (University of Sheffield)
Abstract: Redacted
Authors: Xinying Wei (University of Sheffield), Eleni Vasilaki (University of Sheffield), Thomas Hain (University of Sheffield)
Abstract: Redacted
Authors: Xiaozhou Tan (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Diffusion models in speech enhancement and synthesis tend to have heavy mathematical background. To ensure mathematical tractability, these models are based on idealized but not fully realistic assumptions. We found that most of these models have a common feature that they can all be reparameterized as clean data prediction task. In addition, we also find that despite many forms of diffusion model have been proposed, the basic and simple form of inference and training is neglected by the community before switching to more complex methods. Thus, based on the common feature of these more complex methods, we propose a more basic and simple form of diffusion-like model and aim to fill the gap of ignorance of this method by the community. We implemented two applications using this method: speech enhancement and synthesis. Besides its simplicity, this method is more flexible than classic diffusion or flow matching models to incorporate more diverse degradation process in speech synthesis. This flexibility expands the design space and can potentially improves the result.
Authors: Arezo Shakeri (University of Stavanger, Norway and University of Sheffield), Madhurananda Pahar (University of Sheffield), Ning Ma (University of Sheffield)
Abstract: This study investigates the use of neural audio codec representations for automatic detection of cognitive decline from speech. Cognitive disorders such as Mild Cognitive Impairment (MCI) and Alzheimer's disease (AD) often affect speech production, making speech analysis a promising non-invasive tool for early diagnosis. Unlike traditional approaches that rely on handcrafted acoustic features, linguistic features, or embeddings from large speech models, this work explores discrete speech tokens generated by the Mimi neural audio codec. These tokens jointly capture acoustic, phonetic, and linguistic information while providing compact and computationally efficient representations. Two GPT-2-based approaches were evaluated: (1) custom GPT-2 models trained on Mimi tokens to generate representations for external classifiers, and (2) a pretrained GPT-2 fine-tuned for direct classification. Additional methods using token frequencies, flattened token sequences, and token embeddings were also examined. Experiments were conducted on the PROCESS-2 corpus containing speech recordings from healthy controls, individuals with MCI, and individuals with dementia. The best-performing neural audio codec approach achieved a macro F1-score of 63% for three-class classification, outperforming a manual-transcription baseline (58%) without requiring transcripts. These findings demonstrate the potential of neural audio codecs for efficient, transcript-free cognitive decline detection and motivate future work on improved token utilisation and interpretability.
Authors: Xiaolei Xu (University of Sheffield), Chaoyue Niu (University of Sheffield), Guy J. Brown (University of Sheffield), Hector Romero (Passion for Life Healthcare), Ning Ma (University of Sheffield)
Abstract: Redacted