Venue: Workroom 2, Diamond Building. Note: all poster presentations will be delivered in-person.
11:45am - 12:20pm: Even numbered posters to be presented
12:25pm - 1:00pm: Odd numbered posters to be presented
Authors: Favour Yahdii Aghaebe (University of Sheffield), Tanefa Apekey (University of Sheffield), Elizabeth Williams (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)
Abstract: Opinion and multi-document summarisation often involve genuinely conflicting viewpoints, yet many existing approaches, particularly LLM-based systems, implicitly smooth disagreement and over-represent majority opinions. This limits the faithfulness of generated summaries in opinion-heavy settings. We introduce a disagreement-aware synthesis pipeline that separates belief-level aggregation from language generation. Documents are first represented as structured belief sets and aggregated using distance-based belief merging operators that explicitly model conflict. Large language models are then used only to realise the aggregated beliefs as natural language summaries. We evaluate the approach across multiple model families and scales, comparing it to methods that perform explicit aggregation during generation. Our results show that while sufficiently large models can match belief-level aggregation when aggregation is handled at generation time, this behaviour is not stable across architectures or capacities. In contrast, belief-level aggregation combined with simple prompting yields consistently strong disagreement-aware performance across models, while maintaining fluent and grounded summaries.
Authors: Christopher Bartley (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Nearly half of the world's languages are endangered. Speech technologies, such as Automatic Speech Recognition (ASR), have been shown to be central to revival efforts, yet most lan- guages remain unsupported because standard pipelines expect utterance-level supervised data. In this paper, we explore using speech-text pairs aligned at the level of words or short phrases as a more accessible entry point than conventional utterance- level corpora. Using a standard English ASR pipeline, we show how much performance is impacted by substituting a conven- tional corpus with short-form speech. We then source short- form data for 5 typologically diverse endangered languages and use them to force-align long-form speech, creating new utterance-level speech datasets for each language. Finally, we build ASR systems and compare them to SOTA multilingual models (OmniASR, MMS, Whisper), showing that better out- of-domain (OOD) performance can be achieved at a fraction of the computational cost.
Authors: Jason Chan (University of Sheffield), Zhixue Zhao (University of Sheffield), Robert Gaizauskas (University of Sheffield)
Abstract: Existing NLP work commonly treats contradictions as errors to be resolved by choosing which statements to accept or discard. Yet a key aspect of human reasoning in social interactions and professional domains is the ability to hypothesise explanations that reconcile contradictions. For example, "Cassie hates coffee" and "She buys coffee everyday" may appear contradictory, yet both are compatible if Cassie has the unenviable daily chore of buying coffee for all her coworkers. Despite the growing reasoning capabilities of large language models (LLMs), their ability to hypothesise such reconciliatory explanations remains largely unexplored. To address this gap, we introduce the task of reconciliatory explanation generation, where models must generate explanations that effectively render contradictory statements compatible. We propose a novel method of repurposing existing natural language inference (NLI) datasets, and introduce quality metrics that enable scalable automatic evaluation. Experiments with 18 LLMs show that most models achieve limited success in this task, and that the benefit of extending test-time compute by "thinking" plateaus as model size increases. Our results highlight an under-explored dimension of LLM reasoning and the need to address this limitation in enhancing LLMs' downstream applications such as chatbots and scientific aids.
Authors: Joseph James (University of Sheffield), Chenghao Xiao (Durham University), Yucheng Li (University of Surrey), Nafise Sadat Moosavi (University of Sheffield), Chenghua Lin (University of Manchester)
Abstract: Scientific rigour tends to be sidelined in favour of bold statements, leading authors to overstate claims beyond what their results support. We present RIGOURATE, a two-stage multimodal framework that retrieves supporting evidence from a paper's body and assigns each claim an overstatement score. The framework consists of a dataset of over 10K claim-evidence sets from ICLR and NeurIPS papers, annotated using eight LLMs, with overstatement scores calibrated using peer-review comments and validated through human evaluation. It employes a fine-tuned reranker for evidence retrieval and a fine-tuned model to predict overstatement scores with justification. Compared to strong baselines, RIGOURATE enables improved evidence retrieval and overstatement detection. Overall, our work operationalises evidential proportionality and supports clearer, more transparent scientific communication.
Authors: Hezhao Zhang (School of Computer Science, University of Sheffield, United Kingdom), Huang-Cheng Chou (Signal Analysis and Interpretation Laboratory (SAIL), Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA 90089, USA), Shrikanth Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, CA 90089, USA), Thomas Hain (Department of Computer Science, University of Sheffield, United Kingdom)
Abstract: Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Additionally, conventional speech LLMs benchmarks overlook the inherent ambiguity of human emotion. Hence, we present VoxEmo, a comprehensive SER benchmark encompassing 35 emotion corpora across 15 languages for Speech LLMs. VoxEmo provides a standardized toolkit featuring varying prompt complexities, from direct classification to paralinguistic reasoning. To reflect real-world perception/application, we introduce a distribution-aware soft-label protocol and a prompt-ensemble strategy that emulates annotator disagreement. Experiments reveal that while zero-shot speech LLMs trail supervised baselines in hard-label accuracy, they uniquely align with human subjective distributions.
Authors: Jo-Ku Cheng (University of Sheffield), Nikos Aletras (University of Sheffield), Marco Valentino (University of Sheffield)
Abstract: Current Large Language Models (LLMs) are limited by monolithic architectures and pre-training paradigms that inextricably entangle the mechanisms for reasoning with those required for language and knowledge. This leads to systemic fragility, including "content effects" where logical validity is biased by semantic plausibility, as well as extreme training inefficiency. This project aims to address these challenges by redefining the current LLM pre-training pipeline, shifting from next-token prediction on raw linguistic corpora to a neuro-symbolic approach that explicitly separates symbolic reasoning from material knowledge acquisition to improve LLMs efficiency and robustness.
Authors: Boxuan Shan (University of Sheffield), Adrián Barahona-Ríos (Sony Interactive Entertainment), Anton Ragni (University of Sheffield)
Abstract: Expressive speech can be influenced by various paralinguistic aspects, including sentiment, emotion, speaker identity, and style. Those aspects are widely adopted by controllable TTS systems as control signals, but their effectiveness has not been well understood. We therefore present an analysis of how these aspects affect speech, focusing on prosody as a fundamental component of expressive speech. Statistical tests show that emotion, style, and speaker identity produce clear prosodic differences, whereas sentiment yields significant but weaker effects, revealing a challenge for weak control aspects. A contrastive learning method has been introduced to encourage the model to better respond to paralinguistic controls. Finally, we present a distributional visualisation to give more insight into the effectiveness of contrastive learning. Our results highlight the difficulty of modelling and controlling weak paralinguistic aspects and provide insights for future controllable TTS research.
Authors: Ian Kennedy (University of Sheffield), Nafise S. Moosavi (University of Sheffield)
Abstract: Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio Ï = N/KM , which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality–compute frontier. The severity of the bottleneck scales with Ï: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning. Our code is available at 1.
Authors: Wing-Zin Leung (University of Sheffield), Heidi Christensen (University of Sheffield), Stefan Goetze (University of Sheffield)
Abstract: Dysarthria is a type of motor speech disorder that reflects abnormalities in motor movements required for speech production. In clinical practice, identifying characteristic signs and symptoms of the neuropathophysiology underlying a dysarthria is vital for diagnosis and management. The gold standard for dysarthria assessment is auditory-perceptual evaluation by a speech and language therapist for differential diagnosis and management decisions. As the process is time-consuming for clinicians, there is growing interest in automatic dysarthria assessment (ADA). Recent approaches to ADA primarily focus on the classification of broad intelligibility or speech severity labels. However, this does not have much clinical utility and the assessment of communication-relevant parameters do not distinguish between dysarthria types and pathomechanisms. Studies on the classification of dysarthria function or clinical test protocol scores focusing on aspects of dysarthric speech production (such as the Frenchay dysarthria assessment (FDA)) are limited. Therefore, this paper focuses on the preliminary steps towards clinically interpretable ADA, including automatic FDA assessment. The phoneme posteriorgram (PPG) is a time-varying categorical distribution over acoustic speech units, and recent work demonstrates interpretable speech pronunciation distance for downstream tasks, e.g. pronunciation reconstruction. This work extends recent advances in posterior-based phoneme research and mispronunciation models to dysarthria assessment, exploring the extent to which dysarthric speech features in the FDA (identified by auditory-perceptual evaluation in clinical practice) are captured by PPG information. To achieve this, FDA aspects are systematically evaluated. The results show that interpretable PPG probability can capture dysarthric speech features that are related to motor system dysfunction.
Authors: Jasivan Sivakumar (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)
Abstract: This paper presents a rigorous mechanistic analysis of the failure modes in Large Language Model (LLM) numerical reasoning, focusing on the persistent "distraction" caused by prompt-induced biases. Despite attempting multiple mitigation strategies - including weighted rank predictions and bias-reduced decoding - we find that numerical anchors and verbal cues in the prompt exert a "gravitational pull" that is remarkably difficult to override during the decoding stage. We evaluate these dynamics by comparing masked versus unmasked inputs and analysing how prediction confidence often correlates more strongly with superficial prompt patterns than with mathematical accuracy. Our layer-wise diagnostic reveals why these interventions struggle: while MLP and Transformer probes confirm that correct mathematical information exists within the hidden states, this knowledge is frequently "buried" by a dominant rank bias that persists across the majority of the architecture. We detail our findings on rank analysis and early-exit performance, illustrating that the model's internal commitment to a biased answer often occurs early and stabilises regardless of contextual counter-evidence. By documenting these unsuccessful attempts to redirect the model's logic, we provide a detailed map of the structural barriers to fair numerical reasoning and offer critical insights into why standard decoding interventions remain insufficient.
Authors: Robert Flynn (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Automatic speech recognition (ASR) models are normally trained to operate over single utterances, with a short duration of less than 30 seconds. This choice has been made in part due to computational constraints, but also reflects a common, but often inaccurate, modeling assumption that treats utterances as independent and identically distributed samples. When longformat audio recordings are available, to work with such systems, these recordings must first be segmented into short utterances and processed independently. In this work, we show that due to recent algorithmic and hardware advances, this is no longer necessary, and current attention-based approaches can be used to train ASR systems that operate on sequences of over an hour in length. Therefore, to gain a better understanding of the relationship between the training/evaluation sequence length and performance, we train ASR models on large-scale data using 10 different sequence lengths from 10 seconds up to 1 h. Through modifying various architectural components, we find that the method of encoding positional information and the model's width/depth are important factors when working with long sequences. Finally, a series of evaluations using synthetic data are constructed to help analyse the model's use of context.
Authors: Paul Gering (University of Sheffield), Roger K Moore (University of Sheffield)
Abstract: Previous studies have used system-dependent (SD) input features, such as system log data, and machine learning to automatically evaluate user interactions with spoken dialogue systems. However, this approach lacks generalisability across different systems. This study implements a system-agnostic (SA) approach, using acoustic, textual and temporal features to model interaction quality (IQ). Across two experimental phases, we compare the performance of IQ classifiers trained on static and fine-tuned SD and SA features. We found that while the SD features were superior in a static pipeline, the SA features achieved comparable performance when end-to-end fine-tuning was employed. Therefore, our results demonstrate that SA features, when fine-tuned, offer a viable, system-independent alternative for IQ modelling, enabling more scalable evaluations across different dialogue systems.
Authors: Constantinos Karouzos (University of Sheffield), Xingwei Tan (University of Sheffield), Nikolaos Aletras (University of Sheffield)
Abstract: Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.
Authors: Danae Sanchez Villegas (University of Copenhagen), Samuel Lewis-Lim (University of Sheffield), Nikolaos Aletras (University of Sheffield), Desmond Elliott (University of Copenhagen)
Abstract: Recent advances in vision language models (VLMs) offer reasoning capabilities, yet how these unfold and integrate visual and textual information remains unclear. We analyze reasoning dynamics in 18 VLMs covering instruction-tuned and reasoning-trained models from two model families. We track confidence over Chain-of-Thought (CoT), measure reasoning's corrective effect, and evaluate intermediate reasoning steps. We find that models are prone to answer inertia, where early predictions are reinforced rather than revised. While reasoning-trained models show stronger corrective behavior, their gains depend on modality conditions. Using controlled interventions with misleading textual cues, we show that models are consistently influenced by these cues even when visual evidence is sufficient. Although this influence can appear in the CoT, its detectability varies across models and depends on what is monitored. Reasoning-trained models are more likely to explicitly refer to the cues, but their longer CoTs can still appear visually grounded while following textual cues, obscuring modality reliance. In contrast, instruction-tuned models refer to the cues less explicitly, but their shorter traces reveal inconsistencies with the visual input. These findings indicate that CoT provides only a partial view of how different modalities drive VLM decisions, with important implications for the transparency and safety of multimodal systems.
Authors: Jack Cox (University of Sheffield), Jon Barker (University of Sheffield)
Abstract: Speech foundation models, pre-trained on large corpora of unlabelled speech data, produce general-purpose representations which are useful across tasks. However, these representations encode information about salient speech variables in a distributed manner, while downstream speech tasks rely on only some of this variability. In this work, we propose a post-training refinement approach using interventional contrastive learning. By leveraging an interventional dataset and multi-part contrastive loss, we learn a transformation from the entangled representation space of speech foundation models into separate content and speaker subspaces. We evaluate the learnt representations on speaker verification and keyword spotting tasks, showing improved out-of-domain speaker verification performance and evidence that speaker and content information are separated across the learned subspaces.
Authors: Yanyi Pu (University of Sheffield), Damian Gonzalez-Salzberg (University of Birmingham), Yuan Zheng (University of Sheffield), Nikos Aletras (University of Sheffield)
Abstract: We introduce ECtHR-NPD, the first NLP benchmark for numerical prediction of non-pecuniary damage (NPD) award amounts from European Court of Human Rights (ECtHR) judgments. Unlike existing legal NLP benchmarks such as LexGLUE and LEXTREME, which frame legal prediction as classification, ECtHR-NPD poses the task as regression over continuous monetary values, reflecting the court's actual decision-making process. Our dataset comprises approximately 16,000 ECtHR cases spanning 1959 to 2026. We evaluate a comprehensive baseline hierarchy: floor baselines using global and group-level statistics, tree-based models with structured legal features, encoder-based models including ModernBERT with late fusion of structured features, and state-of-the-art large language models under zero-shot and knowledge-grounded agentic prompting conditions. Our central finding is that frontier LLMs are severely miscalibrated: systems claiming 90% confidence in their interval predictions achieve only approximately 47% empirical coverage. Beyond the NLP contribution, our structured feature analysis reveals systematic patterns in award amounts that correlate with geopolitical factors, raising substantive questions about consistency and fairness in judicial compensation decisions. ECtHR-NPD establishes a challenging, legally grounded regression benchmark at the intersection of NLP, computational law, and AI reliability.
Authors: Minghui Zhao (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Autoregressive speech synthesis has traditionally followed a left-to-right order, yet generation order is a modelling choice. This paper investigates decoding order through a masked diffusion framework that enables arbitrary decoding orders at inference time. To isolate the effect of decoding order from inductive biases introduced by learned discrete encoders, we operate on scalar quantised Mel-spectrograms. Our results demonstrate that the left-to-right strategy is suboptimal for capturing complex acoustic dependencies. The reverse, right-to-left, order consistently outperforms the traditional approach, while the adaptive confidence-based strategy \texttt{top1} achieves the highest Mean Opinion Score (MOS) among all model outputs. Analysis reveals that the most effective orders maintain local clusters of consecutive frames to balance long-range temporal dependencies with local coherence. These results establish decoding order as a critical parameter for improving synthesis quality.
Authors: Anthony Hughes (University of Sheffield), Alex Goldberg (Carnegie Mellon University), Prince Jha (MBZUAI) Nikos Aletras (University of Sheffield), Niloofar Mireshghallah (Carnegie Mellon University).
Abstract: Safety classifiers are now essential infrastructure for providing user safety when interacting with language models. Prior work has shown that these classifiers are effective at mitigating the elicitation of misaligned content or flagging concerns regarding user behavior. However, this work does not address concerns regarding the privacy of the information used to train these models. In this work, we conduct the first study of privacy attacks on safety classifiers. We systematically study membership inference attacks (MIA) across a range of language models to take advantage of their memorization of training data documents and their labels. We introduce our novel boundary-based calibration of our inference attack. Our most concerning finding is that, through safe and unsafe flagging of prompts, an adversary can infer sensitive labels for partial and whole documents observed in datasets. We find certain categories of data, such as self-harm, and datasets that provide emotional support are particularly vulnerable. An attacker given partial knowledge of a document can uncover memorized correlations with class labels. This is significant as it means an attacker can infer private labels about an individual without full knowledge of the training document or its target label. Finally, we show demonstrate the power of this attack on classifiers fine-tuned on psychological support data, where the model leaks private information about specific documents and their labels.
Authors: Fritz Peters (University of Sheffield), Madhurananda Pahar (University of Sheffield), Dorota Braun (University of Sheffield), Caitlin Illingworth (University of Sheffield), Daniel Blackburn (University of Sheffield), Heidi Christensen (University of Sheffield)
Abstract: Early detection of dementia is constrained by limited healthcare resources. Speech analysis from picture description tasks offers a non-invasive, scalable tool for detecting cognitive decline. However, the widely used Cookie Theft stimulus is limited by bias and reliance on manually curated Content Information Units (CIUs). Hence, we propose an automated, stimulus-agnostic approach for deriving CIUs from a large corpus of normative data using topic modelling. These topic models are then employed for extracting CIU sequences to construct graphs, representing spatio-semantic information of participants' descriptions. We show that graph features obtained from our proposed approach replicate patterns and diagnostic group differences for the Cookie Theft using an existing approach. Moreover, they generalize to an alternative, more complex picture. These results demonstrate a stimulus-agnostic method for the analysis of picture description tasks, promoting the use of more inclusive pictures.
Authors: Michael Whealing (SLT CDT Affiliate), Thomas Hain (Speech and Hearing Research Group), Rob Gaizauskas (Natural Language Processing Research Group)
Abstract: In this work, we introduce a Self-Supervised (SSL) training regime to separate lexical and stylistic content from SSL feature embeddings. By using cross-utterance correspondence training, we distil high-dimensional variable length frame embeddings into lower dimensional latent vectors, namely Acoustic Word Embeddings (AWEs) and Acoustic Style Embeddings (ASEs), without explicit labelling. The resultant latent embeddings are concatenated, and provide input into a decoder network that reconstructs SSL acoustic features, compelling the network to factorise speech into independent representations of content and style. Qualitative analysis indicates that the learnt latents form distinct clusters, corresponding to lexical and stylistic categories. Additionally, downstream application of the distilled representations prove highly informative for word and speaker identification tasks, outperforming raw SSL features in disentanglement quality without fine-tuning.
Authors: Valeria Pastorino (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)
Abstract: Framing in media critically shapes public perception by selectively emphasizing some details while downplaying others. With the rise of large language models in automated news and content creation, there is growing concern that these systems may introduce or even amplify framing biases compared to human authors. In this paper, we explore how framing manifests in both out-of-the-box and fine-tuned LLM-generated news content. Our analysis reveals that, particularly in politically and socially sensitive contexts, LLMs tend to exhibit more pronounced framing than their human counterparts. In addition, we observe significant variation in framing tendencies across different model architectures, with some models displaying notably higher biases. These findings point to the need for effective post-training mitigation strategies and tighter evaluation frameworks to ensure that automated news content upholds the standards of balanced reporting.
Authors: Maggie Mi (University of Sheffield), Golzar Atefi (Berliner Hochschule für Technik), Atsuki Yamaguchi (University of Sheffield), Felix Gers (Berliner Hochschule für Technik), Aline Villavicencio (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)
Abstract: Idioms can be analysed in terms of their decomposability, the extent to which constituent meanings contribute to the figurative whole. Decomposability is thought to predict syntactic flexibility. Usage-based accounts instead attribute idiom behaviour to distributional experience, such as speaker familiarity and predictability. We examine these views using contextualised language models as controlled distributional learners. We propose a model-internal measure of decomposability and relate it to human ratings, syntactic flexibility, and predictability while tracking idiom learning during pretraining. Model-derived decomposability correlates weakly with human judgments and shows a small but consistent negative relationship with syntactic flexibility. Pretraining analyses show that stabilisation of idiom representations in models is not explained by frequency alone. Instead, surprisal, decomposability, and frequency all contribute, with decomposability showing the strongest training-dependent effect.
Authors: Gerardo Roa-Dabike (University of Sheffield), Jon P. Barker (University of Sheffield), Michael A. Akeroyd (University of Nottingham), Scott Bannister (University of Leed | University of Manchester), Trevor J. Cox (University of Salford), Bruno Fazenda (University of Salford), Jennifer Firth (University of Nottingham), Simone Graetzer (University of Salford), Alinka Greasley (University of Leed), Rebecca R. Vos (University of Salford) and William M. Whitmer (University of Nottingham)
Abstract: Understanding the lyrics in music is key for music enjoyment. People with hearing loss can have difficulties clearly and effortlessly hearing lyrics, however. In speech technology, having metrics to automatically evaluate intelligibility has driven improvements in speech enhancement. We wanted to do the same for music with lyrics. To address this gap we presented the lyric intelligibility challenge. A new dataset, CLIP1, was introduced, comprising audio samples of popular western music paired with listener intelligibility scores. To model diverse listening profiles, samples were processed with no, mild and moderate simulated hearing loss. A total of 27 systems were submitted by 22 teams. After success of CLIP1, we are announcing the launch of CLIP2, the second lyric intelligibility challenge.
Authors: Robert Sutherland (University of Sheffield), Stefan Goetze (University of Sheffield), Jon Barker (University of Sheffield)
Abstract: For those with hearing impairments, improving the performance of assistive hearing devices is essential for making communication in noisy environments easier. The CHiME-9 “Enhancing Conversations to address Hearing Impairment (ECHI)†challenge introduces a large dataset to advance speech enhancement in conversational settings, with submitted systems evaluated through listening tests. While speech quality assessment is well established, methods for evaluating conversational speech intelligibility remain limited, motivating this work's focus on key challenges and a new evaluation approach. The challenge targets four-party conversations in noisy conditions, with participants designing systems to suppress background noise while preserving the speech of three conversational partners. To assess the performance of a system, a listening test is designed with the goal of assessing conversational intelligibility, emphasising ecologically valid tasks. Pilot studies highlight challenges such as designing semantically coherent, memorable segments, providing sufficient context for one-shot listening, and ensuring clear cues for identifying the target speaker. To address this, a novel listening test methodology is introduced, providing listeners with target speaker samples, conversational context, speaking cues, and training examples. All tools, audio data, and intelligibility labels will be open-sourced following the challenge.
Authors: Mattias Cross (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Text-to-speech (TTS) systems commonly employ encoder–decoder architectures in which an acoustic model maps linguistic representations to mel spectrograms prior to waveform synthesis vocoding. Each phone in a sequence has an irregular duration. Typical upsampling methods duplicate a text encoding for a given number of frames based on their predicted phone duration to produce a discrete sequence of structured frames for the decoder. Then, the decoder performs a transform with the same discrete structure as the input to produce a mel spectrogram prediction. Such methods are strong, yet the use of a discrete process is counter intuitive to the continuous reality of acoustic signals. The fact that a mel spectrogram is a discretised representation does not imply that the decoder must also operate in this scheme. This motivates exploring continuous-time acoustic models for speech synthesis. However, modelling a continuous time series from an irregularly sampled input is entirely possible with Neural Controlled Differential Equations (Neural CDEs). Unlike recurrent or convolutional methods, a neural CDE is a parametrised continuous-time vector field whose behaviour is controlled by a discrete input signal, which is a natural fit for solving acoustic modelling problems. This study investigates Neural CDEs as continuous-time mel spectrogram decoders for TTS. To our knowledge, this is the first study to model mel spectrogram generation as a dynamical process using Neural CDEs. This work aims to provide a principled framework for handling irregular alignment structure and for producing smoother, more flexible acoustic trajectories.
Authors: Yao Xiao (University of Sheffield), Fritz Peters (University of Sheffield), Madhurananda Pahar (University of Sheffield), Dorota A Braun (University of Sheffield), Caitlin H Illingworth (University of Sheffield), Stefan Goetze (University of Sheffield), Daniel Blackburn (University of Sheffield), Heidi Christensen (University of Sheffield)
Abstract: Early detection of dementia is critical, yet clinical assessments are often time-consuming and costly. Speech graphs offer a scalable, automated approach by modelling utterances as graphs to capture structural patterns that indicate cognitive decline. While existing methods overlook crucial information such as word meaning and pronunciation, this work proposes Weighted Speech Graphs (WSGs), integrating clinically motivated attributes into speech graphs, and introduces a novel set of graph-derived features for dementia detection. Remarkably, just one or two features can achieve performance comparable to a full baseline feature set, while providing interpretable insights aligned with clinical observations. Furthermore, an open-source Python framework for graph construction, feature extraction and visualisation is released, promoting reproducibility and extensibility. Together, these advances bridge practical application and clinical insights in speech graph-based dementia detection.
Authors: Xinying Wei (University of Sheffield), Eleni Vasilaki (University of Sheffield), Thomas Hain (University of Sheffield)
Abstract: Test-Time Adaptation (TTA) aims to modify a model to better fit test data, using initial inference outcomes. Continual Test-Time Adaptation (CTTA) is TTA where models are incrementally updated, thus allowing them to retain knowledge acquired from prior samples. Applied to automatic speech recognition (ASR), CTTA can reduce Word Error Rates (WERs) on unseen utterances over a non-continual baseline. However, it remains unclear whether CTTA retains performance on already observed utterances, thus effectively accumulating knowledge at the distribution level. This study proposes a novel metric, Forgetting Rate for ASR Continual Test-Time Adaptation (FRACT), for evaluating the loss of learned information in ASR CTTA. The metric better highlights the differences between TTA methods across a range of models and datasets. Results reveal that while forgetting is still observable in prominent ASR CTTA techniques such as Adaptation Without Mode Collapse (AWMC) and Dynamic Single Utterance Test-time Adaptation (DSUTA), it is substantially mitigated compared to a vanilla CTTA baseline. Furthermore, additional evaluation on distribution-level knowledge retention suggests that, although vanilla CTTA accumulates errors as the amount of processed data increases, applying targeted techniques allows utterance-level gains in CTTA to translate into distribution-level improvements.
Authors: Xiaozhou Tan (University of Sheffield), Anton Ragni (University of Sheffield)
Abstract: Diffusion models in speech enhancement and synthesis tend to have heavy mathematical background. To ensure mathematical tractability, these models are based on idealized but not fully realistic assumptions. We found that most of these models have a common feature that they can all be reparameterized as clean data prediction task. In addition, we also find that despite many forms of diffusion model have been proposed, the basic and simple form of inference and training is neglected by the community before switching to more complex methods. Thus, based on the common feature of these more complex methods, we propose a more basic and simple form of diffusion-like model and aim to fill the gap of ignorance of this method by the community. We implemented two applications using this method: speech enhancement and synthesis. Besides its simplicity, this method is more flexible than classic diffusion or flow matching models to incorporate more diverse degradation process in speech synthesis. This flexibility expands the design space and can potentially improves the result.
Authors: Arezo Shakeri (University of Stavanger, Norway and University of Sheffield), Madhurananda Pahar (University of Sheffield), Ning Ma (University of Sheffield)
Abstract: This study investigates the use of neural audio codec representations for automatic detection of cognitive decline from speech. Cognitive disorders such as Mild Cognitive Impairment (MCI) and Alzheimer's disease (AD) often affect speech production, making speech analysis a promising non-invasive tool for early diagnosis. Unlike traditional approaches that rely on handcrafted acoustic features, linguistic features, or embeddings from large speech models, this work explores discrete speech tokens generated by the Mimi neural audio codec. These tokens jointly capture acoustic, phonetic, and linguistic information while providing compact and computationally efficient representations. Two GPT-2-based approaches were evaluated: (1) custom GPT-2 models trained on Mimi tokens to generate representations for external classifiers, and (2) a pretrained GPT-2 fine-tuned for direct classification. Additional methods using token frequencies, flattened token sequences, and token embeddings were also examined. Experiments were conducted on the PROCESS-2 corpus containing speech recordings from healthy controls, individuals with MCI, and individuals with dementia. The best-performing neural audio codec approach achieved a macro F1-score of 63% for three-class classification, outperforming a manual-transcription baseline (58%) without requiring transcripts. These findings demonstrate the potential of neural audio codecs for efficient, transcript-free cognitive decline detection and motivate future work on improved token utilisation and interpretability.
Authors: Xiaolei Xu (University of Sheffield), Chaoyue Niu (University of Sheffield), Guy J. Brown (University of Sheffield), Hector Romero (Passion for Life Healthcare), Ning Ma (University of Sheffield)
Abstract: Obstructive sleep apnea (OSA) is a common sleep disorder with significant health consequences, but many patients remain undiagnosed due to limited access to an overnight polysomnography. Acoustic-based screening offers a scalable alternative, but it may generalise poorly across different home environments, recording devices, and noise conditions. Respiratory effort is a key signal used in the clinical scoring of OSA events. Combining respiratory effort with audio can provide a more generalisable OSA screener, however, directly measuring effort requires additional sensors, which reduces comfort and limits the scalability of smartphone-based screening. In this study, we proposed a physiology-guided audio modelling framework that incorporates respiratory effort information during training. We hypothesise that respiratory dynamics are implicitly reflected in snoring and breathing acoustics. These explicitly modelled latent cues can improve the robustness and generalisability of audio-based OSA screening without introducing additional sensors.