Posters

Annual Conference 2025 – Research posters

Schedule

Keynotes

Research Talks

Research Posters

Venue

Venue: Basement exhibition space, Diamond Building . Note: all poster presentations will be delivered in-person.

Poster Session 1 (morning): even numbered posters (poster numbers 2, 4, 6, ...)

Poster Session 2 (afternoon): odd numbered posters (poster numbers 1, 3, 5, ...)

Poster 1 Can we reconstruct a dysarthric voice with the large speech model Parler TTS?

Authors: Ariadna Sanchez (Centre for Speech Technology Research, University of Edinburgh), Simon King (Centre for Speech Technology Research, University of Edinburgh)

Abstract: Speech disorders can make communication hard or even impossible for those who develop them. Personalised Text-to-Speech is an attractive option as a communication aid. We attempt voice reconstruction using a large speech model, with which we generate an approximation of a dysarthric speaker's voice prior to the onset of their condition. In particular, we investigate whether a state-of-the-art large speech model, Parler TTS, can generate intelligible speech while maintaining speaker identity. We curate a dataset and annotate it with relevant speaker and intelligibility information, and use this to fine-tune the model. Our results show that the model can indeed learn to generate from the distribution of this challenging data, but struggles to control intelligibility and to maintain consistent speaker identity. We propose future directions to improve controllability of this class of model, for the voice reconstruction task.

Poster 2 Can we trust AI to detect cognitive decline of multilingual English speakers in the UK? An investigation using real-world conversational speech

Authors: Madhurananda Pahar, Caitlin Illingworth, Dorota Braun, Bahman Mirheidari, Lise Sproson, Daniel Blackburn, Heidi Christensen (University of Sheffield)

Abstract: Conversational speech often reveals early signs of cognitive decline, such as dementia and mild cognitive impairment (MCI). Numerous recent studies have demonstrated the high performance of AI models in detecting Alzheimer’s dementia among monolingual speakers in their native languages. In the UK, one in four people belongs to an ethnic minority, and dementia prevalence is expected to rise most rapidly among Black and Asian communities. This study investigates the trustworthiness of AI models, i.e. the presence of bias, in detecting cognitive decline among multilingual English speakers so that these tools can be clinically beneficial. Monolingual participants were recruited nationally (UK), and multilingual speakers were enrolled from four community centres in Sheffield and Bradford, a total of 1,166 participants contributing 218.73 hours of speech. In addition to a non-native English accent, multilinguals spoke Somali, Chinese, or South Asian languages (Hindi, Urdu, Punjabi, Mirpuri, Arabic, etc.), who were further divided into two Yorkshire accents (West and South) to challenge the efficiency of the AI tools thoroughly. Although ASR systems (Whisper, Wav2Vec 2.0, and NeMo) showed no significant bias across groups, classification and regression models using acoustic and linguistic features exhibited bias against multilingual speakers, particularly in memory, fluency, and reading tasks. This bias was more pronounced when models were trained on the publicly available DementiaBank dataset. Moreover, multilinguals were more likely to be misclassified as having more severe cognitive decline, and qualitative analysis, including word cloud visualisations, indicated that multilingual participants more frequently referenced their country or place of origin during memory tasks. This study is the first of its kind to discover that, despite their strong overall performance, current AI models show bias against multilingual individuals from ethnic minority backgrounds in the UK, and they are also more likely to misclassify speakers with a certain accent (South Yorkshire) as living with a more severe case of cognitive decline. We conclude that the existing AI tools are therefore not yet reliable for diagnostic use in these populations, and we aim to address this in future work by developing more generalisable, bias-mitigated models.

Poster 3 Utility-Based stopping methods

Authors: Aaron Fletcher (University of Sheffield), Mark Stevenson (University of Sheffield)

Abstract: Conducting efficient systematic reviews is crucial given the growing volume of scientific literature. Technology Assisted Review offers tools to streamline this process, but traditional stopping approaches focus on achieving specific target recall levels, which may not always satisfy specific information needs. This paper introduces utility-based stopping algorithms, including Harmonic Shrinkage and Confidence Boundary, that prioritise the value of retrieved information in addressing a defined information need. These algorithms analyse statistical properties of point estimates, such as sensitivity and false positive rate, to determine when sufficient evidence has been gathered. Experiments using CLEF 2017-2019 datasets demonstrate that these moment-based approaches can significantly reduce the percentage of reviewed relevant documents (PRD) while maintaining high decision agreement. Notably, these methods achieve substantially reduced PRD while maintaining decision agreement when compared to traditional target-recall approaches. This research advances TAR methodologies by shifting from recall-oriented stopping rules to utility-driven approaches, aligning stopping decisions with users' information needs to enhance the efficiency and effectiveness of document review.

Poster 4 Lost in Translation? Measuring gender stereotypes across languages in multilingual LLMs

Authors: Jacqueline Rowe (University of Edinburgh), Mateusz Klimaszewski (Warsaw University of Technology), Liane Guillou (Aveni), Shannon Vallor (University of Edinburgh), Alexandra Birch (University of Edinburgh/Aveni)

Abstract: Large language models (LLMs) encode and sometimes exacerbate gender biases present in their training texts, resulting in stereotypical reasoning, biased text generation and translation, and discriminatory outcomes when LLMs are applied to a range of contexts and tasks. While much work has been done on measuring and benchmarking gender biases in LLMs' English-language outputs, there have been only a few investigations into LLM gender bias in non-English languages, and these have been limited to select languages and tasks. With most widely-used language models now supporting dozens, if not hundreds, of languages, it is important to understand how gender biases might persist or change across different linguistic contexts to ensure that appropriate mitigations for these biases can be implemented when such models are used in practice. In this work, we extend an existing dataset for measuring multilingual gender bias across 29 languages, and use this new benchmark to compare gender biases across languages and across models, exploring whether model size, presence or absence of grammatical gender, and training data composition impacts how gender-biased model outputs are across different languages. This poster will present our methods and key findings.

Poster 5 Instrumental assessment and management of velopharyngeal incompetency, using SNORS device: a pilot study

Authors: Eyüp Sezer (Fenerbahçe Üniversitesi), Dilan Selen Akıl (Bahcelievler Ozel Egitim Merkezi), Kemal Eren Cengiz (Alrenas Technology AS.)

Abstract: The super nasal oral ratiometry system (SNORS) is a commercially available system which measures both nasal and oral airflow during speech, allowing the rapid movement of the velum to be measured. SNORS can be used for objective assessment, where the subject is required to speak a couple of chosen words selected to demonstrate velopharyngeal function. SNORS also provides biofeedback, using a simple real-time display of nasal and oral airflow during procedure. Velopharyngeal insufficiency/incompetency (VPI) is the inability to make adequate velopharyngeal closure, and. it results in abnormal speech characteristics, such as omissions, substitutions or weak articulation of consonants, and hypernasality. The person B. A., a 19 year old male, had mild to moderate hypernasal speech and nasalance without known medical history. Initial oral motor examination revealed sufficient velar movement though somehow lacking in complete velopharyngeal sealing. SNORS device was successfully used as both an assessment and a therapeutic tool in the management of this patient. Moreover, the effectiveness of conventional speech and language therapy vs. SNORS biofeedback therapy was compared. Initially, while there was some movement of the velum, the patient could not achieve velopharyngeal closure. Conventional therapy aimed to strengthen and improve the function of the velum and following this there was some minimal improvement: the patient could now achieve, but not maintain, closure as needed in discourse. Then, SNORS biofeedback therapy was given as an alternative option. This raised the patient's awareness of his velopharyngeal function, thus helping him to maintain closure, thereby reducing hypernasality. SNORS therapy proved significantly more effective than conventional speech and language therapy, in this case. Key words: hypernasality, nasalance, nasal emission velopharygeal sealing,

Poster 6 How Private are Language Models in Abstractive Summarization?

Authors: Anthony Hughes (University of Sheffield), Ning Ma (University of Sheffield), Nikolaos Aletras (University of Sheffield)

Abstract: In medical and legal settings the appropriate management of data is essential. Protective laws, like HIPAA and GDPR, enforce that individuals data is not leaked into the public domain. Although essential for confidentiality, this inhibits data sharing, consequently limiting access to potentially critical intelligence. Given that language models (LMs) have shown outstanding performance in text summarization, understanding to what extent LMs can provide privacy-preserving summaries given a non-private source document would be of great value to intelligence communities. In this paper, we perform a comprehensive study across 3 closed- and 2 open-weight LMs of different sizes and families. We experiment with prompting and fine-tuning strategies for privacy-preservation across a range of summarization datasets. Our extensive quantitative and qualitative analysis including human evaluation shows that LMs often cannot prevent PII leakage on their summaries and that current widely-used metrics cannot capture context dependent privacy risks.

Poster 7 Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks

Authors: Amit Meghanani (University of Sheffield), Thomas Hain (University of Sheffield)

Abstract: Self-supervised learning (SSL)-based speech models are extensively used for full-stack speech processing. However, it has been observed that improving SSL-based speech representations using unlabeled speech for content-related tasks is challenging and computationally expensive. Recent attempts have been made to address this issue with cost-effective self-supervised fine-tuning (SSFT) approaches. Continuing in this direction, a cost-effective SSFT method named "LASER: Learning by Aligning Self-supervised Representations" is presented. LASER is based on the soft-DTW alignment loss with temporal regularisation term. Experiments are conducted with HuBERT and WavLM models and evaluated on the SUPERB benchmark for two content-related tasks: automatic speech recognition (ASR) and phoneme recognition (PR). A relative improvement of 3.7% and 8.2% for HuBERT, and 4.1% and 11.7% for WavLM are observed, for the ASR and PR tasks respectively, with only < 3 hours of fine-tuning on a single GPU.

Poster 8 Predicting Interaction Quality from Spoken Dialogue

Authors: Paul Gering (University of Sheffield), Heidi Christensen (University of Sheffield), Roger K Moore (University of Sheffield)

Abstract: Recent research has used machine learning approaches to implement an automatic, quantitative measure of interaction quality. Schmitt and colleagues used the LEGO spoken dialogue corpus, a publicly available database of telephone-based human-agent interactions, to predict turn-level interaction quality using SVM and LSTM classifiers. These classifiers were trained on automatically derived features relating to the performance of the dialogue system's ASR and DM subcomponents. This project aims to extend the work of Schmitt and colleagues by training machine learning models on linguistic and paralinguistic speech features, such as pauses, intonations and turn-taking, from the LEGO corpus to predict interaction quality. These speech features have been effectively used for similar machine-learning problems, such as engagement estimation and emotion detection. Additionally, it is assumed that individuals use these speech features to evaluate everyday interactions. Therefore, the performance of the trained models is expected to surpass those implemented by Schmitt and colleagues. This is the first of three planned experiments focusing on implementing an automatic, quantitative, local measure of interaction quality.

Poster 9 Frame In, Frame Out: Do LLMs Generate More Biased News Headlines than Humans?

Authors: Valeria Pastorino (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)

Abstract: Framing in media critically shapes public perception by selectively emphasising some details while downplaying others. With the rise of large language models in automated news and content creation, there is growing concern that these systems may introduce or even amplify framing biases compared to human authors. In this paper, we explore how framing manifests in both out-of-the-box and fine-tuned LLM-generated news content. Our analysis reveals that, particularly in politically and socially sensitive contexts, LLMs tend to exhibit more pronounced framing than their human counterparts. In addition, we observe significant variation in framing tendencies across different model architectures, with some models displaying notably higher biases. These findings point to the need for effective post-training mitigation strategies and tighter evaluation frameworks to ensure that automated news content upholds the standards of balanced reporting.

Poster 10 Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis

Authors: Wing-Zin Leung (University of Sheffield), Mattias Cross (University of Sheffield), Anton Ragni (University of Sheffield), Stefan Goetze (South Westphalia University of Applied Sciences, Iserlohn, Germany)

Abstract: Automatic speech recognition (ASR) research has achieved impressive performance in recent years and has significant potential for enabling access for people with dysarthria (PwD) in augmentative and alternative communication (AAC) and home environment systems. However, progress in dysarthric ASR (DASR) has been limited by high variability in dysarthric speech and limited public availability of dysarthric training data. This paper demonstrates that data augmentation using text-to-dysarthic-speech (TTDS) synthesis for finetuning large ASR models is effective for DASR. Specifically, diffusion-based text-to-speech (TTS) models can produce speech samples similar to dysarthric speech that can be used as additional training data for fine-tuning ASR foundation models, in this case Whisper. Results show improved synthesis metrics and ASR performance for the proposed multi-speaker diffusion-based TTDS data augmentation for ASR fine-tuning compared to current DASR baselines.

Poster 11 Speaker Embedding Informed Audiovisual Active Speaker Detection for Egocentric Recordings

Authors: Jason Clarke (University of Sheffield), Yoshihiko Gotoh (University of Sheffield), Stefan Goetze (South Westphalia University of Applied Sciences, Iserlohn, Germany)

Abstract: Audiovisual active speaker detection (ASD) addresses the task of determining the speech activity of a candidate speaker given acous- tic and visual data. Typically, systems model the temporal correspondence of audiovisual cues, such as the synchronisation between speech and lip movement. Recent work has explored extending this paradigm by additionally leveraging speaker embeddings extracted from candidate speaker reference speech. This paper proposes the speaker comparison auxiliary network (SCAN) which uses speaker-specific information from both reference speech and the candidate audio signal to disambiguate challenging scenes when the visual signal is unresolvable. Furthermore, an improved method for enrolling face-speaker libraries is developed, which implements a self-supervised approach to video-based face recognition. Fitting with the recent proliferation of wearable devices, this work focuses on improving speaker-embedding-informed ASD in the context of egocentric recordings, which can be characterised by acoustic noise and highly dynamic scenes. SCAN is implemented with two well- established baselines, namely TalkNet and Light-ASD; yielding a relative improvement in mAP of 14.5% and 10.3% on the Ego4D benchmark, respectively.

Poster 12 Flowing Straighter with Conditional Flow Matching for Accurate Speech Enhancement

Authors: Mattias Cross (University of Sheffield), Anton Ragni (University of Sheffield)

Abstract: Current flow-based generative speech enhancement methods learn curved probability paths which model a mapping between clean and noisy speech. Despite impressive performance, the implications of curved probability paths are unknown. Methods such as Schrodinger bridges focus on curved paths, where time-dependent gradients and variance do not promote straight paths. Findings in machine learning research suggest that straight paths, such as conditional flow matching, are easier to train and offer better generalisation. Here we show the effect of path straightness on speech enhancement quality. We report experiments with the Schrodinger bridge, where we show that certain configurations are straighter. Conversely, we propose independent conditional flow-matching for speech enhancement, which models straight paths between noisy and clean speech. We identify empirically that a time-independent variance has a greater effect on sample quality than the gradient. Conditional flow matching improves several speech quality metrics, but requires more inference steps. We rectify this with a one-step solution by inferring the trained flow-based model as if it were directly predictive. Our work suggests that straighter time-independent probability paths improve generative speech enhancement over curved time-dependent paths.

Poster 13 Fast-Tuning of Large Language Models for Dynamically Changing Tasks

Authors: Yaseen Mohammed Osman (University of Southampton), Stuart E. Middleton (University of Southampton), Geoff V. Merrett (University of Southampton)

Abstract: In this study, we conducted a series of comparative studies to quantify the impact of rapid fine-tuning of large language models (LLMs). We have done so using SQuAD 2.0 as our benchmark. Based on our experiments, we observe that rapid fine-tuning is indeed sufficient and can be better than few-shot prompting. In fact, we find that fine-tuning an 8B Llama3.1 model for only 5-minutes using just 0.01% of SQuAD 2.0 training dataset can result in a 12.31% increase in performance (i.e., exact match metric, from 77% to 83.69%) as well as making the 8B model outperform the 70B variant. This can save 89% of the hardware costs by deploying such a model. Our approach can also achieve this while reducing both the inference time and energy consumption by around 12% of the same sized model. This study provides evidence that rapid retraining of LLMs can work. As future work, we plan to investigate Active Learning algorithms to explore its impact on in-context few-shot learning.

Poster 14 From Text Sentiment to Speech Prosody: Cross-Modal Analysis and Implications for Synthesis

Authors: Boxuan Shan (University of Sheffield), Anton Ragni (University of Sheffield)

Abstract: In expressive speech synthesis, various factors can influence the synthesised speech, such as prosody, speaker identity, emotion, etc. However, little work has been done on the influence of sentiment on speech. Therefore, in this work, the effect of sentiment on speech is explored, and as a starting point, we focus on its effect on prosody. Firstly, we examined the differences in the distribution of prosody for different sentiment groups in real speech data. In this process, a sentiment analysis technique was adopted to perform sentiment grouping on the transcription of the speech dataset. Furthermore, some mainstream TTS models are examined to investigate whether their prediction performance is sensitive to sentiment grouping. Our experimental results show that sentiment has a significant impact on prosody, while current TTS models are sensitive to sentiment, which means that their prediction performance on prosody varies across different sentiment groups. However the reason for this difference is unclear. These results suggest that the effect of sentiment on speech synthesis is a worthwhile direction for further research.

Poster 15 Sources of Overfitting in Neural Abstractive Summarization of Scraped News and Business Text

Authors: Donovan Wright (University of Sheffield), Robert Gaizauskas (University of Sheffield), Mark Stephenson (University of Sheffield)

Abstract: Journalistic writing conforms to an inverted pyramid structure in which the most important elements of the news content appear at the top of a news column, with supporting content following. This inverted pyramid structure of journalistic writing results in layout bias. Indeed, some researchers have exploited the layout bias in news content. Because of the bias within the datasets, abstractive summarization models developed with datasets comprised of scraped news content, are unlikely to generalize to non-news content-based datasets. This research focusses on noise or differences in similarity between abstractive summarization models, which have been developed using unconstrained online news content, and seeks to make quantitative comparisons with datasets originating from non-news based content.

Poster 16 Alzheimer's Dementia Detection Using Perplexity from Paired Large Language Models

Authors: Yao Xiao (University of Sheffield), Heidi Christensen (University of Sheffield), Stefan Goetze (University of Sheffield)

Abstract: Alzheimer's dementia (AD) is a neurodegenerative disorder with cognitive decline that commonly impacts older people. AD can cause language-related deficits, such as difficulty in word-searching. This poses challenges for individuals in efficiently communicating with others, and consequently affect their quality of life. The linguistic and acoustic differences between impaired and unimpaired speech or language have enabled clinicians to perform screening for AD. Such cues have also inspired the automated AD detection, serving as a non-intrusive, scalable, and cost-effective way to facilitate early detection, monitoring, and management of such conditions. This project investigates the use of speech and language technologies for the automated detection of AD. By analysing spontaneous speech from cognitive assessments, the project explores novel methods to achieve accurate AD screening and progression monitoring.

Poster 17 Integrating Task-Specific Features and Spatio-Semantic Graphs for the Automatic Detection and Subtyping of Primary Progressive Aphasia from Speech

Authors: Fritz Peters (University of Sheffield), W Richard Bevan-Jones (University of Sheffield), Grace Threlfall (University of Sheffield), Jenny M Harris (University of Exeter), Julie S Snowden (University of Manchester), Matthew Jones (University of Manchester), Jennifer C Thompson (University of Manchester), Daniel J Blackburn (University of Sheffield), Heidi Christensen (University of Sheffield)

Abstract: Primary progressive aphasia (PPA) encompasses a group of neurodegenerative disorders primarily affecting language abilities. Diagnosing PPA typically requires experienced clinicians, who are often only available in specialized hospital settings. Early in the disease course, individuals with PPA frequently exhibit noticeable changes in speech and language. In this study, we extracted acoustic, linguistic, and task-specific features from audio recordings to assess their effectiveness for PPA classification. Using a subset of task-specific features, we identified PPA with 97% accuracy. For subtyping, models trained on the complete feature set achieved 74% accuracy in distinguishing between the traditional three PPA variants. Our findings underscore the added value of task-specific features, which enhance conventional approaches. Furthermore, their visualisation provides an intuitive representation of task performance, improving clinical interpretability and potential diagnostic applications.

Poster 18 Autoregressive Diffusion Models for Frame-by-Frame Speech Synthesis

Authors: Minghui Zhao (University of Sheffield), Anton Ragni (University of Sheffield)

Abstract: Diffusion models have shown strong performance in speech synthesis, primarily through non-autoregressive, parallel generation. While effective in terms of quality and diversity, these models can be less effective at capturing fine-grained frame-to-frame dependencies. In this work, we propose an autoregressive diffusion model that generates speech frame by frame, without adhering to a fixed temporal order. The model preserves autoregressive conditioning across frames. Experimental results show improved performance in pitch accuracy, as measured by log F0, demonstrating the model's ability to produce more natural and expressive speech.

Poster 19 Causal Disentangled Speech Representations

Authors: Jack Cox (University of Sheffield), Prof. Jon Barker (University of Sheffield), Dr. Maha Elbayad (Meta)

Abstract: Learning disentangled representations of speech data has the potential to provide benefits in robustness, domain generalisation, controllability and interpretability, and fairness, but a lack of clarity around disentanglement has made progress difficult. We bring work in the broader machine learning community on causal representation learning to the context of speech disentanglement, and ask whether approaching speech as the result of a causal generative process can result in better representations.

Poster 20 On the Impact of Calibration Data in Post-training Quantization and Pruning

Authors: Miles Williams (University of Sheffield), Nikolaos Aletras (University of Sheffield)

Abstract: Quantization and pruning form the foundation of compression for neural networks, enabling efficient inference for large language models (LLMs). Recently, various quantization and pruning techniques have demonstrated remarkable performance in a post-training setting. They rely upon calibration data, a small set of unlabeled examples that are used to generate layer activations. However, no prior work has systematically investigated how the calibration data impacts the effectiveness of model compression methods. In this paper, we present the first extensive empirical study on the effect of calibration data upon LLM performance. We trial a variety of quantization and pruning methods, datasets, tasks, and models. Surprisingly, we find substantial variations in downstream task performance, contrasting existing work that suggests a greater level of robustness to the calibration data. Finally, we make a series of recommendations for the effective use of calibration data in LLM quantization and pruning.

Poster 21 Evaluating Large Language Models for Reasoning in Voice-Controlled Smart Homes

Authors: Mary Hewitt (University of Sheffield), Hamish Cunningham (University of Sheffield)

Abstract: Smart home systems often rely on cloud-based assistants for voice control, but these raise privacy concerns and struggle with underspecified or indirect requests due to rigid command structures. This work explores how large language models (LLMs) can address these limitations, enabling more flexible interpretation of spoken commands, though with increased risk of error. We evaluate open-source LLMs as the reasoning component of a local smart home assistant, running on consumer-grade hardware without internet access. Using an annotated dataset of natural language commands, we benchmark multiple LLMs across homes of varying complexity, assessing their ability to generate both structured control outputs and conversational responses grounded in a home model. We identify common error types and explore prompting techniques that improve model behaviour on this task. Our results highlight the potential of local LLMs for private, adaptable smart home control, and we outline methods for their safe deployment in real-world environments.

Poster 22 A Dataset for Enhancing Conversations for the Hearing Impaired

Authors: Robert Sutherland (University of Sheffield), Jason Clarke (University of Sheffield), Stefan Goetze (University of Sheffield), Jon Barker (University of Sheffield)

Abstract: With machine learning techniques being developed and applied to the domain of speech enhancement for hearing aids, there comes a need for high-quality datasets which represent realistic scenarios that might be faced by hearing-impaired listeners in everyday life. In this work, we discuss the development of a dataset for Enhancing Conversations for the Hearing Impaired (ECHI), which is comprised of four-party conversations in a simulated cafe/restaurant environment. For this dataset, we recorded over 30 hours of conversational speech from over fifty 4-person conversations involving over 200 participants. For each session, we captured 55 channels of audio - including ego-centric recordings from the Meta Aria glasses and hearing aid microphones, synchronised with audio from eight separate 4-element fixed microphone arrays. In addition, we recorded six-degree-of-freedom head motion data for all four participants, and video from the Aria glasses worn by one of the participants. The data will have a number of possible use cases, which are relevant to the tasks of speech enhancement for hearing aids, distant-microphone speech processing, audio-visual speech processing, as well as investigations into head motion and behaviour in conversations in noisy environments. A subset of this data will be used as the dataset for the CHiME9-ECHI Challenge.

Poster 23 On the impact of Matrix Language identification for Automatic Speech Recognition of Code-Switched speech

Authors: Olga Iakovenko (University of Sheffield), Thomas Hain (University of Sheffield)

Abstract: Code-switching (CS) is alternating between two or more languages within a single conversation. CS presents significant challenges for automatic speech recognition (ASR) systems due to abrupt language changes, mixed linguistic structures and varying phonetic and syntactic patterns. Another difficulty when dealing with CS is data scarcity when compared to monolingual data. One crucial aspect of enhancing ASR performance for CS is the accurate identification of the Matrix Language (MLID), which provides the syntactic and structural framework for CS utterances. This paper investigates the impact of MLID on the effectiveness and accuracy of ASR systems when processing CS speech. The MLID was predicted from CS audio simultaneously with the ASR and language diarisation task. This was compared to a similar setup trained in a multi-task learning fashion combining ASR, LD and utterance language identification (LID) task, where CS utterances were regarded as a separate language. The proposed CS ASR system has shown an absolute of 0.5% Mixed Error Rate (MER) decrease in comparison to the baseline and at least 0.2% absolute MER decrease in comparison to existing CS ASR multitask learning setups. The proposed approach has demonstrated that having predicted MLID as Mandarin leads to an increase of recognised function words indicating that MLID informs the ASR decoder of the grammatical properties of the utterance. The study highlights the potential of MLID-aware ASR systems in various applications, from multilingual virtual assistants to real-time translation services, indicating a broader applicability of the approach.

Poster 24 Preference Tuning Under Domain Shift

Authors: Constantinos Karouzos (University of Sheffield), Nikos Aletras (University of Sheffield)

Abstract: Recent progress in aligning large language models (LLMs) with human preferences through techniques such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) has led to significant improvements in generation quality. However, most studies train and evaluate models in a single, static domain, even though deployment settings often encounter substantial domain shifts. In this empirical study, we present an extensive empirical study of preference optimization under domain shift. We evaluate a suite of preference optimization methods, including Supervised Fine-Tuning (SFT), RLHF, DPO, KTO, and GRPO, on summarization (using a Reddit TL;DR dataset as the source and CNN/DM as the target). Furthermore, we explore several domain adaptation strategies, such as domain-adaptive pre-training (DAPT), interleaved training, pseudo-labeling, and domain-adversarial techniques, to assess their impact on transferring alignment knowledge to new domains. Our results highlight significant discrepancies in performance between source and target domains and reveal promising approaches for mitigating the alignment gap under domain shift.

Poster 25 Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context

Authors: Maggie Mi (University of Sheffield), Aline Villavicencio (University of Exeter), Nafise Sadat Moosavi (University of Sheffield)

Abstract: Recent work has shown that Large Language Models (LLMs) perform well on idiomaticity detection tasks. However, existing evaluation methods often contain confounds that undermine the validity of these results. In this work, we introduce a new evaluation dataset designed to more rigorously test whether LLMs can disambiguate idiomatic meaning based on context. Our findings show that LLMs frequently fail to effectively leverage contextual cues. Moreover, linguistic features known to influence human understanding of idioms, such as collocational frequency and sentence likelihood, which proxy for familiarity, do not consistently correlate with improved model performance.

Page updated

Report abuse