Talks

Annual Conference 2025 – Research talks

Schedule

Keynotes

Research Talks

Research Posters

Venue

Venue: Lecture Theatre 1, Diamond Building

Note: all research talks will be delivered in-person.

Talk 1 AdMIRe-ing the View: Lessons from SemEval-2025 Task 1

Authors: Thomas Pickard (University of Sheffield), Aline Villavicencio (University of Exeter), Maggie Mi (University of Sheffield), Wei He (University of Exeter), Dylan Phelps (University of Sheffield), Marco Idiart (Federal University of Rio Grande do Sul)

Abstract: AdMIRe: Advancing Multimodal Idiomaticity Representation is a shared task at SemEval-2025, combining text and images to evaluate language models' processing of idiomatic language. This talk will present the task itself and the results obtained from both participating systems and human annotators. We will also discuss the practicalities of organising and running a challenge like this and our 'lessons learned' for anyone who might consider organising one in the future.

Talk 2 Diffusion Models in Speech Synthesis

Authors: Xiaozhou Tan (University of Sheffield), Anton Ragni (University of Sheffield)

Abstract: I will give an introduction to the application of diffusion models in Speech synthesis, and what can be done to explore the diffusion like models(models that iteratively refine the result) in Speech synthesis.

Talk 3 RULEBREAKERS: Challenging Large Language Models at the Crossroads between Formal Logic and Human-like Reasoning

Authors: Jason Chan (The University of Sheffield), Robert Gaizauskas (The University of Sheffield), Zhixue Zhao (The University of Sheffield)

Abstract: Formal logic enables computers to reason in natural language by representing sentences in symbolic forms and applying rules to derive conclusions. However, in what our study characterises as "rulebreaker" scenarios, this method can lead to conclusions that are typically not inferred or accepted by humans given their common sense and factual knowledge. Inspired by works in cognitive science, we create RULEBREAKERS, the first dataset for rigorously evaluating the ability of large language models (LLMs) to recognise and respond to rulebreakers (versus non-rulebreakers) in a human-like manner. Evaluating seven LLMs, we find that most models, including GPT-4o, achieve mediocre accuracy on RULEBREAKERS and exhibit some tendency to over-rigidly apply logical rules unlike what is expected from typical human reasoners. Further analysis suggests that this apparent failure is potentially associated with the models' poor utilisation of their world knowledge and their attention distribution patterns. Whilst revealing a limitation of current LLMs, our study also provides a timely counterbalance to a growing body of recent works that propose methods relying on formal logic to improve LLMs' general reasoning capabilities, highlighting their risk of further increasing divergence between LLMs and human-like reasoning.

Talk 4 Accent Hallucination in Zero-Shot Text-to-Speech

Authors: Jinzuomu Zhong (University of Edinburgh), Korin Richmond (University of Edinburgh), Suyuan Liu (University of British Columbia), Dan Wells (University of Edinburgh), Zhiba Su (Independent Researcher), Siqi Sun (University of Edinburgh)

Abstract: While recent Zero-Shot Text-to-Speech (ZS-TTS) models achieve high naturalness and speaker similarity, they fall short in accent fidelity and control - generating hallucinated accents that diverge from the input speech prompt. To address this, we introduce zero-shot accent generation, a new task aimed at synthesising speech in any target content, speaker, and accent. We present AccentBox, the first system capable of this task via a two-stage pipeline. In the first stage, we propose GenAID, a novel Accent Identification model that learns speaker-agnostic accent embeddings, achieving 0.16 F1-score improvement on unseen speakers. In the second stage, a ZS-TTS model is conditioned on these embeddings, achieving 57.4-70.0% listener preference for accent fidelity, compared to strong baselines. We also advance evaluation methodologies for accent generation. Subjectively, we improve listener guidance with transcriptions and accent difference highlighting, with rigorous listener screening. Objectively, we propose pronunciation-sensitive metrics using vowel formant and phonetic posteriorgram distances. providing more reliable evaluation for underrepresented accents. Looking forward, we aim to expand AccentBox's capabilities to more accents via pseudo-labelling of in-the-wild data, and improve accent fidelity via formant-guided generation - moving toward fairer and more inclusive speech synthesis for all accents.

Talk 5 Navigating Fairness in ASR with Contrastive Learning

Authors: Hend ElGhazaly (University of Sheffield), Bahman Mirheidari (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield), Heidi Christensen (University of Sheffield)

Abstract: Ensuring fairness in Automatic Speech Recognition (ASR) models requires not only reducing biases but also making sure that fairness improvements generalize beyond the training domain. This challenge is particularly relevant for pre-trained models, which have already been trained on large-scale data and may overfit quickly during fine-tuning. In this work, we investigate contrastive learning as a fairness intervention, introducing a contrastive loss term alongside the standard cross-entropy loss to promote gender-invariant speech representations. Our results show that fairness-aware fine-tuning is highly dependent on training data diversity, with contrastive learning proving effective only when applied to diverse and representative datasets. Simply increasing training data without explicitly enforcing fairness does not ensure bias mitigation. Our findings highlight the need for fairness-aware dataset selection and evaluation beyond in-domain settings to build robust and equitable ASR systems.

Talk 6 The Representation of Prosodic Emphasis in Self-Supervised Speech Models

Authors: Shaun Cassini (University of Sheffield), Thomas Hain (University of Sheffield), Anton Ragni (University of Sheffield)

Abstract: This work investigates whether modern speech models are sensitive to prosodic emphasis - whether they encode emphasised and neutral words in systematically different ways. Prior work typically relies on isolated acoustic correlates (e.g., pitch, duration) or label prediction, both of which miss the relational structure of emphasis. This paper proposes a residual-based framework, defining emphasis as the difference between paired neutral and emphasised word representations. Analysis on self-supervised speech models shows that these residuals correlate strongly with duration changes and fail at word identity prediction, supporting their role as a structured, relational encoding of prosodic emphasis. In ASR fine-tuned models, residuals occupy a subspace up to 50% more compact than in pre-trained models, further suggesting that emphasis is encoded as a consistent, low-dimensional transformation that becomes more structured with task-specific learning.

Page updated

Report abuse