Annual Conference 2023 - Research talks

Venue: Lecture Theatre 2, The Wave. Note: all research talks will be delivered in-person.

MTCue: Learning Zero-Shot Control of Extra-Textual Attributes by Leveraging Unstructured Context in Neural Machine Translation

Author: Sebastian Vincent, University of Sheffield
Co-authors: Rob Flynn and Dr Carolina Scarton, University of Sheffield

Talk scheduled for: 11:30am - 11:50am on Tuesday 13 June

Abstract: Efficient use of both intra- and extra-textual context is one of the critical gaps between machine and human translation. To date, research has mostly focused on individual, well-defined types of context, such as the surrounding text or discrete external variables like speaker's gender. This work introduces MTCue, a novel neural machine translation framework which interprets all context (including discrete variables) as text, and learns an abstract representation of context, enabling transfer across different data settings and leveraging similar attributes in low resource settings. 

Focusing on the domain of dialogue with access to document and metadata context, we perform an extensive evaluation in three language pairs in both directions; MTCue achieves impressive gains in translation quality over a non-contextual baseline as measured by BLEU (+1.28 to 2.83) and Comet (+2.5 to 5.03). Further analysis shows that the context encoder of MTCue learns a context space representation which is organised with respect to specific attributes such as formality, effectively enabling their zero-shot control. Pre-training on context embeddings also improves MTCue's few-shot performance compared to a tagging baseline. Finally, we conduct an ablation study over model components and contextual variables, gathering further evidence of the robustness of MTCue.

What can MINERVA2 tell us about killing hope? Exploring L2 Collocational Processing as Memory Retrieval

Author: Sydelle de Souza, University of Edinburgh
Co-authors: Ivan Vegner, University of Edinburgh

Talk scheduled for: 11:50am - 12:10pm on Tuesday 13 June

Abstract: Language processing relies on a division of labour—compositional, rule-based language is thought to be computed on the fly, while non-compositional idiomatic units are thought to be holistically stored in and retrieved from memory.  Collocations (e.g., kill hope) are frequently occurring word combinations with one word used figuratively (kill) and the other literally (hope), characterised by an arbitrary restriction on substitution (#murder hope). As such, they are neither fully compositional nor fully idiomatic and little is known about the underlying psycholinguistic mechanisms at play in collocational processing. Furthermore, collocations are notoriously difficult for L2 speakers to acquire, and behavioral evidence shows that they incur a processing cost over fully compositional language as well as idioms.

Given that collocations possess idiosyncratic meanings, we hypothesise that they resort to memory retrieval. Therefore, we model L2 collocational processing using MINERVA2, a frequency-based, global-matching memory model. We use vector embeddings from DistilBERT as our input and systematically vary assumptions about how L2 speakers represent and store semantic information in memory to simulate various trends in L2 collocational processing. We modify MINERVA2 to simulate reaction times and compare this to L1 and L2 acceptability judgement data on the same items. Under the assumptions that: (i) the L2 lexicon develops in relation to that of the L1, and (ii) the L2 lexicon is sensitive to L1 frequencies, our results indicate that MINERVA2 can account for processing trends in both L1 and L2 collocational processing.

Speech and Cognitive Decline

Author: Megan Thomas, University of Sheffield
Co-authors: Dr Traci Walker, Prof Heidi Christensen, University of Sheffield

Talk scheduled for: 12:10pm - 12:30pm on Tuesday 13 June

Abstract: Early work in this thesis examined the use of disfluency features as a method of differentiating between healthy controls and people at different stages of cognitive decline. Research found that some disfluency features (including length and number of pauses, phonetic additions and deletions, and word repetitions) were able to differentiate between people with different levels of cognition. We also found that this disfluency information can be a valuable addition to automatic cognitive decline detection systems. Current work is focussed on conversation analysis and investigating what this can tell us about how people interact with automatic cognitive decline detection systems, with hopes of enhancing the amount of speech such systems are able to elicit from patients. 

DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language Models

Author: Amr Keleg, University of Edinburgh 

Talk scheduled for: 12:30pm - 12:50pm on Tuesday 13 June

Abstract: A few benchmarking datasets have been released to evaluate the factual knowledge of pretrained language models. These benchmarks (e.g., LAMA, and ParaRel) are mainly developed in English and later are translated to form new multilingual versions (e.g., mLAMA, and mParaRel). Results on these multilingual benchmarks suggest that using English prompts to recall the facts from multilingual models usually yields significantly better and more consistent performance than using non-English prompts. Our analysis shows that mLAMA is biased toward facts from Western countries, which might affect the fairness of probing models. We propose a new framework for curating factual triples from Wikidata that are culturally diverse. A new benchmark DLAMA-v1 is built of factual triples from three pairs of contrasting cultures having a total of 78,259 triples from 20 relation predicates. The three pairs comprise facts representing the (Arab and Western), (Asian and Western), and (South American and Western) countries respectively. Having a more balanced benchmark (DLAMA-v1) supports that mBERT performs better on Western facts than non-Western ones, while monolingual Arabic, English, and Korean models tend to perform better on their culturally proximate facts. Moreover, both monolingual and multilingual models tend to make a prediction that is culturally or geographically relevant to the correct label, even if the prediction is wrong.

Perceive and predict: self-supervised speech representation based loss functions for speech enhancement

Author: George Close, University of Sheffield
Co-authors: Dr Stefan Goetze, University of Sheffield

Talk scheduled for: 2:30pm - 2:50pm on Tuesday 13 June

Abstract: Recent work in the domain of speech enhancement has explored the use of self-supervised speech representations to aid in the training of neural speech enhancement models. However, much of this work focuses on using the deepest or final outputs of self supervised speech representation models, rather than the earlier feature encodings. The use of self supervised representations in such a way is often not fully motivated. In this work the distance between the feature encodings of clean and noisy speech are shown to correlate strongly with psychoacoustically motivated measures of speech quality and intelligibility, as well as with human Mean Opinion Score (MOS) ratings. Speech enhancement models trained with loss functions based on these distance measures show improved performance versus a number of commonly used baseline loss functions.

Memorisation maps for neural machine translation

Author: Verna Dankers, University of Edinburgh

Talk scheduled for: 2:50pm - 3:10pm on Tuesday 13 June

Abstract: Redacted

FERMAT: An Alternative to Accuracy for Numerical Reasoning

Author: Jasivan Sivakumar, University of Sheffield
Co-authors: Dr Nafise Moosavi, University of Sheffield

Talk scheduled for: 11:15am - 11:35am on Wednesday 14 June

Abstract: While pre-trained language models achieve impressive performance on various NLP benchmarks, they still struggle with tasks that require numerical reasoning. Recent advances in improving numerical reasoning are mostly achieved using very large language models that contain billions of parameters and are not accessible to everyone. In addition, numerical reasoning is measured using a single score on existing datasets. As a result, we do not have a clear understanding of the strengths and shortcomings of existing models on different mathematical aspects and therefore, potential ways to improve them apart from scaling them up. Inspired by CheckList (Ribeiro et al., 2020), we introduce a multi-view evaluation set for numerical reasoning in English, called FERMAT. Instead of reporting a single score on a whole dataset, FERMAT evaluates models on various key mathematical aspects such as number understanding, mathematical operations, and training dependency. Apart from providing a comprehensive evaluation of models on different mathematical aspects, FERMAT enables a systematic and automated generation of (a) an arbitrarily large evaluation set for each aspect, and (b) training examples for augmentation regarding the model's underperforming aspects. We make our dataset and code publicly available to generate further multi-view data for ulterior tasks and languages.

Self-supervised predictive coding models encode speaker and phone information in orthogonal subspaces

Author: Oli Liu, University of Edinburgh 

Talk scheduled for: 11:35am - 11:55am on Wednesday 14 June

Abstract: Self-supervised speech representations are known to encode both speaker and phonetic information, but how they are distributed in the high-dimensional space remains largely unexplored. We hypothesise that they are encoded in orthogonal subspaces, a property that lends itself to simple disentanglement. Applying principal component analysis to representations of two predictive coding models, we identify two subspaces that capture speaker and phonetic variances, and confirm that they are nearly orthogonal. Based on this property, we propose a new speaker normalisation method which collapses the subspace that encodes speaker information, without requiring transcriptions. Probing experiments show that our method effectively eliminates speaker information and outperforms a previous baseline in phone discrimination tasks. Moreover, the approach generalises and can be used to remove information of unseen speakers.