Annual Conference 2023 - Research posters

Venue: Rear foyer area, The Wave. Note: all poster presentations will be delivered in-person.

Poster Session 1 (Tuesday, 4:30pm): posters 1 - 34

Poster Session 2 (Wednesday, 9:45am): posters 35 - 66

Poster 1 Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature

Authors: Tomas Goldsack (University of Sheffield), Zhihao Zhang (Beihang University), Chenghua Lin (University of Sheffield), Carolina Scarton (University of Sheffield)

Abstract: Lay summarisation aims to jointly summarise and simplify a given text, thus making its content more comprehensible to non-experts. Automatic approaches for lay summarisation can provide significant value in broadening access to scientific literature, enabling a greater degree of both interdisciplinary knowledge sharing and public understanding when it comes to research findings. However, current corpora for this task are limited in their size and scope, hindering the development of broadly applicable data-driven approaches. Aiming to rectify these issues, we present two novel lay summarisation datasets, PLOS (large-scale) and eLife (medium-scale), each of which contains biomedical journal articles alongside expert-written lay summaries. We provide a thorough characterisation of our lay summaries, highlighting differing levels of readability and abstractiveness between datasets that can be leveraged to support the needs of different applications. Finally, we benchmark our datasets using mainstream summarisation approaches and perform a manual evaluation with domain experts, demonstrating their utility and casting light on the key challenges of this task.

Poster 2 Explainable Query Answering with the FRANK System

Authors: Nick Ferguson (University of Edinburgh), Kwabena Nuamah (University of Edinburgh), Alan Bundy (University of Edinburgh), Liane Guillou (University of Edinburgh)

Abstract: Forthcoming

Poster 3 Improving Compositional Generalisation in Semantic Parsing

Authors: Matthias Lindemann (University of Edinburgh), Ivan Titov (University of Edinburgh & University of Amsterdam), Alexander Koller (Saarland University)

Abstract: Semantic parsing is the task of mapping natural language utterances to (executable) meaning representations. While seq2seq models have been successful on standard settings of this task, they have been shown to struggle with compositional generalisation. Compositional generalisation is the ability of a learner to handle deeper recursion and unseen compositions of phrases that have been seen individually during training. My goal is to improve compositional generalisation of neural network models. In work I completed this year, we phrase semantic parsing as a two-step process: we first tag each input token with a multiset of output tokens. Then we arrange the tokens into an output sequence using a new way of parameterising and predicting permutations. We formulate predicting a permutation as solving a regularised linear program, and we backpropagate through the solver. In contrast to my prior work, this approach does not place a priori restrictions on possible permutations, making it very expressive. Our model outperforms pretrained seq2seq models and prior work on realistic semantic parsing tasks that require generalisation to longer examples. For the first time, we show that a model without an inductive bias provided by trees achieves high accuracy on generalisation to deeper recursion on the COGS benchmark.

Poster 4 Using a Large Language Model to Control Speaking Style for Expressive TTS

Authors: Atli Sigurgeirsson (The Centre for Speech Technology Research, University of Edinburgh), Simon King (The Centre for Speech Technology Research, University of Edinburgh)

Abstract: Appropriate prosody is critical for successful spoken communication. Contextual word embeddings are proven to be helpful in predicting prosody but do not allow for choosing between plausible prosodic renditions. Reference-based TTS models attempt to address this by conditioning speech generation on a reference speech sample. These models can generate expressive speech but this requires finding an appropriate reference. Sufficiently large generative language models have been used to solve various language-related tasks. We explore whether such models can be used to suggest appropriate prosody for expressive TTS. We train a TTS model on a non-expressive corpus and then prompt the language model to suggest changes to pitch, energy and duration. The prompt can be designed for any task and we prompt the model to make suggestions based on target speaking style and dialogue context. The proposed method is rated most appropriate in 49.9% of cases compared to 31.0% for a baseline model.

Poster 5 Deformable Temporal Convolutional Networks for Monaural Noisy Reverberant Speech Separation

Authors: William Ravenscroft (University of Sheffield), Stefan Goetze (University of Sheffield), Thomas Hain (University of Sheffield)

Abstract: Speech separation models are used for isolating individual speakers in many speech processing applications. Deep learning models have been shown to lead to state-of-the-art (SOTA) results on a number of speech separation benchmarks. One such class of models known as temporal convolutional networks (TCNs) has shown promising results for speech separation tasks. A limitation of these models is that they have a fixed receptive field (RF). Recent research in speech dereverberation has shown that the optimal RF of a TCN varies with the reverberation characteristics of the speech signal. In this work deformable convolution is proposed as a solution to allow TCN models to have dynamic RFs that can adapt to various reverberation times for reverberant speech separation. The proposed models are capable of achieving an 11.1 dB average scale-invariant signal-to-distortion ratio (SISDR) improvement over the input signal on the WHAMR benchmark. A relatively small deformable TCN model of 1.3M parameters is proposed which gives comparable separation performance to larger and more computationally complex models.

Poster 6 A Joint Matrix Factorization Analysis of Multilingual Representations

Authors: Zheng Zhao (University of Edinburgh), Yftah Ziser (University of Edinburgh), Bonnie Webber (University of Edinburgh), Shay Cohen (University of Edinburgh)

Abstract: Redacted

Poster 7 Explainable Abuse Detection as Intent Classification and Slot Filling

Authors: Agostina Calabrese (University of Edinburgh), Björn Ross (University of Edinburgh), Mirella Lapata (University of Edinburgh)

Abstract: To proactively offer social media users a safe online experience, there is a need for systems that can detect harmful posts and promptly alert platform moderators. In order to guarantee the enforcement of a consistent policy, moderators are provided with detailed guidelines. In contrast, most state-of-the-art models learn what abuse is from labelled examples and as a result base their predictions on spurious cues, such as the presence of group identifiers, which can be unreliable. In this work we introduce the concept of policy-aware abuse detection, abandoning the unrealistic expectation that systems can reliably learn which phenomena constitute abuse from inspecting the data alone. We propose a machine-friendly representation of the policy that moderators wish to enforce, by breaking it down into a collection of intents and slots. We collect and annotate a dataset of 3,535 English posts with such slots, and show how architectures for intent classification and slot filling can be used for abuse detection, while providing a rationale for model decisions.

Poster 8 Extending counterfactual models with unconstrained social explanations

Authors: Stephanie Droop (CDT-NLP, Institute for Language, Cognition and Computation, University of Edinburgh), Neil Bramley (Department of Psychology, University of Edinburgh)

Abstract: In contrast to rationalist accounts, people do not always have consistent goals nor do they always explain other people's behaviour as driven by rational goal pursuit. Elsewhere, counterfactual accounts have shown how a situation model can be perturbed to measure the explanatory power of different causes. We take this approach to explore how people explain others' behaviour in two online experiments and a computational model. First, 90 UK-based adults rated the likelihood of various scenarios combining short biographies with trajectories through a gridworld. Then 49 others saw each scenario and outcome, and verbally gave their best explanations for why the character moved the way they did. Participants generated a range of explanations for even the most incongruous behaviour. We present an expanded version of a counterfactual effect size model which uses innovative features (crowdsourced parameters and free text responses) that not only can generalise to human situations and handle a range of surprising behaviours, but also performs better than the existing model it is based on.

Poster 9 On the trade-off between redundancy and local coherence in summarization

Authors: Ronald Cardenas (University of Edinburgh), Matthias Galle (Cohere AI), Shay B. Cohen (University of Edinburgh)

Abstract: Extractive summaries are presented as lists of sentences with no expected cohesion between them, and --if not accounted for-- with redundant information. In our work, we argue that controlling for inter-sentential cohesion and inter-sentential redundancy is beneficial for informativeness of extractive summaries. For this, we leverage a psycholinguistic theory of human reading comprehension which directly models lexical cohesiveness and redundancy. Implementing this theory, two unsupervised deterministic summarization systems are proposed. Our systems operate at the proposition level and exploit properties of human memory representations to rank similarly content units that are cohesive and non-redundant. When compared against strong supervised and unsupervised systems, our models extract summaries that exhibit more inter-sentential lexical cohesiveness and less inter-sentential redundancy, although capturing less informative information. Carefully designed human studies confirm that the perceived cohesiveness in our systems' output is higher, without sacrificing much informativeness.

Poster 10 Bridging the Communication Rate Gap: Enhancing Text Input for Augmentative and Alternative Communication (AAC)

Authors: Hussein Yusufali (University of Sheffield), Stefan Goetze (University of Sheffield), Roger K Moore (University of Sheffield) 

Abstract: Over 70 million people worldwide face communication difficulties, with many using Augmentative and Alternative Communication (AAC) strategies. While AAC systems have reduced the communication gap, the communication rate gap between speaking and non-speaking partners remains significant, leading to low sustained use of AAC systems. To address this, this paper presents a text prediction interface utilising BERT and RoBERTa language models to improve communication rates for AAC users. Three interface layouts were developed and tested, finding that a radial layout was the most efficient for users. RoBERTa models fine-tuned on conversational AAC corpora resulted in the highest communication rates of 25.75Words Per Minute (WPM), with alphabetical ordering preferred over probabilistic ordering. Conversational corpora like TV and Reddit outperformed generic corpora such as COCA or Wikipedia. The study combined Human Computer Interaction (HCI) and language modelling techniques to improve ease of use, user satisfaction, and communication rates, highlighting the importance of reducing the communication gap for AAC users in open-domain interactions. However, limited availability of large-scale conversational AAC corpora presents a challenge for improving communication rates and robust AAC systems. Further research is needed to test the interfaces with specialist users and improve language modelling and interface design to narrow the communication rate gap.

Poster 11 An enquiry into the harms of language technologies, grounded in the social sciences 

Authors: Eddie Ungless (Edinburgh), Björn Ross (Edinburgh), Vaishak Belle (Edinburgh), Zachary Horne (Edinburgh), Amy Rafferty (Edinburgh), Hrichika Nag (Edinburgh), Seraphina Goldfarb-Tarrant (Edinburgh), Su Lin Blodgett (Microsoft), Atoosa Kasirzadeh (Edinburgh), Charlotte Bird (Edinburgh), Nina Markl (Edinburgh)

Abstract: My thesis constitutes an enquiry into the harms of language technologies, broadly understood. My work addresses the public response to biased NLP technologies, as it is only by understanding how those impacted by these technologies behave that we can develop suitable bias mitigation methods. To this end I am conducting a study on how users of TikTok attempt to circumvent biased censorship algorithms through creative use of language. I am also investigating how anthropomorphism of AI affects the public's behaviour in response to biased models. My work additionally encompasses the robust measurement of model bias. I have shown that many popular sentiment analysis tools show a queerphobic bias, and used a theory of stereotyping to measure bias in language models. I have investigated the harms of text-to-image models, particularly as related to the non-cisgender population. I conducted a survey paper demonstrating that many of the metrics for bias in generative models lack validity and rigour. What unites these strands is a desire to measure the real world impact of harmful language technologies. 

Poster 12 Multimodal Multilingual Framing Bias Detection

Authors: Valeria Pastorino (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)

Abstract: The use of frames in mass communication is an important area of study for social scientists. Frames are used to selectively highlight certain aspects of reality while obscuring others, thus directing audiences towards specific conclusions. This can be achieved through linguistic and non-linguistic devices, and has a significant impact on how audiences understand and remember a problem, influencing their subsequent actions. The Natural Language Processing (NLP) community has become increasingly interested in the automatic detection of framing bias, which can be a time-consuming and labour-intensive process in manual analysis. This project aims to address shortcomings in the task of frame categorisation from a multidisciplinary point of view, drawing from social science, linguistics, and NLP research. The focus will be on the automatic detection of framing bias in contexts such as news, political discourse, and other forms of media communication. The research aims to detect framing bias in multilingual and multimodal settings, considering both the cultural and social aspects imbued into frames and the textual/visual nature of framing devices. By improving the task of framing bias detection in multilingual and multimodal settings, this project could provide a valuable tool for both social scientists and NLP researchers.

Poster 13 Evaluating adversarial networks for unsupervised speech recognition

Authors: Mattias Cross (University of Sheffield), Anton Ragni (University of Sheffield)

Abstract: Unsupervised speech recognition learns a mapping of speech and text without transcribed data. Recent advancements have shown that it is possible to learn an accurate mapping by leveraging unmatched speech and text data with generative adversarial networks and pre-trained self-supervised representations. Although this method works well with speech that has been included in the pre-training data, it does not generalise to out-of-domain speech. This poster explores what features are important for unsupervised speech recognition, and whether unsupervised speech recognition with GANs can generalise to data not seen in pre-training.

Poster 14 Fewer Data → Better Generalisation: Sample Relationship from Learning Dynamics Matters

Authors: Shangmin Guo (University of Edinburgh), Yi Ren (University of British Columbia), Stefano Albrecht (University of Edinburgh), Kenny Smith (University of Edinburgh)

Abstract: Although much research has been done on proposing new models or loss functions to improve the generalisation of artificial neural networks (ANNs), less attention has been directed to the impact of the training data on generalisation. In this work, we start from approximating the interaction between samples, i.e. how learning one sample would modify the model's prediction on other samples. Through analysing the terms involved in weight updates in supervised learning, we find that labels influence the interaction between samples. Therefore, we propose the labelled pseudo Neural Tangent Kernel (lpNTK) which takes label information into consideration when measuring the interactions between samples. We first prove that lpNTK asymptotically converges to the empirical neural tangent kernel in terms of the Frobenius norm under certain assumptions. Secondly, we illustrate how lpNTK helps to understand learning phenomena identified in previous work, specifically the learning difficulty of samples and forgetting events during learning. Moreover, we also show that using lpNTK to identify and remove poisoning training samples can help to improve the generalisation performance of ANNs in image classification tasks.

Poster 15 TwistList: Resources and Baselines for Tongue Twister Generation

Authors: Tyler Loakman (University of Sheffield), Chenghua Lin (University of Sheffield), Chen Tang (University of Surrey)

Abstract: Previous work in phonetically-grounded language generation has focused on lyrics and poetry. In this work, we present progress on a newly studied type of language generation, tongue twisters, a form of language that is required to be phonetically conditioned to maximise sound overlap, whilst maintaining semantic consistency with an input topic, and still being grammatically coherent. We present TwistList, the largest annotated dataset of tongue twisters, consisting of 2.1k+ human-authored examples, alongside several benchmark systems referred to as TwisterMisters for the task of tongue twister generation.

Poster 16 Language Evolution & Neurodiversity: Communicative efficiency and social biases affect language learning in autistic and allistic learners

Authors: Lauren Fletcher (University of Edinburgh), Jennifer Culbertson (University of Edinburgh), Hugh Rabagliati (University of Edinburgh)

Abstract: Studies on communicative efficiency and other cognitive factors that have been argued to shape language have, like the vast majority of studies in experimental psychology, been conducted only on neurotypical populations. Here, we examine how the influence of social biases on efficient communication is impacted by autistic traits. We find that autistic people's use of case in the absence of a social bias is comparative to their neurotypical peers. However, in the presence of a social bias, our results suggest that autistic people adhere more to the bias, increasing production effort to behave more like the group that they are biased towards. We argue that some autistic people may be more likely to adhere to a social bias as a result of learnt social behaviours. More broadly, our results underscore the importance of studying more diverse populations in language evolution research. We also present preliminary results of a project exploring the impact of neurotype in language accommodation (i.e. the process where one alters their language to be more like their partner's). We examine whether there is a difference in accommodation behaviours between neurotype-matched and neurotype-mixed pairs.

Poster 17 Unsupervised Code-switched Text Generation from Parallel Text

Authors: Jie Chi (University of Edinburgh), Brian Lu (Johns Hopkins University), Jason Eisner (Johns Hopkins University), Peter Bell (University of Edinburgh)

Abstract: There has been great interest in developing automatic speech recognition (ASR) systems that can handle code-switched (CS) speech to meet the needs of a growing bilingual population. However, existing datasets are limited in size. It is expensive and difficult to collect real transcribed spoken CS data due to the challenges of finding and identifying CS data in the wild. As a result, many attempts have been made to generate synthetic CS data. Existing methods either require the existence of CS data during training, or are driven by linguistic knowledge. We introduce a novel approach of forcing a multilingual MT system that was trained on non-CS data to generate CS translations. Comparing against two prior methods, we show that simply leveraging the shared representations of two languages (Mandarin and English) yields better CS text generation and, ultimately, better CS ASR. 

Poster 18 How much information do large language models actually need from the input tokens?

Authors: Ahmed Alajrami (University of Sheffield), Nikolaos Aletras (University of Sheffield)

Abstract: Understanding how and what large language models (LLMs) learn about language is an open challenge in natural language processing. Previous work has focused on identifying whether they capture semantic and syntactic information, and how the data or the pre-training objective affects their performance. However, to the best of our knowledge, no previous work has specifically examined how much information LLMs need from individual tokens in order to obtain satisfactory performance in downstream tasks. In this study, we address this gap by pre-training LLMs using small subsets of characters from individual tokens. Surprisingly, we find that pre-training even under extreme settings, i.e. using only one character of each token, the performance retention in standard NLU benchmarks and probing tasks compared to full-token models is high. For instance, a single first character model achieves performance retention of approximately 93% and 77% of the full-token model in SuperGLUE and GLUE tasks, respectively. Our empirical results might be a step towards shedding light on the differences between human and machine reading comprehension. While LLMs are still able to find associations in data that are incomprehensible to humans, it is highly unlikely for humans to perform above chance on the same tasks by only having access to one or two characters of each token in a given text.

Poster 19 Data-driven and Discourse-aware Scientific text generation

Authors: Joseph James (University of Sheffield), Chenghua Lin (University of Sheffield), Nafise S Moosavi (University of Sheffield)

Abstract: Current neural network-based approaches for natural language generation generally do not account for discourse structure explicitly. This could lead to sub-optimal generation quality as the training data could contain biases, skewed representations and a lack of generality. Uncovering and representing the underlying discourse structure of texts could improve the coherence and cohesion of text generation, and enhance our understanding of how humans formulate text from cognitive and psycholinguistic perspectives. This project aims to build novel models for discourse-aware text generation for the scientific domain, especially for generating high quality long form text (e.g. literature reviews). Additionally, we will advance the areas of explainable and robust natural language generation. 

Poster 20 Adapting large language models to respond to domain changes in content moderation systems

Authors: Meredith Gibbons (University of Sheffield), Xingyi Song (University of Sheffield), Holly Francois (Ofcom), Nafise Moosavi (University of Sheffield)

Abstract: Social media platforms use content moderation systems to detect posts that are unacceptable according to their Terms of Service. For large platforms these systems usually use a combination of automated and human moderation and often include language models. Due to the fast-evolving nature of social media language, these models require constant updates to respond to new situations such as new slang or spellings (either due to natural changes, or those designed to deliberately fool an existing moderation system), or to scale systems to new functionalities or platforms. Recent development of large language models demonstrate powerful language understanding and generation capability, including the social media moderation task. However, updating large language models demands significant computational power and resources. In this project we are aiming to investigate language model compression and knowledge distillation techniques to produce small models. These models can be rapidly and efficiently adapted/fine-tuned to adapt to social media topic/language domain changes and reduce unnecessary carbon emissions. We will also explore methods to minimise the need for human labelled data during model updating. Hence the model can be updated more easily to accommodate emerging domain changes, ensuring a more adaptable and efficient content moderation system. 

Poster 21 An Experimental Investigation of Unidirectionality in Semantic Extension 

Authors: Anna Kapron-King (CDT in NLP, University of Edinburgh), Simon Kirby (PPLS, University of Edinburgh), Kenny Smith (PPLS, University of Edinburgh)

Abstract: Grammaticalization is the process by which a lexical item acquires a more functional role over time, such as when a noun comes to be used as a preposition. Grammaticalization is often described as unidirectional, that is, change from functional items to lexical items is far less common. Where does this unidirectionality come from? We present results from two artificial language experiments designed to shed light on whether people show a unidirectional bias when engaging in semantic extension. We focus on the phenomenon of using body part nouns as spatial prepositions. For the first experiment, participants are given the English meaning of an artificial word, then asked to rate how likely it is that that word can also be used to refer to another meaning. One of the meanings is a body part and the other is a preposition. Assuming individuals have a unidirectional bias, we expected lower ratings when the first meaning is a preposition compared to when it is a body part. For the second experiment, we pair participants up to have them perform semantic extension in a communication game. In both experiments, we found that participants did not show the expected unidirectional bias. 

Poster 22 Leveraging Cross-Utterance Context For ASR Decoding

Authors: Robert Flynn (University of Sheffield), Anton Ragni (University of Sheffield)

Abstract: While external language models (LMs) are often incorporated into the decoding stage of automated speech recognition systems, these models usually operate with limited context. Cross utterance information has been shown to be beneficial during second pass re-scoring, however this limits the hypothesis space based on the local information available to the first pass LM. In this work, we investigate the incorporation of long-context transformer LMs for cross-utterance decoding of acoustic models via beam search, and compare against results from n-best rescoring. Results demonstrate that beam search allows for an improved use of cross-utterance context. When evaluating on the long-format dataset AMI, results show a 0.7\% and 0.3\% absolute reduction on dev and test sets compared to the single-utterance setting, with improvements when including up to 500 tokens of prior context. Evaluations are also provided for Tedlium-1 with less significant improvements of around 0.1\% absolute. 

Poster 23 Level of dialectness as a better formulation for Arabic dialect identification

Authors: Amr Keleg (University of Edinburgh), Sharon Goldwater (University of Edinburgh), Walid Magdy (University of Edinburgh)

Abstract: Automatically distinguishing between similar languages or dialects of the same language is still a challenging task. Dialects of the same language might have distinctive phonological, lexical, morphological, syntactic, and semantic features. Humans generally depend on these features for identifying the dialect. For languages such as Arabic and German which seem to have a mostly agreed upon standard variant of the language, dialect identification defaults to classifying a text to be written in the standard language. The text is later considered to be dialectal if dialectal markers show in the text. This categorical formulation assumes all dialectal texts to be the same, irrespective of the number of markers and the distinctiveness of these markers. Conversely, we formulate the "Level of Dialectness" as a new task of quantifying how a dialectal text diverges from the standard language. We provide canonical splits of Arabic online comments annotated for the level of dialectness that are recycled from another dataset, and evaluate how current models can be used to estimate the level of dialectness of a text. Lastly, we demonstrate how the "Level of Dialectness" provides a quantitative metric that could augment sociolinguistic studies meant to understand intraspeaker and interspeaker variation.

Poster 24 Using Adaptors and Auxiliary Tasks in Machine Translation

Authors: Faheem Kirefu (University of Edinburgh), Barry Haddow (University of Edinburgh), Alexandra Birch (University of Edinburgh)

Abstract: The use of pretraining and transfer learning techniques have generally shown to produce better quality results in machine translation. This is especially the case for low resource languages/pairs where higher resourced languages with plentiful monolingual data are leveraged to improve performance to close the performance gap with high resource languages. In recent years, there has been a strong focus of parameter efficient fine-tuning techniques (PEFTs) have been developed, most notably through the use of adaptors. Adaptors make up a small percentage of parameters of the overall model, and by only training these parameters while keeping the rest of the pretrained model frozen, leads to memory and storage efficiencies with only a small tradeoff in performance. However, many of the improvements yielded through such transfer learning techniques are still black box in nature, especially as to the reason how the information learnt at pretraining is being leveraged. My work primarily focuses on how best to use transfer learning for low resource language pairs in machine translation.

Poster 25 Leveraging context for perceptual prediction using word embeddings 

Authors: Georgia-Ann Carter (Institute for Language, Cognition and Computation; School of Informatics, University of Edinburgh), Paul Hoffman (School of Philosophy, Psychology and Language Sciences, University of Edinburgh), Frank Keller (Institute for Language, Cognition and Computation; School of Informatics, University of Edinburgh)

Abstract: Pre-trained word embeddings have been used successfully in semantic NLP tasks to represent words. However, there is continued debate as to whether they encode useful information about the perceptual qualities of concepts. Previous research has shown mixed performance when embeddings are used to predict these perceptual qualities. Here, we tested if we could improve performance by providing an informative context. To this end, we generated decontextualised ("charcoal") and contextualised ("the brightness of charcoal") word2vec and BERT embeddings for a large set of concepts and compared their ability to predict human ratings of the concepts' brightness. We repeated this procedure to also probe for the shape of those concepts, finding that it can be better predicted than brightness. We consider the potential advantages of using context to probe specific aspects of meaning, including those currently thought to be poorly represented by language models. 

Poster 26 Zero-shot cross-lingual learning for implicit discourse relation recognition

Authors: Wanqiu Long (University of Edinburgh), Bonnie Webber (University of Edinburgh)

Abstract: Other than English, there are PDTB-style datasets in other language including a Chinese TED dicourse bank corpus (Long et al., 2020), a Turkish discourse Tree bank corpus (Zeyrek and Kurfali, 2017) and a TED Multilingual Discourse Bank (TED-MDB) (Zeyrek et al., 2019) corpus which has 6 languages. However, most of them are very limited in size. I tried to develop cross-lingual learning methods for different languages for the task of implicit discourse relation recognition.

Poster 27 Test Yor Skill Against ... The Idio-Matic!

Authors: Thomas Pickard (University of Sheffield), Aline Villavicencio (University of Sheffield), Carolina Scarton (University of Sheffield)

Abstract: Humans employ figurative language such as idioms like they're going out of fashion. Familiar idioms bring us closer to our interlocutors, and we're able to produce and understand them more quickly and efficiently than equivalent, literal phrases. Computational language models, however, tend to find figurative language hard to get to grips with. Contribute to research which aims to make language models better at processing idiomatic expressions by trying your hand against The Idio-Matic!

Poster 28 Speech Analytics for Detecting Neurological Conditions in Global English

Authors: Samuel Hollands (University of Sheffield), Heidi Christensen (Department of Computer Science, University of Sheffield, UK), Daniel Blackburn (Sheffield Institute for Translational Neuroscience, University of Sheffield, UK)

Abstract: Dementia is an umbrella term for the loss of cognitive and memory abilities caused by a wide variety of neurological conditions. It has been discovered that both the content of an individual’s discourse and the acoustics of their produced speech can be automatically analysed to detect dementia and other neurological conditions. Whilst the current cutting edge demonstrates effective diagnostic capabilities on L1 (native) speakers, this project is interested in exploring and tackling the potential issues that may arise from the language diversity of L2 (second language) English speakers. Existing research into the cognitive tests used to detect dementia in clinics have demonstrated consistently detrimental efficacy from individuals of different language, educational, and socioeconomic backgrounds. We highlight two series of experiments. Firstly, a series of experiments conducted to assess and evaluate the calculation and subsequent classification impact of disfluency and linguistic speech analytics as feature sets for dementia classification. Secondly, a series of experiments conducted on ASR evaluating the reliability of automatic transcription on different parts of a cognitive assessment test suite. Results indicate strengths and challenges of feature calculation on L1 patients with hypotheses generated for L2 classification and alternative evaluation metrics beyond WER are proposed for task-based ASR evaluation.

Poster 29 Multi-accent Seq2Seq TTS Frontend Modelling

Authors: Siqi Sun (University of Edinburgh), Korin Richmond (University of Edinburgh)

Abstract: A high-performance linguistic frontend is regarded as necessary for a text-to-speech system. Moreover, it is desirable to have a frontend capable of modelling multiple accents to better match any given accent. My current work focuses on a multi-accent sentence-level Seq2Seq frontend that is capable of handling multiple accents in a single neural network. This approach both gives a compact model and encourages knowledge sharing between different accents.

Poster 30 Semantic Parsing for Conversational Question Answering over Knowledge Graphs

Authors: Laura Perez-Beltrachini (University of Edinburgh), Parag Jain (University of Edinburgh), Emilio Monti (Amazon), Mirella Lapata (University of Edinburgh)

Abstract: In this paper, we are interested in developing semantic parsers which understand natural language questions embedded in a conversation with a user and ground them to formal queries over definitions in a general purpose knowledge graph (KG) with very large vocabularies (covering thousands of concept names and relations, and millions of entities). To this end, we develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof. We present two different semantic parsing approaches and highlight the challenges of the task: dealing with large vocabularies, modelling conversation context, predicting queries with multiple entities, and generalising to new questions at test time. We hope our dataset will serve as useful testbed for the development of conversational semantic parsers.

Poster 31 Explainable Evaluation Metrics for Natural Language Generation

Authors: Emmanouil Zaranis (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield), Nikolaos Aletras (University of Sheffield)

Abstract: Nowadays the release of large pre-trained transformer-based language models has led to major improvements in natural language generation tasks. Although many metrics have been proposed and are widely used to evaluate the generated outputs, the latest research shows that they are poorly correlated with human judgments and they also suffer in terms of explainability and robustness. Consequently, it is essential to build more reliable and explainable automatic metrics. While most of the metrics are common between generation tasks, we choose to start from the summarization task which has been widely studied by the research community. However the proposed research direction is extendable to other tasks as well. More specifically, this work focuses on metrics used in summarization for evaluating the factual consistency of the generated summaries. We plan to use post-hoc explainability methods to firstly measure which parts of the input contribute more to the provided evaluation scores and secondly examine what type of factuality errors current metrics are able to capture. This research direction may provide a deeper understanding of current factuality metrics and encourage the use of explainability methods for improving current metrics via semi-supervised approaches.

Poster 32 Alternative Methods of Including Whitespace Information in Transformer Encoders

Authors: Edward Gow-Smith (University of Sheffield), Dylan Phelps (University of Sheffield), Harish Tayyar Madabushi (University of Bath), Carolina Scarton (University of Sheffield), Aline Villavicencio (University of Sheffield)

Abstract: Transformer-based models for NLP, such as BERT, all use subword tokenisation algorithms, such as WordPiece, to process input text. Due to this subword tokenisation, these models have no explicit knowledge of word boundaries, or whitespace information. However, since these algorithms use prefixes (e.g. ``\#\#'') for continuing subwords, these models do have implicit knowledge of where the word boundaries are. In this work, we investigate how this knowledge impacts the performance of these models on downstream tasks. In particular, we investigate the modified tokenisation approach introduced previously by us, which has been shown to give more morphologically correct tokenisations, but results in sequences with no space information. We apply this modification to WordPiece, and we also try approaches for passing word boundary information in alternative ways.

Poster 33 Differences in Human and Machine Interpretation of Non-literal Meanings: The Case of Fillers

Authors: Aida Tarighat (University of Edinburgh), Martin Corley (School of Philosophy, Psychology and Language Sciences, University of Edinburgh)

Abstract: Language models (LMs) are known to have difficulty recognizing and interpreting non-literal meanings such as idioms, metaphors, and sarcasm. In this work, we focus on the role of written natural language disfluencies, specifically fillers (e.g., um, hmm), in highlighting intended non-literal and sarcastic meanings. Our hypothesis is that for human readers, written disfluencies render the text more "speech-like" which in turn opens it up to less literal interpretation. In an online self-paced reading experiment, we hypothesize that words compatible with a sarcastic reading of a sentence (hunting blue whales is a really WISE move) are faster to read than those compatible with literal meanings (hunting blue whales is a really BAD move) when preceded by a (written) filler ("um wise/bad"). We use ease-of-reading as an index of predictability, for comparisons with next-word predictions made by an LM such as BERTweet in completing sentences with or without "um" (hunting blue whales is a really [um] ...). We expect the LM to predict words compatible with the literal meaning of the sentence and not to perform well in predicting sarcastic words, with or without the filler.

Poster 34 FERMAT: An Alternative to Accuracy for Numerical Reasoning

Authors: Jasivan Sivakumar (University of Sheffield), Nafise Sadat Moosavi (University of Sheffield)

Abstract: While pre-trained language models achieve impressive performance on various NLP benchmarks, they still struggle with tasks that require numerical reasoning. Recent advances in improving numerical reasoning are mostly achieved using very large language models that contain billions of parameters and are not accessible to everyone. In addition, numerical reasoning is measured using a single score on existing datasets. As a result, we do not have a clear understanding of the strengths and shortcomings of existing models on different numerical reasoning aspects and therefore, potential ways to improve them apart from scaling them up. Inspired by CheckList (Ribeiro et al., 2020), we introduce a multi-view evaluation set for numerical reasoning in English, called FERMAT. Instead of reporting a single score on a whole dataset, FERMAT evaluates models on various key numerical reasoning aspects such as number understanding, mathematical operations, and training dependency. Apart from providing a comprehensive evaluation of models on different numerical reasoning aspects, FERMAT enables a systematic and automated generation of an arbitrarily large training or evaluation set for each aspect.The datasets and codes are publicly available to generate further multi-view data for ulterior tasks and languages.

Poster 35 Leveraging Ambient and Latent Information for Active Speaker Detection in Egocentric Video

Authors: Jason Clarke (University of Sheffield), Yoshi Gotoh (University of Sheffield), Stefan Goetze (University of Sheffield)

Abstract: The recent proliferation of wearable devices has led to an increase in research into egocentric data. One resultant topic of particular interest is understanding human interaction from a social-communicative standpoint from the egocentric perspective. A fundamental component of social-communicative understanding is the ability to determine the speech activity of interlocutors. Egocentric data presents unique challenges for traditional audio-visual diarization methods. Environmental noise, overlapping speech, and short utterance exchanges pose challenges from the audio modality. Likewise, motion-induced blurring, heterogeneous lighting conditions, and crowded scenes pose challenges from the video modality. Despite video being a rich source of information, the occurrence of these obstacles in conjunction with one another result in audio-visual diarization methods performing worse than their audio-only counterparts. To address this issue, this work presents an initial exploration of methods that can leverage ambient and latent information present within egocentric video to supplement active speaker detection systems that typically model visual features of a scrutinized speaker in isolation.

Poster 36 Language identification at scale for code-switched text

Authors: Laurie Burchell (University of Edinburgh), Kenneth Heafield (University of Edinburgh), Alexandra Birch (University of Edinburgh)

Abstract: Language identification (LID) systems are a fundamental step of nearly any NLP pipeline, particularly for multilingual applications using web-domain text as training data. However, these algorithms usually assume that each input should be assigned exactly one language label. This is not an unreasonable assumption in many domains, but much of the text on the web (particularly in social media) features code-switching: the use of two or more languages in a single utterance. Current single-label LID systems cannot process such text correctly and so it is often discarded, leading to a lack of data (and consequent lower performance) for downstream NLP tasks. In this work, we investigate high-coverage (200+ languages) LID for code-switched text. We assess the current performance of scaleable LID on monolingual and code-switched text and investigate whether existing systems can be adapted to improve their performance. We experiment with better modelling of ambiguity within LID systems with the aim of creating a model which can label both monolingual and multilingual text with the correct language label(s) at scale. 

Poster 37 Select and Summarize: Scene Saliency for Movie Script Summarization

Authors: Rohit Saxena (University of Edinburgh), Frank Keller (University of Edinburgh)

Abstract: Redacted

Poster 38 Detecting Communication Situations in AI Enabled Hearing Aids

Authors: Robert Sutherland (University of Sheffield), Jon Barker (University of Sheffield), Stefan Goetze (University of Sheffield)

Abstract: While hearing loss is a widespread problem around the world, modern hearing aids still struggle to perform effective speech enhancement in everyday conversational settings. Modern devices feature multiple microphones, which offers the potential for sophisticated beamforming techniques to be applied, but they do not "understand the scene" to determine which strategy to apply, such as which sources should be amplified and which should be attenuated. The aim of this project is to investigate methods and features to distinguish between various common conversational scenes, so that the hearing aid can work out which speakers are a part of the conversation, and which sound sources are interferers. The latter becomes especially difficult when these interfering sources are also speech, such as another conversation happening nearby, or a TV show. This project will explore machine learning techniques that not only use the acoustic signal, but also the behavioural cues from the listener (such as head motion and the listeners own speech patterns). An interesting point to consider here is the microphone array movement; while this can present a challenge to the beamforming algorithms, it can also provide insight into the scene and where the listener's attention may be.

Poster 39 Unveiling Acoustic Embedding Space: Factorising Word Embeddings into Subword Embeddings

Authors: Amit Meghanani (University of Sheffield), Thomas Hain (University of Sheffield)

Abstract: Acoustic word embeddings (AWE) are widely used to represent spoken words as fixed dimensional vectors. Words are composed of smaller subword structures which are shared across multiple words. Understanding the relationships between these subwords in the embedding space is crucial for a deeper understanding of the AWE space. This work attempts to factorise AWEs into acoustic subword embeddings (ASE) and reconstruct them via simple mathematical operations. The proposed approach involves extracting ASEs from AWEs, without using any time boundaries of subwords. To evaluate the quality of derived ASEs and their properties in the embedding space, three new metrics are introduced. These metrics are adapted versions of the average precision metric used for evaluating AWEs. The experimental results demonstrate that the derived ASEs are of high quality and that reconstructing AWEs from the ASEs is possible by simple mathematical operations in the embedding space. Overall, this work sheds light on the relationships between subwords in AWEs, and opens up new possibilities for improving the accuracy and efficiency of various speech-related tasks.

Poster 40 Interaction-Driven Trustworthy AI for Autonomous Vehicles

Authors: Balint Gyevnar (University of Edinburgh), Cheng Wang (University of Edinburgh), Christopher G. Lucas (University of Edinburgh), Shay B. Cohen (University of Edinburgh), Stefano V. Albrecht (University of Edinburgh)

Abstract: While AI methods showed impressive results in recent times, they are yet to be widely adopted by the public in high-risk domains such as health care or transportation. Our work combines technologies across discipline, from explainable AI (XAI) and causal reasoning, to natural language processing, to create trustworthy AI systems. We propose a general method for causal selection driven by counterfactual simulations and a human-in-the-loop interaction design, and evaluate our method with autonomous vehicles. To further boost interdisciplinary collaboration, we also study the legal challenges of adopting AI systems in high-risk environments as regulated by the proposed EU AI Act.

Poster 41 Cross-language activation in bilingual lexical access

Authors: Irene Winther (University of Edinburgh), Martin Pickering (University of Edinburgh)

Abstract: Previous research suggests that bilinguals co-activate linguistic representations from both the native and non-native language on-line during lexical access. The current research explores the role of learning in relation to different cross-language effects that have been taken as support for on-line co-activation and asks whether some of these effects could instead be explained by co-activation during the learning of a second language.

Poster 42 Humans to machines: representing exemplars

Authors: Rhiannon Mogridge (University of Sheffield), Anton Ragni (University of Sheffield)

Abstract: Some approaches in machine learning, most notably neural networks, are inspired by the human brain. These approaches are generally prototype-based, using data to learn large, parametrised models. In contrast, other theories of human cognition are exemplar-based, making use of specific examples, rather than parametrised models. Exemplar-based models have been popular for speech and NLP in the past, but more recent prototype-based methods have outperformed them for some years. Does this mean that exemplar-based approaches are inferior to prototype-based approaches? Theories of human cognition typically assume very good feature representations, which have generally not been available in the fields of speech and language with exemplar-based approaches. Modern, rich feature representations have improved performance in many machine learning tasks. HuBERT, for example, is widely used for speech tasks, and BERT-based word-, sentence- and document-level representations are available for NLP tasks. More effective feature representations provide the opportunity to revisit simple, interpretable exemplar-based models. Minerva 2 is an exemplar-based human memory model that has been shown to reproduce the results from numerous human studies. We show that, with modern feature representations, it can be implemented effectively for modern speech tasks, with additional benefits of interpretability. 

Poster 43 Converting Comment Trees into Argument Trees

Authors: Jonathan Clayton (University of Sheffield), Rob Gaizauskas (University of Sheffield), Marco Damonte (Amazon)

Abstract: This work applies techniques from Argument Mining (the field of using NLP techniques to analyse argument/debate) to the summarisation of online debate. Online debate threads, like other types of comment threads, have several features which make them difficult to read, such as length, redundant and irrelevant comments. An Argument Graph (AG) is one potentially useful representation of an online thread, containing the propositions of an argument and the relations between them (support/attack). In this paper, we evaluate different techniques for the automatic generation of this kind of representation. 

Poster 44 Speech analysis and training methods for atypical speech

Authors: Wing Leung (University of Sheffield), Stefan Goetze (University of Sheffield)

Abstract: Dysarthria is a type of motor speech disorder that reflects abnormalities in the motor movements required for speech production. ASR has important implications for assisted communication devices and home environmental control systems. Although the accuracy of ASR systems for typical speech have improved significantly in recent years, there are challenges inherent with dysarthric speech due to high inter and intra-speaker variability, and limited availability of dysarthric data. Furthermore, the classification of dysarthria (including measures of speech intelligibility) are important metrics for the clinical (and social) management of dysarthria. In current practice, metrics are based on subjective listening evaluation by expert human listeners which require high human effort and cost. Therefore, the current project aims to 1) collect a corpus of dysarthric data to increase the volume of quality dysarthric data available to the research community, 2) improve the performance of dysarthric ASR systems, including investigation of methods of adapting ASR models trained on typical speech, and 3) create automated estimators for the classification of dysarthria. 

Poster 45 Automatic Audio Description for Movies

Authors: Radina Dobreva (University of Edinburgh), Alexandra Birch (University of Edinburgh), Frank Keller (University of Edinburgh)

Abstract: Audio description is a narrated description of visual information that has the purpose of enabling access to this information for blind and visually-impaired people. My research focuses specifically on audio description for movies. The creation of audio description for movies is a specialised and time-consuming task and work on automating aspects of it has been sparse. Nevertheless, there have been attempts at automatic audio description generation. This has mostly been formulated as video description, where the input is a short video clip and the output is the audio description. Recent work has additionally shown the importance of incorporating context in the form of preceding descriptions and related dialogue which are vital for such a complex type of description. However, audio description generation remains a challenging task. My research looks into finding methods for generating more informative and faithful to the source material descriptions.

Poster 46 Automating Subjective Evaluation Metrics in Speech Synthesis for Direct Speech to Speech Translation

Authors: Shaun Cassini (University of Sheffield), Thomas Hain (University of Sheffield), Anton Ragni (University of Sheffield)

Abstract: Direct speech-to-speech translation (dS2ST) has the potential to advance global communication by facilitating real-time conversation across languages while preserving speaker identity. However, achieving seamless dS2ST remains challenging due to the scarcity of speech data compared to textual data. This PhD project aims to address this issue by developing an automated evaluation metric for the quality and humanness of synthesized speech in dS2ST models. By employing sequence-to-sequence models (seq2seq) to automate the Mean Opinion Score (MOS) assessment, researchers can optimize their models more efficiently and enhance speech synthesis quality. Furthermore, this project explores the role of linguistic information in speech humanness by studying the encoder/decoder layers of seq2seq models trained on nonsensical languages. This research holds the potential to significantly advance speech synthesis models, bridging the data gap, and refining dS2ST capabilities. Ultimately, this project aims to contribute to improved cross-cultural communication and a more connected world.

Poster 47 MANA-Net: A News Sentiment Modelling and Text Processing Framework for Attention-based Market Forecasting

Authors: Mengyu Wang (University of Edinburgh), Tiejun Ma (University of Edinburgh)

Abstract: Millions of news are released on the internet on daily bases. Such news sentiment rapidly shifts investors beliefs and has significantly influenced financial markets. Adopting the state-of-the-art Nature Language Processing (NLP) and latest machine learning models to quantify the impact of news and predict financial markets has become a hot area of interdisciplinary AI/NLP for Finance research. In this piece of work, we focus on (i) handling the dynamic large volume of news corpus without manual rules/labelling, and (ii) developing deep learning models to combine news sentiment and market prices to maximize specific financial measures such as risk adjusted returns. We explore deep learning techniques and propose a novel market attention-weighted News Aggregation Network (MANA-Net) that leverages both news sentiments and stock prices to predict stock trends. Our proposed news aggregation and stock prediction approach are integrated in a uniform pipeline as a single objective function of risk adjusted returns. The major advantage of such an approach enhances the model training process by allowing back-propagate gradients with each input news item and weighted with price changes. We employed S&P500 index and Thomson Reuters news data between 2003-2018 with more than 35 million pieces of news. Our results have shown that MANA-Net outperforms existing machine learning models on test set by 3% on average daily return and loss and 0.69 on average Sharpe ratio.

Poster 48 A Low-Resource Pipeline for Text-to-Speech from Found Data with Application to Scottish Gaelic

Authors: Dan Wells (University of Edinburgh), Korin Richmond (University of Edinburgh), William Lamb (University of Edinburgh)

Abstract: There are two major issues when building a text-to-speech (TTS) system for a new language: 1) the availability of suitable speech recordings with matching text transcripts, and 2) the knowledge required to process input text and represent the target language symbolically. In this work we present an end-to-end pipeline for building a speech corpus and TTS system for a new language without reference to any expert-defined linguistic resources, with Scottish Gaelic as our target language. We begin by segmenting and aligning over 85 hours of short speech utterances and corresponding text transcripts from long-form audio recordings found online, using a purely character-based acoustic model. We then select utterances up to 2 or 8 hours in total duration to achieve good coverage of speech sounds based on discrete acoustic unit sequences extracted from a self-supervised speech representation model pre-trained on English. Finally, we train TTS models with these corpora, comparing the performance of character, acoustic unit and phone inputs. Subjective listening tests with Gaelic speakers show promising results for all systems, with some indication that using discrete acoustic units as TTS input symbols may be beneficial in especially low-data scenarios.

Poster 49 Prompting Language Models with Knowledge Graphs for Question Answering Involving Long-tail Facts

Authors: Wenyu Huang (University of Edinburgh), Guancheng Zhou (University of Edinburgh), Mirella Lapata (University of Edinburgh), Pavlos Vougiouklis (Huawei Technologies), Jeff Pan (University of Edinburgh, Huawei Technologies)

Abstract: Large Language Models (LLMs) which trained with a huge amount of real-world knowledge and can perform question-answering tasks effectively, but still struggle with tasks that require rich world knowledge, especially encountering long-tail facts. This highlights the limitations of relying solely on their parameters to encode a wealth of world knowledge. To mitigate this issue, LLMs are often prompted with contextual passages retrieved from a large knowledge corpus. However, this approach may lead to noise with irrelevant knowledge and is expensive. In this work, we hypothesised that the knowledge graphs (KGs) are a good external knowledge source to mitigate the above issues since triples in KGs contain much less noise (since we retrieve triples based on the entity) and are much shorter in length. To investigate our hypothesis, we create several question-answering datasets that require long-tail fact knowledge to answer, all annotated with related knowledge graphs. In our experiments, we found that by prompting LLMs with KG triples, the question-answering performance improved dramatically, even relatively small models with only 7B parameters can outperform the latest GPT4 model (without knowledge). Furthermore, when instruction-finetuned LLMs with our dataset, KG prompting even outperform gold passage prompting with the same LLM.

Poster 50 Dense Procedure Summarisation of Instructional videos

Authors: Anil Batra (University of Edinburgh), Frank Keller (University of Edinburgh), Laura Sevilla-Lara (University of Edinburgh)

Abstract: Understanding the steps required to perform a task is an important skill for AI systems. Learning these steps from instructional videos involves two subproblems: (i) identifying the temporal boundary of sequentially occurring segments and (ii) summarizing these steps in natural language. We refer to this task as Procedure Segmentation and Summarization (PSS). Identifying the temporal boundary of each step is critical, as it help to generating a correct summary each step. However, current segmentation metrics often overestimate the step segmentation quality because they do not consider the temporal order of steps in the video. In our first work, we propose a new metric, SODA-D, that takes into account the temporal order of segments, giving a more reliable measure of the accuracy of predicted segmentation. Additionally, we utilize the differentiable version of SODA to perform the optimisation and observe improvement on two instructional video datasets (YouCook2 and Tasty) over state-of-the-art by 6-7%. However, our and the prior methods work in supervised setting on limited datasets, due to expensive annotation. To address this issue, we are currently working on large scale pre-training using YouTube instructional videos. 

Poster 51 What do Language Models know about "bow-ties" and "kangaroo courts"?

Authors: Maggie Mi (University of Sheffield), Aline Villavicencio (University of Sheffield)

Abstract: In recent years, pre-trained language models (PLMs) built using deep Transformer networks have been seen to dominate the field of SLT. They are trained on large amounts of data and achieve unmatched performance scores on a wide range of NLP tasks. Whilst many favour PLMs, there remains much unknown regarding why they work so well. This research aims to explore what PLMs learn about language by probing such models and decoding the knowledge that is stored in the representations. One way to do this is by prompting idiomatic expressions. Idiomatic expressions are commonly used in day-to-day language, but they prove difficult for machines to understand, due to their meaning being different from their surface forms. It is hoped that, through understanding how PLMs handle various linguistic phenomena, PLMs can be more interpretable, and their shortcomings can be precisely addressed.

Poster 52 The continuum of compositionality in natural language and its impact on neural models of language

Authors: Verna Dankers (University of Edinburgh), Chris Lucas (University of Edinburgh), Dieuwke Hupkes (Meta AI), Ivan Titov (University of Edinburgh)

Abstract: The field of natural language processing has recently seen increased interest in compositional generalisation: the objective of creating computational models of language that can build up the meaning of an input sequence from the meaning of the sequence's parts and the way they are combined. If computational models could compose meaning like humans, they would become more robust and better generalise to out-of-distribution data. However, the language produced by humans need not be compositional since it is rife with figurative and formulaic language. I elaborate on how the continuum of compositionality influences neural models that process language via my interpretability research.

Poster 53 Controllable Visual Storytelling

Authors: Danyang Liu (University of Edinburgh), Frank Keller (University of Edinburgh), Mirella Lapata (University of Edinburgh)

Abstract: We propose a new framework for visual storytelling, which provides good controllability compared to previous end-to-end visual storytelling models. Experimental results show that our appoach outperforms baselines and SOTA models in terms of both automatic metrics and human evaluation metrics.

Poster 54 The Environmental Impact of Training Large Language Models

Authors: Miles Williams (University of Sheffield), Nikolaos Aletras (University of Sheffield)

Abstract: The upwards scaling of large language models (LLMs) has been shown to reliably improve performance across a broad set of downstream tasks. Recently, the phenomenon of emergent abilities has seen substantial research. An ability can be considered emergent if it is not present in smaller models, yet does appear in larger ones. Therefore, it is not possible to predict whether new abilities will emerge from further upwards scaling. Although the necessity for scaling LLMs continues to be robustly motivated, there are increasing concerns over the environmental impact of such models. For example, the training of GPT-3 (175 billion parameters) has been estimated as producing 502 tonnes of CO2 equivalent emissions. This poster considers the environmental impact of training a range of LLMs relative to examples from day-to-day life.

Poster 55 Automatic claims detection methodology for social science research

Authors: Sandrine Chausson (University of Edinburgh), Bjorn Ross (University of Edinburgh), Marion Fourcade (University of California, Berkeley), David Harding (University of California, Berkeley), Gregory Renard

Abstract: Many tasks in Computational Social Science involve classifying pieces of text based on the claims they contain. However state-of-the-art approaches usually involve fine-tuning models on large annotated datasets, which are costly to produce in both time and money. In light of this, we propose a qualitative and versatile few-shot learning methodology as a common paradigm for any claim-based textual classification task. The methodology involves using Natural Language Inference models to obtain the textual entailment between claims and documents from a corpus of interest. The performance of these models is then boosted by annotating a minimal sample of data points, dynamically sampled using the statistical heuristic of Probabilistic Bisection. Our methodology has the advantage of allowing expert knowledge to be easily incorporated in the NLP pipeline. We illustrate this methodology in the context of identifying claims relating to China in Tweets about the 2020 US elections.

Poster 56 UDALM: Unsupervised Domain Adaptation through Language Modeling

Authors: Constantinos Karouzos (University of Sheffield), Georgios Paraskevopoulos (NTUA), Alexandros Potamianos (NTUA)

Abstract: In this work, we explore Unsupervised Domain Adaptation (UDA) of pre-trained language models for downstream tasks. We introduce UDALM, a fine-tuning procedure using a mixed classification and Masked Language Model loss that can adapt to the target domain distribution in a robust and sample-efficient manner. Our experiments show that the performance of models trained with the mixed loss scales with the amount of available target data and the mixed loss can be effectively used as a stopping criterion during UDA training. Furthermore, we discuss the relationship between A-distance and the target error and explore some limitations of the Domain Adversarial Training approach. Our method is evaluated on twelve domain pairs of the Amazon Reviews Sentiment dataset, yielding 91.74% accuracy, a 1.11% absolute improvement over the state-of-the-art. (work presented in NAACL 2021)

Poster 57 Optimizing Spoken Human-Robot Interaction: User Profiling, Contextual Scenarios, and Affordances

Authors: Guanyu Huang (University of Sheffield), Roger K Moore (University of Sheffield)

Abstract: As a rapidly developing user interface, many technology products, including social robots, have adopted speech-enabled interaction. In theory, such an interface is seen as a tool that can provide natural human-robot interaction (HRI). However, in reality, it has yet to be as effective. Due to the nature of spoken communication, social robots and human users are often considered mismatched partners. How can these partners achieve effective interaction? While many studies have focused on enhancing social robots' abilities to perceive and understand users, the work described in this poster places users in the central role. It examines how social robots' affordances, namely appearance, voice, and language behaviours, can be influenced by user profiling and user needs. In the case of user profiling, the relationship between users' knowledge, experience, and attitudes towards speaking agents and users' preferences and expectations of social robots is examined. In the case of user needs, the relationship between users' expectations of a social robot's warmth and competence and its affordance in scenarios is reviewed. The results demonstrate that a one-size-fits-all social robot does not work. They show that social robot affordance design should take a comprehensive approach to achieve more acceptable and effective HRI.

Poster 58 Self-supervised predictive coding models encode phone and speaker information in orthogonal subspaces

Authors: Oli Liu (University of Edinburgh), Hao Tang (University of Edinburgh), Sharon Goldwater (University of Edinburgh)

Abstract: Self-supervised speech representations are known to encode both speaker and phonetic information, but how they are distributed in the high-dimensional space remains largely unexplored. We hypothesize that they are encoded in orthogonal subspaces, a property that lends itself to simple disentanglement. Applying principal component analysis to representations of two predictive coding models, we identify two subspaces that capture speaker and phonetic variances, and confirm that they are nearly orthogonal. Based on this property, we propose a new speaker normalization method which collapses the subspace that encodes speaker information, without requiring transcriptions. Probing experiments show that our method effectively eliminates speaker information and outperforms a previous baseline in phone discrimination tasks. Moreover, the approach generalizes and can be used to remove information of unseen speakers.

Poster 59 Multi-User Smart Speakers - A Narrative Review of Concerns and Problematic Interactions

Authors: Nicole Meng-Schneider (School of Informatics, University of Edinburgh), Rabia Yasa Kostas (School of Informatics, University of Edinburgh), Kami Vaniea (School of Informatics, University of Edinburgh), Maria K Wolters (School of Informatics, University of Edinburgh)

Abstract: Smart speakers in multi-user spaces, such as Amazon Echos, introduce risks to both owners and anyone sharing the space. They store voice recordings of user requests, and anyone in range can potentially interact with the device. As smart speakers are usually bound to a single account, despite being shareable by design, it introduces potential tensions between users. We systematically searched the literature for findings on concerns and scenarios in which problems may arise and synthesised the resulting 20 papers in a narrative review. Owners were concerned about other users', potentially malicious, interactions, device faults, and third party sharing. In contrast, bystanders worried about "being listened" to and a lack of awareness and protections. Our findings show a clear gap in literature on the privacy concerns of regular and incidental secondary users of smart speakers.

Poster 60 Towards privacy-preserving Smart Homes via IoT device augmentation of voice control systems

Authors: Mary Hewitt (University of Sheffield), Hamish Cunningham (University of Sheffield)

Abstract: Voice control in the smart home is commonplace, enabling the convenient control of smart home Internet of Things (IoT) hubs, gateways and devices, along with information seeking dialogues. Cloud-based voice assistants are commonly relied upon to facilitate the interaction, yet rising privacy concerns are associated with both the cloud-side analysis and storage of user data. We look towards a private smart home assistant that uses IoT devices, gateways and local processing techniques to ensure the privacy of user data. Purely local voice control is not expected to reach a performance level sufficient to replace the cloud, yet systems that exploit the IoT to augment voice control might viably do so. The work analyses how IoT devices can provide additional context and input modalities to both assist and simplify the speech recognition task, in order to bridge the performance gap between local and cloud-based voice control systems.

Poster 61 Extending LLMs for Planning

Authors: Gautier Dagan (University of Edinburgh), Frank Keller (University of Edinburgh), Alex Lascarides (University of Edinburgh)

Abstract: While Large Language Models (LLMs), such as GPT-4 (OpenAI, 2023), can solve many NLP tasks in zero-shot settings, LLMs remain tricky to use in embodied agents. LLMs hallucinate and are brittle to specific prompts, which makes them unreliable long-term planners. Complex plans that require multi-step reasoning become difficult (and costly) as the context window grows. Planning in an environment requires understanding one's actions, how these affect the world, and the goal state and environment. Traditional symbolic planners, such as the Fast-Forward Planner (Hoffmann, 2001), find optimal solutions quickly. However, modern planners require a problem and domain descriptions up-front (McDermott et al., 1998), severely limiting their usage in real environments. Whereas modern LLMs need minimal information to figure out the task, traditional planners need maximal information. Our work presents a neuro-symbolic framework where an LLM works hand in hand with a traditional planner (Fast-Forward) to solve an embodied task.

Poster 62 Encoded Representations as Language

Authors: Henry Conklin (University of Edinburgh), Ivan Titov (University of Edinburgh, University of Amsterdam), Kenny Smith (University of Edinburgh)

Abstract: While large-scale neural networks have increasingly shown impressive performance across a broad array of linguistic tasks they often struggle to generalize to novel examples drawn from outside their training data, even when those examples would prove easy for a human given the same training. Human language affords us robust generalization, letting us produce and interpret sentences we've never encountered before. In language this generalizability is driven in part by two structural properties: compositionality, with whole utterances made up of reusable parts, and regularity, the tendency to map meanings to representations in predicatable ways. During training deep neural networks learn to map an input to a continuous hidden representation, then map that representation to their prediction. We show that the mapping from inputs to representations can be framed as a kind of language, and that to improve generalization performance we should encourage that mapping to exhibit both compositionality and regularity. We introduce a novel set of measures for assessing representational regularity, and show that the degree of regularity a model converges to is strongly predictive of how well it generalizes. 

Poster 63 Towards Personification in Controllable Text to Speech 

Authors: Nicholas Sanders (University of Edinburgh), Korin Richmond (University of Edinburgh), Simon King (University of Edinburgh)

Abstract: When listening to a voice, listeners make inferences and assumptions about the speaker, constructing a mental model or persona of who that person is and how they behave. The act of a listener personifying a voice can heavily influence interaction. However, Text to Speech (TTS) stereotypically is maximized for subjective naturalness, while also lacking the control and variation that a human speaker may have. In this work, I'll present my general view on personification in TTS and how it may be subjectively measured, how speaker identity and behavioural traits may be reflected in synthetic voice, and how we can influence listener perceptions of persona through controllable TTS. I'll also present some recent work in the domain of contrastive stress through symbolic control.

Poster 64 Hallucinations and Contrastive Decoding in NMT

Authors: Jonas Waldendorf (University of Edinburgh), Alexandra Birch (University of Edinburgh), Barry Haddow (University of Edinburgh)

Abstract: Hallucinations in Neural Machine Translation undermine trust in translation systems and can lead to severe issues when users do not speak the target side language. The root cause of hallucinations is when a model incorrectly attends to the source, leading to either non-factual or non-sensical translations. In high-resource scenarios they are a rare but critical failure mode and as the amount and quality of data decrease the rate of hallucinations increase. We propose to alleviate hallucinations by using contrastive decoding. By comparing the output of the original model (expert) to those of amateur models that intentionally hallucinate, we aim to encourage the generation of tokens that are only likely under the original model and hence less likely to be hallucinated. 

Poster 65 Singing Voice Banking and Conversion

Authors: Cliodhna Hughes (University of Sheffield), Guy Brown (University of Sheffield), Ning Ma (University of Sheffield)

Abstract: There has previously been a focus on voice banking for speech, for example to create personalised speech synthesisers for people with motor neurone disease and Parkinson's, to help preserve their vocal identity. Voice banking for singing is, however, relatively understudied. This project involves working with members of Trans Voices, the UK's first professional trans+ choir, to bank singing voices for use in singing voice conversion tools. These could be used creatively, for example, to show the progression of transgender voices through various voice-altering gender affirmation treatments. Recordings have been obtained from a member of Trans Voices before and after gender affirmation surgery. A preliminary acoustic analysis of this data will be presented. Since the project involves working with a client, this raises an issue with determining which evaluation methods should be used, as evaluation methods should both reflect the desires of the client, and demonstrate whether the research is an advance on the state-of-the-art. This poster will provide a description of the additional challenges associated with singing voice conversion compared with speech, a preliminary acoustic analysis of the recordings obtained from a singer before and after gender affirmation surgery, and an overview of the evaluation challenges associated with this project.

Poster 66 HPSG-based analysis of verb clustering in Germanic languages

Authors: Sara Grzelak (University of Sheffield), Mark Hepple (University of Sheffield)

Abstract: Verb clustering is a linguistic phenomenon in which two or more verbs occur adjacently in a sentence. Especially interesting are Germanic verb clusters because of their complexity and order variability. From the syntactic perspective, the development of a simple uniform analysis of such verb clusters poses a challenge. Influential analyses of verb clusters developed in Head-driven phrase structure grammar (HPSG) either fail to account for all possible orders or employ a complicated heavy-weight apparatus which departs from the general goal of providing the simplest possible explanation of such phenomena. Our goal is to propose a competing account that reduces the complexity of existing analyses while accounting for a full range of grammatical orders in German and Dutch verb clusters. We aim to achieve this goal by introducing flexible structure assignment to HPSG which originates from flexible categorial grammars. Flexible structure assignment means that verb clusters are no longer obliged to fit into one structural pattern, such as bottom-up, but they can be analysed more flexibly. With this novel approach to the analysis of verb clusters, we aim to provide an effective and more elegant HPSG-based account of this linguistic phenomenon.