Annual Conference 2021 – Posters
Poster 1 – Acoustic monitoring of sleep-disordered breathing
Author: Ning Ma
Co-authors: Hector Romero, Guy Brown
Abstract: Sleep disordered breathing (SDB) is a debilitating condition that affects a significant proportion of the population. One of its most serious forms is obstructive sleep apnoea (OSA), which result from the partial or complete collapse of the upper airway during sleep, respectively. This severely interrupts breathing during sleep, leading to fatigue, daytime sleepiness, and increased risk of stroke, heart attack, high blood pressure and diabetes. The gold standard to diagnose SDB is the polysomnography (PSG) test, but it is uncomfortable, time-consuming and expensive. Many OSA sufferers are not identified until these other medical problems become apparent, meaning that they are less likely to make lifestyle changes that could improve their condition without need for treatment.
In our research, the acoustic analysis of breathing sounds during sleep has been leveraged as an inexpensive and less obtrusive alternative for OSA screening. We have demonstrated promising results for automatic monitoring of sleep disorders using both standard smartphones and newly developed bespoke recording devices, which offers the prospect of cheap and continuous monitoring for SDB at home.
Poster 2 – Highly Efficient Knowledge Graph Embedding Learning with Orthogonal Procrustes Analysis
Author: Xutan Peng
Co-authors: Guanyi Chen (Utrecht University), Chenghua Lin (University of Sheffield), Mark Stevenson (The University of Sheffield)
Abstract: Knowledge Graph Embeddings (KGEs) have been intensively explored in recent years due to their promise for a wide range of applications. However, existing studies focus on improving the final model performance without acknowledging the computational cost of the proposed approaches, in terms of execution time and environmental impact. This paper proposes a simple yet effective KGE framework which can reduce the training time and carbon footprint by orders of magnitudes compared with state-of-the-art approaches, while producing competitive performance.
We highlight three technical innovations: full batch learning via relational matrices, closed-form Orthogonal Procrustes Analysis for KGEs, and non-negative-sampling training. In addition, as the first KGE method whose entity embeddings also store full relation information, our trained models encode rich semantics and are highly interpretable. Comprehensive experiments and ablation studies involving 13 strong baselines and two standard datasets verify the effectiveness and efficiency of our algorithm.
Poster 3 – Snorer Diarisation Based on Deep Neural Network Embeddings
Author: Hector Romero
Co-authors: Ning Ma and Guy J. Brown
Abstract: Acoustic analysis of sleep breathing sounds using a smartphone at home provides a much less obtrusive means of screening for sleep-disordered breathing (SDB) than assessment in a sleep clinic. However, application in a home environment is confounded by the problem that a bed partner may also be present and snore. This paper proposes a novel acoustic analysis system for snorer diarisation, a concept extrapolated from speaker diarisation research, which allows screening for SDB of both the user and the bed partner using a single smartphone.
The snorer diarisation system involves three steps. First, a deep neural network (DNN) is employed to estimate the number of concurrent snorers in short segments of monaural audio recordings. Second, the identified snore segments are clustered using snorer embeddings, a feature representation that allows different snorers to be discriminated. Finally, a snore transcription is automatically generated for each snorer by combining consecutive snore segments. The system is evaluated on both synthetic snore mixtures and real two-snorer recordings. The results show that it is possible to accurately screen a subject and their bed partner for SDB in the same session from recordings of a single smartphone.
Poster 4 – The Use of Voice Source Features for Sung Speech Recognition
Author: Gerardo Roa Dabike
Co-authors: Jon Barker
Abstract: In this paper, we ask whether vocal source features (eg pitch, shimmer, jitter) can improve the performance of automatic sung speech recognition, arguing that conclusions previously drawn from spoken speech studies may not be valid in the sung speech domain. We first use a parallel singing/speaking corpus (NUS-48E) to illustrate differences in sung vs spoken voicing characteristics including pitch range, syllables duration, vibrato, jitter and shimmer. We then use this analysis to inform speech recognition experiments on the sung speech DSing corpus, using a state of the art acoustic model and augmenting conventional features with various voice source parameters. Experiments are run with three standard (increasingly large) training sets, DSing1 (15.1 hours), DSing3 (44.7 hours) and DSing30 (149.1 hours). Pitch combined with degree of voicing produces a significant decrease in WER from 38.1% to 36.7% when training with DSing1 however smaller decreases in WER observed when training with the larger more varied DSing3 and DSing30 sets were not seen to be statistically significant. Voicing quality characteristics did not improve recognition performance although analysis suggests that they do contribute to an improved discrimination between voiced/unvoiced phoneme pairs.
Poster 5 – Time-Domain Speech Extraction With Spatial Information And Multi Speaker Conditioning Mechanism
Author: Jisi Zhang
Co-authors: Jon Barker, Catalin Zorila, and Rama Doddipatla
Abstract: In this paper, we present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture in noisy and reverberant environments. The proposed method is built on an improved multi-channel time-domain speech separation network which employs speaker embeddings to identify and extract multiple targets without label permutation ambiguity. To efficiently inform the speaker information to the extraction model, we propose a new speaker conditioning mechanism by designing an additional speaker branch for receiving external speaker embeddings. Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline, and it increases the speech recognition accuracy by more than 16% relative over the same baseline.
Poster 6 – Speech to Speech translation through Language Embeddings
Author: Kyle Reed
Co-authors: Thomas Hain, Anton Ragni
Abstract: This project is concerned with speech-to-speech translation (S2ST). The traditional approach to this problem uses cascaded ASR, MT and TTS models that are used in sequence to map from source to target language. However, this approach has limitations including: error compounding, conversation of para-linguistic features, and large computational requirements. These limitations motivate models that map directly from the source to the target language without an intermediate textual representation, as has been achieved in the speech-to-text translation.
A key challenge of S2ST is obtaining the appropriate data in the necessary quantities. While direct speech-to-speech data is scarce, there is a wealth of complementary data in the form of text-to-text translation or speech-to-text translation. We hypothesise that is is possible to leverage these better-resourced, complementary modalities to construct an 'effective' latent space for S2ST. We aim to explore physical and effective spaces to determine which proves more suitable for S2ST. We will additionally explore appropriate methodologies for projection of the source signal onto the space and generation of a target signal from it. We will also develop heuristic approaches to evaluate the effective space that are reliable and cost-effective.
Poster 7 – Improving Variational Autoencoder for Text Modelling with Timestep-Wise Regularisation
Author: Ruizhe Li
Co-authors: Chenghua Lin (University of Sheffield), Xiao Li (University of Sheffield), Guanyi Chen (Utrecht University)
Abstract: The Variational Autoencoder (VAE) is a popular and powerful model applied to text modelling to generate diverse sentences. However, an issue known as posterior collapse (or KL loss vanishing) happens when the VAE is used in text modelling, where the approximate posterior collapses to the prior, and the model will totally ignore the latent variables and be degraded to a plain language model during text generation. Such an issue is particularly prevalent when RNN-based VAE models are employed for text modelling. In this paper, we propose a simple, generic architecture called Timestep-Wise Regularisation VAE (TWR-VAE), which can effectively avoid posterior collapse and can be applied to any RNN-based VAE models. The effectiveness and versatility of our model are demonstrated in different tasks, including language modelling and dialogue response generation.
Poster 8 – Spoken Language Interaction Between Mismatched Partners
Author: Guanyu Huang
Co-authors: Roger Moore
Abstract: Spoken-language based interactions between a human being and an artificial device (such as a social robot) are very popular in recent years, but the user experience with such devices is not very satisfactory. Apart from limits of the current speech and language technologies, it is hypothesised users' dissatisfaction is partly caused by the mismatched abilities between humans and artificial devices. Due to limited cognitive abilities, such agents cannot take perceptions of others into consideration, nor react to situation accordingly, which resulted in unsuccessful communicative interaction. Communicative interaction efficiency involves multiple factors across disciplines such as languages, psychology, and artificial intelligence. However, the role of influential factors and the mechanism of how those factors interact with each other are unknown.
Here the project aims to develop a unified framework which can characterize communicative interaction efficiency. It will investigate factors and strategies that make a spoken-language based interaction effective. Based on understandings of the nature of spoken language interaction between humans and artificial devices, the next objective is to maximize affordance of speech-enabled artefacts, and to achieve more effective communicative interaction between a human being and an artificial device. It is hoped the result of this project provides a general guideline for communicative interaction between a human being and an artificial device.
Poster 9 – Towards Low-Resource Stargan Voice Conversion Using Weight Adaptive Instance Normalization
Author: Mingjie Chen
Co-authors: Yanpei Shi, Thomas Hain
Abstract: Many-to-many voice conversion with non-parallel training data has seen significant progress in recent years. It is challenging because of lacking of ground truth parallel data. StarGAN-based models have gained attentions because of their efficiency and effectiveness. However, most of the StarGAN-based works only focused on small number of speakers and large amount of training data. In this work, we aim at improving the data efficiency of the model and achieving a many-to-many non-parallel StarGAN-based voice conversion for a relatively large number of speakers with limited training samples.
In order to improve data efficiency, the proposed model uses a speaker encoder for extracting speaker embeddings and weight adaptive instance normalization (W-AdaIN) layers. Experiments are conducted with 109 speakers under two low-resource situations, where the number of training samples is 20 and 5 per speaker. An objective evaluation shows the proposed model outperforms baseline methods significantly. Furthermore, a subjective evaluation shows that, for both naturalness and similarity, the proposed model outperforms baseline method.
Poster 10 – Adaptive Natural Language Generation for Technical Documents
Author: Tomas Goldsack
Co-authors: Chenghua Lin, Carolina Scarton
Abstract: Technical documents often contain content that is incomprehensible to a non-specialist audience, making it difficult for those without specific knowledge of the subject area to understand ideas that such a piece of work conveys. Natural Language Generation (NLG) tasks such as text simplification, summarisation, and style transfer all revolve around the adaptation of input data to fit a particular purpose, whilst preserving the essence of its core content. Such techniques could be used to communicate the key ideas from technical documents in a format that is more digestible to a non-specialist audience. However, the style in which such ideas should be presented, and the degree to which content should be simplified, is dependent on the audience. This project will involve the research of automatic data-to-text NLG techniques that are able to adapt to a given audience. This could, for example, involve leveraging extra-textual audience information to provide a personalised output. These are to be used for the transformation of technical documents to their content more suited to a different audience or domain. As data typically used for the aforementioned tasks is significantly shorter than such documents, techniques for handling large documents are also likely to be investigated.
Poster 11 – Insights on Neural Representations for End-to-End Speech Recognition
Author: Anna Ollerenshaw
Co-authors: Md Asif Jalal, Thomas Hain
Abstract: End-to-end automatic speech recognition (ASR) models aim to learn a generalised speech representation. However, there are limited tools available to understand the internal functions and the effect of hierarchical dependencies within the model architecture. It is crucial to understand the correlations between the layer-wise representations, to derive insights on the relationship between neural representations and performance. Previous investigations of network similarities using correlation analysis techniques have not been explored for End-to-End ASR models.
This paper analyses and explores the internal dynamics between layers during training with CNN, LSTM and Transformer based approaches using Canonical correlation analysis (CCA) and centred kernel alignment (CKA) for the experiments. It was found that neural representations within CNN layers exhibit hierarchical correlation dependencies as layer depth increases but this is mostly limited to cases where neural representation correlates more closely. This behaviour is not observed in LSTM architecture, however there is a bottom-up pattern observed across the training process, while Transformer encoder layers exhibit irregular coefficiency correlation as neural depth increases. Altogether, these results provide new insights into the role that neural architectures have upon speech recognition performance. More specifically, these techniques can be used as indicators to build better performing speech recognition models.
Poster 12 – Who and What is When and Where? And how often? The identification (and re-identification) of characters and actions using acoustical and visual data
Author: Joshua Smith
Co-authors: Dr Yoshi Gotoh, Dr Stefan Goetze
Abstract: Over the past decade strides have been made in both audio and video analysis. To create scene or story descriptions for media, audio/video analysis can be combined with natural language generation to automatically create a synopsis of the plot or character details. This project aims to take the next step in combined audio visual analysis, focusing on character identification and re-identification and summarising what happens on screen. Visual and acoustical analysis can mutually benefit from the respective other domain for identifying actions taking place on screen. Methods for video identification already work well independently of the amount of acoustic noise, in the case of the object to be identified being clearly visible in the scene. For identification based on audio, the object can be in difficult lighting conditions and even doesn't have to appear on screen. Speaker identification and re-identification is possible based on the audio signal even when no visual data is available. Based on information extracted from video and audio content, the project aims to create scene and plot descriptions useful as input for state-of-the-art meta-data enrichment, as used in movie streaming platforms. This poster presents initial findings on audio-/ video- and audio-video object identification and re-identification.
Poster 13 – Multilingual Content Reuse in Social Media and Misinformation Websites
Author: Melissa Thong
Co-authors: Professor Kalina Bontcheva and Dr Carolina Scarton
Abstract: Misinformation poses a serious threat to society through its power to mislead the public across domains such as politics, the environment and health. The quick generation and spread of misinformation online makes it extremely difficult to tackle this problem through manual verification alone. The use of natural language processing techniques and machine learning models can therefore facilitate the identification of pieces of misinformation and the detection of misinformation networks. This project aims to investigate the spread of misinformation through the angle of content reuse, where the same story is reused across websites or social media with modifications. We will first identify patterns and features of content reuse through methods such as semantic text similarity and plagiarism detection, before using these features to build machine learning classifiers. We will examine content reuse across multiple languages, as well as through additional features like images in order to create multimodal models.
Poster 14 – Robust Speech Recognition in Complex Acoustic Scenes
Author: George Close
Co-authors: Stefan Goetze
Abstract: This project aims at increasing ASR robustness in the challenging acoustic far-field scenario described above. For this, methods from the area of signal processing for signal enhancement and separation will be applied to improve ASR performance as well as novel methods acoustic and language modelling will be developed. By this, ASR performance will be increased for adverse conditions with noise and reverberation, and particularly in environments with competing speech signals. To achieve this goal, different techniques from the signal processing area, the machine-learning area and combinations of both are applied, and in the technical challenges associated with implementation of such systems, particularly in smaller, low power devices such as smart speakers or hearing aids. Additionally, novel techniques involving the usage of semantic information to improve performance will be explored.
Poster 15 – Simulating realistically-spatialised simultaneous speech using video-driven speaker detection and the CHiME-5 dataset
Author: Jack Deadman
Co-authors: Jon Barker
Abstract: Simulated data plays a crucial role in the development and evaluation of novel distant microphone ASR techniques. However, the commonly used simulated datasets adopt uninformed and potentially unrealistic speaker location distributions. We wish to generate more realistic simulations driven by recorded human behaviour. By using devices with a paired microphone array and camera, we analyse unscripted dinner party scenarios (CHiME-5) to estimate the distribution of speaker separation in a realistic setting. We deploy face-detection, and pose-detection techniques on 114 cameras to automatically locate speakers in 20 dinner party sessions. Our analysis found that on average, the separation between speakers was only 17 degrees. We use this analysis to create datasets with realistic distributions and compare it with commonly used datasets of simulated signals. By changing the position of speakers, we show that the word error rate can increase by over 73.5% relative when using a strong speech enhancement and ASR system.
Poster 16 – Accent-agnostic speech analytics for detecting neurodegenerative conditions
Author: Samuel Hollands
Co-authors: Heidi Christensen and Daniel J Blackburn
Abstract: Neurodegenerative conditions such as forms of dementia are currently the leading cause of death in the United Kingdom and a rapidly emerging problem across the globe with increased life expectancies positively correlating with the growth of conditions associated with aging. This research acknowledges that whilst important research is being conducted to detect cognitive impairment from diseases such as Alzheimer's using speech and language processing, there is limited energy being devoted to the diversification and accessibility of these technologies for non-native speakers and individuals with varying accents.
This study aims initially to explore existing corpora and develop where necessary a corpus of L2 English speakers with varying levels of cognitive impairment. Once developed experimentation will be conducted in two primary areas. Firstly, the exploration of new approaches to the eliciting of language features indicative of cognitive impairment by exploring accent and language/culturally agnostic strategies that diverge from existing Anglocentric and Eurocentric methodologies. Secondly, exploring existing strategies to classification techniques and data pre-processing; experimenting for potential avenues of improvement in terms of system accuracy and bias reduction. The ultimate objective is to improve the accessibility of these - currently - experimental technologies to ensure a more universal benefit indiscriminate of patient background.
Poster 17 – Methods for Detection and Representation of Idiomatic Multiword Expressions
Author: Edward Gow-Smith
Co-authors: Carolina Scarton and Aline Villavicencio
Abstract: Computational methods for natural language processing (NLP) require representing sequences of language as vectors, or embeddings. Modern representational approaches generate contextualised word embeddings, where how the word is represented depends on the context it appears in. The embeddings generated by state-of-the-art language representation models such as BERT and ELMo have pushed performance on a variety of NLP tasks, however there are still areas that these embeddings perform poorly in. One of these areas is handling multi-word expressions (MWEs), which are phrases that are semantically or syntactically idiosyncratic. These are wide-ranging in human language, estimated to comprise half of a speaker’s lexicon, and thus are an important area for research. Examples include 'kick the bucket' (to die), 'by and large' and 'car park'. In particular, state-of-the-art contextualised word embeddings have been shown to perform poorly at capturing the idiomaticity of MWEs, where the meaning of the expression is not deducible from the meanings of the individual words (non-compositionality). This inability is detrimental to the performance of NLP models on downstream tasks. In machine translation, for example, idiomatic phrases may be translated literally and thus the wrong meaning conveyed. This poster details the problem and approaches to solving it.
Poster 18 – Developing Interpretable and Computational Efficient Deep Learning models for NLP
Author: Ahmed Alajrami
Co-authors: Dr Nikolaos Aletras
Abstract: Deep neural network models have achieved state-of-the-art results in different Natural Language Processing tasks in recent years. Their superior performance comes at the cost of understanding how they work and how their output results can be justified. It is still challenging to interpret the intermediate representations and explain the inner workings of neural network models. The lack of justification for deep neural network models decisions is considered one of the main reasons that deep learning models are not widely used in some critical domains such as health and law. Therefore, there is a need for designing more interpretable and explainable deep neural networks models while reducing their computing requirements.
The aim of this PhD research project is to make deep neural network models for NLP more interpretable and explainable. As a first step, we will investigate these models' architecture and how each component works. Next, we will try different approaches to reduce the computational complexity of these models and improve their interpretability while achieving comparable accuracy and performance. In addition, the research project will investigate how models interpretations can help in designing more data-efficient models. Furthermore, we will investigate designing novel evaluation methods for the quality of deep neural networks explanations.
Poster 19 – A 'hybrid' approach to automatic speech recognition
Author: Rhiannon Mogridge
Co-authors: Anton Ragni
Abstract: Modern techniques for Automatic Speech Recognition (ASR) are typically data driven and depend on the availability of suitable training data. For some languages, such as native British and American English, there is ample data for building ASR systems. For many other languages, however, a large volume of accessible training data does not exist. How can the quality of ASR for under-resourced languages be improved? Current state-of-the-art ASR systems distil the information contained within vast amounts of training data into parametrised models. The data is then discarded and only the trained model is used for future decision-making. In contrast, some historic ASR techniques used the data directly. For example, Dynamic Time Warping (DTW), a method of ASR popular in the 1980s, used similarity metrics to identify the phones within a speech clip by comparing each frame with a library of examples. While modern data-driven methods are demonstrably superior in most situations, this project explores the usefulness of hybrid methods that use both parametrised models and direct examples to make better use of limited data.
Poster 20 – Summarisation of Argumentative Conversation
Author: Jonathan Clayton
Co-authors: Rob Gaizauskas
Abstract: My PhD project seeks to combine Argument Mining with text summarisation. That is, given some text involving a debate between multiple participants (e.g. in online comment on news or in a parliamentary debate), can we automatically produce a summary of the key topics under discussion and the viewpoints expressed by the participants? The question of how to go produce a machine-generated summary of an argument is an open one. Several intriguing research questions may be addressed. How do we best represent the components of arguments under discussion? What techniques or models are suitable for extracting these from text? Additionally, can we envision a suitable algorithm to convert these argument components into a readable summary (either text or graph-based)? The difficulties inherent in this task are numerous. Principal among them is the fact that the key units of analysis in argumentative conversation rarely correspond to surface-level features like words, and inferring these latent structures is a non-trivial problem. Additionally, training data explicitly addressing these tasks are scarce. Overcoming these obstacles is, therefore, an interesting challenge, rarely addressed in NLP literature.
Poster 21 – What's This Song About? A Top-k Topic Classification Approach Using Lyrics and Listeners' Interpretations
Author: Varvara Papazoglou
Co-authors: Robert Gaizauskas
Abstract: Automatic topic classification of song lyrics has recently drawn the attention of researchers. Previous work has shown that incorporating listeners' interpretations of the lyrics can significantly improve the accuracy of topic classification. These interpretations are in the form of comments about the whole song. Using interpretations in the form of comments about specific fragments of the lyrics, we experiment with four representations of song lyrics as input for classification systems. The results show that some representations are consistently better than the others, and also suggest that the similarity of topic classes along with the ambiguity of song lyrics may affect the classification accuracy. This argues for using top-k classification, which associates multiple top ranking classes with each song.
Poster 22 – Dhasp: Differentiable Hearing Aid Speech Processing
Author: Zehai Tu
Co-authors: Ning Ma, Jon Barker
Abstract: We explore a data-driven approach to hearing aid fitting that automatically optimises the hearing aid gain parameters for maximum speech intelligibility by using a differentiable hearing aid speech processing (DHASP) framework. We use objective measures to compare our approach to the commonly used NAL-R fitting formula.
Poster 23 – Multi-task Estimation of Age and Cognitive Decline from Speech
Author: Yilin Pan
Co-authors: Heidi Christensen, Venkata Srikanth Nallanthighal (Centre for Language Studies (CLS), Radboud University Nijmegen), Daniel Blackburn (Sheffield Institute for Translational Neuroscience (SITraN), University of Sheffield), Aki Härmä (Philips Research, Eindhoven, The Netherlands)
Abstract: Speech is a common physiological signal that can be affected by both ageing and cognitive decline. Often the effect can be confounding, as would be the case for people at, eg, very early stages of cognitive decline due to dementia. Despite this, the automatic predictions of age and cognitive decline based on cues found in the speech signal are generally treated as two separate tasks. In this paper, multi-task learning is applied for the joint estimation of age and the Mini-Mental Status Evaluation criteria (MMSE) commonly used to assess cognitive decline.
To explore the relationship between age and MMSE, two neural network architectures are evaluated: a SincNet-based end-to-end architecture, and a system comprising of a feature extractor followed by a shallow neural network. Both are trained with single-task or multi-task targets. To compare, an SVM-based regressor is trained in a single-task setup. i-vector, x-vector and ComParE features are explored. Results are obtained on systems trained on the DementiaBank dataset and tested on an in-house dataset as well as the ADReSS dataset. The results show that both the age and MMSE estimation is improved by applying multi-task learning, with state-of-the-art results achieved on the ADReSS dataset acoustic-only task.
Poster 24 – Accented Data Augmentation for Speech Recognition using StarGAN Voice Conversion
Author: Mauro Nicolao
Co-authors: Samuel J. Broughton, Thomas Hain
Abstract: Automatic speech recognition (ASR) performance is often affected by speaker regional accents when a conspicuous number of similar examples have not been seen during the acoustic model training. Large datasets of specific accented speech are uncommon and labour intensive to create. In this paper, a method is introduced to create arbitrarily vast datasets of accented speech to train ASR acoustic models by using a state-of-the-art voice conversion (VC) component. VC is a technique aimed at translating a source speaker's voice to the style of a target speaker while preserving linguistic information contained and can be used as a method for data augmentation. More specifically, artificial acoustic features of accented speech are generated using a StarGAN-VC framework that learns many-to-many speaker mappings while training on non-parallel data.
The performance of the proposed method is evaluated on differing training configurations of both non-accented (WSJ) and accented (L2-ARCTIC) datasets. Compatibility between the augmented features and the speech-extracted ones is ensured by the comparable word error rate (WER) of the augmented ASR system and its related baseline on non-accented speech. On the other hand, the proposed method improves performance by 8.9% on average when conditioning on specific speaker accents for accent-related recognition tasks.
Poster 25 – Attention Based Model for Segmental Pronunciation Error Detection
Author: Jose Lopez
Co-authors: Thomas Hain (University of Sheffield), Md Asif Jalal (University of Sheffield), Rosanna Milner (University of Sheffield)
Abstract: Automatic pronunciation assessment is usually done by estimating a score associated to the correctness of produced phonemes. The Goodness of Pronunciation (GOP) is a popular example of this approach. Any pronunciation scoring method based on phoneme segments is dependent on detecting precise segment boundaries. These are obtained from an alignment process of canonical pronunciations and often contains significant errors. We propose an alternative that aims to rely the dependency on phoneme boundaries. The method estimates the presence of an error in a segment with more than one phoneme using a combination of sequence modelling and attention. We explore different configurations for this approach and compare it to a GOP baseline using data from young Dutch learners of English.
Poster 26 – Multiword Expressions: The Road Ahead
Author: Harish Tayyar Madabushi
Co-authors: Marcos Garcia (Universidade de Santiago de Compostela, Spain), Carolina Scarton (University of Sheffield, UK), Marco Idiart (Federal University of Rio Grande do Sul, Brazil), Aline Villavicencio (University of Sheffield, UK)
Abstract: Despite the tremendous progress in natural language processing made possible by the introduction of pre-trained language models, multiple studies have shown the inability of these models to effectively detect and represent non-compositional multiword expressions (such as idioms). Given this, we present possible aspects of natural language understanding that can be improved by addressing this shortcoming and the impact of such improvements on downstream tasks. For example, language models are currently sample inefficient in learning representations of words, and methods of altering representations of specific phrases have not been fully explored. Improvements in this regard could have far a wide variety of implications in the form of improvements on downstream tasks such as machine translation, text summarisation, and the detection of satire while simultaneously improving our understanding of and control over these models.
Poster 27 – A Pilot Study on Annotating Scenes in Narrative Text using SceneML
Author: Tarfah Alrashid
Co-authors: Rob Gaizauskas
Abstract: SceneML is a framework for annotating scenes in narratives, along with their attributes and relations [GA19]. It adopts the widely held view of scenes as narrative elements that exhibit continuity of time, location and character. Broadly, SceneML posits scenes as abstract discourse elements that comprise one or more scene description segments – contiguous sequences of sentences that are the textual realisation of the scene – and have associated with them a location, a time and a set of characters. A change in any of these three elements signals a change of scene. Additionally, scenes stand in narrative progression relations with other scenes, relations that indicate the temporal relations between scenes. In this paper we describe a first small-scale, multi-annotator pilot study on annotating selected SceneML elements in real narrative texts. Results show reasonable agreement on some but not all aspects of the annotation. Quantitative and qualitative analysis of the results suggest how the task definition and guidelines should be improved.
Poster 28 – Sources of Overfitting in Neural Abstractive Summarisation of Scraped News and Business Text
Author: Donovan Wright
Co-authors: Robert Gaizauskas and Mark Stevenson
Abstract: Text summarisation is the reduction of long documents into summaries containing only the most important and salient information from the source documents. Text summarization is divided into two approaches; extractive summarization, and abstractive summarization. Extractive summarization is the older approach, which involves the statistical determination and extraction of the most significant elements of a document, and combining them to form a summary. In the case of abstractive summarization, a summary is generated by understanding the most important elements within the text and paraphrasing the source text in order to produce the summary.
Contemporary researchers of abstractive summarization have tended to develop neural models focussing on evaluation metrics, and have used datasets comprised of scraped news content. Journalistic content conforms to an inverted pyramid structure in which the most import elements of the news story appear at the top of a news column, with supporting content following. This inverted pyramid structure of journalistic writing results in layout bias. Indeed, some researchers have exploited the layout bias in scraped news content. Because of the bias within the datasets, abstractive summarization models developed using scraped news content datasets, are unlikely to generalize to non-news content based datasets. This proposed research focusses on noise generation, within abstract summarization models, which have been developed using unconstrained scraped online news content, and seeks to make quantitative comparisons with datasets originating from non-news based content.
Poster 29 – Computational Analysis of Gendered Language in Kuwaiti Arabic
Author: Hesah Aldihan
Co-authors: Rob Gaizauskas and Susan Fitzmaurice
Abstract: The question of whether men and women speak differently has been explored extensively in the field of sociolinguistics, starting from Robin Lakoff's Language and Woman's Place (1973). However, with the development of the field of computational sociolinguistics, this question can now be explored using computational techniques that facilitate the collection and analysis of large datasets. As gender segregation is a defining factor in many aspects of life in Kuwait, this study aims at analysing the social variable of gender in the Kuwaiti Arabic (KA) dialect from a sociolinguistic perspective using computational techniques. This will be done by collecting and analysing KA textual data from social media to answer the question of whether men and women speak differently in Kuwait.
Poster 30 – Information Extraction and Entity Linkage in Historical Crime Records: OCR quality scoring and post-correction
Author: Callum Booth
Co-authors: Rob Gaizauskas
Abstract: This research seeks to find methodology to parse crime reports within the OCR texts of nineteenth-century London newspapers in the British Library Newspapers (BLN) corpus. We seek to corroborate the existing information in the Digital Panopticon, a structured repository of criminal histories, by using newspaper reports of police court hearings to shed light on the criminal justice processes that took place before a case was tried in the Old Bailey, giving historians structured access to a valuable additional source of crime data. This work covers the methodology carried out to identify and rank high quality OCR documents within the BLN corpus, by using genre-adjacent language modelling. We examine the landscape of the available data from a time and publication perspective, and introduce rules to reduce the dataset down to a working corpus of relevant and high quality documents. Finally, we explore means of mitigating the issue of transcription noise introduced by the effects of physical source degradation, microfilm scan quality, and varying OCR tooling, by exploring heuristic and neural OCR post-correction methods.