Annual Conference 2021 – PhD talks

Managing disagreement – controlling for noisy crowdsourced performance on an ambiguous annotation task

Author: Tom Green
Co-authors: Dr Diana Maynard, Dr Chenghua LIn

Session 1: 11.45am – 1.15pm on 15 June 2021

Abstract: 'Skills' are often used by recruiting professionals as an indicator of job fit; high similarity between a candidate's skills and those the job requires indicates good fit. Automatic solutions for job matching, which effectively filter the large volumes of data required for review, often use Skills as a measure of this.

However, there is a lack of academic consensus regarding the definition of a Skill, and a lack of public data for training and evaluating methods of Skill extraction from candidate CVs and job descriptions. An important first step in developing an automatic solution is to construct a labelled dataset of Skills and other relevant entities (such as Occupations and Domains) within job descriptions.

The development of this dataset is challenging because annotator disagreement is likely to be high on these fuzzy entities, despite using an iterative approach to annotation guideline and task development incorporating user feedback and a proficiency qualification task. Our solution is to use a learning model which avoids aggregating labels by leveraging disagreement between multiple noisy annotators. In this talk we describe the development of our dataset and investigate the 'ideal' level of accuracy required of crowdsourced workers for such a model to perform well.

Automatic Meeting Transcription for Highly Reverberant Environments disagreement – controlling for noisy crowdsourced performance on an ambiguous annotation task

Author: Will Ravenscroft
Co-authors: Professor Thomas Hain, Dr Stefan Goetze

Session 2: 10.00am – 10.40am on 16 June 2021

Abstract: Significant progress has been made over the last decade in the area of multi-speaker far field automatic speech recognition (ASR). However, it still remains a challenging task in many scenarios. This is due to a number of factors including additive noise sources, overlapping speaker segments as well as high levels of reverberation.

Reverberation can negatively affect ASR systems in a number of ways. It has a temporal smearing effect on speech spectra. This can particularly be an issue in the lower frequency ranges of the voice where key information in the signal is obscured. The reflected paths of reverberation can also cause issues in speech enhancement front-ends such as beamformers that seek to target the direction of arrival from the speaker.

In this presentation some of the recent advances in far field speech recognition were discussed and some novel solutions for how to address some of these issues were presented. The proposed research aims to investigate the robustness of recent research into time domain neural networks being used for both speech enhancement and speech separation as front-ends to far field ASR systems.

Integrating Text and Information from other Modalities for Social Media Analysis

Author: Danae Sanchez Villegas
Co-authors: Dr Nikolaos Aletras (University of Sheffield), Dr Saied Mokaram (Cambridge Assessment)

Session 1: 11.45am – 1.15pm on 15 Tuesday 2021

Abstract: Large-scale analysis of the language people use in social media has applications in a variety of fields such as linguistics, political science, journalism, and geography. Integrating contextual information from different modes into social media text is important to obtain a deeper understanding of the full context of the language in which it appears. This research will investigate the role of multimodality and the challenges that arise when integrating multimodal data to address computational social science tasks.

Commonsense Reasoning from Multimodal Data

Author: Peter Vickers
Co-authors: Loïc Barrault (University of Sheffield), Nikolaos Aletras (University of Sheffield), Emilio Monti (Amazon)

Session 1: 11.45am – 1.15pm on 15 Tuesday 2021

Abstract: Following the astonishing success of recent Machine Learning approaches to the NLP and Computer Vision domains, focus has increasingly turned to multimodal approaches. Such models, which interpret and reason over multiple input modalities, come closer the human experience of interpreting, interacting, and communicating in a multimodal manner. Furthermore, leading theories of cognition stress the importance of inter-modal coordination.

Building on these theories and ML research, we study machine multimodal reasoning in an area long known to be problematic to AI: commonsense reasoning, the task of making ordinary, everyday judgements. Through the investigation of vision, language, and Knowledge Graphs (KG), we aim to quantify the contributions of these additional modalities to models performing commonsense reasoning tasks.

We explicitly respond to the following questions: Can existing models for commonsense reasoning, which operate on synthetic data, be transferred to real world datasets and achieve acceptable accuracy? Can external knowledge bases improve the accuracy of state-of-the-art models for multimodal AI Commonsense tasks? To what extent can AI Commonsense models be made interpretable, allowing for their process of inference to be examined?

Towards Personalised and Document-level Neural Machine Translation of Dialogue

Author: Sebastian Vincent
Co-authors: Dr Carolina Scarton, Dr Loïc Barrault

Session 2: 10.00am – 10.40am on 16 Wednesday 2021

Abstract: State-of-the-art (SOTA) neural machine translation (NMT) systems translate texts at sentence level, ignoring context: intra-textual information, like the previous sentence, and extra-textual information, like the gender of the speaker. Because of that, some sentences are translated incorrectly.

Personalised NMT (PersNMT) and document-level NMT (DocNMT) incorporate this information into the translation process. Both fields are relatively new and previous work within them is limited. Moreover, there are no readily available robust evaluation metrics for them, which makes it difficult to develop better systems, as well as track global progress and compare different methods.

Within this thesis, we focus on PersNMT and DocNMT for the domain of dialogue extracted from TV subtitles in five languages: English, Brazilian Portuguese, German, French and Polish. Three main challenges are addressed: (1) incorporating extra-textual information directly into NMT systems; (2) improving the machine translation of cohesion devices; (3) reliable evaluation for PersNMT and DocNMT.

Continuous End-to-End Streaming TTS as a Communications Aid for Individuals with Speaking Difficulties but Normal Mobility

Author: Hussein Yusufali
Co-authors: Prof Roger K Moore, Dr Stefan Goetze

Session 1: 11.45am – 1.15pm on 15 June 2021

Abstract: This project is aimed at individuals who have had trauma or surgery to their vocal apparatus, but who are unable to talk in a conventional manner but who have full motor control of the rest of their body. Previous research in 'silent speech recognition' and 'direct speech synthesis' has used a wide variety of specialised/bespoke sensors to generate speech in real-time from residual articulatory movements.

However, such solutions are expensive (due to the requirement for specialised/bespoke equipment) and intrusive (due to the need to install the necessary sensors). As an alternative, it would be of great interest to investigate the potential for using a conventional keyboard as a readily-available and cheap alternative to specialised/bespoke sensors, ie a solution based on text-to-speech synthesis (TTS).

There are fundamental problems with using contemporary TTS as a communications aid:

  1. The conversion from typed input to speech output is non real-time and delayed.

  2. Even a trained touch-typist would be unable to enter text fast enough for a normal conversational speech rate.

  3. The output is typically non-personalised.

  4. It is not possible to control the prosody in real-time.

  5. It is not possible to control the affect in real-time. These limitations mean that existing TTS users are unable to intervene/join-in a conversation, unable to keep up with information rate of exchanges, unable to express themselves effectively (ie their individuality and their communicative intent) and suffer a loss of empathetic/social relations as a consequence.

This research aims to utilise a prediction mechanism into speech synthesis, by embedding conversational speaking rates into the system, enabling the user to further participate in social conversational settings. Furthermore we aim to incorporate real-time control of affect, eg by analysing the acoustics of key-presses or facial expressions from a webcam, to change the emotional tone of the TTS.