Poster 1 How does the pre-training objective affect what large language models learn about linguistic properties?
Authors: Ahmed Alajrami and Nikolaos Aletras; University of Sheffield
Abstract: Several pre-training objectives, such as masked language modeling (MLM), have been proposed to pre-train language models (e.g. BERT) with the aim of learning better language representations. However, to the best of our knowledge, no previous work so far has investigated how different pre-training objectives affect what BERT learns about linguistics properties. We hypothesize that linguistically motivated objectives such as MLM should help BERT to acquire better linguistic knowledge compared to other non-linguistically motivated objectives that are not intuitive or hard for humans to guess the association between the input and the label to be predicted. To this end, we pre-train BERT with two linguistically motivated objectives and three non-linguistically motivated ones. We then probe for linguistic characteristics encoded in the representation of the resulting models. We find strong evidence that there are only small differences in probing performance between the representations learned by the two different types of objectives. These surprising results question the dominant narrative of linguistically informed pre-training.
Poster 2 Efficient Non-Autoregressive GAN Voice Conversion using VQWav2vec Features and Dynamic Convolution
Authors: Mingjie Chen, Yanghao Zhou, Heyan Huang, and Thomas Hain; University of Sheffield
Abstract: It was shown recently that a combination of ASR and TTS models yield highly competitive performance on standard voice conversion tasks such as the Voice Conversion Challenge 2020 (VCC2020). To obtain good performance both models require pretraining on large amounts of data, thereby obtaining large models that are potentially inefficient in use. In this work we present a model that is significantly smaller and thereby faster in processing while obtaining equivalent performance. To achieve this the proposed model, Dynamic-GAN-VC (DYGAN-VC), uses a non-autoregressive structure and makes use of vector quantised embeddings obtained from a VQWav2vec model. Furthermore dynamic convolution is introduced to improve speech content modeling while requiring a small number of pa- rameters. Objective and subjective evaluation was performed using the VCC2020 task, yielding MOS scores of up to 3.86, and character error rates as low as 4.3%. This was achieved with approximately half the number of model parameters, and up to 8 times faster decoding speed.
Poster 3 "An ache in the donkey" - teaching deep neural networks to understand idioms
Authors: Thomas Pickard, Aline Villavicencio, and Carolina Scarton; University of Sheffield
Abstract: Multi-word expressions occur frequently in language, and idioms in particular present significant challenges for language learners and computational linguists. While the meaning of a phrase like car thief can be understood from its constituent words, a one-armed bandit is completely opaque without further explanation. Deep neural network models, which underpin state-of-the-art natural language processing, construct their representations of phrases by combining individual words, and this means that they struggle to accurately handle idiomatic expressions. This limits their ability to perform all manner of natural language processing tasks; imagine translating "It's all Greek to me" into Greek one word at a time or captioning a picture of a zebra crossing while unfamiliar with those phrases. The aim of this project is to enhance the capabilities of neural network models to understand idiomatic expressions by finding ways to provide them with world knowledge, indications of which words go together in a phrase, or the ability to identify when the presence of an unexpected word might indicate an idiomatic usage. Understanding how humans handle unfamiliar expressions and learn them in second languages may also provide insight which we can apply to our models and make idioms less of a pain in the ass!
Poster 4 Robust Binaural Sound Localisation with Temporal Attention
Authors: Qi Hu1,2, Ning Ma2, and Guy J. Brown2; 1Institute of Acoustics of Chinese Academy of Sciences, 2University of Sheffield
Abstract: Despite there being clear evidence for attentional effects in biological spatial hearing, relatively few machine hearing systems exploit attention in binaural sound localisation. This paper addresses this issue by proposing a novel binaural machine hearing system with temporal attention for robust localisation of sound sources in noisy and reverberant conditions. The proposed system uses a convolutional neural network that operates directly on phase spectra of the left and right ears to extract noise-robust features which are similar to interaural phase difference. A temporal attention layer operates on top of these frame-level features by incorporating outputs of a temporal mask estimation module, which indicate speech dominance within each frame. The combined features are then exploited by fully connected layers, which map them to the corresponding source azimuth. Our evaluation shows that by training both the temporal mask estimation module and sound localisation module jointly in a multi-task learning manner, the proposed system is able to accurately estimate the azimuth of a sound source, even in challenging reverberant and noise conditions.
Poster 5 Improving Tokenisation by Alternative Treatment of Spaces
Authors: Edward Gow-Smith, Harish Tayyar Madabushi, Carolina Scarton and Aline Villavicencio; University of Sheffield
Abstract: Tokenisation is the first step in almost all NLP tasks, and state-of-the-art transformer-based language models all use subword tokenisation algorithms to process input text. Existing algorithms have problems, often producing tokenisations of limited linguistic validity, and representing equivalent strings differently depending on their position within a word. We hypothesise that these problems hinder the ability of transformer-based models to handle complex words, and suggest that these problems are a result of allowing tokens to include spaces. We thus experiment with an alternative tokenisation approach where spaces are always treated as individual tokens. Specifically, we apply this modification to the BPE and Unigram algorithms. We find that our modified algorithms lead to improved performance on downstream NLP tasks that involve handling complex words, whilst having no detrimental effect on performance in general natural language understanding tasks. Intrinsically, we find our modified algorithms give more morphologically correct tokenisations, in particular when handling prefixes. Given the results of our experiments, we advocate for always treating spaces as individual tokens as an improved tokenisation method.
Poster 6 Towards Dialogue Systems with Social Intelligence
Authors: Tyler Loakman and Chenghua Lin; University of Sheffield
Abstract: Social intelligence is the ability of human beings to communicate in daily life and consists of two important competencies, social awareness, and social facility. Social awareness is the ability to understand other people's feelings and social environment through social cognition (e.g., being empathetic), while social facility is the ability to interact smoothly and effectively based on social awareness. Generating socially intelligent responses like human beings in a dialogue system is one of the most challenging tasks in AI research. This may be realised as responding with adequate levels of humour or empathy, given specific conversational contexts and pragmatics. This project aims to build dialogue systems with Social Intelligence, with specific focus given to the humour dimension of social communication. It is therefore with work of this ilk that AI may one day achieve human-level social competence, rather than pure omniscience. To achieve the aim, the project will: (1) explore linguistic and cognitive theories which can potentially provide heuristic grounds for the dialogue system design; (2) construct a high-quality labelled humour dialogue dataset for model training; and (3), develop novel computational models guided by linguistic and cognitive theories, which can generate responses with humour in a social communication setting.
Poster 7 Knowledge-Based Misinformation Detection
Authors: Dawid Grad and Xingyi Song; University of Sheffield
Abstract: The number of misinformation posts on social media has exhausted fact-checking resources and brought a new major challenge to government responses worldwide. An Artificial Intelligence-based automatic misinformation detection algorithm is required to minimise human fact-checking efforts and enable large-scale misinformation detection and analysis. This project firstly aims to develop a novel knowledge-based misinformation detection model, which incorporates a knowledge graph built from professional debunking articles. This approach will allow for a justification of the predictions made by the model. Another point that will be explored is the knowledge inference approach to extend the existing knowledge base in order to detect unseen misinformation that professional fact-checkers have not yet debunked. Finally, a multimodal approach to misinformation detection will be explored throughout the duration of the project, as many misinformation posts contain other modalities of information that could be essential for models to make correct predictions. The detection results will be evaluated on COVID-19 misinformation data created during the EU WeVerify project, and a qualitative user-based evaluation will also be carried out with journalist and fact-checker users of the InVID-WeVerify verification plugin.
Poster 8 Multi-Task Learning Model for Full Body Localization of Partially Occluded People in Video
Authors: Abdulaziz Alrashidi, Charith Abhayaratne, Peter Cudd and Yoshihiko Gotoh; University of Sheffield
Abstract: For autonomous video surveillance systems, people detection and distance calculation from the observing cameras are critical. Inverse perspective mapping (IPM) is often used to localise people in video sequences with respect to the world coordinates. Prior works have not addressed situations where the detected people are partially occluded and have different postures. Published datasets for occluded humans only give full-body labels for standing posture, which makes developing models for full-body estimates for diverse postures with occlusion even more difficult. This work proposes a post-processing Multi-Task Learning (MTL) approach which jointly trains posture classification and visibility ratio estimation for the detected occluded people. Evaluation results show that the proposed MTL outperforms the models trained using conventional approaches.
Poster 9 Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation
Authors: Jisi Zhang, Catalin Zorila, Rama Doddipatla, and Jon Barker; University of Sheffield
Abstract: In this work, we introduce a novel semi-supervised learning framework for end-to-end speech separation. The proposed method first uses mixtures of unseparated sources and the mixture invariant training (MixIT) criterion to train a teacher model. The teacher model then estimates separated sources that are used to train a student model with standard permutation invariant training (PIT). The student model can be fine-tuned with supervised data, i.e., paired artificial mixtures and clean speech sources, and further improved via model distillation. Experiments with single and multi channel mixtures show that the teacher-student training resolves the over-separation problem observed in the original MixIT method. Further, the semi-supervised performance is comparable to a fully-supervised separation system trained using ten times the amount of supervised data.
Poster 10 Can computers really understand maths?
Authors: Jasivan Sivakumar and Nafise Sadat Moosavi; University of Sheffield
Abstract: In recent years, SLT models have become powerful tools for various applications from machine translation to smart speakers. However, one skill that they still struggle with is arithmetic reasoning. Powerful models like GPT can solve easier problems like "what is 12 times 4?" but fail when the numbers get too large. One possible reason may be because they learn patterns and recall information instead of learning to multiply. It could also be that they don't understand numbers, e.g. they understand 1988 but not 1987. This research aims at exploring models' ability to learn maths and whether this can be helped by creating a purpose-made dataset. A dataset with contrast questions where only numbers or text is altered, questions with different levels of difficulty (only using integers or extending it to decimals and surds), multilinguality (learning maths language agnostically) and zero/few shot learning (can the model do maths given pre-requisite knowledge?).
Poster 11 Simulation of Teacher-Learner Interaction in English Language Pronunciation Learning
Authors: Elaf Islam and Thomas Hain; University of Sheffield
Abstract: Second language acquisition (SLA) is a process in which a second language is learnt from a native speaker. The research related to SLA frequently draws on different disciplines, including psychology, education, and computer science. Acquiring L2 is a challenging process, while learners tend to depend on the norms and categories of their L1 while learning to interpret and emit L2 sounds. Also, each learner has a distinct language learning strategy which is particular actions or behaviours that learners use to learn L2. The teacher plays an essential role in second language learning. Using a suitable teaching strategy to give corrective feedback improves the learner's language acquisition. Therefore, many researchers study the teacher and learner's interactions in SLA. To simulate the teacher role, Computer-assisted language learning (CALL) presented. However, the CALL system does not provide a whole meaning of enhancing the learning process since it simulates the teacher role only. Acting the learner behaviours and responding to different teaching helps find the best teaching strategies. Some other research focuses on identifying the cognitive and perceptual operations that enable the learner mind to act. This project aims to simulate teacher-learner interaction in English pronunciation learning. Each experiment simulates a different case in the pronunciation learning process.
Poster 12 Monitoring sleep disordered breathing of long-Covid patients at home using acoustic AI technology
Authors: Gerardo Roa Dabike, Ning Ma, and Guy J. Brown; University of Sheffield
Abstract: There is recognised prevalence of obstructive sleep apnoea (OSA) in patients recovering from Covid-19, with long-Covid symptoms such as breathlessness and fatigue even two months after an infection. The team (the University of Sheffield, Sheffield Teaching Hospitals NHS Foundation Trust, and Passion for Life Healthcare Limited) is collaborating to collect acoustic and home sleep test data from patients who are suffering from long-Covid symptoms, in order to develop AI systems that will enable efficient screening of OSA at home using sound recordings. This will enable us to answer two research questions: 1) We have already collected a large corpus of OSA data from a previous study before Covid. How do acoustic data of OSA from non-Covid patients differ from those from long-Covid patients? 2) We have developed a robust machine learning technique for screening of OSA acoustically, but this has not been evaluated on patients in which OSA is compounded with the symptoms of long-Covid. Is the system still robust for such patients? The AI software for screening of OSA at home would help earlier diagnosis of OSA in long-Covid patients, thereby relieving some of the strain on the waiting times in hospitals which have reduced capacity post-Covid.
Poster 13 Modelling Code-Switching for Automatic Speech Recognition
Authors: Olga Iakovenko and Thomas Hain; University of Sheffield
Abstract: Code-switching (CS) is a process of speaker changing between languages in the context of a language production act. This phenomenon can appear in communication between speakers when the they are fluent in two or more languages. In Automatic Speech Recognition (ASR) this process can be modelled by using monolingual data in a single ASR system. Although being feasible in theory, in practice models trained on highly resoursed monolingual data perform worse then models trained on smaller amount of CS data. This is because the current ASR models are not able to capture the patterns of how exactly speech units are being interchanged within an utterance in respect to their languages. To solve the issue is to model code-switching production in spoken languages by using monolingual data and incorporating linguistically informed Matrix Language Frame and 4-M models. In the current work I analyse the data that is available for CS speech. There are multiple datasets of multilingual speech used for analysis and modelling of CS, but in reality not all of them represent CS. Due to this, linguistic literature was analysed and MLF and 4-M models were appled to verify if those datasets represent code-switching of languages or a separate language by itself (mixed language or creole).
Poster 14 Green NLP: Data and Resource Efficient Natural Language Processing
Authors: Miles Williams and Nikolaos Aletras; University of Sheffield
Abstract: Current state-of-the-art NLP technologies are underpinned by complex pre-trained neural network architectures that pose two main challenges: (1) they require large and expensive computational resources for training and inference; and (2) they need large amounts of data that usually entails expensive expert annotation for downstream tasks. Mass mainstream adoption of such systems with large carbon footprints would have serious environmental implications, making this technology unsustainable in the long term. It is estimated that training a large Transformer network with architecture search produces 5 times more CO2 emissions compared to driving a car for a lifetime. Moreover, the cost for developing and experimenting with large deep learning models in NLP introduces inherent inequalities between those that can afford it and those that cannot, both in the academic community and in industry. Given the importance of these challenges, this project has two main objectives: (1) to develop lightweight pre-trained language models that have substantially smaller computing resource requirements for training and inference; and (2) to investigate data efficiency when fine-tuning pre-trained transformer models.
Poster 15 Speaking to the Edge: IoT microcontrollers and Natural Language Control
Authors: Mary Hewitt and Hamish Cunningham; University of Sheffield
Abstract: The Internet of Things (IoT) has gained increasing attention over the past decade as a term to describe the connection of microcontrollers to the internet. Defined by tight constraints on power and cost, such microcontrollers are commonly hidden in special purpose devices (e.g. TVs, fridges), yet recently they have played an increasing role in user interfaces. Difficult privacy concerns surround the IoT, since the basic connection of the small devices to a network makes them vulnerable to cyber-security attacks. Additionally, the devices computing resource constraints mean cloud services are commonly relied upon for outsourcing computation and storage. The project seeks to survey, prototype and quantify natural language control interfaces running on Internet of Things devices in the context of a smart home. Findings will relate to how much natural language control can be implemented on IoT devices, while privacy concerns will be explored through research into edge-based computation and open source local infrastructures. Multiple sensor technologies have potential to provide enhanced control interfaces, and we hope to implement some. Comparisons with state-of-the-art benchmarks will be used for evaluation of speech recognition components, and we hope to conduct some human evaluation of the system.
Poster 16 Speech Analysis and Training Methods for Transgender Speech
Authors: Sebastian Ellis, Stefan Goetze, and Heidi Christensen; University of Sheffield
Abstract: An issue present in the transgender community is that of voice-based dysphoria. Individuals looking to alleviate such discomfort benefit from regular speech therapy sessions provided by a trained speech therapist, however typically these sessions are not regular or long enough. Through discussions with both speech therapists and members of the transgender community, this project aims to explore how well the current needs of this community are being met, if and which voice analysis and training methods are currently being used and how these methods could be adapted for a digital system for audio & acoustic training. Directions of interest include investigating signal analysis metrics & features, and the derivation of user feedback, from existing methods in related fields such as atypical speech; investigating existing gender classification methods; eventually culminating in developing a system which is capable of supporting higher-frequency professional quality voice training for members of the transgender community.
Poster 17 Learning Idiom Representations using BERTRAM
Authors: Dylan Phelps; University of Sheffield
Abstract: This paper describes our system for SemEval-2022 Task 2 Multilingual Idiomaticity Detection and Sentence Embedding sub-task B. We modify a standard BERT sentence transformer by adding embeddings for each idioms, which are created using BERTRAM and a small number of contexts. We show that this technique increases the quality of idiom representations and leads to better performance on the task. We also perform analysis on our final results and show that the quality of the produced idiom embeddings is highly sensitive to the quality of the input contexts.
Poster 18 Improved simulation of realistically-spatialised simultaneous speech using multi-camera analysis in the CHiME-5 dataset
Authors: Jack Deadman and Jon Barker; University of Sheffield
Abstract: An acoustic room simulation is an essential tool in developing distant microphone ASR and speech separation. However, most commonly used simulated datasets adopt uninformed and potentially unrealistic speaker location distributions. In earlier work, we analysed a 50-hour audio-visual dataset of multiparty recordings made in real homes to estimate typical angular separations between speakers. We now refine and extend this work using a multi-camera analysis to estimate full 2-D speaker location distributions. Results show that commonly used simulated datasets use unrealistically large angular separations and unrealistic ranges for the target to interferer distance ratios. We generate more realistically distributed datasets and use them to re-evaluate state-of-the-art speech separation and ASR approaches. Our results suggest that imposing realistic angular separation distributions makes datasets more challenging. However, the pattern when using realistic distance ratios is more complicated and depends on room size.
Poster 19 Factorisation of Speech Embeddings
Authors: Amit Meghanani and Thomas Hain; University of Sheffield
Abstract: Speech signals are complex and are the results of the interaction of various factors (content, speaker, channels, etc.). The learning of these factors in embedding space can allow much richer factorizations. Instead of separating the speaker and content in the signal domain, the speech can be factorized in the embedding domain as it offers us various advantages over the signal domain. The first advantage is that the embedding space is fixed and low-dimensional compared to the variable length of speech in the signal domain. Another advantage of separating (de-mixing) speaker/content attributes in embedding space is that it can be directly used in the downstream tasks, which makes the training process more efficient. In this project, the objective is to learn mappings that allow to encode relationships between separated factors. It is however important that such relationship encoding remains simple in structure. Formulating such relationships can serve as a new way to express objective functions and will allow us to extract richer information about the relationship between these factors. Such relationships may be linear or in hierarchical form or maybe encoded in terms of a graph of a formal group of mathematical operations. One can imagine both supervised and unsupervised learning approaches.
Poster 20 A Model For Assessor Bias In Automatic Pronunciation Assessment
Authors: Jose Antonio Lopez Saenz, and Thomas Hain; University of Sheffield
Abstract: In pronunciation assessment, the assessor's perception is influenced by a particular pronunciation reference. This assessor may hold a bias towards certain variations in pronunciation which may not affect communication, yet could be penalized during the assessment. This work proposes a model for pronunciation assessment as the combination of an assessor independent (A) and an assessor specific (B) component. The latter could be interpreted as the assessor bias. The resulting assessment function was implemented as a dual model trained to detect mispronounced speech segments. The models incorporate Long-Short Memory and saliency region selection using attention. An experiment was performed using recordings from young Dutch learners of English as second language, which were annotated for mispronunciation by three trained phoneticians (a1, a2, a3). The models combined were able to detect mispronunciations given the assessor identity achieving F1 scores of 0.77, 0.68 and 0.86 for a1, a2, a3 respectively on the Train set and 0.66, 0.53 and 0.81 on the Test set. Additionally, the attention weights of the B model were able to illustrate disagreements between assessors related to the bias.
Poster 21 Exemplar-informed speech recognition
Authors: Rhiannon Mogridge and Anton Ragni; University of Sheffield
Abstract: Current state-of-the-art ASR systems are typically data-driven, relying on large amounts of data to train a parametrised model. In contrast, some historic ASR techniques make use of specific examples - or exemplars - directly. There is evidence that humans make use of both approaches, suggesting that there may be benefits to both. Modern ASR systems are rarely exemplar-based, likely because such approaches are not usually scalable. An alternative option is to use an exemplar-informed approach, in which a modern, parametric, data-driven model is given access to a set of exemplars. This allows the use of exemplars while retaining scalability. This work examines how exemplars can be used to augment a parametric model, including options for exemplar selection for different scenarios and exploration of different scenarios in which such a model might be useful.
Poster 22 Understanding and Disentangling Conversational Participants in a Multi-talker Environment
Authors: Jason Clarke, Stefan Goetze, and Yoshi Gotoh; University of Sheffield
Abstract: This project focuses on disentangling speech in a multi-talker environment by exploiting the multi-modality aspect of modern multi-purpose datasets such as Ego4D. Recent advances in automatic speech recognition (ASR) systems have shown significant improvements in word error rate (WER) over the past decade, with state of the art systems achieving scores as low as 1.4% on the LibriSpeech benchmark. Despite this, even highly sophisticated systems struggle with the more challenging task of ASR in a multi-talker context. One of the primary reasons for this is the prevalence of overlapping speech which is inherent to real life spontaneous conversations, such as heated political debates or casual conversations amongst friends. With the potential to improve the egocentric social understanding of virtual assistants and social robots, audio-visual data is promising since it contains information pertaining to whether certain speech is relevant to an active conversation or not. This is in addition to assisting with the process of diarization generally, thus mitigating some of the challenges present in the described problem. Methods to supplement conventional systems with pertinent features extracted from the visual data such as gaze and audio-visual localisation are to be explored as part of this project.
Poster 23 The Storyline of Individuals
Authors: Josh Smith, Yoshi Gotoh, and Stefan Goetze; University of Sheffield
Abstract: Within the world of audio and video analysis strides have been made over the past few years. To create scene or story descriptions for media, audio/video analysis can be combined with natural language generation to automatically create a synopsis of a character or plotlines. This project aims to take the next step in combined audio visual analysis, focusing on character identification and re-identification specifically in the field of professionally produced media such as tv shows and movies. Challenges such as the audio-visual speaker recognition tasks run by NIST (National Institute of Standards and Technology) have shown how visual and acoustical analysis can mutually benefit from the respective other domain when identifying what is taking place on screen. Audio identification can take place when an individual is obstructed, not visually on screen or in situations such as costume changes and other shifts in appearance. Meanwhile, identification based on visual data can work independently of loud background noise, overlapping speech or silent characters. By using information extracted from combining video and audio processing, the project aims to create scene descriptions which can be used in character synopses or further input for state-of-the-art meta-data enrichment; aiding in localisation and dubbing processes.
Poster 24 MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data
Authors: George Close, Thomas Hain, and Stefan Goetze; University of Sheffield
Abstract: Training of speech enhancement systems often does not incorporate knowledge of human perception and thus can lead to unnatural sounding results. Incorporating psychoacoustically motivated speech perception metrics as part of model training via a predictor network has recently gained interest. However, the performance of such predictors is limited by the distribution of metric scores that appear in the training data. In this work, we propose MetricGAN+/- (an extension of MetricGAN+, one such metric-motivated system) which introduces an additional network - a "de-generator" which attempts to improve the robustness of the prediction network (and by extension of the generator) by ensuring observation of a wider range of metric scores in training. Experimental results on the VoiceBank-DEMAND dataset show relative improvement in PESQ score of 3.8% (3.05 vs 3.22 PESQ score), as well as better generalisation to unseen noise and speech.
Poster 25 Towards the Next Generation of Conversational Dialogue Modelling for Embodied Systems
Authors: Teresa Rodriguez and Roger K. Moore; University of Sheffield
Abstract: Nowadays, voice-controlled applications are used for casual information retrieval or as smart home devices, and users cherish how fast and easy it is to access information. However, these conversational agents behave like question & answer dialogue systems, only capable of turn-taking exchanges. Additionally, we have become accustomed to two practices: (i) using wake words like "Hey Siri" to initiate a dialogue and (ii) waiting for a short period of time until the voice-enabled device replies back. These adaptations are clear indicators that existing human-machine interactions are far from "conversational". Breaking from this rigid dialogue scheme may be possible by processing dialogue incrementally (word-by-word instead of at the end of the user's utterance), thus allowing for real-time processing of dialogue. Moreover, coupling an incremental dialogue architecture with an effort-based model may help to achieve more interactive behaviours. This is because we display regulatory behaviour in everyday speech and we adjust that communicative effort depending on who we are speaking to. Finally, multimodal embodied systems (such as situated social robots) are capable of using gestures and prosody to enrich a conversation and convey more information. Therefore, these systems prove to be an interesting experimental platform for conversational dialogue modelling.
Poster 26 Is Honesty the Best Policy?
Authors: Guanyu Huang and Roger K. Moore; University of Sheffield
Abstract: Spoken-language based interactions between a human being and an artificial device are very popular in recent years. To optimize the interaction between humans and social robots, a strong tendency in research is to make human-robot interaction (HRI) resemble human-human interaction (HHI). It is hoped that robots designed with anthropomorphic appearance and cognitive behaviours can enable humans to interact with social robots in similar ways as they interact with humans, even to develop social bonds. Researchers also show concerns about HRI that simulates HHI. This study aims to explore the approach of designing a social robot to be more consistent inside out by aligning its visual, auditory, and behavioural cues with its capabilities, and testing users' perception of the social robot and its effectiveness in social chats, as well as in task-oriented interaction. Here we hope to show that synchronized multi-modalities which match a social robot's true capabilities can avoid the uncanny valley effect and achieve effective interaction. It is hoped that the result of this study reveals social robots can have appropriate affordances apart from being human-like as much as possible. Instead of resembling humans perfectly, it is better to design, build and deploy social robots by striving for truth.
Poster 27 Summarising Scientific Articles
Authors: Tomas Goldsack, Zhihao Zhang, Chenghua Lin, and Carolina Scarton; University of Sheffield
Abstract: Scientific articles contain knowledge that is essential to preserving and advancing our understanding of all scientific disciplines. In the current age of information, such articles are published at an ever-increasing rate, resulting in a vast and rapidly growing pool of technical knowledge. For those without a scientific background, this knowledge is obscured behind unintelligible jargon and unfamiliar formalised structure. Even for those possessing the domain expertise required to fully comprehend the contents of such texts, factors such as their size and quantity can prove a formidable barrier to the location and retrieval of desired information. The task of summarisation specialises in the retrieval and contextualisation of salient information. However, scientific articles pose a number of challenges to current summarisation models, with factors such as their prohibitive length and technical language often requiring special consideration. Our work focuses on the development of summarisation models which perform well on such articles, making their content easily accessible to a range of audiences. Specifically, we are currently looking at the parallel tasks of scientific summarisation and lay summarisation, with the aim to develop models with the ability to personalise their output to a given audience in the future.
Poster 28 Point-of-Interest Type Prediction using Text and Images
Authors: Danae Sánchez Villegas and Nikolaos Aletras; University of Sheffield
Abstract: Point-of-interest (POI) type prediction is the task of inferring the type of a place from where a social media post was shared. Inferring a POI's type is useful for studies in computational social science including sociolinguistics, geosemiotics, and cultural geography, and has applications in geosocial networking technologies such as recommendation and visualization systems. Prior efforts in POI type prediction focus solely on text, without taking visual information into account. However in reality, the variety of modalities, as well as their semiotic relationships with one another, shape communication and interactions in social media. This paper presents a study on POI type prediction using multimodal information from text and images available at posting time. For that purpose, we enrich a currently available data set for POI type prediction with the images that accompany the text messages. Our proposed method extracts relevant information from each modality to effectively capture interactions between text and image achieving a macro F1 of 47.21 across eight categories significantly outperforming the state-of-the-art method for POI type prediction based on text-only methods. Finally, we provide a detailed analysis to shed light on cross-modal interactions and the limitations of our best performing model.
Poster 29 Towards Prosodic Speech-to-Speech Translation
Authors: Kyle Reed, Thomas Hain, and Anton Ragni; University of Sheffield
Abstract: Speech-to-speech translation is the task of translating speech between a source and target language. In the absence of direct paired samples to model, the historically dominant approach has been to use a cascade of speech recognition, machine translation and text-to-speech synthesis models to transcribe the source utterance, translate it and synthesise a target utterance. Recent approaches have augmented existing machine translation and speech-to-text translation datasets to construct and learn from direct speech-to-speech pairs. The importance of prosodic characteristics of source utterances has not been explicitly addressed by direct approaches. An exploration is made into the potential use of source utterance prosodic characteristics to improve the output of a speech-to-speech translation system. The literature indicates that source prosody should be reflected in the source utterance and may affect the lexical content of the target translation. An exploratory experiment reveals that there is significant translation ambiguity within a large-scale S2T corpus and introduces a method to quantify this ambiguity. Approaches are proposed to analyse the source of translation ambiguity the extent to which source utterance prosody can resolve it, and how source prosody can otherwise be translated to a target utterance.
Poster 30 Argument Parsing: What granularity of input is best?
Authors: Jonathan Clayton1, Marco Damonte2, and Rob Gaizauskas1; 1University of Sheffield, 2Amazon
Abstract: Argument Structure Parsing (ASP) is a popular subtask of argument mining; many authors have proposed new architectures, but few have explored these systematically. Two important questions that have not been directly addressed are: (1) what is the appropriate level of granularity at which to segment texts prior to ASP? (2) for models relying on EDU segmentation, how much does ASP performance degrade when moving from gold standard to automatic EDU segmentation? We find that what we refer to as an essay-level input granularity (taking into account sentence-level representations of all relevant discourse units) is probably best. We additionally find that models relying on discourse segmentation are adversely affected by propagation of errors from the segmentation model.
Poster 31 Long Context Speech and Language Modelling
Authors: Robert Flynn and Anton Ragni; University of Sheffield
Abstract: The majority of the current automatic speech recognition systems are designed to model speech as a conditionally independent set of utterances. However, in many real-world scenarios these utterances are highly interrelated, and previously processed data can be used to form a prior on which to condition future transcription. While previous work has investigated the incorporation of context from surrounding utterances, doing so effectively remains an open problem. Indeed, many of the current architectures used for modelling sequential data are limited in their ability to utilise very long-distance temporal dependencies (Khandelwal et al., 2018). This project will investigate approaches towards modelling long-contexts and memory, for applications in automatic speech recognition, real-word use cases can include meetings and other long-format dialogue scenarios. Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 284-294, Melbourne, Australia. Association for Computational Linguistics.
Poster 32 Understanding algorithmic bias in automatic speech recognition
Authors: Nina Markl, UKRI CDT for NLP, University of Edinburgh
Abstract: In this poster, I present my PhD research about algorithmic bias in speech and language technologies, in particular automatic speech recognition. I am interested in understanding the origins and consequences of algorithmic bias in this context - drawing on theories of sociolinguistic variation, language ideology and language policy and perspectives from human-computer interaction, science and technology studies and bias and fairness in machine learning. I present some work due to be published at FAccT 2022 showing that commercial British English ASR systems perform best for highly prestigious varieties of British English (Southern British English) and worse for varieties from the North of England, Northern Ireland, and Wales. In this context, I highlight the need for context-sensitive evaluation methods (based on work published at the HCI + NLP workshop 2021) and more thoughtful approaches to dataset compilation (EDI in LT workshop 2022 and LREC 2022). I also describe some future directions.
Poster 33 Adult to child voice conversion
Authors: Protima Nomo Sudro, Thomas Hain, Anton Ragni; University of Sheffield
Abstract: Child actors play an important role in movies, cartoons, and dubbing in foreign languages. Though human voice is the highest standard with respect to translated audio quality. However, dubbing is an expensive and time-consuming mode of audio translation due to complexities and many dependent professionals it requires. With the advances in technology, dubbing became easier and cost-effective. In this direction, the present work aims in developing an adult to child voice conversion (VC) system. A VC system takes the speech from the source speaker (adult) and generates an output speech that sounds like a target speaker (child). While performing the conversion, a VC system maintains the same linguistic content between the adult and child speakers. In the VC literature, VC techniques are broadly classified as parallel and non-parallel methods based on the use of training data. In parallel VC technique, paired speech data is required from both the source and target speakers. Whereas, in non-parallel VC technique, unpaired speech data is use to train the system. For an adult to child VC in limited training data scenario, parallel VC technique is more suitable. Therefore, initially we focus on VC techniques that perform efficiently on paired speech data.