Our PhD students
Cohort 1 (2019)
Nearly everyone has a CV (or at least a career and/or education history).
Nearly every non-sole trader business recruits people.
Most people want to progress from the job they are in – over 60% of people are always at least passively looking for a job.
CV-to-Job Spec matching technologies are generally underperforming right now, but there is plenty of data available for machine learning solutions to improve performance. This would improve the quality and objectivity of recruitment, reduce the number of irrelevant applications recruiters receive, reduce the likelihood that a good candidate is overlooked, save time and lead to faster filling of open positions, and enhance the candidate experience.
One challenge is that job seekers are reluctant to provide structured data about themselves, and prefer to 'throw' a CV over to the recruiter and then 'throw' their CV over to the next recruiter at the next company and so on. Because of this, the data from a job seeker is often not in structured form. Similarly, a lot of the data the recruiter gives about the job is also unstructured – provided in a number of paragraphs of text. Yet all the key information is available to match someone. The job description specifies key skills in the text, the ideal candidate, the soft skills, the must-haves, the skills that are beneficial but not essential, and so on.
This project will involve researching a 'matching solution' that takes all of the above into consideration – the unstructured data needs to be parsed using Natural Language Processing (NLP) techniques, the key components extracted, and fed into a machine learning algorithm to determine how close someone is for a job and what the key differences are. This could help people see what skills they need to acquire in order to progress (their 'skills delta'), and that data could eventually feed into a training system to suggest courses (college, university or other).
A wide range of Information Extraction methods (including knowledge-engineering and supervised learning methods) will be explored to tackle parsing of unstructured CV and job specification data, and both shallow-learning and deep-learning methods will be considered for the 'matching' component of the matching solution.
Investigation of some or all of the following questions will drive the methodology, by providing deeper insights that go beyond simple matching of keywords.
Is the job they are applying for a natural progression from their current role, a side step, or a bit left-field?
How successful were similar people in applying for similar jobs and how long did they stay in those jobs for?
How likely is someone to be hired for a particular job (based on empirical data)?
Does someone's location help or hinder their application? What about their school, or their previous places of work?
Is the job seeker a 'job hopper' or a contractor?
The field of automatic speech recognition (ASR) has seen significant advancements in recent years due to a combination of the development of more advanced deep neural network models as well as the computational power necessary to train them. One area that still presents a significant challenge is in the transcription of meetings involving multiple speakers in real world settings, which are often noisy, reverberant and where the number of speakers and their location changes. Not only is the conversational environment challenging, but also the fact that recording of voices in these settings poses issues.
Multiple microphones or microphone arrays are commonly used, but these may be situated a fair distance away from the speakers which in turn leads to a degraded signal to noise ratio. For this reason meeting transcription tends to require additional stages of processing, many of which are not required for simpler ASR tasks. One of these stages is the preprocessing of audio data, often involving a combination of signal processing and machine learning techniques to denoise and beamform the audio from microphone arrays. As no device is capturing audio from the speaker alone, speaker diarisation needs to be included in the systems which also then makes it possible to adapt ASR models to the specific acoustic conditions and the speaker in their present location.
Speech in meetings has additional complexities – it is conversational and thereby language complexity is increased. This also has implications on the acoustics, speakers talk concurrently and give backchannel signals. The speaking style is much more varied and is adjusted according to the conversational situation. It is also influenced by the conversation partners and the topic and location of the meeting. All of these qualities require a meeting transcription system to be highly adaptive and flexible.
The main focus of this project is on improving ASR technology for doctor-patient meetings in a healthcare setting. The challenges for these types of meetings are unique. For example, the environment where recordings take place may have hard surfaces leading to very reverberant speech signals. One can also not assume that recording microphones are in fixed places or even well located in relation to the speakers. While moving speakers occur in other meeting scenarios, one can expect this to be present more often here. The impulse response of a particular acoustic environment not only changes from room to room but also varies depending on where within the room speakers are in relation to microphones.
Beyond acoustic challenges large amounts of data for doctor's speech might be available, whereas data from individual patients will be sparse. This poses a challenge for commonly used adaptation techniques in ASR, in particular if speakers with non-standard speech are present – something that can be expected in health settings. Overall the type and content of language is expected to be very specific to the healthcare domain. All of these unique attributes have an impact on ASR performance and can severely reduce the accuracy of outputs.
The aim for this project will be to improve upon existing ASR technology by examining some of the areas outlined. Particular areas of interest for advancing this technology will be on improving microphone array recording techniques and adaptation methods that address the specific properties found in real recordings. It is expected that any techniques developed will have to be highly adaptive to the recording settings. Further exploration will include investigation into the suitability of standard ASR metrics for assessing performance for users.
The need to look after one's mental health and wellbeing is ever increasing. Recent events such as the Covid-19 pandemic have resulted in almost everyone being subjected to a lockdown, with many people reporting increased feelings of anxiety, depression, and isolation. Speech has been found to be a key method of identifying and examining a range of different mental health concerns, but as of yet little research has been conducted on how to integrate such knowledge into everyday life. A lacuna exists between the number of mental health and wellbeing apps and virtual 'chatbots' available, and the speech features that have the potential to aid in the diagnosis and monitoring of such conditions.
This project investigates the efficacy of a spoken language virtual agent, designed to prompt users to record them talking about how they're feeling over an extended period of time. This serves two purposes; firstly, the virtual agent acts as an interactive tool aiding in alleviating feelings of loneliness and allowing people an ear for talking therapy, without the need for another human to converse with. Secondly, machine learning techniques will track features of the user's speech, such as phoneme rate and duration and acoustic features, in order to monitor the state of the user's mental wellbeing over a set amount of time. This provides a snapshot of a user's mental peaks and troughs, and in turn helps to train the AI to detect such changes.
This project will address multiple research questions, such as how people from different age groups would interact with such a virtual agent, and how their needs would differ in regard to what they are hoping to get out of such an interaction. Secondly, this project serves as a study into how much time a person would be required to interact with such a virtual agent in order to collect enough data to properly analyse and further train the AI with. Thirdly, this project offers an opportunity to investigate how such virtual agents could be used as a tool to combat loneliness across different age groups, attempting to combat early signs of mental health concerns brought about by feelings of isolation.
Many Natural Language Processing tasks are being successfully completed by sophisticated models which learn complex representations from gigantic text or task-specific datasets. However, a subset of NLP tasks which involve simple human reasoning remain intractable. We hypothesise that this is because the current class of state-of-the-art models are not adequate for learning either relational or common-sense reasoning problems.
We believe there is a two-fold deficiency. Firstly, the structure of the models is not adequate for the task of reasoning, as they fail to impose sensible priors on reasoning such as pairwise commutativity or feature order invariance. Secondly, the data presented to these models is often insufficient for them to learn the disentangled, grounded functions which represent common sense-reasoning. Furthermore, a high degree of bias has been found in many existing datasets designed for reasoning tasks, indicating that current models may be predicting correct answers from unintended hints within the questions.
For the purposes of this project, we would define common-sense reasoning as the ability to:
Reason about relations between objects, both in single and multiple-step problems.
Achieve near human-level performance in 'common-sense' reasoning problems.
Ground task understanding with reference to the scene in which a question is asked, allowing for informed inference.
We aim to target real-world multi- and mono-modal relational reasoning tasks with models designed to capture the properties of relational reasoning. From this, we will investigate the possibility of extracting structured knowledge such as relational graphs or linguistic information for accuracy improvement on commonsense inference tasks such as VQA.
To investigate novel ways of allowing AI to learn object properties, such as using reinforcement learning in a physics engine to learn affordances, the properties of an object which define what actions can be performed on it.
Construct our own dataset to aid with the task of scene-grounded commonsense reasoning, which may include textual, visual, video, and physics information for the purpose of training a model which can learn object affordances and representations.
Can existing models for reasoning, which operate on synthetic data, be transferred to real world datasets and achieve acceptable accuracy?
Can features or models extracted from explicit reasoning tasks improve the accuracy of state-of-the-art models for multimodal commonsense tasks by augmenting the knowledge base to draw from?
What tasks and corresponding datasets would enable models to learn efficient, disentangled commonsense properties and relations which humans intuitively obtain? Are we able to create a useful and tractable task and dataset which advanced the field?
Danae Sánchez Villegas
Multimodal communication is prominent on social platforms such as Twitter, Instagram, and Flickr. Every second, people express their views, opinions, feelings, emotions, and sentiments through different combinations of text, images, and videos. This provides an opportunity to create better language models that can exploit multimodal information. The purpose of this project is to understand the language from a multimodal perspective by modeling intra-modal and cross-modal interactions. The modes of our interest are text, visual, and acoustic.
The integration of visual and/or acoustic features to NLP systems has recently proved to be beneficial in several tasks such as sentiment analysis (Rahman, Wasifur, et al., 2019; Kumar, Akshi, et al., 2019, Tsai, Yao-Hung Hubert, et al. 2019), sarcasm detection (Cai, Yitao, et al., 2019; Castro, Santiago, et al., 2019), summarisation (Tiwari, Akanksha, et al., 2018), paraphrasing (Chu, Chenhui, et al., 2018), and crisis events classification (Abavisani, Mahdi, et al, 2020).
In this Ph.D. project, we will study and develop multimodal fusion methods and their integration in different stages of model architectures, such as neural networks and transformer models (Devlin, Jacob, et al., 2018). Moreover, we aim to evaluate these methods’ performance on sentiment analysis, social media analysis tasks such as abusive behavior analysis (Breitfeller, Luke, et al., 2019; Chowdhury, Arijit Ghosh, et al., 2019), and other NLP tasks.
Abavisani, Mahdi, et al. "Multimodal Categorization of Crisis Events in Social Media." arXiv preprint arXiv:2004.04917 (2020).
Breitfeller, Luke, et al. "Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
Cai, Yitao, Huiyu Cai, and Xiaojun Wan. "Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
Castro, Santiago, et al. "Towards Multimodal Sarcasm Detection (An Obviously Perfect Paper)." arXiv preprint arXiv:1906.01815 (2019).
Rahman, Wasifur, et al. "M-BERT: Injecting Multimodal Information in the BERT Structure." arXiv preprint arXiv:1908.05787 (2019).
Chowdhury, Arijit Ghosh, et al. "# YouToo? detection of personal recollections of sexual harassment on social media." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
Chu, Chenhui, Mayu Otani, and Yuta Nakashima. "iParaphrasing: Extracting visually grounded paraphrases via an image." arXiv preprint arXiv:1806.04284 (2018).
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
Kumar, Akshi, and Geetanjali Garg. "Sentiment analysis of multimodal twitter data." Multimedia Tools and Applications 78.17 (2019): 24103-24119.
Tiwari, Akanksha, Christian Von Der Weth, and Mohan S. Kankanhalli. "Multimodal Multiplatform Social Media Event Summarization." ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14.2s (2018): 1-23.
Tsai, Yao-Hung Hubert, et al. "Multimodal transformer for unaligned multimodal language sequences." arXiv preprint arXiv:1906.00295 (2019).
Modern approaches to Natural Language Processing are often developed under the assumption that all textual resources should be static, as coherent and unambiguous as possible, inherently global in meaning (independent of local factors) and ordered. From this perspective, dialogue is a challenging type of input, as it violates most of these assumptions.
As an interactive exchange between a narrow group of individuals, dialogue is highly personalised and contextualised. Its primary task is efficient communication on a small scale, and so the evolution of these communication channels happens rapidly and much more locally, when compared to written texts of a given language. Contemporary NLP approaches which are not sufficiently flexible to handle this phenomenon. Despite Dialogue Modeling being a self-standing task within NLP, most NLP methods expect correct grammar and logic in the texts they are applied to. They become inefficient when applied to language typical of dialogue. A paradigm shift in how one approaches interpreting dialogue is required.
Accommodation describes the phenomenon that people will attune their speech to the discourse situation in the most suitable, "optimal" way (Giles and Ogay 2007). It has been extensively studied in the field of socio-linguistics. According to Giles and Ogay (2007), individuals will amend their accent, vocabulary, facial expressions and other verbal and non-verbal features of interaction in order to become similar to their conversational partners. The benefits of doing so can include sounding more convincing, signifying certain attributes (e.g. membership to a specific social group), or minimising the "distance in rank" between speakers. Similarly, Danescu-Niculescu-Mizil and Lee (2011) show that within a conversation, a participant will tend to mimic the linguistic style of other participants. By conducting an experiment on movie scripts they showed that this mimicry is something automatic for humans, a reflex that is not only a tool for "gaining social rank", and therefore is an inherent characteristic of dialogue.
Accommodation is one of the prisms through which we can look at conversational behaviour. It highlights that the addressee and the situation between her and the speaker is crucial to interpreting the exchange. It can be argued that a framework which is centred around these findings may enhance the performance of dialogue interpretation systems.
The aim of this project is to examine how the developments in the understanding of conversations can be introduced into various approaches to Natural Language Processing – how we can systematise and generalise them. As a case study, the task of translating subtitles for scripted entertainment from English to other languages will be considered. Only recently, novel approaches to Machine Translation started going beyond translating isolated sentences – by incorporating document-wide aspects into the process. However, document-level Machine Translation applied to dialogue has been underexplored in literature (Maruf, Saleh and Haffari 2019).
Within the context of MT, if an automatic translator could be made to preserve the accommodation aspects present in dialogues, the resulting translation could be more dynamic and natural as the contextual information and aspects of the mimicry of style between participants can remain approximately preserved.
Danescu-Niculescu-Mizil, C. and Lee, L. (2011) 'Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs'. Cognitive Modeling and Computational Linguistics Workshop at ACL 2011, pp. 76-87.
Maruf, S., Saleh, F. and Haffari, G. (2019) 'A Survey on Document-level Machine Translation: Methods and Evaluation', pp. 1-29.
Ogay, T. and Giles, H. (2007) 'Communication Accommodation Theory', B.B Whaley & W.Samters (Eds.), Explaining communication: Contemporary theories.
This project is aimed at individuals who have had trauma or surgery to their vocal apparatus, eg laryngectomy, tracheoectomy, glossectomy, who are unable to talk in a conventional manner but who have full motor control of the rest of their body. Previous research in 'silent speech recognition' and 'direct speech synthesis' has used a wide variety of specialised/bespoke sensors to generate speech in real-time from residual articulatory movements. However, such solutions are expensive (due to the requirement for specialised/bespoke equipment) and intrusive (due to the need to install the necessary sensors). As an alternative, it would be of great interest to investigate the potential for using a conventional keyboard as a readily-available and cheap alternative to specialised/bespoke sensors, ie. a solution based on text-to-speech synthesis (TTS).
Of course, there are fundamental problems with using contemporary TTS as a communications aid:
the conversion from typed input to speech output is non real-time and delayed,
even a trained touch-typist would be unable to enter text fast enough for a normal conversational speech rate,
the output is typically non-personalised,
it is not possible to control the prosody in real-time, and
it is not possible to control the affect in real-time.
These limitations mean that existing TTS users are unable to intervene/join-in a conversation, unable to keep up with information rate of exchanges, unable to express themselves effectively (ie their individuality and their communicative intent) and suffer a loss of empathic/social relations as a consequence.
So, what is needed is a solution that overcomes these limitations, and it is proposed that they could be addressed by investigating/inventing a novel end-to-end TTS architecture that facilitates:
Simultaneous real-time input/output (ie sound is produced immediately a key is pressed).
Conversational speech rates by embedding a suitable prediction mechanism (ie the spoken equivalent of autocorrect).
Configurable output that allows a user to 'dial-up' appropriate individualised vocal characteristics
Real-time control of prosody, e.g. using special keys or additional sensors.
Realtime control of affect, e.g. by analysing the acoustics of key-presses or facial expressions from a webcam.
Cohort 2 (2020)
Deep neural network models have achieved state-of-the-art results in different Natural Language Processing tasks in recent years. Their superior performance comes at the cost of understanding how they work and how their output results can be justified. It is still challenging to interpret the intermediate representations and explain the inner workings of neural network models. The lack of justification for deep neural network models decisions is considered one of the main reasons that deep learning models are not widely used in some critical domains such as health and law. Therefore, there is a need for designing more interpretable and explainable deep neural networks models while reducing their computing requirements.
This PhD research project aims to make deep neural network models for NLP more interpretable and explainable. As a first step, we will investigate these models' architecture and how each component works. Next, we will try different approaches to reduce the computational complexity of these models and improve their interpretability while achieving similar accuracy and performance. In addition, the research project will investigate how models interpretations can help in designing more data-efficient models. Furthermore, we will investigate designing novel evaluation methods for deep neural networks interpretation's quality.
Which components of deep neural networks do affect the model's decisions?
How does the interpretability of deep learning models for NLP vary from task to task?
Can different pre-training approaches be more computationally efficient?
Can interpretability help in designing more data-efficient deep learning models for NLP?
How models interpretations can be evaluated?
My research project seeks to combine a well-established subfield of NLP, summarisation, with the growing area of Argument Mining (AM). In essence, AM involves automatic analysis of texts involving argument (language that uses reasoning to persuade) in order to identify its underlying components. These components can then be used in the generation of a textual summary that allows human readers to quickly grasp important points about a discussion.
As an illustrative problem for argument summarisation, we can think of a media company that wishes to report on recent parliamentary debates around a controversial issue. If the volume of debates is large, a summary would clearly be useful. An ideal system which solved this problem would be able to identify the key opposing stances in the debate and summarise the arguments for each.
The difficulties inherent in this task are numerous. Principal among them is the fact that the key units of analysis in argumentative conversation rarely correspond to surface-level features like words, and inferring these latent structures is a non-trivial problem .
Within a machine learning framework, this can be restated as the problem of choosing the most appropriate output representation for a supervised machine learning model. One commonly-used representation for argumentative text is the Toulmin model , which identifies six distinct Argument Discourse Units (ADUs) and defines relations between them. Many existing systems use only a subset of these units, since distinctions between some different types of unit are difficult to identify in practice.
To give an example of these challenges, one ADU that is commonly annotated in such tasks is the Backing unit, used to give evidence to a Claim. Though the presence of such units may be indicated by the presence of certain words or phrases ("To illustrate this...", "Consider...", "As an example..."), in many cases, such explicit indicators are lacking, and the relation of the Backing unit to a Claim unit may merely be implied by the juxtaposition of Backing and Claim in adjacent sentences (see eg the examples in ).
In such cases, it may be impossible to identify the relation between two argument components on the basis of the text alone, and some sort of world knowledge may be required to do this. This could be encoded implicitly, using a pre-trained language model such as BERT, or explicitly, using a knowledge graph .
Since we are also addressing the task of summarising an argument, we also expect to have to address the problems involved in the automatic generation of textual summaries. These include diversity, coverage and balance – in other words, producing summaries which do not repeat themselves, attempt to cover as much of the original text as possible, and give equal weight to different parts of the text with equal importance .
In conclusion, the problem of automatically summarising argumentative text is one which includes many open sub-questions we intend to address, including how to best represent argument components, how to learn them from textual data, and how to use these to generate a useful summary.
Lawrence, John, and Chris Reed. "Argument mining: A survey." Computational Linguistics 45.4 (2020): 765-818. (see section 3.1 on text segmentation).
Toulmin, Stephen E. The uses of argument. Cambridge university press, 2003.
El Baff, Roxanne, et al. "Computational argumentation synthesis as a language modeling task." Proceedings of the 12th International Conference on Natural Language Generation. 2019.
Fromm, Michael, Evgeniy Faerman, and Thomas Seidl. "TACAM: topic and context aware argument mining." 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE, 2019.
Li, Liangda, et al. "Enhancing diversity, coverage and balance for summarization through structure learning." Proceedings of the 18th international conference on World Wide Web. 2009.
Human-machine interaction relying on robust far-field automatic speech recognition (ASR) is a topic of wide interest to both academia and industry. The constantly growing use of digital home/personal assistants is one such example.
In far-field scenarios, for which the recording microphone is in considerable distance to the user, not only the target speech signal is picked up by the microphone, but also noise, reverberation and possibly competing speakers. While large research efforts were made to increase ASR performance in acoustically challenging conditions with high noise and reverberation, the conversational aspect adds more challenges to the acoustic and language modelling. If more than one speaker is present in the conversation, tracking the target speaker or separation of the audio signal into separate audio sources from overlapping speech will pose further challenges to ASR. Another layer of complexity is introduced by the recording equipment, such as the number of microphones and their positioning since these can have significant influence on the recorded signal used for ASR.
Deep machine learning in ASR applications significantly advanced the state-of-the-art in recent years. They are widely explored to implement various components of the ASR system such as front-end signal enhancement, acoustic models or the language model independently or jointly either in hybrid or end-to-end (E2E) scenarios.
This project aims at increasing ASR robustness in the challenging acoustic far-field scenario described above. For this, methods from the area of signal processing for signal enhancement and separation will be applied to improve ASR performance as well as novel methods acoustic and language modeling will be developed. By this, ASR performance will be increased for adverse conditions with noise and reverberation, and particularly in environments with competing speech signals. To achieve this goal, different techniques from the signal processing area, the machine-learning area and combinations of both are applied, and in the technical challenges associated with implementation of such systems, particularly in smaller, low power devices such as smart speakers or hearing aids. Additionally, novel techniques involving the usage of semantic information to improve performance will be explored.
Technical documents, such as academic research papers, often contain content that is incomprehensible to a non-specialist audience. This can make it difficult for those without specific knowledge of the subject area to digest and understand ideas that such a piece of work conveys, even at a high-level.
Natural Language Generation (NLG) tasks such as text simplification, text summarisation and style transfer all revolve around the adaptation of their input data (eg a body of text) to fit a particular purpose, whilst preserving the essence of its core content. Methods used for such tasks could feasibly be used to summarise and explain the key ideas from technical documents in a format that is more digestible to a non-specialist audience. However, the style in which such ideas should be presented, and the degree to which technical content should be simplified, is highly dependent on the audience itself. People naturally adapt the manner we communicate to better suit our audience, but this is something that, historically, has seldom been taken into account in many NLG tasks. However, some more recent works have experimented with the incorporation of extra-textual information relating to the audience, such as gender  or grade-level information , with promising results.
This project will involve the research and application of automatic data-to-text NLG techniques that are able to adapt to a given audience. This could, for example, involve leveraging such audience information to provide a personalised output. These are to be used for the transformation of existing technical documents in an effort to make their content more suitable to a different audience or domain. As input data typically used for the aforementioned tasks is often significantly shorter than such technical documents, techniques for handling large documents will also be investigated.
How can we expand on existing research [1, 6] into leveraging audience meta-information when performing the aforementioned data-to-text NLG tasks? How can such information be taken into account to better evaluate the performance of models?
How can we adapt current methods of evaluation for the given NLG tasks to allow them to perform at document-level? Do new methods of evaluation need to be developed?
How can we expand upon alternative approaches to NLG tasks (eg deep reinforcement l earning , unsupervised l earning ) to allow them to perform on entire/multiple documents? Can existing/novel techniques for handling large/multiple documents (eg hierarchical attention mechanisms [2, 5]) be applied to allow for good performance at document-level?
Scarton, C., Madhyastha, P., & Specia, L. (2020). Deciding When, How and for Whom to Simplify. Proceedings of the 2020 Conference of the the 24th European Conference on Artificial Intelligence.
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical Attention Networks for Document Classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1480–1489.
Zhang, X., & Lapata, M. (2017). Sentence Simplification with Deep Reinforcement Learning. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 584–594.
Surya, S., Mishra, A., Laha, A., Jain, P., & Sankaranarayanan, K. (2019). Unsupervised Neural Text Simplification. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2058–2068.
Miculicich, L., Ram, D., Pappas, N., & Henderson, J. (n.d.). Document-Level Neural Machine Translation with Hierarchical Attention Networks. Association for Computational Linguistics.
Chen, G., Zheng, Y., & Du, Y. (2020). Listener’s Social Identity Matters in Personalised Response Generation.
Computational methods for natural language processing (NLP) require representing sequences of language as vectors, or embeddings. Modern representational approaches generate contextualised word embeddings, where how the word is represented depends on the context it appears in. The embeddings generated by state-of-the-art language representation models such as BERT and ELMo have pushed performance on a variety of NLP tasks, however there are still areas that these embeddings perform poorly in.
One of these areas is handling multi-word expressions (MWEs), which are phrases that are semantically or syntactically idiosyncratic.
These are wide-ranging in human language, estimated to comprise half of a speaker's lexicon, and thus are an important area for research. Examples include "kick the bucket", "by and large" and "car park".
In particular, state-of-the-art contextualised word embeddings have been shown to perform poorly at capturing the idiomaticity of MWEs, where the meaning of the expression is not deducible from the meanings of the individual words (non-compositionality). As an example of this, the meaning of "eager beaver" (an enthusiastic person) is not deducible from the meaning of "beaver".
This inability is detrimental to the performance of NLP models on downstream tasks. In machine translation, for example, idiomatic phrases may be translated literally and thus the wrong meaning conveyed.
The aim of this project is to investigate methods of developing embeddings that better deal with the idiomaticity of MWEs, particularly in a cross-domain setting, ie across different genres of text (for example medical and legal).
The novel embedding methods developed in this project would be evaluated both intrinsically and extrinsically.
Intrinsically, we wish to demonstrate an improvement of the ability of the model to deal with idiomaticity. This would be done using eg a dataset of sentences labelled as idiomatic or not to test the ability of a model to detect idiomaticity.
We then wish to show that an increase in intrinsic performance leads to an improved performance on downstream NLP tasks in a cross-domain setting (ie extrinsic evaluation). The evaluation method would depend on the task at hand, for example using the BLEU score for evaluation of cross-domain machine translation.
Aside from looking at how we can improve model performance on certain tasks, we also want to investigate how the developed methods can be used to boost the interpretability of models. This means helping understand why models give the outputs that they do. It is hoped that by the explicit treatment of MWEs will allow for a better understanding of the decisions made by models.
Can novel embedding methods be developed that better handle idiomatic multiword expressions, particularly in a cross-domain setting?
How does the use of these novel embedding methods affect the performance of models on downstream NLP tasks?
Can the interpretability of NLP models be improved through the explicit handling of multiword expressions?
The detection of medical conditions using speech processing techniques is a rapidly emerging field, focusing on the development of non-invasive diagnostic strategies for conditions ranging from dementia, to depression and anxiety, to even Covid-19. One of the largest issues facing medical speech classification that seldom impacts other areas of speech technology (at least in English) in the scarcity of data available. Many of the largest corpora looking at dementia, for example, contain only tens of thousands of words. In addition, these datasets usually contain very few speakers. Limited speakers to train a model creates a new risk of overfitting to idiosyncratic, dialectal, or accentual features of the participants which in turn can gravely impact the efficacy of the classifier depending on the language features of the test subject.
Accent variation can have a large impact on speech technologies, the traditional approach to counter this impact is to use a colossal corpus or accent independent models, either selected by the user or dictated based on geography which are specifically trained on individuals with a similar accent. Unfortunately, this approach depends on a rich dataset from which to train said models, where in the case of medical classification systems, simply does not exist.
This project aims to explore approaches for reducing the impact of accent and language variation on onset cognitive impairment detection systems. This approach will explore the impact of accents both on the construction of cognitive impairment detection classifiers, and on the compilation and initial processing and feature extraction of the datasets. Whilst large elements of this feature extraction will expand over the process of compiling a literature review, one such example may be to investigate bilingual features.
Is it possible that dementia has a consistently detrimental impact on second language production that is distinctly different from the broken language you find in an early learner of the language?
For example, we know individuals make the largest amount of speech production errors on phones which are more similar between their L1 and L2, particularly when learning the language, do we see a loss in ability to maintain phonetic distinctions as someone's cognitive state declines and are the features different to the inverse process of an L2 language learner and thus classifiable. This project aims to develop normalisation strategies and new feature extraction methods for limiting the impact of accents and language variation on medical speech classification systems.
The importance of this research stems from the growing enthusiasm to implement onset cognitive impairment detection systems into the medical industry. Issues here arise where the tools may only be effective on certain demographics of individuals creating significant concern over potential inadvertent segregation created by the technologies. Tools from facial recognition systems to credit scoring systems have all previously and presently seen substantial criticism for their impact on certain demographics of individuals where the systems either perform poorly or adversely impact certain groups of people. It remains vital that medical speech technology is non-discriminatory and provides universally stable efficacy across as many demographics of people as possible.
Spoken-language based interactions between a human being and an artificial device (such as a social robot) are very popular in recent years. Multi-modal speech-enabled artefacts are seen in many fields, such as entertainment, education, and healthcare, but the user experience of spoken interaction with such devices is not very satisfactory. For example, these devices sometimes fail to understand what the command means, or fail to provide a relevant answer to questions; people do not know how to make their commands more understandable to artificial devices; and the spoken interaction is limited to a rigid style and lacks rich interactive behaviours.
Users' dissatisfaction is partly caused by limited language abilities of social agents, like inaccurate automatic speech recognition (ASR). Apart from that, it is hypothesised the dissatisfaction is also caused by the mismatched abilities between humans and artificial devices. Due to limited cognitive abilities, such agents cannot take perceptions of others into consideration, nor react to situation accordingly, which resulted in unsuccessful communicative interaction.
Communicative interaction efficiency involves multiple factors across disciplines such as languages, psychology, and artificial intelligence. However, the role of influential factors and the mechanism of how those factors interact with each other are unknown. Here the project aims to develop a unified framework which can characterise communicative interaction efficiency. It will investigate factors and strategies that make a spoken-language based interaction effective. Based on understandings of the nature of spoken language interaction between humans and artificial devices, the next objective is to maximise affordance of speech-enabled artefacts, and to achieve more effective communicative interaction between a human being and an artificial device.
Preliminary questions so far would appear to be as follows.
Definition of effectiveness: What makes a spoken-language based interaction effective in human-human interaction? Is it the same for interaction between a human and an artificial device? If so, in what way?
Affordance of speech-enabled artefacts: What affects affordance of speech-enabled artefacts? How would it affect communicative interaction effectiveness? For example, some people may argue naturalness is about making speech-enabled artefacts like humans, with human-like voices and appearances. Is naturalness helpful to maximise usability of speech-enabled artefacts? If natural voice or appearance is not correlated with artificial devices' limited language and cognitive abilities, would it cause `uncanny valley' effect?
Short-term and long-term interaction: Do people's expectations of speech-enabled artefacts change over time? How would it change the way artificial devices interact with people?
Modelling: Can the communicative interaction effectiveness be modelled? What are different levels of communicative interaction effectiveness? How could those levels be applied to speech-enabled artefacts?
It is hoped the result of this project provides a general guideline for communicative interaction between a human being and an artificial device. It is anticipated such guideline serves as a starting point when people design or improve multi-modal speech-enabled artefacts.
The effectiveness of data-driven Automatic Speech Recognition (ASR) depends on the availability of suitable training data. In most cases, this takes the form of labelled data, where speech clips (or a numerical representation of them) are matched to their corresponding words or phonetic symbols. Creating labelled training data is an expensive and time-consuming process. For some languages, such as native British and American English, there is ample data for building ASR systems. For many other languages, however, a large volume of accessible training data does not exist. How can the quality of ASR for under-resourced languages be improved?
Techniques for conducting ASR typically fall into two categories: parametric and non-parametric. Parametric ASR methods use data to train a model. This can be thought of as a 'black box' function, which takes speech as input and outputs the corresponding text, ideally with a high degree of accuracy. Once trained, the model is used without further reference to the training data. The training data is summarised by the model, but some of the information it contains is lost in the process, and it cannot be reinterpreted in the presence of additional data. Parametric models, such as deep learning-based neural networks, are usually trained from scratch for the specific task they will be used for, which makes them extremely effective, but means they cannot be easily generalised to other languages. They are often not easy to interpret, and they are usually not as effective when data is scarce because they have a lot of parameters, and determining the values of many parameters requires a lot of data.
Non-parametric approaches, such as Gaussian processes, instead use the training data directly. In this case, the labelled training data are available at the decision-making stage. This preserves all of the information in the training data, and requires fewer assumptions to be made when predicting text from speech. It also, at least in principle, allows sensible decision-making even with limited data. The downside is that it is not easily scalable, meaning that the computational power and data storage requirements increase rapidly as the amount of data increases. Large quantities of training data therefore become unfeasible.
The two approaches, parametric and non-parametric, have different strengths. When large amounts of data are available, capturing most of the information from a large data set is better than capturing all of the information from a smaller training data set, so that the parametric approach may perform better. When training data is scarce, the non-parametric approach may perform better, since all of the limited training information is preserved.
The objective of this project is to explore and evaluate approaches inspired by both parametric and non-parametric methods for multi-lingual automatic speech recognition.
This project is concerned with the task of speech-to-speech translation (S2ST). The traditional approach to this problem uses cascaded models consisting of three components: ASR, MT and TTS models that are used in sequence to map from source to target language. However, there are limitations to the cascaded approach: errors can be compounded between components, it is difficult to conserve para-linguistic features such as prosody or emotion, and the computational requirements involved with three separate models are significant. These limitations motivate models that map directly from the source to the target language without an intermediate textual representation.
The advent of deep neural network models along with a concerted effort from the International Workshop on Spoken Language Translation (IWSLT) community have demonstrated that collapsing the ASR and MT components of the S2ST pipeline is viable, with direct speech-to-text models showing competitive performance to cascaded models. This success paves the way for work on direct speech-to-speech translation.
One of the key challenges with training a direct S2ST model is obtaining the appropriate data in the necessary quantities. Given sufficient examples of speech-pairs in the source and target languages, it would be possible to train a model end-to-end, constructing a physical embedding space onto which we project the source signal and from which we generate the target signal. The feasibility of this approach has been proved in concept using small datasets [1, 2]. However, the creation of large-scale datasets is extremely unlikely, given the difficulty of creating 'objective' or para-linguistically 'equivalent' spoken translations.
While direct speech-to-speech data is scarce, there is a relative wealth of complementary data in the form of text-to-text translation or speech-to-text translation that could be exploited for S2ST. This project aims to explore an alternative to constructing a physical space by directly training on speech-pairs, instead leveraging these better-resourced, complementary modalities to construct an 'effective' latent space. This space is expected to behave similarly to a semantic space ,where a point represents the meaning of a sentence.
The project will additionally explore appropriate methodologies for projection of the source signal onto the space and generation of a target signal from it, determining the viability of zero-shot learning of S2ST using an effective space. The project will also develop heuristic approaches to evaluate the effective space that are reliable and cost-effective.
Michelle Guo, Albert Haque, and Prateek Verma. End-to-end spoken language translation, 2019.
Y. Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey, M. Johnson, Z. Chen, and Yonghui Wu. Direct speech-to-speech translation with a sequence-to-sequence model. In INTERSPEECH, 2019.
The story of individuals concerns how we are able to understand the stories being told in TV shows and movies. To do this automatically we combine three main areas of study:
identification/reidentification of an individual;
recognition of actions and objects on the screen;
and text summarisation of these features.
The current state of the art research uses image processing (through neural networks) to identify high-level features such as an individual's gender, emotion and objects in the environment. What we will be working towards is further improvements in combining this visual data with audio features from the sounds associated with the video.
Many functions of visual processing can be easily done using toolkits that are already well known within the industry such as OpenCV. Functions such as facial recognition and feature extraction are practically pre-made while others like gesture analysis require some work but are well documented. Audio processing is in a similar state with speech recognition from video clips being widely available by using industry models (Amazon/Google speech–to-text API, KALDI, etc), with processes such as voice recognition having no pre-made toolkit. Through a mixture of customising these toolkits for most functions and building others using known techniques, the project aims to use the audio data from the clips to improve upon the current ability of visual processing for person re-identification.
The audio often contains crucial plot details such as names, intentions and emotions that are not always found strictly on screen. Even characters who are directly speaking to each other can be in different shots, never appearing on screen within the same frame. By including this information we are able to re-identify characters from their voice, physical feature or a mixture of both along with greater emotion recognition for establishing character reactions to major story beats. Work will be done in refining the balance between these two fields in regards to how reliant the final system is on different parts of the data and where priority is given on competing conclusions.
In addition to this, a possible model for identifying the cinematography shots used in mainstream media will also be researched. The choice of camera angles, distance and framing convey vast amounts of information which is currently not accessed. For example, an 'over the shoulder' followed by a 'close up' is often used to show that 2 characters are talking together even though they may not be visible or audible in any number of frames. By creating a model that can extract from a clip what shots are being used (referred to as scene detection to avoid confusion with other widely used terms) we are able to add back this cinematographic information to build an improved well rounded picture of the events taking place on screen. Combining this with the already integrated visual and audio data can lead to a more robust method for identifying the storylines from multiple scenes throughout a piece of media, giving a more useful framework of information for the text summary of the plot in the final stage.
Misinformation poses a serious threat to society through its power to mislead the public across domains such as politics, the environment and health. The quick generation and spread of misinformation online makes it extremely difficult to tackle this problem through manual verification alone. The use of natural language processing techniques and machine learning models can therefore facilitate the identification of pieces of misinformation and the detection of misinformation networks.
Existing research in this field includes areas such as fake news and rumour detection through analysis of the features of the users, content and context of a potential piece of misinformation. This project aims to investigate the spread of misinformation through the angle of content reuse, where the same story is reused across websites or social media with minor modifications. This content reuse shall be examined across languages and through additional features such as images.
Whilst misinformation is defined as false information spread regardless of intention to deceive, disinformation is deliberately misleading. Both these aspects will be explored through this project.
Our first objective is to identify patterns and features of content reuse through methods such as semantic text similarity and plagiarism detection. We will use data from misinformation websites and social media, which have been identified as pieces of content reuse. This will also be applied to a multilingual setting by looking at pieces of content that are reused across other languages by applying techniques such as zero or few shot learning.
Based on the discovered language patterns and features of content reuse, we aim to create machine learning classifiers and enhance deep learning models with this information in order to identify new pieces of content reuse. We will also explore how these models perform on changing patterns of misinformation and evolving stories over time.
Finally, we will investigate other contextual features, such as out-of-context and manipulated images or interlinking of misinformation networks as an indicator of content reuse. For example, in the case of out-of-context images, we can analyse similarities between image and text within a given article or social media post, as well as the similarities across a series of articles or posts across time. We can then combine language features with visual features in order to create multimodal classifiers for content reuse.
What are the best models and most important features for detecting content reuse?
Are models capable of identifying content reuse where the story continues to evolve or has been more drastically modified?
How does the language of disinformation vary across different languages and can models generalise over them?
Is it possible to adapt models to deal with changing styles of language and new patterns of disinformation?
What other information, such as contextual features (interlinking of websites) or images, used alongside language would be beneficial to detecting content reuse?
Cohort 3 (2021)
Natural language generation
SLT for social change
Information retrieval and recommender systems
Text simplification, and other approaches that improve the accessibility of literature
Open-Domain Question Answering
Controllable Natural Language Generation
Uncertainty and explainability in neural networks
Toxicity detection in environments like social media and competitive video games
User-generated content analysis
Clinical applications of speech technology
NLP for social media analysis
Online misinformation detection
My interests on the speech side of things are emotion detection (generally, as well as mental health applications and effects related to intoxication) and voice cloning (particularly due to the criminal use of deepfakes and exploiting biometric security), whilst on the NLP side of things my primary interest would be mis/dis-information detection.
natural language processing
speech information processing,
machine learning with applications to the healthcare domain
I fell in love with Computational Linguistics thanks to multi-word expressions, but I'm also interested in the practicalities of using SLTs - domain adaptation and robustness, emojis and rapidly-evolving language, fairness, equality and sustainability.
Teresa Rodríguez Muñoz
Design of frontend conversational agents for scientific and healthcare applications
ASR robustness including detecting different pronunciations and dialects
Effect of the environment on the user's experience with voice-controlled IVAs
Novel ML and deep learning algorithms for neural networks.
Novel and robust SLT models
Computationally efficient natural language processing
Interpretability in natural language processing models
Clinical applications of speech technology