Our PhD students
Cohort 1 (2019)
Project title: Using NLP to Resolve Mismatches Between Jobseekers and Positions in Recruitment
Supervisor: Dr Diana Maynard
Industry partner: Tribepad
Nearly everyone has a CV (or at least a career and/or education history).
Nearly every non-sole trader business recruits people.
Most people want to progress from the job they are in – over 60% of people are always at least passively looking for a job.
CV-to-Job Spec matching technologies are generally underperforming right now, but there is plenty of data available for machine learning solutions to improve performance. This would improve the quality and objectivity of recruitment, reduce the number of irrelevant applications recruiters receive, reduce the likelihood that a good candidate is overlooked, save time and lead to faster filling of open positions, and enhance the candidate experience.
One challenge is that job seekers are reluctant to provide structured data about themselves, and prefer to 'throw' a CV over to the recruiter and then 'throw' their CV over to the next recruiter at the next company and so on. Because of this, the data from a job seeker is often not in structured form. Similarly, a lot of the data the recruiter gives about the job is also unstructured – provided in a number of paragraphs of text. Yet all the key information is available to match someone. The job description specifies key skills in the text, the ideal candidate, the soft skills, the must-haves, the skills that are beneficial but not essential, and so on.
This project will involve researching a 'matching solution' that takes all of the above into consideration – the unstructured data needs to be parsed using Natural Language Processing (NLP) techniques, the key components extracted, and fed into a machine learning algorithm to determine how close someone is for a job and what the key differences are. This could help people see what skills they need to acquire in order to progress (their 'skills delta'), and that data could eventually feed into a training system to suggest courses (college, university or other).
A wide range of Information Extraction methods (including knowledge-engineering and supervised learning methods) will be explored to tackle parsing of unstructured CV and job specification data, and both shallow-learning and deep-learning methods will be considered for the 'matching' component of the matching solution.
Investigation of some or all of the following questions will drive the methodology, by providing deeper insights that go beyond simple matching of keywords.
Is the job they are applying for a natural progression from their current role, a side step, or a bit left-field?
How successful were similar people in applying for similar jobs and how long did they stay in those jobs for?
How likely is someone to be hired for a particular job (based on empirical data)?
Does someone's location help or hinder their application? What about their school, or their previous places of work?
Is the job seeker a 'job hopper' or a contractor?
The field of automatic speech recognition (ASR) has seen significant advancements in recent years due to a combination of the development of more advanced deep neural network models as well as the computational power necessary to train them. One area that still presents a significant challenge is in the transcription of meetings involving multiple speakers in real world settings, which are often noisy, reverberant and where the number of speakers and their location changes. Not only is the conversational environment challenging, but also the fact that recording of voices in these settings poses issues.
Multiple microphones or microphone arrays are commonly used, but these may be situated a fair distance away from the speakers which in turn leads to a degraded signal to noise ratio. For this reason meeting transcription tends to require additional stages of processing, many of which are not required for simpler ASR tasks. One of these stages is the preprocessing of audio data, often involving a combination of signal processing and machine learning techniques to denoise and beamform the audio from microphone arrays. As no device is capturing audio from the speaker alone, speaker diarisation needs to be included in the systems which also then makes it possible to adapt ASR models to the specific acoustic conditions and the speaker in their present location.
Speech in meetings has additional complexities – it is conversational and thereby language complexity is increased. This also has implications on the acoustics, speakers talk concurrently and give backchannel signals. The speaking style is much more varied and is adjusted according to the conversational situation. It is also influenced by the conversation partners and the topic and location of the meeting. All of these qualities require a meeting transcription system to be highly adaptive and flexible.
The main focus of this project is on improving ASR technology for doctor-patient meetings in a healthcare setting. The challenges for these types of meetings are unique. For example, the environment where recordings take place may have hard surfaces leading to very reverberant speech signals. One can also not assume that recording microphones are in fixed places or even well located in relation to the speakers. While moving speakers occur in other meeting scenarios, one can expect this to be present more often here. The impulse response of a particular acoustic environment not only changes from room to room but also varies depending on where within the room speakers are in relation to microphones.
Beyond acoustic challenges large amounts of data for doctor's speech might be available, whereas data from individual patients will be sparse. This poses a challenge for commonly used adaptation techniques in ASR, in particular if speakers with non-standard speech are present – something that can be expected in health settings. Overall the type and content of language is expected to be very specific to the healthcare domain. All of these unique attributes have an impact on ASR performance and can severely reduce the accuracy of outputs.
The aim for this project will be to improve upon existing ASR technology by examining some of the areas outlined. Particular areas of interest for advancing this technology will be on improving microphone array recording techniques and adaptation methods that address the specific properties found in real recordings. It is expected that any techniques developed will have to be highly adaptive to the recording settings. Further exploration will include investigation into the suitability of standard ASR metrics for assessing performance for users.
The need to look after one's mental health and wellbeing is ever increasing. Recent events such as the Covid-19 pandemic have resulted in almost everyone being subjected to a lockdown, with many people reporting increased feelings of anxiety, depression, and isolation. Speech has been found to be a key method of identifying and examining a range of different mental health concerns, but as of yet little research has been conducted on how to integrate such knowledge into everyday life. A lacuna exists between the number of mental health and wellbeing apps and virtual 'chatbots' available, and the speech features that have the potential to aid in the diagnosis and monitoring of such conditions.
This project investigates the efficacy of a spoken language virtual agent, designed to prompt users to record them talking about how they're feeling over an extended period of time. This serves two purposes; firstly, the virtual agent acts as an interactive tool aiding in alleviating feelings of loneliness and allowing people an ear for talking therapy, without the need for another human to converse with. Secondly, machine learning techniques will track features of the user's speech, such as phoneme rate and duration and acoustic features, in order to monitor the state of the user's mental wellbeing over a set amount of time. This provides a snapshot of a user's mental peaks and troughs, and in turn helps to train the AI to detect such changes.
This project will address multiple research questions, such as how people from different age groups would interact with such a virtual agent, and how their needs would differ in regard to what they are hoping to get out of such an interaction. Secondly, this project serves as a study into how much time a person would be required to interact with such a virtual agent in order to collect enough data to properly analyse and further train the AI with. Thirdly, this project offers an opportunity to investigate how such virtual agents could be used as a tool to combat loneliness across different age groups, attempting to combat early signs of mental health concerns brought about by feelings of isolation.
Many Natural Language Processing tasks are being successfully completed by sophisticated models which learn complex representations from gigantic text or task-specific datasets. However, a subset of NLP tasks which involve simple human reasoning remain intractable. We hypothesise that this is because the current class of state-of-the-art models are not adequate for learning either relational or common-sense reasoning problems.
We believe there is a two-fold deficiency. Firstly, the structure of the models is not adequate for the task of reasoning, as they fail to impose sensible priors on reasoning such as pairwise commutativity or feature order invariance. Secondly, the data presented to these models is often insufficient for them to learn the disentangled, grounded functions which represent common sense-reasoning. Furthermore, a high degree of bias has been found in many existing datasets designed for reasoning tasks, indicating that current models may be predicting correct answers from unintended hints within the questions.
For the purposes of this project, we would define common-sense reasoning as the ability to:
Reason about relations between objects, both in single and multiple-step problems.
Achieve near human-level performance in 'common-sense' reasoning problems.
Ground task understanding with reference to the scene in which a question is asked, allowing for informed inference.
We aim to target real-world multi- and mono-modal relational reasoning tasks with models designed to capture the properties of relational reasoning. From this, we will investigate the possibility of extracting structured knowledge such as relational graphs or linguistic information for accuracy improvement on commonsense inference tasks such as VQA.
To investigate novel ways of allowing AI to learn object properties, such as using reinforcement learning in a physics engine to learn affordances, the properties of an object which define what actions can be performed on it.
Construct our own dataset to aid with the task of scene-grounded commonsense reasoning, which may include textual, visual, video, and physics information for the purpose of training a model which can learn object affordances and representations.
Can existing models for reasoning, which operate on synthetic data, be transferred to real world datasets and achieve acceptable accuracy?
Can features or models extracted from explicit reasoning tasks improve the accuracy of state-of-the-art models for multimodal commonsense tasks by augmenting the knowledge base to draw from?
What tasks and corresponding datasets would enable models to learn efficient, disentangled commonsense properties and relations which humans intuitively obtain? Are we able to create a useful and tractable task and dataset which advanced the field?
Danae Sánchez Villegas
Multimodal communication is prominent on social platforms such as Twitter, Instagram, and Flickr. Every second, people express their views, opinions, feelings, emotions, and sentiments through different combinations of text, images, and videos. This provides an opportunity to create better language models that can exploit multimodal information. The purpose of this project is to understand the language from a multimodal perspective by modeling intra-modal and cross-modal interactions. The modes of our interest are text, visual, and acoustic.
The integration of visual and/or acoustic features to NLP systems has recently proved to be beneficial in several tasks such as sentiment analysis (Rahman, Wasifur, et al., 2019; Kumar, Akshi, et al., 2019, Tsai, Yao-Hung Hubert, et al. 2019), sarcasm detection (Cai, Yitao, et al., 2019; Castro, Santiago, et al., 2019), summarisation (Tiwari, Akanksha, et al., 2018), paraphrasing (Chu, Chenhui, et al., 2018), and crisis events classification (Abavisani, Mahdi, et al, 2020).
In this Ph.D. project, we will study and develop multimodal fusion methods and their integration in different stages of model architectures, such as neural networks and transformer models (Devlin, Jacob, et al., 2018). Moreover, we aim to evaluate these methods’ performance on sentiment analysis, social media analysis tasks such as abusive behavior analysis (Breitfeller, Luke, et al., 2019; Chowdhury, Arijit Ghosh, et al., 2019), and other NLP tasks.
Abavisani, Mahdi, et al. "Multimodal Categorization of Crisis Events in Social Media." arXiv preprint arXiv:2004.04917 (2020).
Breitfeller, Luke, et al. "Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
Cai, Yitao, Huiyu Cai, and Xiaojun Wan. "Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
Castro, Santiago, et al. "Towards Multimodal Sarcasm Detection (An Obviously Perfect Paper)." arXiv preprint arXiv:1906.01815 (2019).
Rahman, Wasifur, et al. "M-BERT: Injecting Multimodal Information in the BERT Structure." arXiv preprint arXiv:1908.05787 (2019).
Chowdhury, Arijit Ghosh, et al. "# YouToo? detection of personal recollections of sexual harassment on social media." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
Chu, Chenhui, Mayu Otani, and Yuta Nakashima. "iParaphrasing: Extracting visually grounded paraphrases via an image." arXiv preprint arXiv:1806.04284 (2018).
Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
Kumar, Akshi, and Geetanjali Garg. "Sentiment analysis of multimodal twitter data." Multimedia Tools and Applications 78.17 (2019): 24103-24119.
Tiwari, Akanksha, Christian Von Der Weth, and Mohan S. Kankanhalli. "Multimodal Multiplatform Social Media Event Summarization." ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14.2s (2018): 1-23.
Tsai, Yao-Hung Hubert, et al. "Multimodal transformer for unaligned multimodal language sequences." arXiv preprint arXiv:1906.00295 (2019).
Modern approaches to Natural Language Processing are often developed under the assumption that all textual resources should be static, as coherent and unambiguous as possible, inherently global in meaning (independent of local factors) and ordered. From this perspective, dialogue is a challenging type of input, as it violates most of these assumptions.
As an interactive exchange between a narrow group of individuals, dialogue is highly personalised and contextualised. Its primary task is efficient communication on a small scale, and so the evolution of these communication channels happens rapidly and much more locally, when compared to written texts of a given language. Contemporary NLP approaches which are not sufficiently flexible to handle this phenomenon. Despite Dialogue Modeling being a self-standing task within NLP, most NLP methods expect correct grammar and logic in the texts they are applied to. They become inefficient when applied to language typical of dialogue. A paradigm shift in how one approaches interpreting dialogue is required.
Accommodation describes the phenomenon that people will attune their speech to the discourse situation in the most suitable, "optimal" way (Giles and Ogay 2007). It has been extensively studied in the field of socio-linguistics. According to Giles and Ogay (2007), individuals will amend their accent, vocabulary, facial expressions and other verbal and non-verbal features of interaction in order to become similar to their conversational partners. The benefits of doing so can include sounding more convincing, signifying certain attributes (e.g. membership to a specific social group), or minimising the "distance in rank" between speakers. Similarly, Danescu-Niculescu-Mizil and Lee (2011) show that within a conversation, a participant will tend to mimic the linguistic style of other participants. By conducting an experiment on movie scripts they showed that this mimicry is something automatic for humans, a reflex that is not only a tool for "gaining social rank", and therefore is an inherent characteristic of dialogue.
Accommodation is one of the prisms through which we can look at conversational behaviour. It highlights that the addressee and the situation between her and the speaker is crucial to interpreting the exchange. It can be argued that a framework which is centred around these findings may enhance the performance of dialogue interpretation systems.
The aim of this project is to examine how the developments in the understanding of conversations can be introduced into various approaches to Natural Language Processing – how we can systematise and generalise them. As a case study, the task of translating subtitles for scripted entertainment from English to other languages will be considered. Only recently, novel approaches to Machine Translation started going beyond translating isolated sentences – by incorporating document-wide aspects into the process. However, document-level Machine Translation applied to dialogue has been underexplored in literature (Maruf, Saleh and Haffari 2019).
Within the context of MT, if an automatic translator could be made to preserve the accommodation aspects present in dialogues, the resulting translation could be more dynamic and natural as the contextual information and aspects of the mimicry of style between participants can remain approximately preserved.
Danescu-Niculescu-Mizil, C. and Lee, L. (2011) 'Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs'. Cognitive Modeling and Computational Linguistics Workshop at ACL 2011, pp. 76-87.
Maruf, S., Saleh, F. and Haffari, G. (2019) 'A Survey on Document-level Machine Translation: Methods and Evaluation', pp. 1-29.
Ogay, T. and Giles, H. (2007) 'Communication Accommodation Theory', B.B Whaley & W.Samters (Eds.), Explaining communication: Contemporary theories.
This project is aimed at individuals who have had trauma or surgery to their vocal apparatus, eg laryngectomy, tracheoectomy, glossectomy, who are unable to talk in a conventional manner but who have full motor control of the rest of their body. Previous research in 'silent speech recognition' and 'direct speech synthesis' has used a wide variety of specialised/bespoke sensors to generate speech in real-time from residual articulatory movements. However, such solutions are expensive (due to the requirement for specialised/bespoke equipment) and intrusive (due to the need to install the necessary sensors). As an alternative, it would be of great interest to investigate the potential for using a conventional keyboard as a readily-available and cheap alternative to specialised/bespoke sensors, ie. a solution based on text-to-speech synthesis (TTS).
Of course, there are fundamental problems with using contemporary TTS as a communications aid:
the conversion from typed input to speech output is non real-time and delayed,
even a trained touch-typist would be unable to enter text fast enough for a normal conversational speech rate,
the output is typically non-personalised,
it is not possible to control the prosody in real-time, and
it is not possible to control the affect in real-time.
These limitations mean that existing TTS users are unable to intervene/join-in a conversation, unable to keep up with information rate of exchanges, unable to express themselves effectively (ie their individuality and their communicative intent) and suffer a loss of empathic/social relations as a consequence.
So, what is needed is a solution that overcomes these limitations, and it is proposed that they could be addressed by investigating/inventing a novel end-to-end TTS architecture that facilitates:
Simultaneous real-time input/output (ie sound is produced immediately a key is pressed).
Conversational speech rates by embedding a suitable prediction mechanism (ie the spoken equivalent of autocorrect).
Configurable output that allows a user to 'dial-up' appropriate individualised vocal characteristics
Real-time control of prosody, e.g. using special keys or additional sensors.
Realtime control of affect, e.g. by analysing the acoustics of key-presses or facial expressions from a webcam.
Cohort 2 (2020)
Project title: Developing Interpretable and Computational Efficient Deep Learning Models for NLP
Supervisor: Professor Nikos Aletras
Deep neural network models have achieved state-of-the-art results in different Natural Language Processing tasks in recent years. Their superior performance comes at the cost of understanding how they work and how their output results can be justified. It is still challenging to interpret the intermediate representations and explain the inner workings of neural network models. The lack of justification for deep neural network models decisions is considered one of the main reasons that deep learning models are not widely used in some critical domains such as health and law. Therefore, there is a need for designing more interpretable and explainable deep neural networks models while reducing their computing requirements.
This PhD research project aims to make deep neural network models for NLP more interpretable and explainable. As a first step, we will investigate these models' architecture and how each component works. Next, we will try different approaches to reduce the computational complexity of these models and improve their interpretability while achieving similar accuracy and performance. In addition, the research project will investigate how models interpretations can help in designing more data-efficient models. Furthermore, we will investigate designing novel evaluation methods for deep neural networks interpretation's quality.
Which components of deep neural networks do affect the model's decisions?
How does the interpretability of deep learning models for NLP vary from task to task?
Can different pre-training approaches be more computationally efficient?
Can interpretability help in designing more data-efficient deep learning models for NLP?
How models interpretations can be evaluated?
Project title: Summarisation of Argumentative Conversation
Supervisor: Professor Rob Gaizauskas
Industry partner: Amazon
My research project seeks to combine a well-established subfield of NLP, summarisation, with the growing area of Argument Mining (AM). In essence, AM involves automatic analysis of texts involving argument (language that uses reasoning to persuade) in order to identify its underlying components. These components can then be used in the generation of a textual summary that allows human readers to quickly grasp important points about a discussion.
As an illustrative problem for argument summarisation, we can think of a media company that wishes to report on recent parliamentary debates around a controversial issue. If the volume of debates is large, a summary would clearly be useful. An ideal system which solved this problem would be able to identify the key opposing stances in the debate and summarise the arguments for each.
The difficulties inherent in this task are numerous. Principal among them is the fact that the key units of analysis in argumentative conversation rarely correspond to surface-level features like words, and inferring these latent structures is a non-trivial problem .
Within a machine learning framework, this can be restated as the problem of choosing the most appropriate output representation for a supervised machine learning model. One commonly-used representation for argumentative text is the Toulmin model , which identifies six distinct Argument Discourse Units (ADUs) and defines relations between them. Many existing systems use only a subset of these units, since distinctions between some different types of unit are difficult to identify in practice.
To give an example of these challenges, one ADU that is commonly annotated in such tasks is the Backing unit, used to give evidence to a Claim. Though the presence of such units may be indicated by the presence of certain words or phrases ("To illustrate this...", "Consider...", "As an example..."), in many cases, such explicit indicators are lacking, and the relation of the Backing unit to a Claim unit may merely be implied by the juxtaposition of Backing and Claim in adjacent sentences (see eg the examples in ).
In such cases, it may be impossible to identify the relation between two argument components on the basis of the text alone, and some sort of world knowledge may be required to do this. This could be encoded implicitly, using a pre-trained language model such as BERT, or explicitly, using a knowledge graph .
Since we are also addressing the task of summarising an argument, we also expect to have to address the problems involved in the automatic generation of textual summaries. These include diversity, coverage and balance – in other words, producing summaries which do not repeat themselves, attempt to cover as much of the original text as possible, and give equal weight to different parts of the text with equal importance .
In conclusion, the problem of automatically summarising argumentative text is one which includes many open sub-questions we intend to address, including how to best represent argument components, how to learn them from textual data, and how to use these to generate a useful summary.
Lawrence, John, and Chris Reed. "Argument mining: A survey." Computational Linguistics 45.4 (2020): 765-818. (see section 3.1 on text segmentation).
Toulmin, Stephen E. The uses of argument. Cambridge university press, 2003.
El Baff, Roxanne, et al. "Computational argumentation synthesis as a language modeling task." Proceedings of the 12th International Conference on Natural Language Generation. 2019.
Fromm, Michael, Evgeniy Faerman, and Thomas Seidl. "TACAM: topic and context aware argument mining." 2019 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE, 2019.
Li, Liangda, et al. "Enhancing diversity, coverage and balance for summarization through structure learning." Proceedings of the 18th international conference on World Wide Web. 2009.
Project title: Robust Speech Recognition in Complex Acoustic Scenes
Supervisor: Dr Stefan Goetze
Industry partner: Toshiba
Human-machine interaction relying on robust far-field automatic speech recognition (ASR) is a topic of wide interest to both academia and industry. The constantly growing use of digital home/personal assistants is one such example.
In far-field scenarios, for which the recording microphone is in considerable distance to the user, not only the target speech signal is picked up by the microphone, but also noise, reverberation and possibly competing speakers. While large research efforts were made to increase ASR performance in acoustically challenging conditions with high noise and reverberation, the conversational aspect adds more challenges to the acoustic and language modelling. If more than one speaker is present in the conversation, tracking the target speaker or separation of the audio signal into separate audio sources from overlapping speech will pose further challenges to ASR. Another layer of complexity is introduced by the recording equipment, such as the number of microphones and their positioning since these can have significant influence on the recorded signal used for ASR.
Deep machine learning in ASR applications significantly advanced the state-of-the-art in recent years. They are widely explored to implement various components of the ASR system such as front-end signal enhancement, acoustic models or the language model independently or jointly either in hybrid or end-to-end (E2E) scenarios.
This project aims at increasing ASR robustness in the challenging acoustic far-field scenario described above. For this, methods from the area of signal processing for signal enhancement and separation will be applied to improve ASR performance as well as novel methods acoustic and language modeling will be developed. By this, ASR performance will be increased for adverse conditions with noise and reverberation, and particularly in environments with competing speech signals. To achieve this goal, different techniques from the signal processing area, the machine-learning area and combinations of both are applied, and in the technical challenges associated with implementation of such systems, particularly in smaller, low power devices such as smart speakers or hearing aids. Additionally, novel techniques involving the usage of semantic information to improve performance will be explored.
Technical documents, such as academic research papers, often contain content that is incomprehensible to a non-specialist audience. This can make it difficult for those without specific knowledge of the subject area to digest and understand ideas that such a piece of work conveys, even at a high-level.
Natural Language Generation (NLG) tasks such as text simplification, text summarisation and style transfer all revolve around the adaptation of their input data (eg a body of text) to fit a particular purpose, whilst preserving the essence of its core content. Methods used for such tasks could feasibly be used to summarise and explain the key ideas from technical documents in a format that is more digestible to a non-specialist audience. However, the style in which such ideas should be presented, and the degree to which technical content should be simplified, is highly dependent on the audience itself. People naturally adapt the manner we communicate to better suit our audience, but this is something that, historically, has seldom been taken into account in many NLG tasks. However, some more recent works have experimented with the incorporation of extra-textual information relating to the audience, such as gender  or grade-level information , with promising results.
This project will involve the research and application of automatic data-to-text NLG techniques that are able to adapt to a given audience. This could, for example, involve leveraging such audience information to provide a personalised output. These are to be used for the transformation of existing technical documents in an effort to make their content more suitable to a different audience or domain. As input data typically used for the aforementioned tasks is often significantly shorter than such technical documents, techniques for handling large documents will also be investigated.
How can we expand on existing research [1, 6] into leveraging audience meta-information when performing the aforementioned data-to-text NLG tasks? How can such information be taken into account to better evaluate the performance of models?
How can we adapt current methods of evaluation for the given NLG tasks to allow them to perform at document-level? Do new methods of evaluation need to be developed?
How can we expand upon alternative approaches to NLG tasks (eg deep reinforcement l earning , unsupervised l earning ) to allow them to perform on entire/multiple documents? Can existing/novel techniques for handling large/multiple documents (eg hierarchical attention mechanisms [2, 5]) be applied to allow for good performance at document-level?
Scarton, C., Madhyastha, P., & Specia, L. (2020). Deciding When, How and for Whom to Simplify. Proceedings of the 2020 Conference of the the 24th European Conference on Artificial Intelligence.
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical Attention Networks for Document Classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1480–1489.
Zhang, X., & Lapata, M. (2017). Sentence Simplification with Deep Reinforcement Learning. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 584–594.
Surya, S., Mishra, A., Laha, A., Jain, P., & Sankaranarayanan, K. (2019). Unsupervised Neural Text Simplification. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2058–2068.
Miculicich, L., Ram, D., Pappas, N., & Henderson, J. (n.d.). Document-Level Neural Machine Translation with Hierarchical Attention Networks. Association for Computational Linguistics.
Chen, G., Zheng, Y., & Du, Y. (2020). Listener’s Social Identity Matters in Personalised Response Generation.
Project title: Cross-Domain Idiomatic Multiword Representations for Natural Language Processing
Supervisor: Professor Aline Villavicencio
Computational methods for natural language processing (NLP) require representing sequences of language as vectors, or embeddings. Modern representational approaches generate contextualised word embeddings, where how the word is represented depends on the context it appears in. The embeddings generated by state-of-the-art language representation models such as BERT and ELMo have pushed performance on a variety of NLP tasks, however there are still areas that these embeddings perform poorly in.
One of these areas is handling multi-word expressions (MWEs), which are phrases that are semantically or syntactically idiosyncratic.
These are wide-ranging in human language, estimated to comprise half of a speaker's lexicon, and thus are an important area for research. Examples include "kick the bucket", "by and large" and "car park".
In particular, state-of-the-art contextualised word embeddings have been shown to perform poorly at capturing the idiomaticity of MWEs, where the meaning of the expression is not deducible from the meanings of the individual words (non-compositionality). As an example of this, the meaning of "eager beaver" (an enthusiastic person) is not deducible from the meaning of "beaver".
This inability is detrimental to the performance of NLP models on downstream tasks. In machine translation, for example, idiomatic phrases may be translated literally and thus the wrong meaning conveyed.
The aim of this project is to investigate methods of developing embeddings that better deal with the idiomaticity of MWEs, particularly in a cross-domain setting, ie across different genres of text (for example medical and legal).
The novel embedding methods developed in this project would be evaluated both intrinsically and extrinsically.
Intrinsically, we wish to demonstrate an improvement of the ability of the model to deal with idiomaticity. This would be done using eg a dataset of sentences labelled as idiomatic or not to test the ability of a model to detect idiomaticity.
We then wish to show that an increase in intrinsic performance leads to an improved performance on downstream NLP tasks in a cross-domain setting (ie extrinsic evaluation). The evaluation method would depend on the task at hand, for example using the BLEU score for evaluation of cross-domain machine translation.
Aside from looking at how we can improve model performance on certain tasks, we also want to investigate how the developed methods can be used to boost the interpretability of models. This means helping understand why models give the outputs that they do. It is hoped that by the explicit treatment of MWEs will allow for a better understanding of the decisions made by models.
Can novel embedding methods be developed that better handle idiomatic multiword expressions, particularly in a cross-domain setting?
How does the use of these novel embedding methods affect the performance of models on downstream NLP tasks?
Can the interpretability of NLP models be improved through the explicit handling of multiword expressions?
Project title: Early Onset Cognitive Impairment Detection from Speech and Language Signals
Supervisor: Professor Heidi Christensen
Cohort 2 Student Representative
The detection of medical conditions using speech processing techniques is a rapidly emerging field, focusing on the development of non-invasive diagnostic strategies for conditions ranging from dementia, to depression and anxiety, to even Covid-19. One of the largest issues facing medical speech classification that seldom impacts other areas of speech technology (at least in English) in the scarcity of data available. Many of the largest corpora looking at dementia, for example, contain only tens of thousands of words. In addition, these datasets usually contain very few speakers. Limited speakers to train a model creates a new risk of overfitting to idiosyncratic, dialectal, or accentual features of the participants which in turn can gravely impact the efficacy of the classifier depending on the language features of the test subject.
Accent variation can have a large impact on speech technologies, the traditional approach to counter this impact is to use a colossal corpus or accent independent models, either selected by the user or dictated based on geography which are specifically trained on individuals with a similar accent. Unfortunately, this approach depends on a rich dataset from which to train said models, where in the case of medical classification systems, simply does not exist.
This project aims to explore approaches for reducing the impact of accent and language variation on onset cognitive impairment detection systems. This approach will explore the impact of accents both on the construction of cognitive impairment detection classifiers, and on the compilation and initial processing and feature extraction of the datasets. Whilst large elements of this feature extraction will expand over the process of compiling a literature review, one such example may be to investigate bilingual features.
Is it possible that dementia has a consistently detrimental impact on second language production that is distinctly different from the broken language you find in an early learner of the language?
For example, we know individuals make the largest amount of speech production errors on phones which are more similar between their L1 and L2, particularly when learning the language, do we see a loss in ability to maintain phonetic distinctions as someone's cognitive state declines and are the features different to the inverse process of an L2 language learner and thus classifiable. This project aims to develop normalisation strategies and new feature extraction methods for limiting the impact of accents and language variation on medical speech classification systems.
The importance of this research stems from the growing enthusiasm to implement onset cognitive impairment detection systems into the medical industry. Issues here arise where the tools may only be effective on certain demographics of individuals creating significant concern over potential inadvertent segregation created by the technologies. Tools from facial recognition systems to credit scoring systems have all previously and presently seen substantial criticism for their impact on certain demographics of individuals where the systems either perform poorly or adversely impact certain groups of people. It remains vital that medical speech technology is non-discriminatory and provides universally stable efficacy across as many demographics of people as possible.
Project title: Spoken Language Interaction Between Mismatched Partners
Supervisor: Professor Roger K Moore
Spoken-language based interactions between a human being and an artificial device (such as a social robot) are very popular in recent years. Multi-modal speech-enabled artefacts are seen in many fields, such as entertainment, education, and healthcare, but the user experience of spoken interaction with such devices is not very satisfactory. For example, these devices sometimes fail to understand what the command means, or fail to provide a relevant answer to questions; people do not know how to make their commands more understandable to artificial devices; and the spoken interaction is limited to a rigid style and lacks rich interactive behaviours.
Users' dissatisfaction is partly caused by limited language abilities of social agents, like inaccurate automatic speech recognition (ASR). Apart from that, it is hypothesised the dissatisfaction is also caused by the mismatched abilities between humans and artificial devices. Due to limited cognitive abilities, such agents cannot take perceptions of others into consideration, nor react to situation accordingly, which resulted in unsuccessful communicative interaction.
Communicative interaction efficiency involves multiple factors across disciplines such as languages, psychology, and artificial intelligence. However, the role of influential factors and the mechanism of how those factors interact with each other are unknown. Here the project aims to develop a unified framework which can characterise communicative interaction efficiency. It will investigate factors and strategies that make a spoken-language based interaction effective. Based on understandings of the nature of spoken language interaction between humans and artificial devices, the next objective is to maximise affordance of speech-enabled artefacts, and to achieve more effective communicative interaction between a human being and an artificial device.
Preliminary questions so far would appear to be as follows.
Definition of effectiveness: What makes a spoken-language based interaction effective in human-human interaction? Is it the same for interaction between a human and an artificial device? If so, in what way?
Affordance of speech-enabled artefacts: What affects affordance of speech-enabled artefacts? How would it affect communicative interaction effectiveness? For example, some people may argue naturalness is about making speech-enabled artefacts like humans, with human-like voices and appearances. Is naturalness helpful to maximise usability of speech-enabled artefacts? If natural voice or appearance is not correlated with artificial devices' limited language and cognitive abilities, would it cause `uncanny valley' effect?
Short-term and long-term interaction: Do people's expectations of speech-enabled artefacts change over time? How would it change the way artificial devices interact with people?
Modelling: Can the communicative interaction effectiveness be modelled? What are different levels of communicative interaction effectiveness? How could those levels be applied to speech-enabled artefacts?
It is hoped the result of this project provides a general guideline for communicative interaction between a human being and an artificial device. It is anticipated such guideline serves as a starting point when people design or improve multi-modal speech-enabled artefacts.
Project title: Hybrid Approaches for Multilingual Speech Recognition
Supervisor: Dr Anton Ragni
Industry partner: Toshiba
The effectiveness of data-driven Automatic Speech Recognition (ASR) depends on the availability of suitable training data. In most cases, this takes the form of labelled data, where speech clips (or a numerical representation of them) are matched to their corresponding words or phonetic symbols. Creating labelled training data is an expensive and time-consuming process. For some languages, such as native British and American English, there is ample data for building ASR systems. For many other languages, however, a large volume of accessible training data does not exist. How can the quality of ASR for under-resourced languages be improved?
Techniques for conducting ASR typically fall into two categories: parametric and non-parametric. Parametric ASR methods use data to train a model. This can be thought of as a 'black box' function, which takes speech as input and outputs the corresponding text, ideally with a high degree of accuracy. Once trained, the model is used without further reference to the training data. The training data is summarised by the model, but some of the information it contains is lost in the process, and it cannot be reinterpreted in the presence of additional data. Parametric models, such as deep learning-based neural networks, are usually trained from scratch for the specific task they will be used for, which makes them extremely effective, but means they cannot be easily generalised to other languages. They are often not easy to interpret, and they are usually not as effective when data is scarce because they have a lot of parameters, and determining the values of many parameters requires a lot of data.
Non-parametric approaches, such as Gaussian processes, instead use the training data directly. In this case, the labelled training data are available at the decision-making stage. This preserves all of the information in the training data, and requires fewer assumptions to be made when predicting text from speech. It also, at least in principle, allows sensible decision-making even with limited data. The downside is that it is not easily scalable, meaning that the computational power and data storage requirements increase rapidly as the amount of data increases. Large quantities of training data therefore become unfeasible.
The two approaches, parametric and non-parametric, have different strengths. When large amounts of data are available, capturing most of the information from a large data set is better than capturing all of the information from a smaller training data set, so that the parametric approach may perform better. When training data is scarce, the non-parametric approach may perform better, since all of the limited training information is preserved.
The objective of this project is to explore and evaluate approaches inspired by both parametric and non-parametric methods for multi-lingual automatic speech recognition.
Cohort 3 (2021)
Project title: Disentangling Speech in a Multitalker Environment
Supervisor: Dr Stefan Goetze
Industry partner: Meta
Semantic Word Embedding
End to End ASR
The automated annotation of real life multi-talker conversations requires systems to perform high quality speech transcription as well as accurate speaker-attribution. These two tasks correspond with two well studied concepts in the domain of speech technology: automatic speech recognition (ASR) and diarization, respectively. The latter concerns segmenting speech by talkers and effectively resolving the problem of ‘Who spoke when?’. Recent advances in ASR systems have shown significant improvements in word error rate (WER) over the past decade, with state of the art systems achieving scores similar to human level performance on common ASR benchmarks. Despite this, even highly sophisticated systems struggle with the more challenging task of ASR in a multi-talker context, where there is typically a significant amount of overlapping speech. The main issues associated with multi-talker situations are: the irregular distances between the recording device and each talker; the variable, often unknown, number of distinct talkers; the high spontaneity of conversational speech, which renders it grammatically dissimilar to written language; and, most importantly, overlapping speech segments which occur due to talkers speaking simultaneously. The culmination of these complications is why the task of multi-talker ASR is typically considered one of the most challenging problems in the domain today. This is particularly the case for recording configurations which use only a single distant microphone (SDM), since techniques utilising the geometry between microphone arrays and speakers, such as beamforming, cannot be used. For ASR to be freely applied in more challenging real life scenarios, like general conversations or outside environments, without the need for specific microphone configurations, significant improvements are still required.
Previous literature has demonstrated that combining the two distinct tasks of ASR and diarization into a single multi-task end-to-end framework can achieve state of the art performance in certain relevant benchmarks whilst only requiring a SDM recording configuration. This is because knowing when an active talker’s turn has ended and knowing the talker is information which significantly influences the probability distribution of the following word, thus benefiting the ASR component of the system. Likewise, information pertaining to who the active talker is, and when the conversation is likely to transition to a different active talker, is intrinsic to the verbal content of the conversation. It therefore seems intuitive that leveraging this mutually beneficial information, solving the conventionally independent tasks simultaneously, improves the overall system performance.
With the increasing popularity of augmented reality/smart wearable devices and the introduction of audio-visual datasets which include real life events recorded egocentrically, a new avenue of research in the domain has been presented. This project hopes to focus on exploring methods of exploiting the visual component of this multi-modal data in a similar manner to the aforementioned speaker-attributed automatic speech recognition (SA-ASR) multi-task system. For example, the aggregate gaze of conversation participants is often representative of who the overall party is attending to, and therefore who is most likely speaking. This information would likely benefit the diarization component of a SA-ASR system, thus indirectly improving the accuracy of the ASR component. To conclude, the aims of this project are to explore methods of supplementing conventional speaker attribution and ASR systems, which rely exclusively on audio, with pertinent features extracted from corresponding visual data.
Project title: Speech analysis and training methods for transgender speech
Supervisor: Prof Heidi Christensen and Dr Stefan Goetze
Cohort 3 Student Representative
Natural language generation
SLT for social change
An issue present in the transgender community is that of voice-based dysphoria. In other words, a profound, constant state of unease or dissatisfaction with the way one's voice sounds. Individuals looking to alleviate such discomfort benefit from regular speech therapy sessions provided by a trained speech therapist. In the UK, this would usually take place as a part of the NHS at one of seven adult gender identity clinics. However, not only are these sessions not typically long or frequent enough, there are also a number of issues surrounding the accessing of such care, including long wait times.
Through discussions with both speech therapists and members of the transgender community, this project aims to explore how well the current needs of the transgender community are being met with regards to speech/voice therapy, as well as which voice analysis and training methods are currently being used during such sessions and how these methods could be adapted for a digital system.
There are a number of interesting directions in which I hope to take this research. Firstly, the project will investigate various signal analysis metrics and features - or, put simply, how computers store and understand human voices - and look at how to derive meaningful user feedback using this information. In addition, the project will also be looking into related fields such as atypical speech, which investigates support systems and detection methods for vocal impairments, such as those found with sufferers of Parkinson's disease. The project will also be investigating existing automatic gender classification systems from user's speech and their development through the ages, including looking at how adapting this work can help alleviate dysphoria of transgender individuals.
Eventually, this work will culminate in the development of a system which will be capable of supporting higher-frequency, professional quality voice training for members of the transgender community.
Project title: Leveraging long-context information to enhance conversation understanding
Supervisor: Dr Anton Ragni
Industry partner: Meta
Information retrieval and recommender systems
Text simplification, and other approaches that improve the accessibility of literature
Open-Domain Question Answering
Controllable Natural Language Generation
Uncertainty and explainability in neural networks
Speech and natural language data will often contain many long-range dependencies, where there is some statistical correlation between words or speech that are spatially distant . For example, in conversational data, the interactions between multiple speakers can produce sequences of utterances that are logically connected and reference information from far in the past. As a consequence, systems that have a greater capacity to model and learn these relationships will show greater predictive performance.
Despite this, the majority of the current Automatic Speech Recognition (ASR) systems are designed to model speech as an independent set of utterances, where information from surrounding segments of speech is not utilised by the model. Of course, in many real-world scenarios, these segments of speech are highly interrelated, and the previously processed data can be used to form a useful prior on which to condition future transcription. While previous work has investigated the incorporation of context from surrounding utterances [2,3], doing so effectively with both linguistic and acoustic data remains an open problem. Indeed, many of the current architectures used for modelling sequential data are limited in their ability to utilise very long-distance dependencies .
This project aims to investigate approaches that enable automatic speech recognition systems to incorporate information from a much broader context. Ideally, in scenarios such as work-related meetings, ASR systems should be able to utilise relevant information from across the entire discourse to aid transcription. Additionally, an element of this work can look at what sorts of long-range dependencies are present in long-format speech datasets, and how well these are modelled by our current speech recognition systems.
Ebeling, Werner, and Thorsten Pöschel. "Entropy and long-range correlations in literary English." EPL (Europhysics Letters) 26, no. 4 (1994): 241.
G. Sun, C. Zhang and P. C. Woodland, "Transformer Language Models with LSTM-Based Cross-Utterance Information Representation" ICASSP (2021)
Hori, Takaaki, Niko Moritz, Chiori Hori and Jonathan Le Roux. “Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers.” Interspeech (2021).
Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. “Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context.” ACL (2018)
Project title: Speaking to the Edge: IoT microcontrollers and Natural Language Control
Supervisor: Prof Hamish Cunningham
Clinical applications of speech technology
NLP for social media analysis
Online misinformation detection
A new generation of network-connected microcontrollers are fast becoming the ubiquitous equivalents of yesterday’s embedded computation devices. Instead of hiding anonymously in the control systems of TVs, central heating systems and cars, these devices now play an increasing role in user interfaces. Smart speakers like Google Home are dominated by ARM-based microprocessors, yet a multitude of Internet of Things (IoT) applications (driven by tight constraints on power and cost) rely on much smaller computational cores. While the IoT has been gaining increasing attention over the past decade as a term to describe the connection of microcontrollers to the internet, the small size and power of the devices gives rise to privacy concerns that cannot be solved using traditional techniques. Simply connecting these devices to a network makes them vulnerable to cyberattacks, and the low power of the devices means cloud services are often relied upon for data aggregation, storage and computation. Increasingly in the media we can see stories of privacy concerns surrounding smart speakers, such as security bugs allowing access to voice history from Alexa devices.
This project addresses the extent to which voice control interfaces running on IoT devices can enhance privacy by reducing reliance on the cloud. We will do this by surveying, prototyping and measuring processing capabilities and architectural options for IoT-based voice control in the context of smart homes.
Open source technologies are a prerequisite of verifiable security; migrating IoT infrastructure to open source ones would progress us towards alleviating privacy concerns through transparency. Secondly, local in-house processing that doesn’t transmit data to the cloud would mitigate large numbers of privacy issues. With speech signals carrying rich information relating to the user’s gender, emotions, health, intent and characteristics, it is essential that signals received by voice control devices remain private. We will initially deploy an open source Automatic Speech Recognition (ASR) system that processes speech data locally on the device (and/or gateway or hub), instead of sending it to cloud services. ASR is the task of finding out what has been said, and is a widely studied field with state-of-the-art benchmarks published which we can use to evaluate the relative performance of our ASR system.
Edge computing and federated learning are key fields relating to our aim of eliminating the transmission of sensitive data over networks. Edge computing refers to processing data locally on users’ devices, and federated learning relates to training machine learning models on devices that are controlled by a central server. We will train a model to recognise commands within the domain of smart home control, however training an ASR model from scratch requires huge amounts of data and collecting it is a time-consuming and costly process. Instead we can use transfer learning approaches to fine-tune a pretrained model on a small amount of data collected from our domain. For example, speaker adaptation could be employed to fine-tune the speech recognition model using samples of the users’ voice.
Project title: Towards Dialogue Systems with Social Intelligence
Supervisor: Dr Chenghua Lin
My interests on the speech side of things are emotion detection (generally, as well as mental health applications and effects related to intoxication) and voice cloning (particularly due to the criminal use of deepfakes and exploiting biometric security), whilst on the NLP side of things my primary interest would be mis/dis-information detection.
Whilst chatbots and dialogue systems of the form seen in everyday life (Apple’s Siri or Amazon’s automatic helpdesk, to name a few) are often viewed as task-oriented and transactional, and therefore engaged with solely to achieve an end-goal such as a product return or answer a query, real human communication is rich in social features and unavoidable pragmatic considerations. Consequently, in order to achieve human-like language competency in a conversational setting, these conversational agents require an “understanding” of social intelligence such as pragmatics, empathy, and humour, in order to better integrate into daily life and be interacted with more naturally.
Although social intelligence and pragmatic competence may not be essential for purely task-oriented dialogue systems, there is still potential for these areas to benefit from developments in the field. For example, within the aforementioned customer service helpdesks, improved social intelligence could result in better responses to customers exhibiting a range of emotions (such as frustration when returning an item), as well as build a rapport with the customer base, leading to increased brand loyalty. On the other hand, dialogue agents exhibiting social intelligence have more clear applications in other domains such as in mental health services for talking therapy, and in the videogames industry for creating more personalised and interactive dialogue trees for narrative progression.
Building upon previous work in empathetic dialogue systems, this project aims to develop a novel high-quality annotated dataset of conversational humour using theories from the domains of linguistics and social/cognitive psychology, to then aid in the development of novel computational models that enable a conversational agent to be aware of context and generate appropriate humorous responses. The aim of these generated responses is, therefore, to incorporate humour, but also to be aware of the conversational pragmatics that govern when is and is not an appropriate time to incorporate said humour in order to avoid the law of diminishing returns or being insensitive. In aiming for such outcomes, the development of anthropomorphic conversational agents within dialogue systems moves away from the focus on omniscience in digital assistants and towards social competence.
Brown, P., & Levinson, S. C. (1987). Some Universals in Language Usage. Cambridge University Press.
Pamulapati, V., Purigilla, G., & Mamidi, R. (2020). A Novel Annotation Schema for Conversational Humour: Capturing the cultural nuances in Kanyasulkam. The 14th Linguistic Annotation Workshop, Barcelona, B , 34-47. https://aclanthology.org/2020.law-1.4/
Schanke, S., Burtch, G., & Ray, G. (2021). Estimating the Impact of "Humanizing" Customer Service Chatbots. Information Systems Research, 32(3), 736-751. https://doi.org/10.1287/isre.2021.1015
Skantze, G. (2021). Turn-taking in Conversational Systems and Human-Robot Interaction: A Review. Computer Speech & Language, 67, 1-26. https://doi.org/10.1016/j.csl.2020.101178
Veale, T., (2003). Metaphor and Metonymy: The Cognitive Trump-Cards of Linguistic Humour. International Linguistic Cognitive Conference, La Rioja
Xie, Y., Li, J., & Pu, P. (2021). Uncertainty and Surprisal Jointly Deliver the Punchline: Exploiting Incongruity-Based Features for Humour Recognition. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing , Online, 2, 33-39. https://doi.org/10.18653/v1/2021.acl-short.6
Project title: Factorisation of Speech Embeddings
Supervisor: Prof Thomas Hain
Industry partner: LivePerson
natural language processing
speech information processing,
machine learning with applications to the healthcare domain
Observed speech signals are the outcome of a complex process governed by intrinsic and extrinsic factors. The combination of such factors is a highly non-linear process, most of the factors are not directly observable, or the relation between them. To give an example the speech signal is normally considered to have content, intonation, and speaker attributes combined with environmental factors. Each of these, however, breaks into several sub-factors, e.g. content is influenced by coarticulation, socio-economic factors, speech type, etc. While learning of factors was attempted in many ways over decades the scope and structure of factorial relationships were usually considered to be simple, and often assessed in discreet space. Recent years have seen attempts in embedding space that allow much richer factorisations. This is often loosely referred to as disentanglement ,  as no relationship model is required or included in the factor separation. One of the challenges for factorising speech representations is that the concept of factorization is not well-defined across domains such as the image domain and the speech domain .
In this project, the objective is to learn mappings that allow to encode relationships between separated factors. Such relationships may be linear or in hierarchical form or maybe encoded in terms of a graph of a formal group of mathematical operations. It is, however, important that such relationship encodings remain simple in structure. Formulating such relationships can serve as a new way to express objective functions and allow us to extract richer information about the relationship between these factors. One can imagine both supervised and unsupervised learning approaches.
Early examples of such models were published in  where the simple separation is used to separate different values for speaker and content. The work  shows the simple separation technique to separate different speaker embeddings. However, the work was restricted to two-speaker scenarios. Increasing the number of sources (eg. speakers) will make the factorisation process more complex in terms of modelling and it will be difficult to keep the simple relationship between the separated factors. The goal of this work is to propose different ways to factorise the speech signal Into latent space factors while maintaining a simple relationship between them and later reuse them in various speech technology tasks.
Hsu, Wei-Ning et al. "Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data." NIPS (2017).
Jennifer Williams. 2021. End-to-end signal factorization for speech: identity, content, and style. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI'20). Article 746, 5212-5213.
Jennifer Williams, "Learning disentangled speech representations".
Yanpei Shi, "Improving the Robustness of Speaker Recognition in Noise and Multi-Speaker Conditions Using Deep Neural Networks".
Shi, Yanpei and Thomas Hain. "Supervised Speaker Embedding De-Mixing in Two-Speaker Environment." 2021 IEEE Spoken Language Technology Workshop (SLT) (2021): 758-765.
Project title: Enhanced Deep Neural Models to Capture Nuances in Language such as Idiomatic Expressions
Supervisor: Prof Aline Villavicencio
I fell in love with Computational Linguistics thanks to multi-word expressions, but I'm also interested in the practicalities of using SLTs - domain adaptation and robustness, emojis and rapidly-evolving language, fairness, equality and sustainability.
Multi-word expressions (MWEs) - phrases whose meaning as a whole is greater than the sum of their parts - occur frequently in language, and make up around half of the lexicon of native speakers. Idioms in particular present significant challenges for language learners and for computational linguists. While the meaning of a phrase like car thief can be understood from its constituent words, an unfamiliar idiom like one-armed bandit or green thumbs is completely opaque without further explanation.
Deep neural network models, which underpin state-of-the-art natural language processing, construct their representations of phrases by converting individual words to mathematical vectors (known as word embeddings), and then combining these vectors together. This means that they struggle to accurately handle idiomatic expressions which change the meaning of their constituent words - a red herring is not a fish, and no animals are harmed when it is raining cats and dogs.
This limits the usefulness of these models when performing all manner of natural language processing (NLP) tasks; imagine translating “It’s all Greek to me” into Greek one word at a time while preserving its meaning, identifying the sentiment behind a review which describes your product as being as useful as a chocolate teapot, or captioning a picture of a zebra crossing, while unfamiliar with those phrases.
The aim of this project is to enhance the capabilities of neural network models to understand idiomatic expressions by finding ways to provide them with world knowledge, indications of which words go together in a phrase, or the ability to identify when the presence of an unexpected word might indicate an idiomatic usage.
Our jumping-off point will be to explore how various groups of human readers (including native speakers and language learners) process familiar and novel multi-word expressions, whether they are able to work out their meaning and how much contextual information is required to do so.
This will provide insight into the challenges faced by humans dealing with MWEs and the strategies they employ to do so. We can then examine neural network models and understand where the challenges they experience differ from or resemble our own.
This should enable us to take inspiration from human techniques to enhance the structure and training of our models, before evaluating how much our changes improve model performance on NLP tasks. All of this will hopefully bring us closer to our goal of making idioms less of a pain in the neck!
Project title: End-to-end Arithmetic Reasoning in NLP tasks
Supervisor: Dr Nafise Moosavi
Novel and robust SLT models
The field of natural language processing has recently achieved success in various areas such as search retrieval, email filtering and text completion. Most people in society use these tools everyday when doing a web search, deleting a spam email or writing a message. Nevertheless, even the newest artificial intelligent models such as GPT-3 struggle with understanding numerical data; this can result in pitfalls like incorrect predictions or factually inaccurate outputs when the input data contains numerical values. This project aims to develop techniques to improve arithmetic reasoning in the interpretation of natural language.
First of all, it is important to make the distinction between computers that, in general, can perform very elaborate mathematics and models that focus on written language. Behind a powerful mathematics code lies a well-versed mathematician who programs it. We do not claim that computers cannot do maths, but we want them to be able to use it without having a human whispering what to do. For instance, in school, when students understand how to multiply two numbers, say 82 and 85, it is expected that they will be equally as good at finding the product of 82 and 109. However, GPT-3 correctly answers the former question but fails at the latter. Clearly, the current tools used when working with natural language are not successful at arithmetic reasoning. Evidence suggests that this correlates to information seen when trained, but the systems are too complex to verify this. In other words, they lack various reasoning skills including arithmetic ones, and they may not understand numbers as well as they understand common words.
To tackle the problem, the initial stage of the research is to explore the existing maths question datasets to identify their limitations. The focus is on worded problems like “Alice and Brinda equally share 8 sweets. Brinda buys 2 and gives 1 to Alice. How many sweets does Brinda now have?”. A key step would be to design evaluation sets that measure different aspects of arithmetic reasoning, for example, by developing contrast sets i.e. using similar questions with different wording. For example, “What is 7 times 6?” and “Give the product of 6 and 7”. With worded problems, altering the numbers or the words used in the problem could highlight whether the model struggles more with the context of the problem or the maths itself.
A further investigation could be to explore how arithmetic reasoning capabilities are coupled with the underlying understanding of the language. A method to better understand this phenomenon would be to use multilinguality to test whether language is key for arithmetic reasoning in existing models or whether they can perform maths language agnostically. For instance, would German’s use of terms like “Viereck, Fünfeck, …” meaning “four-corner, five-corner, …” for the nomenclature of polygons make the maths problems more understandable in German than other languages?
Moreover, a curriculum learning approach could be employed by having different questions with different levels of difficulty. The easiest set could solely focus on using the most basic operation i.e. addition. More difficult questions may involve operations like division but also more difficult number sets like negatives and decimals.
In addition, we will design optimal prompting techniques for enhancing few-shot arithmetic reasoning. Few-shot learning is the idea of showing the model only a few instances of a specific problem to see if it generalises to unseen examples given previous knowledge. If the model can add integers, could it do the same for decimals given only a few examples?
Overall, this research aims at improving arithmetic reasoning of technological tools based on teaching principles. In turn it would aid other applications such as fact-checking, summarising tables or even helping the teaching of maths.
Project title: Green NLP: Data and Resource Efficient Natural Language Processing
Supervisor: Professor Nikos Aletras
Computationally efficient natural language processing
Interpretability in natural language processing models
Clinical applications of speech technology
Current state-of-the-art NLP technologies are underpinned by complex pre-trained neural network architectures (Rogers et al., 2021) that pose two main challenges: (1) they require large and expensive computational resources for training and inference; and (2) they need large amounts of data that usually entails expensive expert annotation for downstream tasks.
Mass mainstream adoption of such systems with large carbon footprints would have serious environmental implications, making this technology unsustainable in the long term. It is estimated that training a large Transformer network (Vaswani et al., 2017) with architecture search produces 5 times more CO2 emissions compared to driving a car for a lifetime (Strubell et al., 2019; Schwartz et al., 2020). Moreover, the cost for developing and experimenting with large deep learning models in NLP introduces inherent inequalities between those that can afford it and those that cannot, both in the academic community and in industry.
Given the importance of these challenges, this project has two main objectives: (1) to develop lightweight pre-trained language models (Sanh et al., 2019; Shen et al., 2021; Yamaguchi et al., 2021) that have substantially smaller computing resource requirements for training and inference; and (2) to investigate data efficiency when fine-tuning pre-trained transformer models (Touvron et al., 2021).
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8:842–866.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2020. Green AI. Commun. ACM, 63(12):54–63.
Sheng Shen, Alexei Baevski, Ari Morcos, Kurt Keutzer, Michael Auli, and Douwe Kiela. 2021. Reservoir Transformers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4294–4309, Online. Association for Computational Linguistics.
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy. Association for Computational Linguistics.
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. 2021. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Atsuki Yamaguchi, George Chrysostomou, Katerina Margatina, and Nikolaos Aletras. 2021. Frustratingly Simple Pretraining Alternatives to Masked Language Modeling. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3116–3125, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cohort 4 (2022)
Project title: Machine learning driven speech enhancement for the hearing impaired
Supervisor: Prof Jon Barker
BEng Electrical and Electronic Engineering, University of Liverpool
MSc Communications and Signal Processing, Imperial College London
Robustness in speech recognition
Multi-talker speech recognition (speaker diarisation)
Natural language understanding
This project aims to develop new hearing aid algorithms that work well for speech-in-noise conditions. In the UK, one out of every six individuals has a hearing impairment , but only about 40% of them could benefit from having hearing aid devices . Moreover, even those who own hearing aids do not use them frequently enough, as they do not perform well in many situations, particularly in enhancing speech intelligibility in noisy environments. This issue is the primary reason for the low adoption and utilization of hearing aids . The aim of this project is to explore novel hearing-aid algorithms that target speech enhancement in everyday conversational settings and promote the adoption rate of hearing aid devices.
The project will use deep-learning based source separation approaches to suppress background noise and extract target speech from dynamic acoustic mixtures. Initially, the project will look at algorithms for direction-of-arrival (DOA) finding using a multi-microphone array setup. Sophisticated beamforming approaches will be applied to steer the microphones towards the desired direction where the speech came from based on direction-of-arrival finding, whilst suppressing noises and interferences from other directions.
This project particularly focuses on speech enhancement for hearing-impaired individuals. The challenges of multi-channel hearing aid processing we are facing currently include uncertainties regarding the relative position of microphones, which result in errors in DOA findings; unique head geometry of each individual resulting in varying head-related transfer functions (HRTFs) across listeners; the challenge of measuring head movements with the imperfect knowledge acquired from motion sensors of the hearing aid devices. Additionally, in order to deliver a good user experience, it is necessary to implement signal processing with very low latency. It is worth noting that there is a lack of appropriate datasets, and the project intends to record up-to-date datasets. In this regard, simulations will be used initially to model the scenario and assist in developing the data collection methodology. The project is committed to stressing these challenges and enhancing speech intelligibility in noisy conditions for the hearing impaired.
 NHS England: Hearing loss and healthy ageing. 2017. [pdf] Available at: https://www.england.nhs.uk/wp-content/uploads/2017/09/hearing-loss-what-works-guide-healthy-ageing.pdf
 Share of people with a hearing impairment who use hearing aids in the United Kingdom (UK) in 2018, by age. [online] Available at: https://www.statista.com/statistics/914468/use-of-hearing-aids-in-the-united-kingdom/
 Gallagher, Nicola E & Woodside, Jayne. (2017). Factors Affecting Hearing Aid Adoption and Use: A Qualitative Study. Journal of the American Academy of Audiology. 29. 10.3766/jaaa.16148.
Project title: Speech to Speech translation through Language Embeddings
Supervisor: Dr Anton Ragni and Prof Thomas Hain
Industry partner: Huawei
MComp Artificial Intelligence and Computer Science, University of Sheffield
Speech-to-speech translation (S2ST) has the potential to advance communication by enabling people to converse in different languages while retaining the speaker's original voice. Despite the rapid advancements in Natural Language Processing (NLP) and Large Language Models (LLMs), achieving seamless live direct speech-to-speech translation (dS2ST) remains a challenge. One significant obstacle is the disparity in available data; while LLMs benefit from vast amounts of textual data, speech data is comparatively scarce. For instance, representing approximately just 1 GB of text requires a staggering 1,093 GB of speech data. As a result, fresh approaches and techniques are needed to make substantial progress in dS2ST, similar to the strides made by LLMs.
This project aims to explore methods for evaluating and improving the quality of synthesized speech in dS2ST models. Currently, the Mean Opinion Score (MOS) serves as a subjective assessment of a speech sample's "humanness." However, relying on human judges to provide MOS scores can be slow and potentially biased. To address these limitations, this project seeks to develop an automated MOS system using sequence-to-sequence models (seq2seq). By creating an evaluation metric that rapidly gauges the humanness of speech, researchers can more efficiently optimize their models and enhance the quality of synthesized speech.
An intriguing aspect of this project involves investigating the role of linguistic information in the perceived humanness of speech. By examining the encoder/decoder layers of seq2seq models trained on nonsensical languages like Simlish or Klingon, the research will explore whether linguistic content is essential for speech humanness and how it manifests within different layers of a deep learning model.
Developing automated metrics to assess the quality and humanness of synthetic speech has the potential to significantly advance speech synthesis models. By bridging the data gap and refining dS2ST capabilities, this research can pave the way for innovative applications and improved cross-cultural communication. Ultimately, the results of this project could bring us closer to breaking down language barriers and fostering a more connected world.
Project title: Unsupervised Speech Recognition
Supervisor: Dr Anton Ragni
Speech and text systems; e.g.
ASR and speech synthesis.
(Low resource) Languages.
Languages within their social/cultural context, including endangered languages and the consequences of language convergence.
State-of-the-art machine learning methods. I find it fun to learn about how ML & AI use loose inspiration from neural/cognitive science to break new bounds.
Automatic speech recognition is a method of transforming speech to text with computers. State-of-the-art techniques yield this mapping with supervised learning, a process that requires a large amount of transcribed speech data. Transcribing speech is expensive, making supervised learning less applicable to speech technology. This differentiates speech technology from other domains where labelled data is easier to produce, e.g. classifying pictures of cats and dogs. The difficulty in curating labelled speech data is especially apparent for languages with a low digital footprint, e.g. Xhosa.
These problems motivate solutions that minimise the required labelled data and ideally work in domains where speech data is scarce. Current methods use generative adversarial networks (GANs) to generate text from speech representations. Although GANs for speech recognition are initially successful, they have known limitations such as unstable training and the inability to produce diverse samples.
Recent advancements in generative learning provide many advantageous alternatives to GANs. This project aims to learn a speech-to-text mapping in low-resource scenarios with non-adversarial generative models such as diffusion, normalizing flows and energy-based models.
Project title: Adapting large language models to respond to domain changes in content moderation systems
Supervisor: Dr Xingyi Song and Dr Nafise Moosavi
Industry partner: Ofcom
Morphology and Word Segmentation
Tagging, Chunking, Syntax and Parsing
Semantics (of words, sentences, ontologies and lexical semantics)
Social media platforms use content moderation systems to detect posts that are unacceptable according to their Terms of Service. For large platforms these systems usually use a combination of automated and human moderation and often include language models. Due to the fast-evolving nature of social media language, these models require constant updates to respond to new situations such as new slang or spellings (either due to natural changes, or those designed to deliberately fool an existing moderation system), or to scale systems to new functionalities or platforms.
Recent development of large language models demonstrate powerful language understanding and generation capability, including the social media moderation task. However, updating large language models demands significant computational power and resources.
In this project we are aiming to investigate language model compression and knowledge distillation techniques to produce small models. These models can be rapidly and efficiently adapted/fine-tuned to adapt to social media topic/language domain changes and reduce unnecessary carbon emissions. We will also explore methods to minimise the need for human labelled data during model updating. Hence the model can be updated more easily to accommodate emerging domain changes, ensuring a more adaptable and efficient content moderation system.
Project title: Singing voice banking and conversion
Supervisor: Prof Guy Brown
Linguistics MA, University of Edinburgh
Clinical applications of speech technology
Utilising speech and language technologies in linguistics research
The voice is a fundamental part of our identity. For transgender people, having a voice that does not fit with their gender identity can be a cause of anxiety. Transgender people can go through a number of different treatments to alter their voice, including surgery, vocal therapy and taking hormones. This project involves working with members of Trans Voices, the UK’s first professional trans+ choir, to improve voice banking methods for singing voice conversion tools. These could then be used creatively, for example, as a way to show the progression of transgender singing voices through the aforementioned treatments. A further application for such technology would be to bank a singer’s voice throughout their lifetime to document how their singing voice changes with age, and to combine their earlier and current singing voices creatively.
In the past there has been a focus on voice banking for speech, for example to create personalised speech synthesisers for people with motor neurone disease and Parkinson’s, to help preserve their vocal identity. Singing differs from speech in a number of ways, most notably phoneme duration, pitch and power. There are also a number of creative ways in which the voice is used in singing that are arguably not so important to mirror in spoken voice conversion, such as breathy voice, and the use of different accents for artistic effect. For these reasons, voice banking methods for speech cannot simply be carried over into the domain of singing.
This PhD project has three main aims. The first is to characterise the impact of some of the aforementioned gender affirming treatments on singing voices. This may be achieved through systematic acoustic analysis of recordings made before and after a given treatment, as well as through surveys which ask, for example, whether there are things that the singers can do with their voice that they could not before, or things they can no longer do but could before. This information could be used to inform singing voice conversion systems. If, for example, it is found that often the singer’s style remains the same or very similar throughout their lifetime, then this could potentially be exploited for efficiency in voice banking for singing voice conversion systems, as it means that the input voice will share a lot of features with the output voice.
The second aim of this project is to develop and improve singing voice banking methods for use in singing voice conversion systems, and to work on improving the singing voice conversion systems themselves. As the particular challenges of voice banking depend on the type of system being used, a range of existing systems will be tested for their suitability for singing voice conversion with the needs of the users in mind, in order to find a baseline to build upon.
The final aim of this project is to develop evaluation methods which are ideally suited to a project of this nature, as it is important to demonstrate an advance on the state of the art, as well as to centre the needs and wants of the users. This will most likely involve a combination of objective and subjective methods of evaluation. This project will explore the possibility of adapting existing intelligibility and quality measures to make them suitable for use on singing data. This project also aims to develop evaluation methods which assess a singing voice conversion system’s ability to produce specific vocal effects which are sometimes overlooked in evaluation, such as breathy voice, creaky voice and the use of different accents for artistic effect.
Project title: Data-driven and Discourse-aware Scientific Text Generation
Supervisor: Dr Chenghua Lin
Summarisation and simplification
Interpretability and analysis of models for NLP
The current problem with natural language generation is that we allow the neural network to figure out discourse structure just from the training data which could contain biases, skewed representations and a lack of generality. This also makes it difficult to explain why a model picked certain connectors, words and syntax. By researching the underlying structure of these texts we can allow for more coherent and cohesive text generation to better understand how humans formulate text.
With GPT-4 releasing this year there has been a huge improvement in its capabilities especially in understanding questions, jokes and images, while also being able to pass tests with higher percentiles. However, it still has the limitations of reliability and hallucination where the content generated may sound plausible but is factually incorrect. In addition, with current methods and a brief review of different approaches we can see that there is always a lack of understanding of the discourse structure to a more fine-grained and grounded representation.
We will aim to build a discourse-aware text generation method for scientific papers to generate high quality long form text (i.e. related work section).In the paper “Scaling Laws for Neural Language models”, the consensus for improving LLMs is to increase the size and computational power, but for more niche and domain specific areas with less resources this may not be a reliable method and so by developing a novel method we can move towards a more explainable and robust NLG model.
There has been ongoing research in utilising linguistic theory to support natural language processing and generation. Rhetorical structure theory and Penn Discourse are the most well known tree banks, which are great at modelling span prediction, nuclearity indication (predict nucleus or satellite) and relation classification for adjacent sentences and paragraphs. We will look into whether or not there is more to discourse when it comes to structure as it seems to indicate that the writer's intent can affect the phrasing used to formulate text or else we would see all scientific papers being written in the exact same way. We will investigate what common attributes and structure are generated between different NLG architectures and relate it back to the theory. This could include new theories about discourse or expand on current ones.
OpenAI (2023) GPT-4 Technical report https://arxiv.org/abs/2303.08774
Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv preprint arXiv:2001.08361 (2020). https://doi.org/10.48550/arXiv.2001.08361
Cho, Woon Sang, et al. "Towards coherent and cohesive long-form text generation." arXiv preprint arXiv:1811.00511 (2018). https://arxiv.org/abs/1811.00511
Zhu, Tianyu, et al. "Summarizing long-form document with rich discourse information." Proceedings of the 30th ACM international conference on information & knowledge management. 2021. https://doi.org/10.1145/3459637.3482396
Gao, Min, et al. "A Novel Hierarchical Discourse Model for Scientific Article and It’s Efficient Top-K Resampling-based Text Classification Approach." 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2022. https://doi.org/10.1109/SMC53654.2022.9945306
Yang, Erguang, et al. "Long Text Generation with Topic-aware Discrete Latent Variable Model." Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. https://aclanthology.org/2022.emnlp-main.554/
Project title: Efficient transfer learning for pre-train language models across many tasks
Supervisor: Prof Nikos Aletras
Integrated Master in Electrical and Computer Engineering, National Technical University of Athens
Machine Learning for NLP
Transfer Learning and Domain Adaptation
Unsupervised & Low-Resource Learning
In recent years, natural language processing (NLP) technologies have made significant advancements thanks to the development of pre-trained language models. These models are trained on massive amounts of text data from various sources, such as websites, news articles, and books. Once trained, these models can be fine-tuned for specific tasks and domains, resulting in improved performance across a wide range of NLP applications. However, there are still many open questions about how these models can be effectively adapted and used for new tasks and domains. This project aims to explore various ideas and techniques related to transfer learning, which is the process of leveraging pre-trained models to improve performance on related tasks.
Some of the research directions we plan to explore include:
Alternatives to fine-tuning: Instead of the traditional fine-tuning approach, we will investigate methods that involve continual learning, intermediate task learning, multi-task learning, and learning with auxiliary losses. These techniques may offer more efficient and effective ways to adapt pre-trained models for new tasks and domains.
Dealing with domain shifts: When the target domain is different from the source domain used for pre-training, performance may degrade. We will explore fine-tuning procedures that can address this issue and help maintain high performance across different domains.
Parameter-efficient transfer learning: Adapting large pre-trained models can be computationally expensive. We will investigate the use of adapter modules to enable more efficient transfer learning. These adapters can be trained on small amounts of task-specific data, making them useful for low-resource settings.
Model combinations: Instead of fine-tuning, we will explore techniques that involve merging pre-trained models. This approach may offer a more efficient way to combine the knowledge of multiple models for downstream tasks.
By exploring these research directions, this project aims to advance the state-of-the-art in NLP and contribute to the development of more efficient, adaptable, and powerful language understanding technologies. Through this project, we hope to contribute to the ongoing growth and success of NLP technologies, ultimately leading to improved human communication and decision-making capabilities.
Project title: Speech analysis and training methods for atypical speech
Supervisor: Dr Stefan Goetze
Speech production and perception
Models of speech production
Cognition and brain studies on speech
Speech, voice and hearing disorders
Phonation and voice quality
Pathological speech and language
Speech and audio classification
Acoustic model adaptation
Applications in medical practice
Innovative products and services based on speech technologies
Motor speech disorders (MSDs) are speech disorders with neurological cause that affect the planning, control or execution of speech (Duffy 2019). A dysarthria is a type of MSD that reflect abnormalities in the movement required for speech production (Duffy 2019). Some common neurological causes of dysarthria are Parkinson’s Disease, Multiple Sclerosis, and Cerebral Palsy. Furthermore, the psychosocial impacts (e.g. to identity, self-esteem, and social participation & quality of life) of dysarthria are well documented for individuals with dysarthria, and their family and carers (Walshe & Miller 2011).
Speech technologies have a fundamental role in the clinical management of atypical speech, and the proceeding impact on an individual’s quality of life. Automatic speech recognition (ASR) (i.e. the task of transforming audio data to text transcriptions) has important implications for assistive communication devices and home environment systems. Alternative and Augmentative Communication (AAC) is defined as a range of techniques that support or replace spoken communication. The Royal College of Speech and Language Therapists (RCSLT) outline the use of AAC devices in the treatment of individuals with MSDs (RCSLT 2006), and AAC devices have become standard practice in intervention. Although the accuracy of ASR systems for typical speech have improved significantly (Yue et al. 2022), there are challenges that have limited dysarthric ASR system development and limit the generalisation of typical speech ASR systems to dysarthric speech, namely: 1) high variability across speakers with dysarthria, and high variability within a dysarthric speaker’s speech, and 2) limited availability of dysarthric data. Accordingly, studies have focused on i) adapting ASR models trained on typical speech data to address the challenge of applying typical speech models to dysarthric speech and ii) collecting further dysarthric data (although the volume and range of dysarthric data remains limited) (Yue et al. 2022). Furthermore, the classification of dysarthria, including measures of speech intelligibility are important metrics for the clinical (and social) management of dysarthria, including assessment of the severity of dysarthria and functional communication (Guerevich & Scamihorn 2017). The RCSLT promotes individually-tailored goals in context of the nature and type of dysarthria, underlying pathology and specific communication needs (RCSLT 2006). In current 1 practice, metrics are based on subjective listening evaluation by expert human listeners (RCSLT 2006) which require high human effort and cost Janbakhshi et al. (2020). Recent studies have implemented automated methods to classify dysarthric speech, including automatic estimators of speech intelligibility (Janbakhshi et al. 2020).
To advance the application of speech technologies to the clinical management of atypical speech, the current project aims to 1) collect a corpus of dysarthric data to increase the volume of quality dysarthric data available to the research community, 2) improve the performance of dysarthric ASR systems, including investigation of methods of adapting ASR models trained on typical speech, and 3) create automated estimators for the classification of dysarthria.
Guerevich, N. & Scamihorn, L. (2017), ‘SLP use of intelligiblity measures in adults with dysarthria’, American Journal of SLP pp. 873–892.
Janbakhshi, P., Kodrasi, I. & Bourlard, H. (2020), ‘Automatic pathological speech intelligibility assessment exploiting subspace-based analyses’, IEEE, 1717–1728.
RCSLT (2006), Communicating Quality 3, Oxon: RCSLT. Walshe, M. & Miller, N. (2011), ‘Living with acquired dysarthria: the speaker's perspective’, Disability and Rehabilitation 33(3), 195–203.
Yue, Z., Loweimi, E., Christensen, H., Barker, J. & Cvetkovic, Z. (2022), ‘Dysarthric speech recognition from raw waveform with parametric cnns’.
Project title: What do neural network architectures learn about language? Probing for Semantics and Syntactical Relationships in Large Pretrained Language Models
Supervisor: Prof Aline Villavicencio
Linguistics BA (Lancaster University)
Clinical applications of speech and language technology
Natural language generation
In recent years, pretrained language models (PLMs) built using deep Transformer networks dominated the field of natural language processing (NLP). Deep neural models perform well in many natural language processing-related tasks but can be hard to understand and interpret. A body of research has thus arisen to probe such models for a better understanding of the linguistic knowledge they encode and their inner structure.
This project aims to find out more about what such models learn about language, focusing specifically on semantics. Recent probing methods for lexical semantics rely on word type-level embeddings that are derived from contextualized representations using vector aggregation techniques and many PLMs use weight tying. Results from studies have shown that word embeddings are liable to degeneration. When this happens, word embeddings are not uniformly spread out in the embedding hyperspace, but rather they appear as a narrow cone. This is problematic because due to the incorrectly modelled hyperspace, the cosine similarity between two random, unrelated words would not be meaningfully big enough. Through experimentation of a series of experiments that are linguistically-driven tasks, it is hoped that a better understanding of PLM models can be obtained, as well as insights to resolve this issue of vector degeneration. In particular, areas of exploration include computational linguistics, probing for idiomaticity, complex forms of negation and perceptual nouns injection.
Project title: Multimodal Multilingual Framing Bias Detection
Supervisor: Dr Nafise Sadat Moosavi
Cohort 4 Student Representative
BA Linguistic and Cultural Mediation, University of Naples "L'Orientale"
MA Sociolinguistics, University of York
Sociolinguistics and Pragmatics
Social Media Analysis
Social scientists have explored the political significance of frames in mass communication, analysing how they are employed to direct audiences towards specific conclusions by selectively highlighting certain aspects of reality while obscuring others. A frame is thus a perceived reality, imbued with social and cultural values. In fact, the act of framing through linguistic and non-linguistic devices can have a significant impact on how the audience understands and remembers a problem and thus influence their subsequent actions. Think, for example, about the use of the words “freedom fighters” and “terrorists” to report on the same event, and how these two words can have very different inferred meanings. In our example, the choice of one word over the other might suggest the frame of that news article.
In recent years, the Natural Language Processing (NLP) community’s interest in the automatic detection of framing bias has increased, and social scientists too recognise the possible usefulness of computational tools in frame analysis. In fact, manual frame analysis typically involves manually identifying and categorising frames in large amounts of text. This can be a time-consuming and labour-intensive process, particularly for large datasets which, in contrast, could be computationally processed in a quicker and more efficient way. In addition, while manual frame analysis may capture a limited range of frames, automatic frame detection algorithms can analyse a broader range of frames or perspectives.
The very nature of framing makes the task more complicated for NLP than it is for human analysts, since even in manual analysis there is a high degree of inter-coder variation. In general, NLP algorithms may not have access to the same level of contextual information as a human analyst, which can make it harder to identify the relevant frames. When dealing with framing bias what is important is how something is discussed, rather than what is discussed. Recent research has also shown that hallucinatory generation of framing bias has been observed in large Natural Language Generation (NLG) models, corroborating the importance of tackling the problem of framing bias.
This project will focus on the task of automatic detection of framing bias in contexts such as news, political discourse, and other forms of media communication. Drawing from social science, linguistics and NLP research, we will address shortcomings in the task of frame categorisation from a multidisciplinary point of view. This research also aims to detect framing bias in multilingual and multimodal settings, considering both the cultural and social aspects imbued into frames and the textual/visual nature of framing devices.
Overall, this project aims to further improve the task of frames detection in multilingual and multimodal settings, to provide a valuable tool for social scientists and NLP researchers to automatically detect framing bias in media communications. Such a tool could have significant applications in political and social sciences, as well as in media studies.
Project title: Detecting communication situations for AI-enabled hearing aids
Supervisor: Prof Jon Barker
Industry partner: WSAudiology
Speech enhancement in hearing aids
Speech enhancement / noise reduction / dereverberation / echo cancelation
Speaker verification and identification
Hearing impairment is a prevalent problem in the UK, affecting approximately 1 in 6 people making it the second most common disability in the country. However, despite the fact that 6.7 million people could benefit from using a hearing aid, only 2 million people use them, and even those that do use them do not find them to be especially effective .
The basic function of a hearing aid is to amplify specific frequencies to directly compensate for the hearing loss experienced by the user. This is effective in quiet, ideal environments, but the overwhelming majority of typical communication situations present much more challenging environments, with lots of background noise, competing conversations and reverberation. In these scenarios, the amplification also gets applied to these unwanted noises, which renders the desired speech signal inaudible to the user. To mitigate this key issue, new approaches are needed that are able to amplify selected sound sources.
Beamforming is an extremely useful technique for enhancing the desired sound sources; in systems with multiple microphones, such as hearing aids, the signals from each microphone can be combined in such a way that the signal from a specific direction is enhanced. While there has been a lot of progress in this domain, it is difficult to apply in the context of hearing aids; there are significant computational restrictions due to the small form factor of the device, the system must function in real time with low latency (the signal should not be delayed by more than 10 ms), the microphone arrays are constantly moving, and the precise relative positions of the microphones may not be accurately known.
While modern beamforming algorithms are very sophisticated, their effect is severely limited when little is known about the acoustic scene; the algorithm needs to be aware of which sources should be amplified, and which should be attenuated. The aim of this project is to investigate methods and features that distinguish between common conversational scenes so the hearing aid may determine which sound sources are relevant to the listener, and which sounds are interfering (similar to what is described by the cocktail party effect). It can be especially difficult when the interfering sound is also speech, such as a nearby conversation or TV show, since there may be little acoustic information which can indicate whether it is relevant to the listener. For this reason, it is important to explore machine learning methods that consider both acoustic information and behavioural cues from the listener; head movement, for example, has been shown to provide insight into the scene and where the listener’s attention may be .
Deafness & Hearing Loss Facts, 2022, https://www.hearinglink.org/your-hearing/about-hearing/facts-about-deafness-hearing-loss/
Lauren V. Hadley and John F. Culling. Timing of head turns to upcoming talkers in triadic conversation: Evidence for prediction of turn ends and interruptions. Frontiers in Psychology, 13, 2022.
Project title: Explainable Evaluation Metrics for Natural Language Generation
Supervisor: Dr Nafise Moosavi and Prof Nikos Aletras
Integrated Master in Electrical and Computer Engineering, National Technical University of Athens
Machine Learning for NLP
Natural Language Generation (Dialogue Systems & Text Summarisation)
Multimodal Social Media Analysis
Nowadays the release of large pre-trained transformer-based language models has led to major improvements in natural language generation tasks. Modern text generation models are capable of producing fluent outputs in a wide variety of tasks and many metrics have been proposed to evaluate the generated outputs. Although these metrics are widely used in the research community, the latest research shows that they are poorly correlated with human judgments and they also suffer in terms of explainability and robustness. Consequently, it is essential to build more reliable and explainable automatic metrics. This research project focuses on building explainable evaluation metrics (evaluation metrics that can justify or provide explanations about the evaluation scores) for natural language generation tasks, improving the current ones.