Students

Our PhD and MPhil students

Cohort 1

Cohort 2

Cohort 3

Cohort 4

Cohort 5

Cohort 1 (2019)

Hussein Yusufali

PhD Project title: Continuous End-to-End Streaming TTS as a Communications Aid for Individuals with Speaking Difficulties but Normal Mobility

Supervisor: Prof Roger K Moore

Industry partner: Apple

Project description

This project is aimed at individuals who have had trauma or surgery to their vocal apparatus, eg laryngectomy, tracheoectomy, glossectomy, who are unable to talk in a conventional manner but who have full motor control of the rest of their body. Previous research in 'silent speech recognition' and 'direct speech synthesis' has used a wide variety of specialised/bespoke sensors to generate speech in real-time from residual articulatory movements. However, such solutions are expensive (due to the requirement for specialised/bespoke equipment) and intrusive (due to the need to install the necessary sensors). As an alternative, it would be of great interest to investigate the potential for using a conventional keyboard as a readily-available and cheap alternative to specialised/bespoke sensors, ie. a solution based on text-to-speech synthesis (TTS).

Of course, there are fundamental problems with using contemporary TTS as a communications aid:

the conversion from typed input to speech output is non real-time and delayed,
even a trained touch-typist would be unable to enter text fast enough for a normal conversational speech rate,
the output is typically non-personalised,
it is not possible to control the prosody in real-time, and
it is not possible to control the affect in real-time.

These limitations mean that existing TTS users are unable to intervene/join-in a conversation, unable to keep up with information rate of exchanges, unable to express themselves effectively (ie their individuality and their communicative intent) and suffer a loss of empathic/social relations as a consequence.

So, what is needed is a solution that overcomes these limitations, and it is proposed that they could be addressed by investigating/inventing a novel end-to-end TTS architecture that facilitates:

Simultaneous real-time input/output (ie sound is produced immediately a key is pressed).
Conversational speech rates by embedding a suitable prediction mechanism (ie the spoken equivalent of autocorrect).
Configurable output that allows a user to 'dial-up' appropriate individualised vocal characteristics
Real-time control of prosody, e.g. using special keys or additional sensors.
Realtime control of affect, e.g. by analysing the acoustics of key-presses or facial expressions from a webcam.

Cohort 3 (2021)

Jason Clarke

PhD Project title: Disentangling Speech in a Multitalker Environment
Supervisor: Dr Yoshi Gotoh
Industry partner: Meta

Research interests:

Machine Translation
Semantic Word Embedding
Topic Modelling
End to End ASR
Text Classification

Project description

The automated annotation of real life multi-talker conversations requires systems to perform high quality speech transcription as well as accurate speaker-attribution. These two tasks correspond with two well studied concepts in the domain of speech technology: automatic speech recognition (ASR) and diarization, respectively. The latter concerns segmenting speech by talkers and effectively resolving the problem of ‘Who spoke when?’. Recent advances in ASR systems have shown significant improvements in word error rate (WER) over the past decade, with state of the art systems achieving scores similar to human level performance on common ASR benchmarks. Despite this, even highly sophisticated systems struggle with the more challenging task of ASR in a multi-talker context, where there is typically a significant amount of overlapping speech. The main issues associated with multi-talker situations are: the irregular distances between the recording device and each talker; the variable, often unknown, number of distinct talkers; the high spontaneity of conversational speech, which renders it grammatically dissimilar to written language; and, most importantly, overlapping speech segments which occur due to talkers speaking simultaneously. The culmination of these complications is why the task of multi-talker ASR is typically considered one of the most challenging problems in the domain today. This is particularly the case for recording configurations which use only a single distant microphone (SDM), since techniques utilising the geometry between microphone arrays and speakers, such as beamforming, cannot be used. For ASR to be freely applied in more challenging real life scenarios, like general conversations or outside environments, without the need for specific microphone configurations, significant improvements are still required.

Previous literature has demonstrated that combining the two distinct tasks of ASR and diarization into a single multi-task end-to-end framework can achieve state of the art performance in certain relevant benchmarks whilst only requiring a SDM recording configuration. This is because knowing when an active talker’s turn has ended and knowing the talker is information which significantly influences the probability distribution of the following word, thus benefiting the ASR component of the system. Likewise, information pertaining to who the active talker is, and when the conversation is likely to transition to a different active talker, is intrinsic to the verbal content of the conversation. It therefore seems intuitive that leveraging this mutually beneficial information, solving the conventionally independent tasks simultaneously, improves the overall system performance.

With the increasing popularity of augmented reality/smart wearable devices and the introduction of audio-visual datasets which include real life events recorded egocentrically, a new avenue of research in the domain has been presented. This project hopes to focus on exploring methods of exploiting the visual component of this multi-modal data in a similar manner to the aforementioned speaker-attributed automatic speech recognition (SA-ASR) multi-task system. For example, the aggregate gaze of conversation participants is often representative of who the overall party is attending to, and therefore who is most likely speaking. This information would likely benefit the diarization component of a SA-ASR system, thus indirectly improving the accuracy of the ASR component. To conclude, the aims of this project are to explore methods of supplementing conventional speaker attribution and ASR systems, which rely exclusively on audio, with pertinent features extracted from corresponding visual data.

Robert Flynn

PhD Project title: Leveraging long-context information to enhance conversation understanding
Supervisor: Dr Anton Ragni
Industry partner: Meta

Research interests:

Continual learning and meta learning
Methods for handling and learning from long sequences of data
Automatic Speech Recognition
Language Modelling

Project description

Speech and natural language data will often contain many long-range dependencies, where there is some statistical correlation between words or speech that are spatially distant [1]. For example, in conversational data, the interactions between multiple speakers can produce sequences of utterances that are logically connected and reference information from far in the past. As a consequence, systems that have a greater capacity to model and learn these relationships will show greater predictive performance.

Despite this, the majority of the current Automatic Speech Recognition (ASR) systems are designed to model speech as an independent set of utterances, where information from surrounding segments of speech is not utilised by the model. Of course, in many real-world scenarios, these segments of speech are highly interrelated, and the previously processed data can be used to form a useful prior on which to condition future transcription. While previous work has investigated the incorporation of context from surrounding utterances [2,3], doing so effectively with both linguistic and acoustic data remains an open problem. Indeed, many of the current architectures used for modelling sequential data are limited in their ability to utilise very long-distance dependencies [3].

This project aims to investigate approaches that enable automatic speech recognition systems to incorporate information from a much broader context. Ideally, in scenarios such as work-related meetings, ASR systems should be able to utilise relevant information from across the entire discourse to aid transcription. Additionally, an element of this work can look at what sorts of long-range dependencies are present in long-format speech datasets, and how well these are modelled by our current speech recognition systems.

Ebeling, Werner, and Thorsten Pöschel. "Entropy and long-range correlations in literary English." EPL (Europhysics Letters) 26, no. 4 (1994): 241.
G. Sun, C. Zhang and P. C. Woodland, "Transformer Language Models with LSTM-Based Cross-Utterance Information Representation" ICASSP (2021)
Hori, Takaaki, Niko Moritz, Chiori Hori and Jonathan Le Roux. “Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers.” Interspeech (2021).
Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. “Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context.” ACL (2018)

Mary Hewitt

PhD Project title: Speaking to the Edge: IoT microcontrollers and Natural Language Control
Supervisor: Prof Hamish Cunningham

Research interests:

Clinical applications of speech technology
NLP for social media analysis
Online misinformation detection

Project description

A new generation of network-connected microcontrollers are fast becoming the ubiquitous equivalents of yesterday’s embedded computation devices. Instead of hiding anonymously in the control systems of TVs, central heating systems and cars, these devices now play an increasing role in user interfaces. Smart speakers like Google Home are dominated by ARM-based microprocessors, yet a multitude of Internet of Things (IoT) applications (driven by tight constraints on power and cost) rely on much smaller computational cores. While the IoT has been gaining increasing attention over the past decade as a term to describe the connection of microcontrollers to the internet, the small size and power of the devices gives rise to privacy concerns that cannot be solved using traditional techniques. Simply connecting these devices to a network makes them vulnerable to cyberattacks, and the low power of the devices means cloud services are often relied upon for data aggregation, storage and computation. Increasingly in the media we can see stories of privacy concerns surrounding smart speakers, such as security bugs allowing access to voice history from Alexa devices.

This project addresses the extent to which voice control interfaces running on IoT devices can enhance privacy by reducing reliance on the cloud. We will do this by surveying, prototyping and measuring processing capabilities and architectural options for IoT-based voice control in the context of smart homes.

Open source technologies are a prerequisite of verifiable security; migrating IoT infrastructure to open source ones would progress us towards alleviating privacy concerns through transparency. Secondly, local in-house processing that doesn’t transmit data to the cloud would mitigate large numbers of privacy issues. With speech signals carrying rich information relating to the user’s gender, emotions, health, intent and characteristics, it is essential that signals received by voice control devices remain private. We will initially deploy an open source Automatic Speech Recognition (ASR) system that processes speech data locally on the device (and/or gateway or hub), instead of sending it to cloud services. ASR is the task of finding out what has been said, and is a widely studied field with state-of-the-art benchmarks published which we can use to evaluate the relative performance of our ASR system.

Edge computing and federated learning are key fields relating to our aim of eliminating the transmission of sensitive data over networks. Edge computing refers to processing data locally on users’ devices, and federated learning relates to training machine learning models on devices that are controlled by a central server. We will train a model to recognise commands within the domain of smart home control, however training an ASR model from scratch requires huge amounts of data and collecting it is a time-consuming and costly process. Instead we can use transfer learning approaches to fine-tune a pretrained model on a small amount of data collected from our domain. For example, speaker adaptation could be employed to fine-tune the speech recognition model using samples of the users’ voice.

Tyler Loakman

PhD Project title: Towards Dialogue Systems with Social Intelligence
Supervisor: Prof Rob Gaizauskas

My interests on the speech side of things are emotion detection (generally, as well as mental health applications and effects related to intoxication) and voice cloning (particularly due to the criminal use of deepfakes and exploiting biometric security), whilst on the NLP side of things my primary interest would be mis/dis-information detection.

Project description

Whilst chatbots and dialogue systems of the form seen in everyday life (Apple’s Siri or Amazon’s automatic helpdesk, to name a few) are often viewed as task-oriented and transactional, and therefore engaged with solely to achieve an end-goal such as a product return or answer a query, real human communication is rich in social features and unavoidable pragmatic considerations. Consequently, in order to achieve human-like language competency in a conversational setting, these conversational agents require an “understanding” of social intelligence such as pragmatics, empathy, and humour, in order to better integrate into daily life and be interacted with more naturally.

Although social intelligence and pragmatic competence may not be essential for purely task-oriented dialogue systems, there is still potential for these areas to benefit from developments in the field. For example, within the aforementioned customer service helpdesks, improved social intelligence could result in better responses to customers exhibiting a range of emotions (such as frustration when returning an item), as well as build a rapport with the customer base, leading to increased brand loyalty. On the other hand, dialogue agents exhibiting social intelligence have more clear applications in other domains such as in mental health services for talking therapy, and in the videogames industry for creating more personalised and interactive dialogue trees for narrative progression.

Building upon previous work in empathetic dialogue systems, this project aims to develop a novel high-quality annotated dataset of conversational humour using theories from the domains of linguistics and social/cognitive psychology, to then aid in the development of novel computational models that enable a conversational agent to be aware of context and generate appropriate humorous responses. The aim of these generated responses is, therefore, to incorporate humour, but also to be aware of the conversational pragmatics that govern when is and is not an appropriate time to incorporate said humour in order to avoid the law of diminishing returns or being insensitive. In aiming for such outcomes, the development of anthropomorphic conversational agents within dialogue systems moves away from the focus on omniscience in digital assistants and towards social competence.

Bibliography

Brown, P., & Levinson, S. C. (1987). Some Universals in Language Usage. Cambridge University Press.
Pamulapati, V., Purigilla, G., & Mamidi, R. (2020). A Novel Annotation Schema for Conversational Humour: Capturing the cultural nuances in Kanyasulkam. The 14th Linguistic Annotation Workshop, Barcelona, B , 34-47. https://aclanthology.org/2020.law-1.4/
Schanke, S., Burtch, G., & Ray, G. (2021). Estimating the Impact of "Humanizing" Customer Service Chatbots. Information Systems Research, 32(3), 736-751. https://doi.org/10.1287/isre.2021.1015
Skantze, G. (2021). Turn-taking in Conversational Systems and Human-Robot Interaction: A Review. Computer Speech & Language, 67, 1-26. https://doi.org/10.1016/j.csl.2020.101178
Veale, T., (2003). Metaphor and Metonymy: The Cognitive Trump-Cards of Linguistic Humour. International Linguistic Cognitive Conference, La Rioja
Xie, Y., Li, J., & Pu, P. (2021). Uncertainty and Surprisal Jointly Deliver the Punchline: Exploiting Incongruity-Based Features for Humour Recognition. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing , Online, 2, 33-39. https://doi.org/10.18653/v1/2021.acl-short.6

Amit Meghanani

PhD Project title: Factorisation of Speech Embeddings
Supervisor: Prof Thomas Hain
Industry partner: LivePerson

Research interests:

Disentangled Speech Representations
Self-supervised Learning
Acoustic Word/Sub-word Embeddings

Project description

Observed speech signals are the outcome of a complex process governed by intrinsic and extrinsic factors. The combination of such factors is a highly non-linear process, most of the factors are not directly observable, or the relation between them. To give an example the speech signal is normally considered to have content, intonation, and speaker attributes combined with environmental factors. Each of these, however, breaks into several sub-factors, e.g. content is influenced by coarticulation, socio-economic factors, speech type, etc. While learning of factors was attempted in many ways over decades the scope and structure of factorial relationships were usually considered to be simple, and often assessed in discreet space. Recent years have seen attempts in embedding space that allow much richer factorisations. This is often loosely referred to as disentanglement [1], [2] as no relationship model is required or included in the factor separation. One of the challenges for factorising speech representations is that the concept of factorization is not well-defined across domains such as the image domain and the speech domain [3].

In this project, the objective is to learn mappings that allow to encode relationships between separated factors. Such relationships may be linear or in hierarchical form or maybe encoded in terms of a graph of a formal group of mathematical operations. It is, however, important that such relationship encodings remain simple in structure. Formulating such relationships can serve as a new way to express objective functions and allow us to extract richer information about the relationship between these factors. One can imagine both supervised and unsupervised learning approaches.

Early examples of such models were published in [4] where the simple separation is used to separate different values for speaker and content. The work [5] shows the simple separation technique to separate different speaker embeddings. However, the work was restricted to two-speaker scenarios. Increasing the number of sources (eg. speakers) will make the factorisation process more complex in terms of modelling and it will be difficult to keep the simple relationship between the separated factors. The goal of this work is to propose different ways to factorise the speech signal Into latent space factors while maintaining a simple relationship between them and later reuse them in various speech technology tasks.

Hsu, Wei-Ning et al. "Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data." NIPS (2017).
Jennifer Williams. 2021. End-to-end signal factorization for speech: identity, content, and style. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI'20). Article 746, 5212-5213.
Jennifer Williams, "Learning disentangled speech representations".
Yanpei Shi, "Improving the Robustness of Speaker Recognition in Noise and Multi-Speaker Conditions Using Deep Neural Networks".
Shi, Yanpei and Thomas Hain. "Supervised Speaker Embedding De-Mixing in Two-Speaker Environment." 2021 IEEE Spoken Language Technology Workshop (SLT) (2021): 758-765.

Tom Pickard

PhD Project title: Enhanced Deep Neural Models to Capture Nuances in Language such as Idiomatic Expressions
Supervisor: Dr Carolina Scarton

I fell in love with Computational Linguistics thanks to multi-word expressions, but I'm also interested in the practicalities of using SLTs - domain adaptation and robustness, emojis and rapidly-evolving language, fairness, equality and sustainability.

Project description

Multi-word expressions (MWEs) - phrases whose meaning as a whole is greater than the sum of their parts - occur frequently in language, and make up around half of the lexicon of native speakers. Idioms in particular present significant challenges for language learners and for computational linguists. While the meaning of a phrase like car thief can be understood from its constituent words, an unfamiliar idiom like one-armed bandit or green thumbs is completely opaque without further explanation.

Deep neural network models, which underpin state-of-the-art natural language processing, construct their representations of phrases by converting individual words to mathematical vectors (known as word embeddings), and then combining these vectors together. This means that they struggle to accurately handle idiomatic expressions which change the meaning of their constituent words - a red herring is not a fish, and no animals are harmed when it is raining cats and dogs.

This limits the usefulness of these models when performing all manner of natural language processing (NLP) tasks; imagine translating “It’s all Greek to me” into Greek one word at a time while preserving its meaning, identifying the sentiment behind a review which describes your product as being as useful as a chocolate teapot, or captioning a picture of a zebra crossing, while unfamiliar with those phrases.

The aim of this project is to enhance the capabilities of neural network models to understand idiomatic expressions by finding ways to provide them with world knowledge, indications of which words go together in a phrase, or the ability to identify when the presence of an unexpected word might indicate an idiomatic usage.

Our jumping-off point will be to explore how various groups of human readers (including native speakers and language learners) process familiar and novel multi-word expressions, whether they are able to work out their meaning and how much contextual information is required to do so.

This will provide insight into the challenges faced by humans dealing with MWEs and the strategies they employ to do so. We can then examine neural network models and understand where the challenges they experience differ from or resemble our own.

This should enable us to take inspiration from human techniques to enhance the structure and training of our models, before evaluating how much our changes improve model performance on NLP tasks. All of this will hopefully bring us closer to our goal of making idioms less of a pain in the neck!

Jasivan Sivakumar

PhD Project title: End-to-end Arithmetic Reasoning in NLP tasks
Supervisor: Dr Nafise Sadat Moosavi

Cohort 3 Student Representative

Research interests:

Numerical Reasoning
Tokenisation/Encoding/Decoding of numbers
Evaluation for NLG
NLP for low resource language
Machine Translation

Project description

The field of natural language processing has recently achieved success in various areas such as search retrieval, email filtering and text completion. Most people in society use these tools everyday when doing a web search, deleting a spam email or writing a message. Nevertheless, even the newest artificial intelligent models such as GPT-3 struggle with understanding numerical data; this can result in pitfalls like incorrect predictions or factually inaccurate outputs when the input data contains numerical values. This project aims to develop techniques to improve arithmetic reasoning in the interpretation of natural language.

First of all, it is important to make the distinction between computers that, in general, can perform very elaborate mathematics and models that focus on written language. Behind a powerful mathematics code lies a well-versed mathematician who programs it. We do not claim that computers cannot do maths, but we want them to be able to use it without having a human whispering what to do. For instance, in school, when students understand how to multiply two numbers, say 82 and 85, it is expected that they will be equally as good at finding the product of 82 and 109. However, GPT-3 correctly answers the former question but fails at the latter. Clearly, the current tools used when working with natural language are not successful at arithmetic reasoning. Evidence suggests that this correlates to information seen when trained, but the systems are too complex to verify this. In other words, they lack various reasoning skills including arithmetic ones, and they may not understand numbers as well as they understand common words.

To tackle the problem, the initial stage of the research is to explore the existing maths question datasets to identify their limitations. The focus is on worded problems like “Alice and Brinda equally share 8 sweets. Brinda buys 2 and gives 1 to Alice. How many sweets does Brinda now have?”. A key step would be to design evaluation sets that measure different aspects of arithmetic reasoning, for example, by developing contrast sets i.e. using similar questions with different wording. For example, “What is 7 times 6?” and “Give the product of 6 and 7”. With worded problems, altering the numbers or the words used in the problem could highlight whether the model struggles more with the context of the problem or the maths itself.

A further investigation could be to explore how arithmetic reasoning capabilities are coupled with the underlying understanding of the language. A method to better understand this phenomenon would be to use multilinguality to test whether language is key for arithmetic reasoning in existing models or whether they can perform maths language agnostically. For instance, would German’s use of terms like “Viereck, Fünfeck, …” meaning “four-corner, five-corner, …” for the nomenclature of polygons make the maths problems more understandable in German than other languages?

Moreover, a curriculum learning approach could be employed by having different questions with different levels of difficulty. The easiest set could solely focus on using the most basic operation i.e. addition. More difficult questions may involve operations like division but also more difficult number sets like negatives and decimals.

In addition, we will design optimal prompting techniques for enhancing few-shot arithmetic reasoning. Few-shot learning is the idea of showing the model only a few instances of a specific problem to see if it generalises to unseen examples given previous knowledge. If the model can add integers, could it do the same for decimals given only a few examples?

Overall, this research aims at improving arithmetic reasoning of technological tools based on teaching principles. In turn it would aid other applications such as fact-checking, summarising tables or even helping the teaching of maths.

Miles Williams

PhD Project title: Green NLP: Data and Resource Efficient Natural Language Processing
Supervisor: Professor Nikos Aletras

Google Scholar

Research interests:

Computationally efficient natural language processing
Interpretability in natural language processing models

Project description

Current state-of-the-art NLP technologies are underpinned by complex pre-trained neural network architectures (Rogers et al., 2021) that pose two main challenges: (1) they require large and expensive computational resources for training and inference; and (2) they need large amounts of data that usually entails expensive expert annotation for downstream tasks.

Mass mainstream adoption of such systems with large carbon footprints would have serious environmental implications, making this technology unsustainable in the long term. It is estimated that training a large Transformer network (Vaswani et al., 2017) with architecture search produces 5 times more CO2 emissions compared to driving a car for a lifetime (Strubell et al., 2019; Schwartz et al., 2020). Moreover, the cost for developing and experimenting with large deep learning models in NLP introduces inherent inequalities between those that can afford it and those that cannot, both in the academic community and in industry.

Given the importance of these challenges, this project has two main objectives: (1) to develop lightweight pre-trained language models (Sanh et al., 2019; Shen et al., 2021; Yamaguchi et al., 2021) that have substantially smaller computing resource requirements for training and inference; and (2) to investigate data efficiency when fine-tuning pre-trained transformer models (Touvron et al., 2021).

References

Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8:842–866.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2020. Green AI. Commun. ACM, 63(12):54–63.
Sheng Shen, Alexei Baevski, Ari Morcos, Kurt Keutzer, Michael Auli, and Douwe Kiela. 2021. Reservoir Transformers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4294–4309, Online. Association for Computational Linguistics.
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, Florence, Italy. Association for Computational Linguistics.
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. 2021. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Atsuki Yamaguchi, George Chrysostomou, Katerina Margatina, and Nikolaos Aletras. 2021. Frustratingly Simple Pretraining Alternatives to Masked Language Modeling. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3116–3125, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.

Cohort 4 (2022)

Shengchang Cao

MPhil Project title: Machine learning driven speech enhancement for the hearing impaired

Supervisor: Prof Jon Barker

Academic background:

BEng Electrical and Electronic Engineering, University of Liverpool
MSc Communications and Signal Processing, Imperial College London

Research interests:

Robustness in speech recognition
Multi-talker speech recognition (speaker diarisation)
Natural language understanding
Machine Translation

Project description

This project aims to develop new hearing aid algorithms that work well for speech-in-noise conditions. In the UK, one out of every six individuals has a hearing impairment [1], but only about 40% of them could benefit from having hearing aid devices [2]. Moreover, even those who own hearing aids do not use them frequently enough, as they do not perform well in many situations, particularly in enhancing speech intelligibility in noisy environments. This issue is the primary reason for the low adoption and utilization of hearing aids [3]. The aim of this project is to explore novel hearing-aid algorithms that target speech enhancement in everyday conversational settings and promote the adoption rate of hearing aid devices.

The project will use deep-learning based source separation approaches to suppress background noise and extract target speech from dynamic acoustic mixtures. Initially, the project will look at algorithms for direction-of-arrival (DOA) finding using a multi-microphone array setup. Sophisticated beamforming approaches will be applied to steer the microphones towards the desired direction where the speech came from based on direction-of-arrival finding, whilst suppressing noises and interferences from other directions.

This project particularly focuses on speech enhancement for hearing-impaired individuals. The challenges of multi-channel hearing aid processing we are facing currently include uncertainties regarding the relative position of microphones, which result in errors in DOA findings; unique head geometry of each individual resulting in varying head-related transfer functions (HRTFs) across listeners; the challenge of measuring head movements with the imperfect knowledge acquired from motion sensors of the hearing aid devices. Additionally, in order to deliver a good user experience, it is necessary to implement signal processing with very low latency. It is worth noting that there is a lack of appropriate datasets, and the project intends to record up-to-date datasets. In this regard, simulations will be used initially to model the scenario and assist in developing the data collection methodology. The project is committed to stressing these challenges and enhancing speech intelligibility in noisy conditions for the hearing impaired.

References

[1] NHS England: Hearing loss and healthy ageing. 2017. [pdf] Available at: https://www.england.nhs.uk/wp-content/uploads/2017/09/hearing-loss-what-works-guide-healthy-ageing.pdf

[2] Share of people with a hearing impairment who use hearing aids in the United Kingdom (UK) in 2018, by age. [online] Available at: https://www.statista.com/statistics/914468/use-of-hearing-aids-in-the-united-kingdom/

[3] Gallagher, Nicola E & Woodside, Jayne. (2017). Factors Affecting Hearing Aid Adoption and Use: A Qualitative Study. Journal of the American Academy of Audiology. 29. 10.3766/jaaa.16148.

Shaun Cassini

PhD Project title: Speech to Speech translation through Language Embeddings

Supervisor: Prof Thomas Hain

Industry partner: Huawei

Academic background:

MComp Artificial Intelligence and Computer Science, University of Sheffield

Project description

Speech-to-speech translation (S2ST) has the potential to advance communication by enabling people to converse in different languages while retaining the speaker's original voice. Despite the rapid advancements in Natural Language Processing (NLP) and Large Language Models (LLMs), achieving seamless live direct speech-to-speech translation (dS2ST) remains a challenge. One significant obstacle is the disparity in available data; while LLMs benefit from vast amounts of textual data, speech data is comparatively scarce. For instance, representing approximately just 1 GB of text requires a staggering 1,093 GB of speech data. As a result, fresh approaches and techniques are needed to make substantial progress in dS2ST, similar to the strides made by LLMs.

This project aims to explore methods for evaluating and improving the quality of synthesized speech in dS2ST models. Currently, the Mean Opinion Score (MOS) serves as a subjective assessment of a speech sample's "humanness." However, relying on human judges to provide MOS scores can be slow and potentially biased. To address these limitations, this project seeks to develop an automated MOS system using sequence-to-sequence models (seq2seq). By creating an evaluation metric that rapidly gauges the humanness of speech, researchers can more efficiently optimize their models and enhance the quality of synthesized speech.

An intriguing aspect of this project involves investigating the role of linguistic information in the perceived humanness of speech. By examining the encoder/decoder layers of seq2seq models trained on nonsensical languages like Simlish or Klingon, the research will explore whether linguistic content is essential for speech humanness and how it manifests within different layers of a deep learning model.

Developing automated metrics to assess the quality and humanness of synthetic speech has the potential to significantly advance speech synthesis models. By bridging the data gap and refining dS2ST capabilities, this research can pave the way for innovative applications and improved cross-cultural communication. Ultimately, the results of this project could bring us closer to breaking down language barriers and fostering a more connected world.

Mattias Cross

PhD Project title: Unsupervised Speech Recognition

Supervisor: Dr Anton Ragni

Research interests:

Speech and text systems; e.g. ASR and speech synthesis; (Low resource) Languages; and Multimodality.
Languages within their social/cultural context, including endangered languages and the consequences of language convergence.
State-of-the-art machine learning methods. I find it fun to learn about how ML & AI use loose inspiration from neural/cognitive science to break new bounds.

Project description

Automatic speech recognition is a method of transforming speech to text with computers. State-of-the-art techniques yield this mapping with supervised learning, a process that requires a large amount of transcribed speech data. Transcribing speech is expensive, making supervised learning less applicable to speech technology. This differentiates speech technology from other domains where labelled data is easier to produce, e.g. classifying pictures of cats and dogs. The difficulty in curating labelled speech data is especially apparent for languages with a low digital footprint, e.g. Xhosa.

These problems motivate solutions that minimise the required labelled data and ideally work in domains where speech data is scarce. Current methods use generative adversarial networks (GANs) to generate text from speech representations. Although GANs for speech recognition are initially successful, they have known limitations such as unstable training and the inability to produce diverse samples.

Recent advancements in generative learning provide many advantageous alternatives to GANs. This project aims to learn a speech-to-text mapping in low-resource scenarios with non-adversarial generative models such as diffusion, normalizing flows and energy-based models.

Meredith Gibbons

PhD Project title: Adapting large language models to respond to domain changes in content moderation systems

Supervisors: Dr Xingyi Song and Dr Nafise Moosavi

Industry partner: Ofcom

Research interests:

Multilinguality
Morphology and Word Segmentation
Tagging, Chunking, Syntax and Parsing
Semantics (of words, sentences, ontologies and lexical semantics)
Machine translation

Project description

Social media platforms use content moderation systems to detect posts that are unacceptable according to their Terms of Service. For large platforms these systems usually use a combination of automated and human moderation and often include language models. Due to the fast-evolving nature of social media language, these models require constant updates to respond to new situations such as new slang or spellings (either due to natural changes, or those designed to deliberately fool an existing moderation system), or to scale systems to new functionalities or platforms.

Recent development of large language models demonstrate powerful language understanding and generation capability, including the social media moderation task. However, updating large language models demands significant computational power and resources.

In this project we are aiming to investigate language model compression and knowledge distillation techniques to produce small models. These models can be rapidly and efficiently adapted/fine-tuned to adapt to social media topic/language domain changes and reduce unnecessary carbon emissions. We will also explore methods to minimise the need for human labelled data during model updating. Hence the model can be updated more easily to accommodate emerging domain changes, ensuring a more adaptable and efficient content moderation system.

Joseph James

PhD Project title: Data-driven and Discourse-aware Scientific Text Generation

Supervisor: Dr Nafise Sadat Moosavi

Research interests:

Language generation
Machine translation
Summarisation and simplification
Interpretability and analysis of models for NLP

Project description

The current problem with natural language generation is that we allow the neural network to figure out discourse structure just from the training data which could contain biases, skewed representations and a lack of generality. This also makes it difficult to explain why a model picked certain connectors, words and syntax. By researching the underlying structure of these texts we can allow for more coherent and cohesive text generation to better understand how humans formulate text.

With GPT-4 releasing this year there has been a huge improvement in its capabilities especially in understanding questions, jokes and images, while also being able to pass tests with higher percentiles. However, it still has the limitations of reliability and hallucination where the content generated may sound plausible but is factually incorrect. In addition, with current methods and a brief review of different approaches we can see that there is always a lack of understanding of the discourse structure to a more fine-grained and grounded representation.

We will aim to build a discourse-aware text generation method for scientific papers to generate high quality long form text (i.e. related work section).In the paper “Scaling Laws for Neural Language models”, the consensus for improving LLMs is to increase the size and computational power, but for more niche and domain specific areas with less resources this may not be a reliable method and so by developing a novel method we can move towards a more explainable and robust NLG model.

There has been ongoing research in utilising linguistic theory to support natural language processing and generation. Rhetorical structure theory and Penn Discourse are the most well known tree banks, which are great at modelling span prediction, nuclearity indication (predict nucleus or satellite) and relation classification for adjacent sentences and paragraphs. We will look into whether or not there is more to discourse when it comes to structure as it seems to indicate that the writer's intent can affect the phrasing used to formulate text or else we would see all scientific papers being written in the exact same way. We will investigate what common attributes and structure are generated between different NLG architectures and relate it back to the theory. This could include new theories about discourse or expand on current ones.

References

OpenAI (2023) GPT-4 Technical report https://arxiv.org/abs/2303.08774
Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv preprint arXiv:2001.08361 (2020). https://doi.org/10.48550/arXiv.2001.08361
Cho, Woon Sang, et al. "Towards coherent and cohesive long-form text generation." arXiv preprint arXiv:1811.00511 (2018). https://arxiv.org/abs/1811.00511
Zhu, Tianyu, et al. "Summarizing long-form document with rich discourse information." Proceedings of the 30th ACM international conference on information & knowledge management. 2021. https://doi.org/10.1145/3459637.3482396
Gao, Min, et al. "A Novel Hierarchical Discourse Model for Scientific Article and It’s Efficient Top-K Resampling-based Text Classification Approach." 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2022. https://doi.org/10.1109/SMC53654.2022.9945306
Yang, Erguang, et al. "Long Text Generation with Topic-aware Discrete Latent Variable Model." Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. https://aclanthology.org/2022.emnlp-main.554/

Constantinos Karouzos

PhD Project title: Efficient transfer learning for pre-train language models across many tasks

Supervisor: Prof Nikos Aletras

Cohort 4 Student Representative

Academic background:

Integrated Master in Electrical and Computer Engineering, National Technical University of Athens

Research interests:

Machine Learning for NLP
NLP Applications
Transfer Learning and Domain Adaptation
Unsupervised & Low-Resource Learning
Legal NLP

Project description

In recent years, natural language processing (NLP) technologies have made significant advancements thanks to the development of pre-trained language models. These models are trained on massive amounts of text data from various sources, such as websites, news articles, and books. Once trained, these models can be fine-tuned for specific tasks and domains, resulting in improved performance across a wide range of NLP applications. However, there are still many open questions about how these models can be effectively adapted and used for new tasks and domains. This project aims to explore various ideas and techniques related to transfer learning, which is the process of leveraging pre-trained models to improve performance on related tasks.

Some of the research directions we plan to explore include:

Alternatives to fine-tuning: Instead of the traditional fine-tuning approach, we will investigate methods that involve continual learning, intermediate task learning, multi-task learning, and learning with auxiliary losses. These techniques may offer more efficient and effective ways to adapt pre-trained models for new tasks and domains.
Dealing with domain shifts: When the target domain is different from the source domain used for pre-training, performance may degrade. We will explore fine-tuning procedures that can address this issue and help maintain high performance across different domains.
Parameter-efficient transfer learning: Adapting large pre-trained models can be computationally expensive. We will investigate the use of adapter modules to enable more efficient transfer learning. These adapters can be trained on small amounts of task-specific data, making them useful for low-resource settings.
Model combinations: Instead of fine-tuning, we will explore techniques that involve merging pre-trained models. This approach may offer a more efficient way to combine the knowledge of multiple models for downstream tasks.

By exploring these research directions, this project aims to advance the state-of-the-art in NLP and contribute to the development of more efficient, adaptable, and powerful language understanding technologies. Through this project, we hope to contribute to the ongoing growth and success of NLP technologies, ultimately leading to improved human communication and decision-making capabilities.

Wing-Zin Leung

PhD Project title: Speech analysis and training methods for atypical speech

Supervisor: Prof Heidi Christensen

Research interests:

Speech production and perception
Models of speech production
Cognition and brain studies on speech
Speech, voice and hearing disorders
Phonation and voice quality
Pathological speech and language
Speech and audio classification
Speech intelligibility
Acoustic model adaptation
Applications in medical practice
Innovative products and services based on speech technologies

Project description

Motor speech disorders (MSDs) are speech disorders with neurological cause that affect the planning, control or execution of speech (Duffy 2019). A dysarthria is a type of MSD that reflect abnormalities in the movement required for speech production (Duffy 2019). Some common neurological causes of dysarthria are Parkinson’s Disease, Multiple Sclerosis, and Cerebral Palsy. Furthermore, the psychosocial impacts (e.g. to identity, self-esteem, and social participation & quality of life) of dysarthria are well documented for individuals with dysarthria, and their family and carers (Walshe & Miller 2011).

Speech technologies have a fundamental role in the clinical management of atypical speech, and the proceeding impact on an individual’s quality of life. Automatic speech recognition (ASR) (i.e. the task of transforming audio data to text transcriptions) has important implications for assistive communication devices and home environment systems. Alternative and Augmentative Communication (AAC) is defined as a range of techniques that support or replace spoken communication. The Royal College of Speech and Language Therapists (RCSLT) outline the use of AAC devices in the treatment of individuals with MSDs (RCSLT 2006), and AAC devices have become standard practice in intervention. Although the accuracy of ASR systems for typical speech have improved significantly (Yue et al. 2022), there are challenges that have limited dysarthric ASR system development and limit the generalisation of typical speech ASR systems to dysarthric speech, namely: 1) high variability across speakers with dysarthria, and high variability within a dysarthric speaker’s speech, and 2) limited availability of dysarthric data. Accordingly, studies have focused on i) adapting ASR models trained on typical speech data to address the challenge of applying typical speech models to dysarthric speech and ii) collecting further dysarthric data (although the volume and range of dysarthric data remains limited) (Yue et al. 2022). Furthermore, the classification of dysarthria, including measures of speech intelligibility are important metrics for the clinical (and social) management of dysarthria, including assessment of the severity of dysarthria and functional communication (Guerevich & Scamihorn 2017). The RCSLT promotes individually-tailored goals in context of the nature and type of dysarthria, underlying pathology and specific communication needs (RCSLT 2006). In current 1 practice, metrics are based on subjective listening evaluation by expert human listeners (RCSLT 2006) which require high human effort and cost Janbakhshi et al. (2020). Recent studies have implemented automated methods to classify dysarthric speech, including automatic estimators of speech intelligibility (Janbakhshi et al. 2020).

To advance the application of speech technologies to the clinical management of atypical speech, the current project aims to 1) collect a corpus of dysarthric data to increase the volume of quality dysarthric data available to the research community, 2) improve the performance of dysarthric ASR systems, including investigation of methods of adapting ASR models trained on typical speech, and 3) create automated estimators for the classification of dysarthria.

References

Guerevich, N. & Scamihorn, L. (2017), ‘SLP use of intelligiblity measures in adults with dysarthria’, American Journal of SLP pp. 873–892.
Janbakhshi, P., Kodrasi, I. & Bourlard, H. (2020), ‘Automatic pathological speech intelligibility assessment exploiting subspace-based analyses’, IEEE, 1717–1728.
RCSLT (2006), Communicating Quality 3, Oxon: RCSLT. Walshe, M. & Miller, N. (2011), ‘Living with acquired dysarthria: the speaker's perspective’, Disability and Rehabilitation 33(3), 195–203.
Yue, Z., Loweimi, E., Christensen, H., Barker, J. & Cvetkovic, Z. (2022), ‘Dysarthric speech recognition from raw waveform with parametric cnns’.

Maggie Mi

PhD Project title: What do neural network architectures learn about language? Probing for Semantics and Syntactical Relationships in Large Pretrained Language Models

Supervisor: Dr Nafise Sadat Moosavi

Academic background:

Linguistics BA (Lancaster University)

Research interests:

Clinical applications of speech and language technology
Computational linguistics
Cognitive modelling
Natural language generation

Project description

In recent years, pretrained language models (PLMs) built using deep Transformer networks dominated the field of natural language processing (NLP). Deep neural models perform well in many natural language processing-related tasks but can be hard to understand and interpret. A body of research has thus arisen to probe such models for a better understanding of the linguistic knowledge they encode and their inner structure.

This project aims to find out more about what such models learn about language, focusing specifically on semantics. Recent probing methods for lexical semantics rely on word type-level embeddings that are derived from contextualized representations using vector aggregation techniques and many PLMs use weight tying. Results from studies have shown that word embeddings are liable to degeneration. When this happens, word embeddings are not uniformly spread out in the embedding hyperspace, but rather they appear as a narrow cone. This is problematic because due to the incorrectly modelled hyperspace, the cosine similarity between two random, unrelated words would not be meaningfully big enough. Through experimentation of a series of experiments that are linguistically-driven tasks, it is hoped that a better understanding of PLM models can be obtained, as well as insights to resolve this issue of vector degeneration. In particular, areas of exploration include computational linguistics, probing for idiomaticity, complex forms of negation and perceptual nouns injection.

Valeria Pastorino

PhD Project title: Multimodal Multilingual Framing Bias Detection

Supervisor: Dr Nafise Sadat Moosavi

Cohort 4 Student Representative

Academic background:

BA Linguistic and Cultural Mediation, University of Naples "L'Orientale"
MA Sociolinguistics, University of York

Research interests:

Computational Linguistics
Sociolinguistics and Pragmatics
Emotion Detection
Social Media Analysis

Project description

Social scientists have explored the political significance of frames in mass communication, analysing how they are employed to direct audiences towards specific conclusions by selectively highlighting certain aspects of reality while obscuring others. A frame is thus a perceived reality, imbued with social and cultural values. In fact, the act of framing through linguistic and non-linguistic devices can have a significant impact on how the audience understands and remembers a problem and thus influence their subsequent actions. Think, for example, about the use of the words “freedom fighters” and “terrorists” to report on the same event, and how these two words can have very different inferred meanings. In our example, the choice of one word over the other might suggest the frame of that news article.

In recent years, the Natural Language Processing (NLP) community’s interest in the automatic detection of framing bias has increased, and social scientists too recognise the possible usefulness of computational tools in frame analysis. In fact, manual frame analysis typically involves manually identifying and categorising frames in large amounts of text. This can be a time-consuming and labour-intensive process, particularly for large datasets which, in contrast, could be computationally processed in a quicker and more efficient way. In addition, while manual frame analysis may capture a limited range of frames, automatic frame detection algorithms can analyse a broader range of frames or perspectives.

The very nature of framing makes the task more complicated for NLP than it is for human analysts, since even in manual analysis there is a high degree of inter-coder variation. In general, NLP algorithms may not have access to the same level of contextual information as a human analyst, which can make it harder to identify the relevant frames. When dealing with framing bias what is important is how something is discussed, rather than what is discussed. Recent research has also shown that hallucinatory generation of framing bias has been observed in large Natural Language Generation (NLG) models, corroborating the importance of tackling the problem of framing bias.

This project will focus on the task of automatic detection of framing bias in contexts such as news, political discourse, and other forms of media communication. Drawing from social science, linguistics and NLP research, we will address shortcomings in the task of frame categorisation from a multidisciplinary point of view. This research also aims to detect framing bias in multilingual and multimodal settings, considering both the cultural and social aspects imbued into frames and the textual/visual nature of framing devices.

Overall, this project aims to further improve the task of frames detection in multilingual and multimodal settings, to provide a valuable tool for social scientists and NLP researchers to automatically detect framing bias in media communications. Such a tool could have significant applications in political and social sciences, as well as in media studies.

Robbie Sutherland

PhD Project title: Detecting communication situations for AI-enabled hearing aids

Supervisor: Prof Jon Barker

Industry partner: WSAudiology

Research interests:

Speech enhancement in hearing aids
Speech enhancement / noise reduction / dereverberation / echo cancelation
Speech intelligibility
Speaker verification and identification

Project description

Hearing impairment is a prevalent problem in the UK, affecting approximately 1 in 6 people making it the second most common disability in the country. However, despite the fact that 6.7 million people could benefit from using a hearing aid, only 2 million people use them, and even those that do use them do not find them to be especially effective [1].

The basic function of a hearing aid is to amplify specific frequencies to directly compensate for the hearing loss experienced by the user. This is effective in quiet, ideal environments, but the overwhelming majority of typical communication situations present much more challenging environments, with lots of background noise, competing conversations and reverberation. In these scenarios, the amplification also gets applied to these unwanted noises, which renders the desired speech signal inaudible to the user. To mitigate this key issue, new approaches are needed that are able to amplify selected sound sources.

Beamforming is an extremely useful technique for enhancing the desired sound sources; in systems with multiple microphones, such as hearing aids, the signals from each microphone can be combined in such a way that the signal from a specific direction is enhanced. While there has been a lot of progress in this domain, it is difficult to apply in the context of hearing aids; there are significant computational restrictions due to the small form factor of the device, the system must function in real time with low latency (the signal should not be delayed by more than 10 ms), the microphone arrays are constantly moving, and the precise relative positions of the microphones may not be accurately known.

While modern beamforming algorithms are very sophisticated, their effect is severely limited when little is known about the acoustic scene; the algorithm needs to be aware of which sources should be amplified, and which should be attenuated. The aim of this project is to investigate methods and features that distinguish between common conversational scenes so the hearing aid may determine which sound sources are relevant to the listener, and which sounds are interfering (similar to what is described by the cocktail party effect). It can be especially difficult when the interfering sound is also speech, such as a nearby conversation or TV show, since there may be little acoustic information which can indicate whether it is relevant to the listener. For this reason, it is important to explore machine learning methods that consider both acoustic information and behavioural cues from the listener; head movement, for example, has been shown to provide insight into the scene and where the listener’s attention may be [2].

Deafness & Hearing Loss Facts, 2022, https://www.hearinglink.org/your-hearing/about-hearing/facts-about-deafness-hearing-loss/
Lauren V. Hadley and John F. Culling. Timing of head turns to upcoming talkers in triadic conversation: Evidence for prediction of turn ends and interruptions. Frontiers in Psychology, 13, 2022.

Cohort 5 (2023)

Jason Chan

PhD Project title: Investigating the Reasoning Properties of LLMs

Supervisors: Prof Rob Gaizauskas and Dr Cass Zhao

Academic background:

BA, Philosophy, University College London
Graduate Diploma in Law, BPP University
PgDip (Distinction), Legal Practice, BPP University
MSc, Speech and Language Processing, University of Edinburgh

Research interests:

Natural language understanding and reasoning
Pragmatics
Machine translation
Speech and NLP applications in law
Speech and NLP applications in music

Project description

Reasoning is a key process through which humans acquire new knowledge from existing facts and premises. While recent research has shown that large language models (LLMs) can achieve high performance on a range of tasks that require reasoning abilities, the mechanism by which they do so remains an open question. This project therefore investigates whether and how LLMs go about the process of “reasoning”, and whether any parallels can be drawn between their reasoning process and that of humans. In doing so, we aim to shed light on the strengths and limitations of current models, and propose new methods for improving the reasoning capabilities of LLMs.

According to the theory of mental models [1] in cognitive science, humans do not typically reason over propositions (statements about a particular event or state of the world) by converting them into abstract logical expressions and applying formal rules. Instead, they construct mental models of the situations being described, based on their interpretation of the propositions’ meaning (which is in turn informed by their existing knowledge about the world). These models then form the basis for making inferences, evaluating conclusions, and identifying and resolving inconsistencies among propositions.

A primary focus of this project is therefore to investigate whether LLMs employ a similar method of constructing and using models in their reasoning process. Recent research suggests that LLMs trained only on text data can nonetheless develop structured representations of abstract concepts such as colours, space and cardinal directions. Building on this body of work, we aim to tackle the open question of whether LLMs can (and do in fact) construct and use models that integrate and represent information across various dimensions such as space, time, possibility and causation.

Moreover, a prediction of the model theory is that the logical properties of relations and connectives, such as “if” and “or”, are not defined by formal rules; instead, they emerge through the process of constructing and evaluating mental models in response to specific contexts and propositions. For example, the relation “taller than” is transitive (A is taller than B and B is taller than C. Therefore A is taller than C.) only under certain contexts, but not others such as the following:

Celia was taller than Joe when she was six years old. Joe was taller than Mary when he was six years old.

(Because Celia and Joe may not have been six years old at the same time, we cannot infer that Celia was taller than Mary when Celia was six years old.)

Consequently, a starting point is to evaluate the sensitivity of LLMs’ outputs to such context variations when reasoning about relations and connectives. This would suggest reasoning mechanisms that go beyond straightforward pattern-matching and rule application, which we would then proceed to uncover through examining and performing causal analyses on the internal representations of LLMs.

Reference

P. N. Johnson-Laird. Mental models. Cambridge University Press, 1983.

Jack Cox

PhD Project title: Decomposition of speech embeddings

Supervisor: Prof Jon Barker
Industry partner: Meta

Academic background:

MSc, Chemical Physics, University of Bristol

Research interests:

I am interested in the use of natural language processing in low-resource contexts, including applications in linguistics and language conservation, greener NLP, and broadening access for marginalised communities.

Project description

Desirable speech representations should disentangle the content (semantics) and the intonation (expression) from potentially interfering variations, such as speaker identity. Applications range from enabling biometrics-free speech models, to data augmentation by controlling speaker or style attributes or controlled automatic alignment of data [1, 2].

While there is no formalised definition of disentanglement, the intuition is that a disentangled representation should separate interpretable factors of variations in the real-world data [3].

In this project we will investigate supervised and unsupervised techniques to disentangle speech embeddings. Existing works in this space include unsupervised learning methods to recover the real-data factors [3], and methods using variational auto-encoders (VAE) for factorization [4,5].

Modelling techniques for disentangling embeddings should be generalizable to other modalities (e.g. audio, text or images), but the focus will be here on the speech modality.

References

Leng et al. (2023). PromptTTS 2: Describing and Generating Voices with Text Prompt, ICLR
Duquenne et al. (2023). SONAR EXPRESSIVE: Zero-shot Expressive Speech-to-Speech Translation.
Bengio et al. (2013). Representation learning: A review and new perspectives. PAMI.
Hsu et al. (2017). Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data. NeurIPS.
Kim et al. (2018). Disentangling by Factorising. ICLM

Aaron Fletcher

PhD Project title: Fact Checking Scientific Claims

Supervisor: Dr Mark Stevenson

Academic background:

Veterinary Sciences, Royal Veterinary College
Veterinary Sciences, CertAVP, Royal College of Veterinary Surgeons
MSc (Distinction), Artificial Intelligence, University of Hertfordshire

Research interests:

My interest in natural language processing stems from the medical domain, particularly clinical decision tools, knowledge base creation and prediction based on medical records.

Project description

The medical evidence base is expanding at an unprecedented rate, with a staggering 268.93% increase in publications from 2002 to 2022. This growth poses a significant challenge, as it becomes impractical for a single human to thoroughly review all evidence returned from queries in a reasonable timeframe. Nevertheless, this is precisely what medical researchers are expected to do during the peer review process. As this body of knowledge continues to swell, the need for efficient tools that can review documents becomes increasingly urgent.

Technology-assisted review (TAR) aims to address this issue by providing tools to reduce the amount of effort required to screen large collections of documents for relevance. Continuous Active Learning (CAL) is the most widely adopted TAR approach. CAL uses Active Learning to incrementally develop a classifier to identify relevant evidence and thereby identify it as early as possible. CAL systems often rely on relatively simple classifiers, such as linear regression, although more recent work has begun to explore the use of large language models (LLMs) with mixed results.

The project aims to make use of recent advances in LLMs to develop improved TAR methods for the medical domain. In particular, it will explore (1) the effect of alternative Active Learning strategies (e.g. query by committee), (2) the benefit of integrating LLMs into the TAR process, (3) the effect of prompting techniques, and (4) the benefit of using state-of-the-art LLM techniques such as chain of thought or reflection on search trees.

Paul Gering

PhD Project title: Towards a quantitative measure of the 'quality' of multimodal interaction involving spoken language
Supervisor: Prof Roger K Moore

Cohort 5 Student Representative

Academic background:

BSc (Hons), Psychology, University of Warwick
MSc Psychological Research Methods with Data Science, University of Sheffield

Research interests:

My research interests generally focus around: the psychological applications of speech and language technology such as supporting individuals with language impairments.

Project description

Social interactions involve two or more individuals producing communicative behaviours such as verbalisations and gestures. Successful interactions are rewarding as they can strengthen social relationships, facilitate mutual understanding and support individual physical and mental health. Given the benefits of positive interactions, researchers have proposed methods of measuring interaction quality (IQ).

'Quality' is a subjective concept because it depends on an individual's judgement processes. Consequently, researchers have relied on self-report measures of IQ, where participants answer a series of questions at the end of an interaction. These measures are unable to measure IQ dynamically, which would allow interaction partners to change their behaviours in real-time based on the current IQ. Additionally, participant responses to self-report may suffer from recency bias, as they are more likely to remember the most recent elements of the interaction. These limitations could be resolved by developing an objective and dynamic measure of IQ.

The aim of this PhD project is to use statistical methods and signal processing to develop an objective and dynamic measure of interaction quality (IQ). The project will focus on assessing the quality of multimodal interactions with a spoken language component. Multimodal interactions involve verbal and non-verbal communicative behaviours and account for the majority of real-world human interactions. These interactions are densely layered and, thus, present a more interesting challenge for this project. Based on previous research, any of the following factors may be relevant for measuring IQ: (1) information exchange rate, (2) synchronicity (whether interaction partners coordinate their behaviours in time), (3) the attentional focus of the interaction partners, (4) factors related to turn-taking such as turn length and the smoothness of turn switching, and (5) emotive behaviours such as smiling and laughing.

The proposed measure will be tested using an embodied conversational agent, a computational interface with a physical, human-like form, enabling it to produce verbal and non-verbal communicative behaviours. The conversational agent will use the proposed measure to assess IQ during several interactions with a human user. These IQ scores will be compared to baseline scores produced by manually rating the quality of the human-computer interactions. The performance of the proposed measure will also be compared to previously developed IQ measures. Statistical testing will indicate whether any significant difference exists between these IQ scores.

The proposed interaction quality (IQ) measure is expected to produce similar IQ scores to the baseline scores, suggesting that the proposed measure is accurately capturing IQ. In contrast with previous measures, the proposed measure will compute IQ scores dynamically, thus demonstrating the feasibility of a real-time IQ measure. The IQ measure will be made openly available for researchers across disciplines to use. It is hoped that the IQ measure will have applications across varying contexts. For example, it could be used to evaluate the performance of dialogue systems such as Siri or Alexa. Additionally, it could be used by social workers to assess parent-child IQ, thus allowing them to identify families in need of additional support. The measure could also be useful for speech-language pathologists to assess and compare the effects of various disabilities on client engagement.

Anthony Hughes

PhD Project title: Multimodal Large Language Models for Healthcare

Supervisor: Dr Ning Ma, Prof Nikos Aletras, and Dr Steven Wood (Sheffield Teaching Hospitals)

Academic background:

BSc (Hons), Computer Systems Engineering, Nottingham Trent University
MA, Computational Linguistics, University of Wolverhampton

Project description

Artificial intelligence (AI) has exploded into the mainstream in recent years, paving the way for the development of large language models (LLM) and large multimodal models (MM). These models are allowing for new and innovative ways of tackling societal problems. This increased use of LLMs and MMs is having a great influence in healthcare in many applications, including prediction of patient outcomes, medical decision making, and the potential to fuse and infer over a multitude of patient signals. Consequently, the critical and high-impact nature of this environment is fuelling the need for a model's ability to detect and adapt to adversarial behaviours over time.

Great strides have been made in incorporating LLMs into a healthcare setting. These sophisticated models now possess the ability to surpass human performance in tasks such as medical exams. However, LLM training processes that focus on generalising across a particular data distribution from a particular point in time are a cause for concern. This style of model training is static, consequently paving the way for models that could become misaligned with the environment in which they operate. In this project, we look to learn and build mechanisms for combating a model's misalignment, helping models to continually learn.

The task of updating models, however, does not come without its challenges. By updating the parameters of a trained language model, we leave the model susceptible to certain effects, such as concept and model drift and catastrophic forgetting and interference. In this PhD project, we look to the continual learning and meta-learning literature to address these critical issues. In a medical setting, it is crucial that when realigning a model to its environment, it does not forget the concepts and tasks it was previously taught.

The multimodal nature of medicine requires the development of MMs in order to process and interpret multiple types of data. In collaboration with local healthcare organisations, this PhD project is aimed at harnessing the potential of LLMs and MMs in the medical domain to improve healthcare outcomes, enhance patient care, and assist healthcare professionals. It will in particular focus on how LLMs and MMs can be adopted to handle and interpret the highly granular data in medicine while ensuring the models remain robust to data changes and task changes over periods of time.

Ian Kennedy

PhD Project title: Advancing Deep Learning Techniques for Enhanced Efficiency and Effectiveness in NLP Models

Supervisor: Dr Nafise Moosavi

Academic background:

BSc, Economics and Finance, University College Dublin
MS, Applied Analytics, Columbia University NYC, USA

Research interests:

Natural Language Generation
Machine Learning for NLP
Knowledge Intensive Tasks
NLP Applications
Cross-modal learning

Project description

This project seeks to transform the capabilities of deep neural networks (DNNs) in natural language processing (NLP) by pioneering the development and implementation of innovative activation functions. Traditionally, activation functions like 'tanh' and 'ReLU' have dominated the field; however, this initiative aims to explore and validate the efficacy of non-monotonic and learnable activation functions, potentially offering a substantial advancement over traditional activation functions. The hopeful benefit is to improve efficiency by reducing data and computational requirements or, conversely, to make neural networks more effective while keeping the data and computation requirements the same.

The research will expand some of the recent advancements exemplified by the Swish function and the SIREN model, which utilises sinusoidal waves, to experiment with similarly novel activation mechanisms that are learnable and non-monotonic. The central hypothesis is that these advanced functions can enhance the network's ability to model complex, non-linear relationships in data more efficiently and effectively.

The investigative focus will aim to improve upon traditional, non-learnable activation functions with novel, learnable ones. A vital aspect of this examination will involve assessing whether networks equipped with learnable functions can simplify overall architecture by encoding complex data relationships more compactly, thereby reducing the depth and size of the network needed. This will aim to improve the trade-off between efficiency and effectiveness. In other words, increased efficiency should correlate with greater effectiveness when using the same resources.

Expected contributions of the research span both theoretical advancements and practical applications. The project aims to deepen the understanding of how novel activation functions affect learning processes and generalisation in DNNs. By establishing practical guidelines for optimal activation function development, this work could lead to significantly more efficient and effective neural networks, achieving robust performance with fewer resources, thus marking a substantial advance in the field of deep learning in NLP.

Fritz Peters

PhD Project title: Using Structured Conversational Prompts in the Diagnosis of Mental Health and Cognitive Conditions

Supervisor: Prof Heidi Christensen

Academic background:

BSc, Psychology and Language Sciences, University College London
MSc, Natural Language Processing , University of Trier, Germany

Project description

Advances in speech and language technologies have brought us closer to healthcare applications that provide a truly meaningful benefit for individuals with various health conditions. Existing applications have focused on either translating standard assessments involving a clinician into an automated tool, or on harnessing the power of conversations as a window into a person’s cognitive health.

Holding a conversation draws onto different cognitive processes related to attention, memory, and executive functioning. In addition, it reveals the person’s ability to understand and produce speech. Specific cognitive conditions such as dementia and psychotic disorders cause changes to different aspects of a person’s ability to hold a conversation. For example, an individual suffering from dementia is likely to use simplified grammar while the speech produced by a person suffering from a psychotic disorder is likely to be less coherent. Such language and speech abnormalities can be analysed and used as predictive features in the diagnosis of these conditions.

There are existing tests and frameworks for the automatic detection of a person’s cognitive health based on speech/language features. The majority of these tests use a limited set of different tasks with the most prominent being a picture description task. Past research has pointed out potential shortcomings of these existing tests including potential biases introduced by the standard picture used for the picture description tasks. Moreover, there is a task-dependency of existing analysis frameworks. This means that while the analysis framework yields significant results using the speech elicited by one task, it does not for the speech elicited by another task.

The primary objective of this research is to develop and evaluate a series of structured conversational prompts in the diagnosis of mental health and cognitive conditions with a focus on dementia and psychotic disorders. These prompts should reliably elicit speech that show condition-specific abnormalities. We aim to build a robust analysis framework that not only aids the diagnosis of cognitive conditions but also produces a human interpretable output. To improve the current diagnostic pathway of cognitive conditions, it is necessary that the output is informative to a clinician. Moreover, to increase fairness in healthcare, we aim to assess and minimise potential biases introduced by our conversational prompts and analysis framework. The product of this research should be an automated assessment of dementia and psychotic disorders that truly benefits the current diagnostic process.

Yanyi Pu

PhD Project title: Autonomous LLM Agents for Graph Problem Solving

Supervisor: Professor Nikos Aletras

Academic background:

MPhil, Computer Science, University of Cambridge
MSc(Eng) (Distinction), Data Science, University of Sydney, Australia
CERT HE (Distinction), Graduate Certificate in Data Science, University of Sydney, Australia
BA, Commerce, University of New South Wales, Australia

Boxuan Shan

PhD Project title: Expressive Speech Synthesis

Supervisor: Dr Anton Ragni

Academic background:

BSc (Hons), Artificial Intelligence and Computer Science, University of Sheffield
MSc (Distinction), Advanced Computer Science, University of Sheffield

Research interests:

Speech synthesis and enhancement
Low resource Languages
Multimodality

Project description

Speech synthesis technology has made remarkable progress in generating human-like speech from text inputs. However, human communication extends beyond merely the correct arrangement of words and grammar. For example, you can identify who is speaking from speech without extra hints at the beginning of each utterance, and you can also feel their emotion without them telling you about it.

Current speech synthesis technology, while being able to produce speech that sounds like people, often struggles with the subtleties that characterise natural human speech. These include properties such as intonation, emphasis, and timing, which convey emotions and intent, or speaker characteristics that are distinct from person to person, such as voice texture, pitch, accents, and speech quirks. Some recent systems allow the customisation of speech by accepting a specific age and emotion instruction or accepting a short clip of speech as a styling example. However, more nuanced modelling and controlling is still a challenging area that needs further development.

This research aims to push the boundaries of expressive speech synthesis by researching and developing novel approaches that go beyond conveying information but infuse synthesised speech with genuine emotions, theatrics, and human quirks like laughter, grunts, pauses, and filler words. The outcome of this project will potentially reshape human-computer interaction and bring synthesised speech closer to the intricacies of genuine human conversation and have the potential to impact various domains, including entertainment, virtual assistants, education, and communication aids.

Yao Xiao

PhD Project title: Increasing Robustness of Augmentative and Alternative Communication based on Speech and Language Technology

Supervisor: Prof Heidi Christensen

Academic background:

MA, Linguistics, University of Edinburgh
MSc Computing, University of Central Lancashire, Cyprus

Project description

Neurological impairments can cause significant speech and language difficulties for individuals e.g. affected by conditions such as stroke [1], traumatic brain injury [2], or neurodegenerative diseases like Parkinson's [3, 4] or Alzheimer's [5]. They can cause speech and language disorders such as aphasia and dysarthria, alongside other language-related deficits including semantic, pragmatic and syntactic impairments. Symptoms such as word-searching difficulties and reduced speech intelligibility pose challenges for individuals in efficiently communicating with others, and consequently affect their quality of life.

Assistive technologies involving speech and language technology, ranging from automatic speech recognition (ASR) systems to augmentative and alternative communication (AAC) devices, can provide valuable aid in mitigating communication difficulties. The linguistic and acoustic differences between impaired and unimpaired speech or language have also inspired the automatic assessment of related diseases and disorders, serving as a non-intrusive, scalable, and cost-effective way to facilitate early detection, monitoring, and management of such conditions.

One of the main challenges in training data-driven speech and language technology is the sparsity of impaired speech and language datasets, which limits the availability of representative and diverse samples necessary for robust model training and evaluation. Recent studies have been focusing on data augmentation methods to increase the amount of training data and enhance model robustness. This project approaches this problem by exploring the bridge between textual and speech domains aiming to advance understanding in the transitional area of speech technologies and natural language processing (NLP) related to impairment. Specifically, the approach involves the application of foundation models to analyse impaired speech and language. The goal of the project is to contribute to the development of robust assistive technologies and diagnostic tools, ultimately enhancing the quality of life for affected individuals.

References

Rousseaux, M., Daveluy, W., & Kozlowski, O. (2010). Communication in conversation in stroke patients. Journal of neurology, 257, 1099-1107.
McDonald, S., Code, C., & Togher, L. (2016). Communication disorders following traumatic brain injury. Psychology Press.
Montemurro, S., Mondini, S., Signorini, M., Marchetto, A., Bambini, V., & Arcara, G. (2019). Pragmatic language disorder in Parkinson’s disease and the potential effect of cognitive reserve. Frontiers in psychology, 10.
Ash, S., Jester, C., York, C., Kofman, O. L., Langey, R., Halpin, A., ... & Grossman, M. (2017). Longitudinal decline in speech production in Parkinson's disease spectrum disorders. Brain and language, 171, 42-51.
Taler, V., & Phillips, N. A. (2008). Language performance in Alzheimer's disease and mild cognitive impairment: a comparative review. Journal of clinical and experimental neuropsychology, 30(5), 501-556.

Minghui Zhao

PhD Project title: Physics-Based Speech Synthesis

Supervisor: Dr Anton Ragni

Academic background:

MSc (Distinction), Speech Science, University of Edinburgh
BA English Language & Linguistics, Ocean University of China

Project description

Speech synthesis technology has attracted significant attention due to its widespread applications in real life. It can not only facilitate communication for individuals with speech impairments but also enhance interaction between humans and machines. Just as studying bird flight informs aircraft design, initial efforts in speech synthesis aimed to draw inspiration from the workings of the human vocal tract. However, since the emergence of neural networks, the speech community has moved away from physics-inspired modelling and instead embraced data-driven statistical approaches.

These statistical methods have demonstrated the ability to produce high-quality speech but come with certain limitations. It's common in machine learning for a successful model architecture in one task to be applied to another, even if the tasks are unrelated. For example, the state-of-the-art transformer-based models are implemented in the generation of both language and speech. However, it is important to realise that speech, unlike other sequential data such as text, possesses numerous properties rooted in physical facts, which are ignored in the general-purpose models. To compensate for the absence of inherent prior information or inductive bias specific to the task, these models are usually equipped with complex architectures followed by a vast number of parameters. However, under limited resource conditions, it's not possible to gather the substantial amount of training data required to accommodate this abundance of parameters. Moreover, the lack of transparency of the learning process leads to limited interpretability and unpredictable model behaviours.

Modelling physics-inspired, or natural, speech production processes offers a way to address these shortcomings. Traditional methods of emulating the natural speech production process faced two challenges: data scarcity and complexity of modelling non-linear systems. With the advancement in both areas, endeavours should now be taken to incorporate physical processes in the modelling of speech signals. Notably, diffusion models have exhibited impressive performance in various machine learning applications, such as speech synthesis. This project explores diffusion models and other physical processes, comparing their efficacy in speech synthesis.

Page updated

Report abuse

Our PhD and MPhil students

Cohort 1 (2019)

Hussein Yusufali

Project description

Cohort 3 (2021)

Jason Clarke

Project description

Robert Flynn

Project description

Mary Hewitt

Project description

Tyler Loakman

Project description

Amit Meghanani

Project description

Tom Pickard

Project description

Jasivan Sivakumar

Project description

Miles Williams

Project description

Cohort 4 (2022)

Shengchang Cao

Project description

Shaun Cassini

Project description

Mattias Cross

Project description

Meredith Gibbons

Project description

Joseph James

Project description

Constantinos Karouzos

Project description

Wing-Zin Leung

Project description

Maggie Mi

Project description

Valeria Pastorino

Project description

Robbie Sutherland

Project description

Cohort 5 (2023)

Jason Chan

Project description

Jack Cox

Project description

Aaron Fletcher

Project description

Paul Gering

Project description

Anthony Hughes

Project description

Ian Kennedy

Project description

Fritz Peters

Project description

Yanyi Pu

Boxuan Shan

Project description

Yao Xiao

Project description

Minghui Zhao

Project description

Sponsored by