Mini project
Every year, our first year students undertake a mini-project.
The mini-project is a six month activity which starts within the first few weeks of being with the CDT. It offers a fantastic opportunity to really bond with your fellow cohort members through the activity of creating an SLT research prototype system.
The mini-projects are designed to span both speech and language sub-areas so you get exposure to different technologies and approaches right from the start.
Working as a team, you will work through a conventional software engineering project approach from scoping through to implementation and evaluation. You also get chance to take on different management roles over the lifespan of the project.
Example topics are speech to speech translation, spoken question answering or call centre analytics.
Cohort five (2023 intake): Drive-through Restaurant Conversational Assistant
A drive-through (or “drive-thru”) restaurant is a type of take-out fast-food provider which allows customers to purchase hot and cold food and / or drinks without leaving their cars. The process involves only short pauses between ordering, paying, and collecting; once the food has been handed to the customer, the customer drives away.
It is common for a digital menu board to be placed at the head of the queue which displays the menu, promotional offers, and a summary of the customer’s order. Since the server is often located some distance away within the restaurant, the digital menu board incorporates a microphone to capture the customer’s speech and a speaker for the customer to hear the server’s responses / questions. This also allows the server to manage multiple ordering queues from the same location (for example, by interleaving ordering across the queues). The server wears a headset with one or more earphones and a boom microphone.
The overall aim of the project is to help the client (an individual drive-through restaurant owner or chain owner) improve the ordering process. The systen will support the existing server rather than replace them. This can be achieved in a number of ways (each of increasing complexity):
Realtime confirmation of the customer’s speech / order to reduce the number of clarification interactions [server support]
Realtime interrogation of the menu based on the customer’s speech to check for item availability and / or automatically adding the item to the order [server support]
Fully automated conversational ordering agent with synthesised speech clarifications and confirmations [server in supervisory role only]
The ordering environment poses a number of acoustic / speech challenges including noise (in-car music; in-car non-driver speech; external speech; environmental noise); low quality microphones and speakers; wide variations in the customer’s conversational approach; customer indecision and / or order changes; variety of accents and speaking rates; speech overlap (customer and server; customer and passenger; passenger and server).
The focus of the project is on research into the speech and language tools necessary to yield an improved version of the service, and the outcome will be tools supported by experimental evidence of their usefulness. Part of the project effort will focus on the design and recording of a representative audio corpus of drive-through restaurant interactions.
Approach
The cohort is split into two teams to add a little competitive excitement.
Cohort four (2022 intake): University Library Spoken Language Information Assistant
The University of Sheffield Library offers a number of information services to support staff, students and others. These include an on-line text chat-with-a-librarian service and a web-based interface to a database of pre-composed answers to frequently asked questions (FAQs) (see https://libraryhelp.shef.ac.uk).
The FAQ database may be accessed in several ways. It can be browsed, optionally by topic, or it can be searched by the user formulating an arbitrary query, which is then “best-matched” against the FAQ answers, with the FAQ answer deemed most likely to answer the user’s query being returned.
Additionally, the library has commissioned a number of information kiosks, which are placed in various locations around the University. These feature both a soft keyboard interface and a speech interface that allows users to pose queries to the kiosk and to receive spoken and/or written feedback. Both these interfaces access the FAQ information.
Our students are working with the Library to develop more effective and efficient information services for its users. Our students' focus is on research into the speech and language tools necessary to yield an improved spoken language interface version of the service.
Approach
The cohort is split into two teams to add a little competitive excitement.
Cohort three (2021 intake): Enhanced access to oral history repositories
“Oral History” is the collection and study of recordings in which the spoken memories of people are captured as a record for future generations. Oral History thus focuses on personal and historical experiences and its production and study involves both public and academic input. Much of the recorded information cannot be found in written sources.
Oral history is an important way to capture the true spirit present at historical events. Stories told by different participants in the same situation give a rich and multifaceted record of complex events that can reveal great insight and allow us to relive the time described.
Our students are working with the UK charity Legasee which is creating a large repository of video and audio recordings of armed forces veterans with the aim "that future generations can learn about our military history through the personal recollections of the men and women who witnessed it first hand’".
The purpose of this project is apply Speech and Language Technologies to make oral history collections more accessible by automatic processing. There are many challenges when dealing with oral history recordings (especially within a military context): accents, dialects, conversational speech, emotion, specialised terminology, names (codenames, geographical locations, battles, ships, vehicles, comrades, etc).
The students will produce a set of tools that can automatically process audio-visual archive recordings in order to pre-produce an enriched version that can be further processed efficiently by archive producers and users to enhance search. The students have access to a large portion of Legasee audio-visual recordings. Some are annotated with transcripts, metadata, and summaries; others are untranscribed recordings.
Approach
The cohort is split into two teams to add a little competitive excitement.
Check out the news postings (Jan 2022, June 2022) about this project.
Following completion of the mini-project (April 2022), Martin Bisiker (Founder & Trustee of the Legasee Educational Trust) said: "I convey my sincere thanks for the effort that you've taken to enhance Legasee's search potential. There's no doubt that the charity now has an opportunity to take a big step forward in respect of what we offer."
Cohort two (2020 intake): A reader’s companion
The purpose of this project is to create a 'reader’s companion’.
Imagine you are reading a bulky novel, or a biography, or indeed anything that has a complicated story to tell. It would be really useful to be able to ask questions like ‘remind me who Ivan is‘ or ‘How is Ivan related to Petrula?’ or ‘Was it Ivan who met Molotov in Moscow?’.
The aim of this project is to design, build and evaluate software capable of answering such questions: a reader’s companion.
The reader’s questions will be input by voice, because you don’t want to have to look away from the book and certainly not to type anything. The companion will speak its answers.
The client has specified that there should be dialogues as opposed to isolated questions. For example:
Reader: Was Ivan previously married to Katerina?
Companion: Yes, would you like to know more?
Reader: Yes, where did they live?
Companion: In Stalingrad, and before that in Kharkov.
Reader: What happened to her?
…
The teams have been asked to consider additional functionality such as
providing a ‘new readers start here’ summary at the start of a session
linking to other books by the same author, which might share characters or settings
speaking passages from the book, perhaps in different voices
allowing the reader to make notes
Approach
Since cohort two is larger than the previous cohort, the cohort was split into two teams which added a little competitive excitement.
Cohort one (2019 intake): A conversational persona for an AI pop musician
This project focused on the design and development of a spoken dialogue system to act as a persona for an Artificial Intelligence pop musician.
The project was commissioned by the frontman of a successful English rock band (shhh – we can’t tell you who) and who worked closely with the student team throughout the project.
The system is capable of conversing with music industry journalists about its ‘life’ as an AI musician in a semi-natural manner. We didn’t want it to sound too natural otherwise it wouldn’t sound like AI – think more audio-only Max Headroom for the 21st century with a grittier Northern attitude.
Approach
The team studied music and general purpose interviews by collecting and analysing an interview corpus containing 1266 interviews.
From this analysis, they proposed to establish idiosyncrasy in the system by incorporating characteristic behaviours in four ways:
giving ‘gnomic responses’;
incorporating incoherent and undirected nonsense in a number of responses (waffling);
predicting the emotional state of the conversational agent for each response;
incorporating a Northern accent to the system.
The students undertook a literature review of dialogue systems including dialogue modelling, question answering in a conversational manner, speech recognition, speech synthesis, and emotion prediction.
Their final system combined a variety of technologies including Kaldi (Automatic Speech Recognition – ASR), LPCNet and Tacotron (Speech Synthesis), a Finite State Machine (Question Answering state), a TF-IDF system (retrieval-based QA), Keras and PyTorch (waffle generation and emotion prediction).
Outcome
The team’s evaluation showed that their chatbot was ranked at the same as Cleverbot (a well-known text-based conversational AI) in terms of humanness, hitting our goal of human-like performance (a significant achievement).
The implementation of the waffle machine is completely bespoke to their system and provides an extra element of humanness for systems that aim to emulate real speech.
They also created a bespoke dataset of Northern speech which could potentially be used by anyone looking for Northern accented speech from a male speaker. It would certainly be of a sufficient size to train a Northern accented speech synthesiser.