Highlights

Ursa: the world's most accurate speech-to-text

March 8th 2023 | Speechmatics Team

I authored the blog introducing Ursa, the world’s most accurate speech-to-text system. Ursa delivers unprecedented performance across a diverse range of voices. We observed relative accuracy gains of 22% and 25% versus Microsoft and OpenAI’s Whisper respectively. As Accuracy Team Lead, I lead many of the technical projects involved in delivering this system: scaling our language model ensuring the throughputs were acceptable on GPU; increasing the amount of supervised English training data by 3x; and running the competitor evaluations.

Read the blog | Twitter Thread

Image prompt: "Ursa"

RNNT and Subtracting Internal LM Scores

November 10th 2022 | Accuracy Team

Integrating external language models into automatic speech recognition systems can improve accuracy by using a larger dataset of words. End-to-end models such as the recurrent neural network transducer (RNNT) include an internal language model, but it is still common to combine external language models with these systems using shallow fusion. Subtracting the internal language model scores is necessary for optimal accuracy and is a project I was supervising while at Speechmatics.

Read the blog

Image prompt: "Recurrent Neural Network Transducer and subtracting the internal language model, digital art"

Modelling Pipelines - Project "Aladdin"

July 22nd 2022 | Accuracy Team

I have played a large role in leading the improvements to our data preparation and training pipelines for acoustic and language modelling. We can now build languages end-to-end with automated testing in a matter of days rather than weeks. Recently, it has meant data engineers in my team have been able to quickly iterate and deliver improvements such as an 18% relative reduction in WER for Canadian French. Furthermore, the project was pivotal to allow us to release 14 new languages in 2022.

Canadian French blog | 14 new languages blog

Image prompt: "Cyberpunk Aladdin riding a magic carpet, digital art"

Self Supervised Learning - Project "Hydra"

October 26th 2021 | AML Team

I am proud to have been part of the team at Speechmatics that delivered this groundbreaking technology, which not only improves speech recognition accuracy for African American voices, but also across accents, dialects, age, and other sociodemographic characteristics. According to datasets used in Stanford's "Racial Disparities in Speech Recognition" study, we recorded an overall accuracy of 82.8% for African American voices, compared to 68.6% for both Amazon and Google. We used self-supervised learning to provide rich representations to our acoustic models which lead to a 45% reduction in WER, or the equivalent of three words in an average sentence. These features can also be used for many other downstream tasks or "heads" hence the project name Hydra.

Read the blog | Whitepaper

Image prompt: "A hydra with 3 heads, digital art"

Talks

Delving into Speech-to-text and where to go beyond WER

October 11th 2022 | Voice 22 in Arlington, Virginia

I discussed the limitations of Word Error Rate (WER) and alternative approaches for evaluating speech-to-text in a talk at Voice 22. The talk also covered topics such as the NER model, large language models (LLMs), few-shot learning, and chain of thought reasoning. Read more about the project "Meaning Error Rate" on the home page.

Interviews

Fireside Chat with Ian Utile

October 11th 2022 | Voice 22 in Arlington, Virginia

Topics: Speechmatics, Self-supervised learning

On the Street Interview with Joti Balani

October 11th 2022 | Voice 22 in Arlington, Virginia

Topics: Speechmatics, inclusivity, language coverage

The AI Journal with Tom Allen

July 28th 2022 | LinkedIn Live

Had a great chat with Tom Allen from the The AI Journal about some key projects and innovations at Speechmatics in the last 6 months. This included self-supervised learning to improve bias in ASR, the "Aladdin" project, language model adaptation (LMA) and inverse text normalisation (ITN).

Watch the recording