Tag: ASR

How big is the gap between academic and commercial speech recognition systems?

Neural networks

Speech recognition has been making a lot of noise in the last few years, both academically and commercially.

Let’s begin with commercial.

Many big companies have begun launching their own voice assistants, such as Apple Siri, Microsoft Cortana, Amazon Alexa and Google Assistant.

Speech is the biggest form of communication used by humans but due to its complexities, it is one of the hardest challenges to overcome. Speech as an interface is made possible by ever-increasing accuracy and speed of speech recognition systems. Speech interfaces are especially important in places with low literacy rates like third world countries where speech is the only form of communication. Or for example, in China where typing in pinyin is not convenient or accessible for many people, especially the older generation.

There are now speech recognition systems available as cloud-based APIs which are making speech as an interface more accessible from the likes of IBM, Microsoft, Google and us.

From an academic perspective, there have been recent claims about achieving human parity from Microsoft and even better systems from IBM and text to speech systems are also making a lot of progress recently.

But what are the differences?


In academia, to allow for fair comparison, the datasets are fixed both for training and testing. However, collecting data is quite expensive, so often the datasets that are used are quite old thus there is a selection bias for models that end up performing on these specific datasets.

The switchboard datasets used in Microsoft’s and IBM’s paper use about 300 hours of training data, whereas other academic datasets go to up to 2000 hours.

For a commercial use, the only limitations are the cost of collecting the data and training the algorithm. For example, Baidu uses 40,000 hours. Major vendors typically use thousands of hours of data for their commercial systems, but the data is not disclosed due to its importance. The test sets they use internally are also kept confidential as they are tailored to the use of their customers.

Companies do not publish the accuracy of their commercial system on an open test set for these reasons, so the only way to discover which system is the best on a particular application is to test them all.

Vocabulary size

In the switchboard testing task, Microsoft used a vocabulary of 30k, and IBM used 85k. Today, large test sets (LVCSR) often have a few hundred thousand words in the vocabulary as standard.

However large vocabularies are often restricted for specific use cases. For example, for an embedded speech recognition system, Google used 64k. And on a popular natural language processing dataset, Google used a vocabulary size of about 800k.

New words appear every year and commercial systems need to take them into account when drawing new results.

Types of errors

In papers, word error rates (WERs) are reported, however we can see in Microsoft’s and IBM’s papers that most of the mistakes are made on short functional words such as “and”, “is”, “the”, and these errors are given the same weighting as errors on keywords.

In a commercial setting, verbatim transcripts are not always necessary. For example, voice search and voice assistants, content extraction and indexing applications often ignore the short functional words when dealing with a transcript.


Although there are open datasets in many languages, academics tend to focus on English. Open datasets also don’t have as much data as the English language. There are common issues found in other languages that are not found in English such as, tone, agglutination and different script (i.e. better wording or a non-Latin script).

A provider of speech recognition systems need to offer languages that cover the most common languages in the world. The top 10 languages in the world only cover about 46% of the world population with the top 100 languages still only covering about 85% of the world population.


Academics rarely report the real-time factor (RTF) ­– the time taken to transcribe the audio divided by duration of the audio – of their models, the best systems proposed in academic papers are a combination of multiple large models. This makes them unusable commercially because the compute cost is too high.

Users often expect a fast turnaround time which has made real time systems increasingly important and desirable especially when embedded on a device.


Popular academic datasets tend to be too clean whereas audio in real life applications are often very noisy. On a noisier dataset, the WER is about 40%, it would be good if more real-world data was available to academics to make the results more commercially viable and applicable.

Additional functionalities

Real world systems need more than good ASR accuracy, they also need diarisation (who speaks when), punctuation and capitalisation, whereas it is something expected from transcriptionist, and makes transcripts much easier and nicer for commercial use.

Audio segmentation – knowing what kind of noise or music happened when.

So, what does this mean for the future of speech recognition systems?

There is a gap between academic and commercial systems, however academic research is important to show that improvement can be made in speech recognition applications. So, then the problem becomes how can it be more efficient?

The academic method relies on dividing the problem into parts that can be improved in isolation. This has resulted in good progress being made on the somewhat artificial tasks we have set ourselves. In contrast, commercial speech recognition needs to take a more holistic approach, it’s the performance of the overall system that matters. Companies are continuing to build both types of ASR technology helping to close the gap between academic and commercial systems.

Rémi Francis, Speechmatics

Unlocking the power of machine learning

Machine learning is the cornerstone of many technologies we use every day; without it we would be interacting with computers as we did 20 years ago – using them to compute things. With the advent of more powerful processors, we can harness this computing power and enable the machines to start to learn for themselves.

To be clear – they do still need input, they still follow patterns and still need to be programmed – they are not sentient machines. They are machines which can find more information that just 2+2=4. What machine learning is very useful for is extracting meaning from large datasets and spotting patterns. We use it here at Speechmatics to enable us to train our ASR to learn a different language on significantly less data than would have been possible even 15 years ago.

We are now in a world which is starting to find more and more uses for machine learning (eventually the machines will find uses for it themselves, the ‘singularity’, but we aren’t there yet!). Your shopping suggestions, banking security and tagging friends on Facebook are all initial uses for it, but the sky is literally the limit. There is no reason why eventually the Jetson’s flying cars wouldn’t be powered by machine learning algorithms, or why I Robot style cars couldn’t be controlled by a central hub. Machine learning could also be used to help out humans; to assist air traffic controllers by directing planes to a holding pattern, or help teachers to identify struggling pupils based on test results.

Machine learning coupled with neural networks (large networks of processors which begin to simulate a brain) can unlock even more power from machine learning. Whilst at Speechmatics we like to think we are changing the world – the reality is research into deep neural networks and machine learning are starting to unravel the way some of the most vicious illnesses operate. The mechanisms of HIV and Aids, as well as simulating flu transmission can both lead to a better understanding of how they operate.

As The Royal Society stated in a recent article on the possibilities of machine learning, they are calling for more research into machine learning ‘to ensure the UK make the most of opportunities’. The progress we have made so far is astounding and with an exciting prospect ahead, we at Speechmatics are continually innovating and researching artificial intelligence.

What is most exciting is to think how things could end up looking in the future. Today your mobile phone had more computing power than that which NASA used to put man on the moon. The phone which you use to combine candies into rows and crush them, has nearly 1 million times the processing power of the machine which landed on the moon. So just as it was hard for the scientists of the 60s to consider what we could do with more computing power (Angry Birds probably wasn’t in their thoughts), it is just as impossible for us to determine what we can do with machine learning backed up by current computing power.

So, welcome to the future. A future where computers no longer just compute. A future where processors think.

Luke Berry, Speechmatics