Supporting the growth of emerging markets

India is considered to be one of the largest emerging markets across the globe. With an expected 7.2% growth for 2017 and 7.7% for 2018 as explained by Finance Minister, Arun Jaitley and the 2nd largest population in the world.

With 521,000,000 Hindi speakers as their primary language (as of 2016 report) we decided to build Hindi to support the growth of the Indian market. Not many companies have tackled Hindi as a language for processing speech-to-text due to the different punctuation used and the limited amount of data available.

Using our language and acoustic framework, we built Hindi in a matter of days, using minimal data sets. The build was possible due to our existing language knowledge of training 28 previous languages, use of machine learning and knowledge of automatic speech recognition technology.

You can try out our languages for free here.

Ian Firth, Speechmatics


Four cities and a lot of Sichuan  

An event with DIT is always an exciting prospect and the recent mission to China was no exception and truly delivered beyond expectation.

We visted Guangzho, Shenzen, Chengdu, Guiyang in the space of a week and it is amazing to see the pace of technology innovation in China and how they are pushing their 2020 innovation agenda.

The first day in Shenzen was a day dedicated to meeting Huawei with an interesting presentation from Huawei and their technology team, which proceeded with an interesting chat with the Huawei innovation team where we discussed the Speechmatics real-time ASR capabilities.

We then headed to Chengdu – the home of Sichuan food and hotpots! We spent the morning meeting several interesting companies and the afternoon at the UK-CHINA BIG DATA COLLABORATION SEMINAR. The big data seminar was opportunity to meet many more interesting Chinese companies. We then spent the evening with Mr & Mrs Li who treated us to local cuisine and shows which were astounding.

Guiyang was next on the agenda which is a fascinating city and is one of the core focus areas for Chinese innovation and they are in the midst of building a large research park. We attended the Big Data Expo event where we met with Chinese press and iFlyTek who showcased their real-time captioning system and education system and also discussed their translation device. We also met with some of the biggest Chinese companies Alibaba, Baidu and Tencent.

We experienced great warmth and humour throughout the duration of the visit and are excited for future prospects working with Chinese companies and can’t wait to visit again soon. DIT and CBBC were a great support, running a seamless event that encapsulated why the UK and China should be in collaboration.

We’re excited for London Technology Week 2017 next week with more DIT run events strengthening collaborations with the UK and the rest of the world.

Benedikt von Thüngen, Speechmatics

What does AI and machine learning actually mean?

I recently read an article on how language led to the Artificial Intelligence revolution and the evolution of machine learning and it got me thinking. To start it’s good to know and understand what we are talking about.

Wikipedia says ‘Artificial Intelligence (AI) is intelligence exhibited by machines. In computer science, the field of AI research defines itself as the study of “intelligent agents”: any device that perceives its environment and takes actions that maximize its chance of success at some goal’. This is a much harder goal to achieve than Machine Learning which is ’the subfield of computer science that, according to Arthur Samuel in 1959, gives “computers the ability to learn without being explicitly programmed”.’ There is much confusion about the perceived ‘buzz words’ of AI and machine learning as many companies say they use AI, whereas in practice they have only used machine learning, which is quite different and not an ‘intelligent agent’ as in the realm of AI.

Machine learning has transformed natural language processing (NLP), in fact the whole area of computational linguistics is that of applying machine learning to NLP. This is a different problem to whether AI needs NLP – it’s perfectly possible to contemplate an AI system that we don’t communicate with in a natural language, it could be a formal language, but natural communication with an AI is going to need natural language communication.

So, what’s the story of machine learning applied to speech recognition?

The article quotes Rico Malvar, distinguished engineer and chief scientist for Microsoft Research, “speech recognition was one of our first areas of research. We have 25-plus years of experience. In the early 90s, it actually didn’t work”. I felt it was worth commenting that this could be potentially misleading for the history of speech recognition. In the early 90s, speech recognition did work for a variety of specific commercial applications such as command and control or personal dictation such as Dragon Dictate.

However, in the 90’s there was an interesting dynamic of computing power and dataset size. In the DARPA evaluations we showed that we could build useful large vocabulary speech systems for a variety of natural speech tasks using both the standard hidden Markov models and using neural networks. Indeed, my team at the time pioneered the use of recurrent neural networks in speech recognition (which can be considered as the first deep neural networks). This funding resulted in extensive data collection so that we could build better speech recognition systems.

It was relatively straightforward to apply hidden Markov models to these large data sources (we just bought a lot more computers) but neural networks couldn’t be so easily scaled to more than one computer. As a result, all the good work in neural networks was put on hold until GPUs arrived when we could train everything on one computer again.  To some, such as Malvar, this was viewed as “The deep neural network guys come up and they invent the stuff. Then the speech guys come along and say, ‘what if I use that?’.” But in my opinion speech was the first big task for neural networks with image and text coming along later (Wikipedia’s view of history).

However you view history, the use of deep neural networks combined with the progression of computing power has drastically improved speech recognition technologies and is now easily consumable by the masses with global reach in a multitude of applications and use-cases.

Tony Robinson, Speechmatics

How big is the gap between academic and commercial speech recognition systems?

Neural networks

Speech recognition has been making a lot of noise in the last few years, both academically and commercially.

Let’s begin with commercial.

Many big companies have begun launching their own voice assistants, such as Apple Siri, Microsoft Cortana, Amazon Alexa and Google Assistant.

Speech is the biggest form of communication used by humans but due to its complexities, it is one of the hardest challenges to overcome. Speech as an interface is made possible by ever-increasing accuracy and speed of speech recognition systems. Speech interfaces are especially important in places with low literacy rates like third world countries where speech is the only form of communication. Or for example, in China where typing in pinyin is not convenient or accessible for many people, especially the older generation.

There are now speech recognition systems available as cloud-based APIs which are making speech as an interface more accessible from the likes of IBM, Microsoft, Google and us.

From an academic perspective, there have been recent claims about achieving human parity from Microsoft and even better systems from IBM and text to speech systems are also making a lot of progress recently.

But what are the differences?


In academia, to allow for fair comparison, the datasets are fixed both for training and testing. However, collecting data is quite expensive, so often the datasets that are used are quite old thus there is a selection bias for models that end up performing on these specific datasets.

The switchboard datasets used in Microsoft’s and IBM’s paper use about 300 hours of training data, whereas other academic datasets go to up to 2000 hours.

For a commercial use, the only limitations are the cost of collecting the data and training the algorithm. For example, Baidu uses 40,000 hours. Major vendors typically use thousands of hours of data for their commercial systems, but the data is not disclosed due to its importance. The test sets they use internally are also kept confidential as they are tailored to the use of their customers.

Companies do not publish the accuracy of their commercial system on an open test set for these reasons, so the only way to discover which system is the best on a particular application is to test them all.

Vocabulary size

In the switchboard testing task, Microsoft used a vocabulary of 30k, and IBM used 85k. Today, large test sets (LVCSR) often have a few hundred thousand words in the vocabulary as standard.

However large vocabularies are often restricted for specific use cases. For example, for an embedded speech recognition system, Google used 64k. And on a popular natural language processing dataset, Google used a vocabulary size of about 800k.

New words appear every year and commercial systems need to take them into account when drawing new results.

Types of errors

In papers, word error rates (WERs) are reported, however we can see in Microsoft’s and IBM’s papers that most of the mistakes are made on short functional words such as “and”, “is”, “the”, and these errors are given the same weighting as errors on keywords.

In a commercial setting, verbatim transcripts are not always necessary. For example, voice search and voice assistants, content extraction and indexing applications often ignore the short functional words when dealing with a transcript.


Although there are open datasets in many languages, academics tend to focus on English. Open datasets also don’t have as much data as the English language. There are common issues found in other languages that are not found in English such as, tone, agglutination and different script (i.e. better wording or a non-Latin script).

A provider of speech recognition systems need to offer languages that cover the most common languages in the world. The top 10 languages in the world only cover about 46% of the world population with the top 100 languages still only covering about 85% of the world population.


Academics rarely report the real-time factor (RTF) ­– the time taken to transcribe the audio divided by duration of the audio – of their models, the best systems proposed in academic papers are a combination of multiple large models. This makes them unusable commercially because the compute cost is too high.

Users often expect a fast turnaround time which has made real time systems increasingly important and desirable especially when embedded on a device.


Popular academic datasets tend to be too clean whereas audio in real life applications are often very noisy. On a noisier dataset, the WER is about 40%, it would be good if more real-world data was available to academics to make the results more commercially viable and applicable.

Additional functionalities

Real world systems need more than good ASR accuracy, they also need diarisation (who speaks when), punctuation and capitalisation, whereas it is something expected from transcriptionist, and makes transcripts much easier and nicer for commercial use.

Audio segmentation – knowing what kind of noise or music happened when.

So, what does this mean for the future of speech recognition systems?

There is a gap between academic and commercial systems, however academic research is important to show that improvement can be made in speech recognition applications. So, then the problem becomes how can it be more efficient?

The academic method relies on dividing the problem into parts that can be improved in isolation. This has resulted in good progress being made on the somewhat artificial tasks we have set ourselves. In contrast, commercial speech recognition needs to take a more holistic approach, it’s the performance of the overall system that matters. Companies are continuing to build both types of ASR technology helping to close the gap between academic and commercial systems.

Rémi Francis, Speechmatics

Making speech recognition real-time at NAB 2017

microphone with mix deskThis year’s NAB show coincided with a big step forward for us at Speechmatics in terms of creating a productised real-time speech-to-text system. We had intended to simply attend the show, look for customers and observe at the state of play in the industry as NAB is the biggest broadcast and media conference in the world. However, after glimpsing some snapshots of our real-time capability online, we had received such interest that we decided to do a push for some live integrations at the show.

As such we were able to demonstrate the capabilities of our technology jointly with the likes of Broadcast Bionics, ISID and bitcentral for a variety of use cases.

Broadcast Bionics were already using our cloud-based speech-to-text transcription system to make radio production and audio searchable. While they used the real-time system to demonstrate how to produce immediately viewable output of a live show as you can see in the video below.

Bitcentral had likewise used our technology to demonstrate how to create searchable transcripts for large swathes of video and audio metadata in an interactive interface.

And finally, ISID had set up a connection to our online real-time cloud system and, despite the habitually flaky internet that is the stalwart of big conference, the results were enthralling. Being able to watch yourself talk with subtitles appearing automatically in multiple languages is a surreal, yet exciting experience. Especially considering the difficulty of live capture in such a noisy environment we were delighted with how both the Broadcast Bionics and ISID real-time integrations worked. Below the video shows myself and my colleague Ian using the integration at the show and just how well the demonstration worked.

The other thing that became clear is that speech is a hot topic at the moment and especially real time ASR. On almost every corner we had another conversation about our ASR, another request for when productised real-time would be available. Never before has a conference so closely resonated with the tech we are developing and we are really excited about making an even bigger impact in this space going forward!

Ricardo Herreros-Symons, Speechmatics

Unlocking the power of machine learning

Machine learning is the cornerstone of many technologies we use every day; without it we would be interacting with computers as we did 20 years ago – using them to compute things. With the advent of more powerful processors, we can harness this computing power and enable the machines to start to learn for themselves.

To be clear – they do still need input, they still follow patterns and still need to be programmed – they are not sentient machines. They are machines which can find more information that just 2+2=4. What machine learning is very useful for is extracting meaning from large datasets and spotting patterns. We use it here at Speechmatics to enable us to train our ASR to learn a different language on significantly less data than would have been possible even 15 years ago.

We are now in a world which is starting to find more and more uses for machine learning (eventually the machines will find uses for it themselves, the ‘singularity’, but we aren’t there yet!). Your shopping suggestions, banking security and tagging friends on Facebook are all initial uses for it, but the sky is literally the limit. There is no reason why eventually the Jetson’s flying cars wouldn’t be powered by machine learning algorithms, or why I Robot style cars couldn’t be controlled by a central hub. Machine learning could also be used to help out humans; to assist air traffic controllers by directing planes to a holding pattern, or help teachers to identify struggling pupils based on test results.

Machine learning coupled with neural networks (large networks of processors which begin to simulate a brain) can unlock even more power from machine learning. Whilst at Speechmatics we like to think we are changing the world – the reality is research into deep neural networks and machine learning are starting to unravel the way some of the most vicious illnesses operate. The mechanisms of HIV and Aids, as well as simulating flu transmission can both lead to a better understanding of how they operate.

As The Royal Society stated in a recent article on the possibilities of machine learning, they are calling for more research into machine learning ‘to ensure the UK make the most of opportunities’. The progress we have made so far is astounding and with an exciting prospect ahead, we at Speechmatics are continually innovating and researching artificial intelligence.

What is most exciting is to think how things could end up looking in the future. Today your mobile phone had more computing power than that which NASA used to put man on the moon. The phone which you use to combine candies into rows and crush them, has nearly 1 million times the processing power of the machine which landed on the moon. So just as it was hard for the scientists of the 60s to consider what we could do with more computing power (Angry Birds probably wasn’t in their thoughts), it is just as impossible for us to determine what we can do with machine learning backed up by current computing power.

So, welcome to the future. A future where computers no longer just compute. A future where processors think.

Luke Berry, Speechmatics

Keeping it real-time at SXSW


We have attended our fair range of conferences across the world and in numerous different markets. However, it is safe to say, that South by South West, SXSW or simply South By, was one of the most eclectic and most enjoyable. Upon arriving we were informed that the mantra for the city is ‘keeping Austin weird’ and that sentiment is embodied in the event.

The city is transformed for this highly interactive conference, houses are converted into tech expo events and open houses, cinemas pop up out of nowhere and live music can be heard around every corner.

The conference is split and partially staggered between music, film and technology with our focus being the latter.

In 2016 we won a final place out of 50 from a field of 500 companies to pitch at SXSW. We were competing with four other finalists in the enterprise and data category. The pitch itself was no longer than 2 minutes followed by a 7 minutes Q&A with the judges.

Our pitch went well, we were happy with how the technology was received and the crowd of interested attendees surrounding us at the end of the pitch to find out more.

We focussed the pitch on our new real-time capabilities, our continuous development and language creation tools. We even transcribed our pitch live using our offline, real-time capability, on a phone. It was a dangerous environment to demo in due to the acoustic levels – but one that certainly resonated with the judges.

Unfortunately, we didn’t win the overall competition, losing out to Deep 6 AI who did clinical trials for cancer patients using AI technology – a worthy winner with great technology and a compelling pitch.

Nonetheless, we still thoroughly enjoyed taking part and you can watch our pitch below.

Along with the pitch competition SXSW also organised a demo day for all the competing finalists. This is something that we are accustomed to: standing in a busy room, armed with a laptop, a video and demo and telling people about our technology.

For us this was the highlight of the event. Over the 3-hour session there was not a moment when we didn’t have a queue of people wanting to find out more or try out the real-time offline system on our phones. The Japanese, English and Spanish tester demos proved to be especially popular! Traditionally footfall and exposure is always difficult at these kind of events, but SXSW did a fantastic job of ensuring there was a constant stream of relevant people visiting the exhibits.

The combination of the three conferences meant that there was always something to do at SXSW with a range of exciting attendees and an electric atmosphere. From the likes of tech influencer Robert Scoble to Joe Biden, from Mick Fleetwood to Buzz Aldrin. SXSW gave us the opportunity to meet, watch and engage with a vast range of industry leaders, experts, celebrities and ultimately, potential customers.

Ricardo Herreros-Symons, Speechmatics