Not everyone is able to write funky fused operators to make ML models run faster on GPUs using clever quantisation tricks. However lots of developers work with algorithms that feel like they should be able to leverage the thousands of cores in a GPU to run faster than using the dozens of cores on a server CPU. To see what is possible and what is involved, I revisited the first problem I ever considered trying to accelerate with a GPU. What is unusual about my chosen problem is that it is officially pointless, so you ought not to be able to find any library that will accelerate this algorithm, because it isn’t worth writing one! That makes it an interesting proxy for algorithms which aren’t catered for by high-performance libraries written by experts, but can be structured to run thousands of threads in parallel.
Reduce Model Tuning Costs with MuP
As machine learning engineers increasingly adopt the Bitter Lesson and models grow in size, the cost associated with training them is also on the rise. A significant portion of overall compute budget is frequently spent on hyper-parameter tuning before launching a final training run. MuP offers the capability to transfer hyperparameters from a much smaller 'toy' model, leading to a substantial reduction in overall training cost.
Building a Radio Translation Streaming Service in Python
At Speechmatics, we wanted to present our real-time translation product in a straightforward yet impactful manner, demonstrating its exceptional capabilities. You can experience this firsthand on our website. Beyond its capabilities in showcasing real-time transcription and translation, our live demo extends its reach to address diverse user needs. For those who may have hearing impairments or find themselves in environments where audio isn't a viable option, our streaming server provides a text-based alternative, ensuring that no one is left out. Moreover, our service bridges language barriers, making it indispensable in situations where immediate translation is crucial, breaking down communication barriers effortlessly.
Improving Speaker Diarization with Self-supervised Learning
Speaker diarization often complements automatic speech recognition (ASR) by determining "Who spoke when?". One intriguing advancement in the field is the adoption of Self-Supervised Learning (SSL). By harnessing vast amounts of unlabelled audio data, SSL manages to improve multiple downstream tasks, including ASR and diarization, using the same pre-trained model. As we explore in this blog, the synergy between SSL and traditional methods not only boosts ASR accuracy but also aids in improving speaker diarization results.
How to Access Microphones Through the Browser API
Here at Speechmatics, audio is the lifeblood of everything we do, from training our models right through to crafting effective demos of our technology. One of the best examples of this is our Portal translation demo, which allows the user to see their speech translated into a number of languages in realtime. However, accessing media devices through the browser isn't straightforward. Browsers require the user to explicitly permit access to the media device, and to make things even more complicated, each browser engine has its own quirks that have to be handled. In this article, I'll walk through how we were able to provide a consistent and straightforward microphone access experience for our demos across all the major browsers and devices.
How to Deploy HuggingFace Translation Models on GPU Servers
Ever since the release of the HuggingFace🤗 Transformers library, it has been incredibly simple to train, finetune and run state-of-the-art Transformer-based translation models. This has also accelerated the development of our recently launched Translation feature. However, deploying these models in a production setting on GPU servers is still not straightforward, so I want to share how we at Speechmatics were able to deploy a performant real-time translation service for more than 30 languages and open-sourced part of our solution in the process.
Fast and Accurate GPU Quantization for Transformers
As Transformer models increase in size, the computational cost of running inference also grows. Many organisations now face the challenge of deploying state-of-the-art models in a cost-effective way.
How to Accurately Time CUDA Kernels in Pytorch
If we know anything of machine learning in 2023, it is this: bigger is better. Give your model more data, parameters, and compute and success is (somewhat) guaranteed (Hoffmann et al., 2022).