3 posts tagged with "CUDA"

Sparse All-Reduce in PyTorch

16 March 2024 · 31 min read

Machine Learning Engineer

The All-Reduce collective is ubiquitous in distributed training, but is currently not supported for sparse CUDA tensors in PyTorch. In the first part of this blog we contrast the existing alternatives available in the Gloo/NCCL backends. In the second part we implement our own efficient sparse All-Reduce collective using PyTorch and CUDA.

An Almost Pointless Exercise in GPU Optimization

7 November 2023 · 21 min read

Andrew Innes

Chief Architect

Not everyone is able to write funky fused operators to make ML models run faster on GPUs using clever quantisation tricks. However lots of developers work with algorithms that feel like they should be able to leverage the thousands of cores in a GPU to run faster than using the dozens of cores on a server CPU. To see what is possible and what is involved, I revisited the first problem I ever considered trying to accelerate with a GPU. What is unusual about my chosen problem is that it is officially pointless, so you ought not to be able to find any library that will accelerate this algorithm, because it isn’t worth writing one! That makes it an interesting proxy for algorithms which aren’t catered for by high-performance libraries written by experts, but can be structured to run thousands of threads in parallel.

How to Accurately Time CUDA Kernels in Pytorch

28 March 2023 · 8 min read

Lawrence Atkins

Machine Learning Engineer

David MacLeod

Machine Learning Engineer

If we know anything of machine learning in 2023, it is this: bigger is better. Give your model more data, parameters, and compute and success is (somewhat) guaranteed (Hoffmann et al., 2022).