One post tagged with "NCCL"

Sparse All-Reduce in PyTorch

16 March 2024 · 31 min read

Machine Learning Engineer

The All-Reduce collective is ubiquitous in distributed training, but is currently not supported for sparse CUDA tensors in PyTorch. In the first part of this blog we contrast the existing alternatives available in the Gloo/NCCL backends. In the second part we implement our own efficient sparse All-Reduce collective using PyTorch and CUDA.