One post tagged with "distributed-training"

Distributed Self-Distillation

14 October 2025 · 13 min read

Machine Learning Engineer

Self-distillation training involves a student model learning from a teacher model that is maintained as an exponential moving average (EMA) of the student's weights. When scaling this approach across multiple GPUs, the challenge lies in efficiently distributing both networks while respecting their different update mechanisms—the student trains via backpropagation, while the teacher updates through EMA. We examine three distributed training strategies: (1) replicating both models with DDP, which is simple but memory-intensive; (2) sharding only the student with FSDP; and (3) identically sharding both student and teacher with FSDP, making the teacher EMA update purely local with no communication overhead. The key insight is that effective distributed training must align with the algorithm's structure. In this case, identical sharding naturally respects the EMA dependency between networks.