slow training speed with multi-node torchrun #78

ThisisBillhe · 2024-12-29T12:10:40Z

Hi,

I try to train a XL model with multi-node (2*8 A100 GPUs). I do not change the global batch size. However, the training speed (how many steps per second) is even slower with 2 nodes compared to using single node (8 GPUs). For 2 nodes, I got: Train 2.35 Steps/Sec. While for single node, I got 2.94 Steps/Sec. The GPU utilization is almost 100% in both cases. However, the power is lower when training with two node. Any help would be appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow training speed with multi-node torchrun #78

slow training speed with multi-node torchrun #78

ThisisBillhe commented Dec 29, 2024

slow training speed with multi-node torchrun #78

slow training speed with multi-node torchrun #78

Comments

ThisisBillhe commented Dec 29, 2024