You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I try to train a XL model with multi-node (2*8 A100 GPUs). I do not change the global batch size. However, the training speed (how many steps per second) is even slower with 2 nodes compared to using single node (8 GPUs). For 2 nodes, I got: Train 2.35 Steps/Sec. While for single node, I got 2.94 Steps/Sec. The GPU utilization is almost 100% in both cases. However, the power is lower when training with two node. Any help would be appreciated.
The text was updated successfully, but these errors were encountered:
Hi,
I try to train a XL model with multi-node (2*8 A100 GPUs). I do not change the global batch size. However, the training speed (how many steps per second) is even slower with 2 nodes compared to using single node (8 GPUs). For 2 nodes, I got: Train 2.35 Steps/Sec. While for single node, I got 2.94 Steps/Sec. The GPU utilization is almost 100% in both cases. However, the power is lower when training with two node. Any help would be appreciated.
The text was updated successfully, but these errors were encountered: