Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow training speed with multi-node torchrun #78

Open
ThisisBillhe opened this issue Dec 29, 2024 · 0 comments
Open

slow training speed with multi-node torchrun #78

ThisisBillhe opened this issue Dec 29, 2024 · 0 comments

Comments

@ThisisBillhe
Copy link

Hi,

I try to train a XL model with multi-node (2*8 A100 GPUs). I do not change the global batch size. However, the training speed (how many steps per second) is even slower with 2 nodes compared to using single node (8 GPUs). For 2 nodes, I got: Train 2.35 Steps/Sec. While for single node, I got 2.94 Steps/Sec. The GPU utilization is almost 100% in both cases. However, the power is lower when training with two node. Any help would be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant