From 10232bba8245cd1e8d4030abe3b6e56f725158b9 Mon Sep 17 00:00:00 2001
From: wangbluo <2538539015@qq.com>
Date: Fri, 27 Sep 2024 15:49:03 +0800
Subject: [PATCH 01/10] seq parallel doc

---
 docs/source/zh-Hans/features/sequence_parallelism.md | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/docs/source/zh-Hans/features/sequence_parallelism.md b/docs/source/zh-Hans/features/sequence_parallelism.md
index 534035cb5abf..c1842aaf1695 100644
--- a/docs/source/zh-Hans/features/sequence_parallelism.md
+++ b/docs/source/zh-Hans/features/sequence_parallelism.md
@@ -150,6 +150,13 @@ for step, batch in enumerate(tqdm(dataloader, desc="Step", disable=not dist.get_
 
 
 ### 结论
-在上述序列并行方法中，ring attention对head number没有要求，可训练超长文本，但是由于细分了计算，计算性能会有所下降。TP+SP， DeepSpeed-Ulysses对于head number有要求，需要可被sp group size 整除。这些序列并行都可与其他高性能注意力兼容，如flash attention。sp可与Gemini一起使用训练超大规模模型，也可以与TP，PP，DP等组成4D并行。
+在上述序列并行方法中，ring attn和Ulysses各有优劣，我们需要根据情况来选择合适的序列并行方法：
+通信方面：Ulysses通信量优于ring attn，Ulysess主要包含三次All2All通信量，复杂度为3*O(Nxd),而ring attn的通信会随着序列长度增长而平方增长。不过另一方面，all2all对底层硬件的要求也会更高。
+内存占用：二者类似。
+模型结构泛化：ring attn优于Ulysses。Ulysses模型泛化性一般，对于head number有要求，需要满足:head number // (tp group size * sp group size)，而ring attn没有此限制。
+由于使用简单，对Attention计算不侵入修改，Ulysses目前是序列并行的主流。这些序列并行都可与其他高性能注意力兼容，如flash attention，还可以与ZeRO、TP、PP、DP等多种并行训练策略混合使用。
+
+总的来说，对于初学者、中小型企业客户，我们更推荐您使用all_to_all，经过测试，在双机16卡的情况下，使用```--tp 2 --sp 8 --sp_mode all_to_all```的启动参数可以很轻松训练128k长度的序列，同时他的性能表现也是所有序列并行模式中最好的。但如果追求极致性能优化，或者使用较多机器训练长文本，可以考虑使用ring attention模式的序列并行。
+
 
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=4 sequence_parallelism.py  -->

From 3c16520664990904bb2fe7d1df09e85ca726aeea Mon Sep 17 00:00:00 2001
From: wangbluo <2538539015@qq.com>
Date: Fri, 27 Sep 2024 16:05:42 +0800
Subject: [PATCH 02/10] seq parallel doc

---
 docs/source/zh-Hans/features/sequence_parallelism.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/zh-Hans/features/sequence_parallelism.md b/docs/source/zh-Hans/features/sequence_parallelism.md
index c1842aaf1695..701d15a79bfd 100644
--- a/docs/source/zh-Hans/features/sequence_parallelism.md
+++ b/docs/source/zh-Hans/features/sequence_parallelism.md
@@ -151,9 +151,9 @@ for step, batch in enumerate(tqdm(dataloader, desc="Step", disable=not dist.get_
 
 ### 结论
 在上述序列并行方法中，ring attn和Ulysses各有优劣，我们需要根据情况来选择合适的序列并行方法：
-通信方面：Ulysses通信量优于ring attn，Ulysess主要包含三次All2All通信量，复杂度为3*O(Nxd),而ring attn的通信会随着序列长度增长而平方增长。不过另一方面，all2all对底层硬件的要求也会更高。
+通信方面：Ulysses通信量优于ring attn，Ulysess主要包含三次All2All通信量,而ring attn的通信会随着序列长度增长而平方增长。不过另一方面，all2all对底层硬件的要求也会更高。
 内存占用：二者类似。
-模型结构泛化：ring attn优于Ulysses。Ulysses模型泛化性一般，对于head number有要求，需要满足:head number // (tp group size * sp group size)，而ring attn没有此限制。
+模型结构泛化：ring attn优于Ulysses。Ulysses模型泛化性一般，对于head number有要求，需要满足:`head number // (tp group size * sp group size)`，而ring attn没有此限制。
 由于使用简单，对Attention计算不侵入修改，Ulysses目前是序列并行的主流。这些序列并行都可与其他高性能注意力兼容，如flash attention，还可以与ZeRO、TP、PP、DP等多种并行训练策略混合使用。
 
 总的来说，对于初学者、中小型企业客户，我们更推荐您使用all_to_all，经过测试，在双机16卡的情况下，使用```--tp 2 --sp 8 --sp_mode all_to_all```的启动参数可以很轻松训练128k长度的序列，同时他的性能表现也是所有序列并行模式中最好的。但如果追求极致性能优化，或者使用较多机器训练长文本，可以考虑使用ring attention模式的序列并行。

From 2f56b5ae4a446014d5ded7d03d37698f9f920ac5 Mon Sep 17 00:00:00 2001
From: wangbluo <2538539015@qq.com>
Date: Fri, 27 Sep 2024 16:08:08 +0800
Subject: [PATCH 03/10] seq parallel doc

---
 docs/source/zh-Hans/features/sequence_parallelism.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/docs/source/zh-Hans/features/sequence_parallelism.md b/docs/source/zh-Hans/features/sequence_parallelism.md
index 701d15a79bfd..c8ae6b523396 100644
--- a/docs/source/zh-Hans/features/sequence_parallelism.md
+++ b/docs/source/zh-Hans/features/sequence_parallelism.md
@@ -151,9 +151,13 @@ for step, batch in enumerate(tqdm(dataloader, desc="Step", disable=not dist.get_
 
 ### 结论
 在上述序列并行方法中，ring attn和Ulysses各有优劣，我们需要根据情况来选择合适的序列并行方法：
+
 通信方面：Ulysses通信量优于ring attn，Ulysess主要包含三次All2All通信量,而ring attn的通信会随着序列长度增长而平方增长。不过另一方面，all2all对底层硬件的要求也会更高。
+
 内存占用：二者类似。
+
 模型结构泛化：ring attn优于Ulysses。Ulysses模型泛化性一般，对于head number有要求，需要满足:`head number // (tp group size * sp group size)`，而ring attn没有此限制。
+
 由于使用简单，对Attention计算不侵入修改，Ulysses目前是序列并行的主流。这些序列并行都可与其他高性能注意力兼容，如flash attention，还可以与ZeRO、TP、PP、DP等多种并行训练策略混合使用。
 
 总的来说，对于初学者、中小型企业客户，我们更推荐您使用all_to_all，经过测试，在双机16卡的情况下，使用```--tp 2 --sp 8 --sp_mode all_to_all```的启动参数可以很轻松训练128k长度的序列，同时他的性能表现也是所有序列并行模式中最好的。但如果追求极致性能优化，或者使用较多机器训练长文本，可以考虑使用ring attention模式的序列并行。

From eb93cf1889c33e91d38fdf7b3f4a771646419e84 Mon Sep 17 00:00:00 2001
From: wangbluo <2538539015@qq.com>
Date: Fri, 27 Sep 2024 16:10:24 +0800
Subject: [PATCH 04/10] seq parallel doc

---
 docs/source/zh-Hans/features/sequence_parallelism.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/source/zh-Hans/features/sequence_parallelism.md b/docs/source/zh-Hans/features/sequence_parallelism.md
index c8ae6b523396..bf3f7b10ad09 100644
--- a/docs/source/zh-Hans/features/sequence_parallelism.md
+++ b/docs/source/zh-Hans/features/sequence_parallelism.md
@@ -152,11 +152,11 @@ for step, batch in enumerate(tqdm(dataloader, desc="Step", disable=not dist.get_
 ### 结论
 在上述序列并行方法中，ring attn和Ulysses各有优劣，我们需要根据情况来选择合适的序列并行方法：
 
-通信方面：Ulysses通信量优于ring attn，Ulysess主要包含三次All2All通信量,而ring attn的通信会随着序列长度增长而平方增长。不过另一方面，all2all对底层硬件的要求也会更高。
+    通信方面：Ulysses通信量优于ring attn，Ulysess主要包含三次All2All通信量,而ring attn的通信会随着序列长度增长而平方增长。不过另一方面，all2all对底层硬件的要求也会更高。
 
-内存占用：二者类似。
+    内存占用：二者类似。
 
-模型结构泛化：ring attn优于Ulysses。Ulysses模型泛化性一般，对于head number有要求，需要满足:`head number // (tp group size * sp group size)`，而ring attn没有此限制。
+    模型结构泛化：ring attn优于Ulysses。Ulysses模型泛化性一般，对于head number有要求，需要满足:`head number // (tp group size * sp group size)`，而ring attn没有此限制。
 
 由于使用简单，对Attention计算不侵入修改，Ulysses目前是序列并行的主流。这些序列并行都可与其他高性能注意力兼容，如flash attention，还可以与ZeRO、TP、PP、DP等多种并行训练策略混合使用。
 

From d7867e2cbbdc28885dc1f9f70095962c04bb44fb Mon Sep 17 00:00:00 2001
From: wangbluo <2538539015@qq.com>
Date: Fri, 27 Sep 2024 16:14:58 +0800
Subject: [PATCH 05/10] seq parallel doc

---
 docs/source/zh-Hans/features/sequence_parallelism.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/docs/source/zh-Hans/features/sequence_parallelism.md b/docs/source/zh-Hans/features/sequence_parallelism.md
index bf3f7b10ad09..accd4d067425 100644
--- a/docs/source/zh-Hans/features/sequence_parallelism.md
+++ b/docs/source/zh-Hans/features/sequence_parallelism.md
@@ -101,6 +101,8 @@ plugin = HybridParallelPlugin(
         )
 ```
 
+启动命令参数举例：```--tp 2 --sp 8 --sp_mode split_gather```
+
 #### 使用DeepSpeed-Ulysses
 定义plugin， 在DeepSpeed-Ulysses的序列并行种，tp group与sp group 是正交的，
 ```python
@@ -112,6 +114,7 @@ plugin = HybridParallelPlugin(
             sequence_parallelism_mode="all_to_all",
         )
 ```
+启动命令参数举例：```--tp 2 --sp 8 --sp_mode all_to_all```
 
 #### 使用ring attention
 定义plugin， 在ring attention的序列并行种，tp group与sp group 是正交的，sp_size必须传入准确的并行大小。
@@ -124,6 +127,8 @@ plugin = HybridParallelPlugin(
             sequence_parallelism_mode="ring_attn",
         )
 ```
+启动命令参数举例：```--tp 2 --sp 8 --sp_mode ring_attn```
+
 #### 使用booster
 ```python
 booster = Booster(plugin=plugin)
@@ -160,7 +165,7 @@ for step, batch in enumerate(tqdm(dataloader, desc="Step", disable=not dist.get_
 
 由于使用简单，对Attention计算不侵入修改，Ulysses目前是序列并行的主流。这些序列并行都可与其他高性能注意力兼容，如flash attention，还可以与ZeRO、TP、PP、DP等多种并行训练策略混合使用。
 
-总的来说，对于初学者、中小型企业客户，我们更推荐您使用all_to_all，经过测试，在双机16卡的情况下，使用```--tp 2 --sp 8 --sp_mode all_to_all```的启动参数可以很轻松训练128k长度的序列，同时他的性能表现也是所有序列并行模式中最好的。但如果追求极致性能优化，或者使用较多机器训练长文本，可以考虑使用ring attention模式的序列并行。
+总的来说，我们更推荐您使用Ulysses，只需要在启动时指定```--sp_mode all_to_all```即可。经过测试，在双机16卡的情况下，使用```--tp 2 --sp 8 --sp_mode all_to_all```的启动参数可以很轻松训练128k长度的序列，同时他的性能表现也是所有序列并行模式中最好的。但如果追求极致性能优化，或者使用较多机器训练长文本，可以考虑使用ring attention模式的序列并行。
 
 
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=4 sequence_parallelism.py  -->

From ee368e75b5022cc48be0993983f2016ad24ccc52 Mon Sep 17 00:00:00 2001
From: wangbluo <2538539015@qq.com>
Date: Fri, 27 Sep 2024 16:30:54 +0800
Subject: [PATCH 06/10] seq parallel doc

---
 docs/source/en/features/sequence_parallelism.md      | 12 ++++++++++--
 docs/source/zh-Hans/features/sequence_parallelism.md |  8 ++++----
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/docs/source/en/features/sequence_parallelism.md b/docs/source/en/features/sequence_parallelism.md
index 70fd2eb10970..1a3f97f056eb 100644
--- a/docs/source/en/features/sequence_parallelism.md
+++ b/docs/source/en/features/sequence_parallelism.md
@@ -101,6 +101,7 @@ plugin = HybridParallelPlugin(
             sequence_parallelism_mode="split_gather",
         )
 ```
+Example of startup command parameters: ```--tp 2 --sp 8 --sp_mode split_gather```
 
 #### Using DeepSpeed-Ulysses
 Define the plugin. In the DeepSpeed-Ulysses sequence parallelism, the tp group and sp group are orthogonal.
@@ -113,6 +114,7 @@ plugin = HybridParallelPlugin(
             sequence_parallelism_mode="all_to_all",
         )
 ```
+Example of startup command parameters: ```--tp 2 --sp 8 --sp_mode all_to_all```
 
 #### Using Ring Attention
 Define the plugin. In ring attention sequence parallelism, the tp group and sp group are orthogonal, and sp_size must be set to the correct parallel size.
@@ -125,6 +127,8 @@ plugin = HybridParallelPlugin(
             sequence_parallelism_mode="ring_attn",
         )
 ```
+Example of startup command parameters: ```--tp 2 --sp 8 --sp_mode ring_attn```
+
 #### Using Booster
 ```python
 booster = Booster(plugin=plugin)
@@ -151,6 +155,10 @@ Currently, the `MoeHybridParallelPlugin` only supports DeepSpeed-Ulysses sequenc
 
 
 ### Conclusion
-Among the sequence parallelism methods mentioned, ring attention has no requirements for the number of attention heads and can train ultra-long sequences. However, due to the division of computation, its performance may decrease. TP+SP and DeepSpeed-Ulysses have requirements for the number of attention heads, which must be divisible by the sp group size. These sequence parallelism methods are all compatible with high-performance attention mechanisms like flash attention. Sequence parallelism can also be used with Gemini to train extremely large-scale models, and it can be combined with TP, PP, and DP to form 4D parallelism.
-
+Among the sequence parallelism methods mentioned, both ring attention and Ulysses have their pros and cons, and we need to choose the appropriate sequence parallelism method based on the situation:
+    Communication: Ulysses has lower communication overhead compared to ring attention, as it primarily involves three All-to-All communication ops, whereas the communication cost of ring attention grows quadratically with the sequence length. However, on the other hand, All-to-All op also demands more bandwidth from the hardware.
+    Memory usage: Both are similar in terms of memory consumption.
+    Model structure generalization: Ring attention is better than Ulysses in terms of generalization. Ulysses requires that the model config need to meet ```the head number // (tp group size * sp group size)``` condition, while ring attention has no such restrictions.
+Due to its simplicity and non-intrusive modification to attention calculation, Ulysses is currently the mainstream method for sequence parallelism. Both methods can be compatible with other high-performance attention methods such as Flash Attention, and can also be combined with other parallel training strategies like ZeRO, TP, PP, and DP.
+Overall, we recommend using Ulysses. You only need to specify ```--sp_mode all_to_all``` during startup. Based on testing, in a two-node, 16-GPU setup, using the startup parameters ```--tp 2 --sp 8 --sp_mode all_to_all```, it's easy to train sequences of up to 128k length, and the performance is the best among all sequence parallelism methods. However, if you're aiming for extreme performance optimization or training long texts on a larger scale of machines, you might want to consider using the ring attention.
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=4 sequence_parallelism.py  -->
diff --git a/docs/source/zh-Hans/features/sequence_parallelism.md b/docs/source/zh-Hans/features/sequence_parallelism.md
index accd4d067425..0eeca57a62f0 100644
--- a/docs/source/zh-Hans/features/sequence_parallelism.md
+++ b/docs/source/zh-Hans/features/sequence_parallelism.md
@@ -155,17 +155,17 @@ for step, batch in enumerate(tqdm(dataloader, desc="Step", disable=not dist.get_
 
 
 ### 结论
-在上述序列并行方法中，ring attn和Ulysses各有优劣，我们需要根据情况来选择合适的序列并行方法：
+在上述序列并行方法中，ring attention和Ulysses各有优劣，我们需要根据情况来选择合适的序列并行方法：
 
-    通信方面：Ulysses通信量优于ring attn，Ulysess主要包含三次All2All通信量,而ring attn的通信会随着序列长度增长而平方增长。不过另一方面，all2all对底层硬件的要求也会更高。
+    通信方面：Ulysses通信量优于ring attention，Ulysess主要包含三次All2All通信量,而ring attention的通信会随着序列长度增长而平方增长。不过另一方面，all2all对底层硬件的要求也会更高。
 
     内存占用：二者类似。
 
-    模型结构泛化：ring attn优于Ulysses。Ulysses模型泛化性一般，对于head number有要求，需要满足:`head number // (tp group size * sp group size)`，而ring attn没有此限制。
+    模型结构泛化：ring attention优于Ulysses。Ulysses模型泛化性一般，对于head number有要求，需要满足: `head number // (tp group size * sp group size)` ，而ring attention没有此限制。
 
 由于使用简单，对Attention计算不侵入修改，Ulysses目前是序列并行的主流。这些序列并行都可与其他高性能注意力兼容，如flash attention，还可以与ZeRO、TP、PP、DP等多种并行训练策略混合使用。
 
-总的来说，我们更推荐您使用Ulysses，只需要在启动时指定```--sp_mode all_to_all```即可。经过测试，在双机16卡的情况下，使用```--tp 2 --sp 8 --sp_mode all_to_all```的启动参数可以很轻松训练128k长度的序列，同时他的性能表现也是所有序列并行模式中最好的。但如果追求极致性能优化，或者使用较多机器训练长文本，可以考虑使用ring attention模式的序列并行。
+总的来说，我们更推荐您使用Ulysses，只需要在启动时指定```--sp_mode all_to_all```即可。经过测试，在双机16卡的情况下，使用```--tp 2 --sp 8 --sp_mode all_to_all```的启动参数可以很轻松训练128k长度的序列，同时它的性能表现也是所有序列并行模式中最好的。但如果追求极致性能优化，或者使用较多机器训练长文本，可以考虑使用ring attention模式的序列并行。
 
 
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=4 sequence_parallelism.py  -->

From 36aea5472d7f2e787fadbf58e2716cbefc355d89 Mon Sep 17 00:00:00 2001
From: wangbluo <2538539015@qq.com>
Date: Fri, 27 Sep 2024 16:36:53 +0800
Subject: [PATCH 07/10] seq parallel doc

---
 docs/source/en/features/sequence_parallelism.md      | 2 +-
 docs/source/zh-Hans/features/sequence_parallelism.md | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/source/en/features/sequence_parallelism.md b/docs/source/en/features/sequence_parallelism.md
index 1a3f97f056eb..1e01b9d8c2d2 100644
--- a/docs/source/en/features/sequence_parallelism.md
+++ b/docs/source/en/features/sequence_parallelism.md
@@ -160,5 +160,5 @@ Among the sequence parallelism methods mentioned, both ring attention and Ulysse
     Memory usage: Both are similar in terms of memory consumption.
     Model structure generalization: Ring attention is better than Ulysses in terms of generalization. Ulysses requires that the model config need to meet ```the head number // (tp group size * sp group size)``` condition, while ring attention has no such restrictions.
 Due to its simplicity and non-intrusive modification to attention calculation, Ulysses is currently the mainstream method for sequence parallelism. Both methods can be compatible with other high-performance attention methods such as Flash Attention, and can also be combined with other parallel training strategies like ZeRO, TP, PP, and DP.
-Overall, we recommend using Ulysses. You only need to specify ```--sp_mode all_to_all``` during startup. Based on testing, in a two-node, 16-GPU setup, using the startup parameters ```--tp 2 --sp 8 --sp_mode all_to_all```, it's easy to train sequences of up to 128k length, and the performance is the best among all sequence parallelism methods. However, if you're aiming for extreme performance optimization or training long texts on a larger scale of machines, you might want to consider using the ring attention.
+Overall, we recommend using Ulysses. You only need to specify ```--sp_mode all_to_all``` during startup. Based on testing, in a two-node, 16-GPU setup, using the startup parameters ```--tp 2 --sp 8 --sp_mode all_to_all```, it's easy to train sequences of up to 128k length, and the performance is the best among all sequence parallelism methods，can reach approximately 480+ TFLOPS on dual H800s. However, if you're aiming for extreme performance optimization or training long texts on a larger scale of machines, you might want to consider using the ring attention.
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=4 sequence_parallelism.py  -->
diff --git a/docs/source/zh-Hans/features/sequence_parallelism.md b/docs/source/zh-Hans/features/sequence_parallelism.md
index 0eeca57a62f0..5280a5813a28 100644
--- a/docs/source/zh-Hans/features/sequence_parallelism.md
+++ b/docs/source/zh-Hans/features/sequence_parallelism.md
@@ -101,7 +101,7 @@ plugin = HybridParallelPlugin(
         )
 ```
 
-启动命令参数举例：```--tp 2 --sp 8 --sp_mode split_gather```
+启动参数举例：```--tp 2 --sp 8 --sp_mode split_gather```
 
 #### 使用DeepSpeed-Ulysses
 定义plugin， 在DeepSpeed-Ulysses的序列并行种，tp group与sp group 是正交的，
@@ -114,7 +114,7 @@ plugin = HybridParallelPlugin(
             sequence_parallelism_mode="all_to_all",
         )
 ```
-启动命令参数举例：```--tp 2 --sp 8 --sp_mode all_to_all```
+启动参数举例：```--tp 2 --sp 8 --sp_mode all_to_all```
 
 #### 使用ring attention
 定义plugin， 在ring attention的序列并行种，tp group与sp group 是正交的，sp_size必须传入准确的并行大小。
@@ -127,7 +127,7 @@ plugin = HybridParallelPlugin(
             sequence_parallelism_mode="ring_attn",
         )
 ```
-启动命令参数举例：```--tp 2 --sp 8 --sp_mode ring_attn```
+启动参数举例：```--tp 2 --sp 8 --sp_mode ring_attn```
 
 #### 使用booster
 ```python
@@ -165,7 +165,7 @@ for step, batch in enumerate(tqdm(dataloader, desc="Step", disable=not dist.get_
 
 由于使用简单，对Attention计算不侵入修改，Ulysses目前是序列并行的主流。这些序列并行都可与其他高性能注意力兼容，如flash attention，还可以与ZeRO、TP、PP、DP等多种并行训练策略混合使用。
 
-总的来说，我们更推荐您使用Ulysses，只需要在启动时指定```--sp_mode all_to_all```即可。经过测试，在双机16卡的情况下，使用```--tp 2 --sp 8 --sp_mode all_to_all```的启动参数可以很轻松训练128k长度的序列，同时它的性能表现也是所有序列并行模式中最好的。但如果追求极致性能优化，或者使用较多机器训练长文本，可以考虑使用ring attention模式的序列并行。
+总的来说，我们更推荐您使用Ulysses，只需要在启动时指定```--sp_mode all_to_all```即可。经过测试，在双机16卡的情况下，使用```--tp 2 --sp 8 --sp_mode all_to_all```的启动参数可以很轻松训练128k长度的序列，同时它的性能表现也是所有序列并行模式中最好的，在双机H800上能够达到约480以上的tflops。但如果追求极致性能优化，或者使用较多机器训练长文本，可以考虑使用ring attention模式的序列并行。
 
 
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=4 sequence_parallelism.py  -->

From ea24e7b9eccde0006b650eaadf32703e6d60f44f Mon Sep 17 00:00:00 2001
From: wangbluo <2538539015@qq.com>
Date: Fri, 27 Sep 2024 16:38:48 +0800
Subject: [PATCH 08/10] seq parallel doc

---
 docs/source/en/features/sequence_parallelism.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/docs/source/en/features/sequence_parallelism.md b/docs/source/en/features/sequence_parallelism.md
index 1e01b9d8c2d2..46f3e8fef644 100644
--- a/docs/source/en/features/sequence_parallelism.md
+++ b/docs/source/en/features/sequence_parallelism.md
@@ -156,9 +156,14 @@ Currently, the `MoeHybridParallelPlugin` only supports DeepSpeed-Ulysses sequenc
 
 ### Conclusion
 Among the sequence parallelism methods mentioned, both ring attention and Ulysses have their pros and cons, and we need to choose the appropriate sequence parallelism method based on the situation:
+
     Communication: Ulysses has lower communication overhead compared to ring attention, as it primarily involves three All-to-All communication ops, whereas the communication cost of ring attention grows quadratically with the sequence length. However, on the other hand, All-to-All op also demands more bandwidth from the hardware.
+
     Memory usage: Both are similar in terms of memory consumption.
+
     Model structure generalization: Ring attention is better than Ulysses in terms of generalization. Ulysses requires that the model config need to meet ```the head number // (tp group size * sp group size)``` condition, while ring attention has no such restrictions.
+
 Due to its simplicity and non-intrusive modification to attention calculation, Ulysses is currently the mainstream method for sequence parallelism. Both methods can be compatible with other high-performance attention methods such as Flash Attention, and can also be combined with other parallel training strategies like ZeRO, TP, PP, and DP.
+
 Overall, we recommend using Ulysses. You only need to specify ```--sp_mode all_to_all``` during startup. Based on testing, in a two-node, 16-GPU setup, using the startup parameters ```--tp 2 --sp 8 --sp_mode all_to_all```, it's easy to train sequences of up to 128k length, and the performance is the best among all sequence parallelism methods，can reach approximately 480+ TFLOPS on dual H800s. However, if you're aiming for extreme performance optimization or training long texts on a larger scale of machines, you might want to consider using the ring attention.
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=4 sequence_parallelism.py  -->

From 765db38e4806fd7c62eb73bae9a09f22bfa83408 Mon Sep 17 00:00:00 2001
From: wangbluo <2538539015@qq.com>
Date: Fri, 27 Sep 2024 16:40:13 +0800
Subject: [PATCH 09/10] seq parallel doc

---
 docs/source/en/features/sequence_parallelism.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/features/sequence_parallelism.md b/docs/source/en/features/sequence_parallelism.md
index 46f3e8fef644..354b8af59140 100644
--- a/docs/source/en/features/sequence_parallelism.md
+++ b/docs/source/en/features/sequence_parallelism.md
@@ -163,7 +163,7 @@ Among the sequence parallelism methods mentioned, both ring attention and Ulysse
 
     Model structure generalization: Ring attention is better than Ulysses in terms of generalization. Ulysses requires that the model config need to meet ```the head number // (tp group size * sp group size)``` condition, while ring attention has no such restrictions.
 
-Due to its simplicity and non-intrusive modification to attention calculation, Ulysses is currently the mainstream method for sequence parallelism. Both methods can be compatible with other high-performance attention methods such as Flash Attention, and can also be combined with other parallel training strategies like ZeRO, TP, PP, and DP.
+Due to its simplicity and non-intrusive modification to attention calculation, Ulysses is currently the mainstream for sequence parallelism. All sequence parallel methods can be compatible with other high-performance attention methods such as Flash Attention, and can also be combined with other parallel training strategies like ZeRO, TP, PP, and DP.
 
 Overall, we recommend using Ulysses. You only need to specify ```--sp_mode all_to_all``` during startup. Based on testing, in a two-node, 16-GPU setup, using the startup parameters ```--tp 2 --sp 8 --sp_mode all_to_all```, it's easy to train sequences of up to 128k length, and the performance is the best among all sequence parallelism methods，can reach approximately 480+ TFLOPS on dual H800s. However, if you're aiming for extreme performance optimization or training long texts on a larger scale of machines, you might want to consider using the ring attention.
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=4 sequence_parallelism.py  -->

From ceadef35d5891f144c8829e0988f1fdbeaecdc89 Mon Sep 17 00:00:00 2001
From: wangbluo <2538539015@qq.com>
Date: Mon, 7 Oct 2024 10:29:58 +0800
Subject: [PATCH 10/10] fix

---
 docs/source/en/features/sequence_parallelism.md      | 2 +-
 docs/source/zh-Hans/features/sequence_parallelism.md | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/en/features/sequence_parallelism.md b/docs/source/en/features/sequence_parallelism.md
index 354b8af59140..e422a24a9c8f 100644
--- a/docs/source/en/features/sequence_parallelism.md
+++ b/docs/source/en/features/sequence_parallelism.md
@@ -157,7 +157,7 @@ Currently, the `MoeHybridParallelPlugin` only supports DeepSpeed-Ulysses sequenc
 ### Conclusion
 Among the sequence parallelism methods mentioned, both ring attention and Ulysses have their pros and cons, and we need to choose the appropriate sequence parallelism method based on the situation:
 
-    Communication: Ulysses has lower communication overhead compared to ring attention, as it primarily involves three All-to-All communication ops, whereas the communication cost of ring attention grows quadratically with the sequence length. However, on the other hand, All-to-All op also demands more bandwidth from the hardware.
+    Communication: Ulysses has lower communication overhead compared to ring attention, as it primarily involves three All-to-All communication ops, whereas the communication cost of ring attention grows quadratically with the sequence length. However, on the other hand, All-to-All op also demands dense network topologies, e.g. NVLink + NVSwitch, so it doesn't scale well across multiple nodes.
 
     Memory usage: Both are similar in terms of memory consumption.
 
diff --git a/docs/source/zh-Hans/features/sequence_parallelism.md b/docs/source/zh-Hans/features/sequence_parallelism.md
index 5280a5813a28..ddf1ae293d5e 100644
--- a/docs/source/zh-Hans/features/sequence_parallelism.md
+++ b/docs/source/zh-Hans/features/sequence_parallelism.md
@@ -157,7 +157,7 @@ for step, batch in enumerate(tqdm(dataloader, desc="Step", disable=not dist.get_
 ### 结论
 在上述序列并行方法中，ring attention和Ulysses各有优劣，我们需要根据情况来选择合适的序列并行方法：
 
-    通信方面：Ulysses通信量优于ring attention，Ulysess主要包含三次All2All通信量,而ring attention的通信会随着序列长度增长而平方增长。不过另一方面，all2all对底层硬件的要求也会更高。
+    通信方面：Ulysses通信量优于ring attention，Ulysess主要包含三次All2All通信量,而ring attention的通信会随着序列长度增长而平方增长。不过另一方面，all2all op由于需要更复杂的网络拓扑，例如NVLink和NVSwitch，因此在多机情况时，并不会随着机器数量增加而有较好的性能提升。
 
     内存占用：二者类似。