[checkpointio]support distributed checkpoint io for model saving. #6181

flybird11111 · 2025-01-16T10:28:13Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs
I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

ver217 · 2025-01-17T08:50:18Z

colossalai/checkpoint_io/distributed_checkpoint_io.py

+
+MODEL_META_PREFIX = "pytorch_model-meta-dist-"
+MODEL_WEIGHT_PREFIX = "pytorch_model-dist-"
+MODEL_SHARD_SUUFIX = ".index.json"


SHARD_META_SUFFIX?

ver217 · 2025-01-17T08:54:09Z

colossalai/checkpoint_io/__init__.py

@@ -10,4 +11,5 @@
    "GeneralCheckpointIO",
    "HybridParallelCheckpointIO",
    "MoECheckpointIO",
+    "DistributedCheckpointIO",


It should not be an independent checkpoint io class. It should provide some utils functions for each current checkpoint io class.

Lemon-412 · 2025-01-18T06:50:46Z

hi all, take a look at this please. This bug is quite annoying for me.

#6168

flybird11111 · 2025-01-20T03:31:12Z

hi all, take a look at this please. This bug is quite annoying for me.

#6168

ok

for more information, see https://pre-commit.ci

ver217 · 2025-01-20T08:26:16Z