【预训练】请问一下预训练时数据集需要注意什么？ #6556

leeburt · 2025-01-07T15:02:06Z

leeburt
Jan 7, 2025

Reminder

I have read the README and searched the existing issues.

System Info

LLaMA Factory, version 0.9.2.dev0

Reproduction

首先非常感谢开源该项目，我在使用过程中遇到了如下的问题：
1. 问题1
我按照readme 提及预训练数据的格式进行了准备。格式是：
[{"text": "document"}, {"text": "document"} ]
发现我的数据中有些document包含的中文字符的长度范围为几百到几万。但是我看模型的输入为截断长度cutoff_len是2048，这里如何理解，以及我需要做什么调整？
2. 问题2
同时，我的数据集是一个文章集合，共100篇文章，20w字。每篇文章字数几百到几万不等，请教一下采用怎样的训练策略比较合适。

Others

No response

hiyouga · 2025-01-07T15:11:43Z

hiyouga
Jan 7, 2025
Maintainer

训练时候会自动按照 cutofflen 分组，不用关心原本的长度

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【预训练】请问一下预训练时数据集需要注意什么？ #6556

{{title}}

Replies: 1 comment

{{title}}

Select a reply

【预训练】请问一下预训练时数据集需要注意什么？ #6556

leeburt Jan 7, 2025

Reminder

System Info

Reproduction

Others

Replies: 1 comment

hiyouga Jan 7, 2025 Maintainer

leeburt
Jan 7, 2025

hiyouga
Jan 7, 2025
Maintainer