Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sdg v0.6.0+ multiple knowledge sources fails to clone #404

Closed
KodieGlosserIBM opened this issue Nov 20, 2024 · 5 comments · Fixed by #416
Closed

Sdg v0.6.0+ multiple knowledge sources fails to clone #404

KodieGlosserIBM opened this issue Nov 20, 2024 · 5 comments · Fixed by #416
Assignees
Labels
bug Something isn't working jira

Comments

@KodieGlosserIBM
Copy link

I think I found a potential race condition specifically here (context aware chunking): #284
Basically if there is more than 1 knowledge document for git to clone, and it happens to do multiple clones with the same second it will generate the same output dir:
document_output_dir = Path(output_dir) / f"documents-{date_suffix}"
Which causes SDG to fail since the directory already exists on the git clone.

Generating data on a single knowledge document, things works just fine. Its when we get to multiple I am seeing failures.

@bbrowning
Copy link
Contributor

Thanks Kodie! At first glance it looks like you're right, and we just didn't actually hit this in any of our testing. I'm not sure that the current knowledge document clone directory has any particular semantic meaning that required it to use a timestamp with only second resolution. We should consider having this either be an entirely unique directory, or if we want the directory structure to have semantic meaning base it on something like the taxonomy leaf node path plus a timestamp. Looping in @khaledsulayman and @aakankshaduggal here for awareness or differing opinions on how we should address this.

@bbrowning bbrowning added the jira label Nov 26, 2024
@bbrowning
Copy link
Contributor

It turns out we were always failing when 2+ knowledge taxonomy leaf nodes were in use. Working on a fix now.

bbrowning added a commit to bbrowning/instructlab-sdg that referenced this issue Nov 27, 2024
Previously, we were attempting to clone multiple knowledge documents
into the same destination directory, leading to failures generating
data for any run that contained 2+ knowledge leaf nodes.

Now, we clone docs into a guaranteed unique (via `tempfile.mkdtemp`)
subdirectory per knowledge leaf node. Just using a subdirectory per
leaf node could still have led to collisions if the user ran data
generation twice within one minute, which is why this goes the extra
step of using `mkdtemp` for guaranteed uniqueness.

Fixes instructlab#404

Signed-off-by: Ben Browning <[email protected]>
bbrowning added a commit to bbrowning/instructlab-sdg that referenced this issue Nov 27, 2024
Previously, we were attempting to clone multiple knowledge documents
into the same destination directory, leading to failures generating
data for any run that contained 2+ knowledge leaf nodes.

Now, we clone docs into a guaranteed unique (via `tempfile.mkdtemp`)
subdirectory per knowledge leaf node. Just using a subdirectory per
leaf node could still have led to collisions if the user ran data
generation twice within one minute, which is why this goes the extra
step of using `mkdtemp` for guaranteed uniqueness.

Fixes instructlab#404

Signed-off-by: Ben Browning <[email protected]>
@bbrowning
Copy link
Contributor

The linked PR #416 fixes this. In the interim, I believe the only viable workaround is to ensure you only generate data for a single taxonomy leaf node at a time. This may require creative use of the --taxonomy-base parameter to ilab data generate and/or working with taxonomies that only have 1 knowledge leaf node. It's generally not easy to only specify a subset of your taxonomy leaf nodes per generate call without copying files into a new taxonomy tree containing only a subset of your total taxonomy or being very careful in how you layer git commits in the taxonomy repo, ensuring only a single taxonomy leaf node changes per commit.

@bbrowning bbrowning added the bug Something isn't working label Nov 27, 2024
mergify bot pushed a commit that referenced this issue Nov 27, 2024
Previously, we were attempting to clone multiple knowledge documents
into the same destination directory, leading to failures generating
data for any run that contained 2+ knowledge leaf nodes.

Now, we clone docs into a guaranteed unique (via `tempfile.mkdtemp`)
subdirectory per knowledge leaf node. Just using a subdirectory per
leaf node could still have led to collisions if the user ran data
generation twice within one minute, which is why this goes the extra
step of using `mkdtemp` for guaranteed uniqueness.

Fixes #404

Signed-off-by: Ben Browning <[email protected]>
(cherry picked from commit 823c279)
@bbrowning
Copy link
Contributor

This has been released in v0.6.1 - thanks for the report! @KodieGlosserIBM let us know if you run into any more issues around multiple knowledge doc cloning. Every git clone we do for knowledge documents is now guaranteed to be in a unique destination directory, even if we've already cloned this knowledge doc, are cloning knowledge doc repos for multiple leaf nodes in short succession, and so on.

@KodieGlosserIBM
Copy link
Author

Ahhh great find @bbrowning !! Thanks for the quick fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working jira
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants