-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sdg v0.6.0+ multiple knowledge sources fails to clone #404
Comments
Thanks Kodie! At first glance it looks like you're right, and we just didn't actually hit this in any of our testing. I'm not sure that the current knowledge document clone directory has any particular semantic meaning that required it to use a timestamp with only second resolution. We should consider having this either be an entirely unique directory, or if we want the directory structure to have semantic meaning base it on something like the taxonomy leaf node path plus a timestamp. Looping in @khaledsulayman and @aakankshaduggal here for awareness or differing opinions on how we should address this. |
It turns out we were always failing when 2+ knowledge taxonomy leaf nodes were in use. Working on a fix now. |
Previously, we were attempting to clone multiple knowledge documents into the same destination directory, leading to failures generating data for any run that contained 2+ knowledge leaf nodes. Now, we clone docs into a guaranteed unique (via `tempfile.mkdtemp`) subdirectory per knowledge leaf node. Just using a subdirectory per leaf node could still have led to collisions if the user ran data generation twice within one minute, which is why this goes the extra step of using `mkdtemp` for guaranteed uniqueness. Fixes instructlab#404 Signed-off-by: Ben Browning <[email protected]>
Previously, we were attempting to clone multiple knowledge documents into the same destination directory, leading to failures generating data for any run that contained 2+ knowledge leaf nodes. Now, we clone docs into a guaranteed unique (via `tempfile.mkdtemp`) subdirectory per knowledge leaf node. Just using a subdirectory per leaf node could still have led to collisions if the user ran data generation twice within one minute, which is why this goes the extra step of using `mkdtemp` for guaranteed uniqueness. Fixes instructlab#404 Signed-off-by: Ben Browning <[email protected]>
The linked PR #416 fixes this. In the interim, I believe the only viable workaround is to ensure you only generate data for a single taxonomy leaf node at a time. This may require creative use of the |
Previously, we were attempting to clone multiple knowledge documents into the same destination directory, leading to failures generating data for any run that contained 2+ knowledge leaf nodes. Now, we clone docs into a guaranteed unique (via `tempfile.mkdtemp`) subdirectory per knowledge leaf node. Just using a subdirectory per leaf node could still have led to collisions if the user ran data generation twice within one minute, which is why this goes the extra step of using `mkdtemp` for guaranteed uniqueness. Fixes #404 Signed-off-by: Ben Browning <[email protected]> (cherry picked from commit 823c279)
This has been released in v0.6.1 - thanks for the report! @KodieGlosserIBM let us know if you run into any more issues around multiple knowledge doc cloning. Every git clone we do for knowledge documents is now guaranteed to be in a unique destination directory, even if we've already cloned this knowledge doc, are cloning knowledge doc repos for multiple leaf nodes in short succession, and so on. |
Ahhh great find @bbrowning !! Thanks for the quick fix. |
I think I found a potential race condition specifically here (context aware chunking): #284
Basically if there is more than 1 knowledge document for git to clone, and it happens to do multiple clones with the same second it will generate the same output dir:
document_output_dir = Path(output_dir) / f"documents-{date_suffix}"
Which causes SDG to fail since the directory already exists on the git clone.
Generating data on a single knowledge document, things works just fine. Its when we get to multiple I am seeing failures.
The text was updated successfully, but these errors were encountered: