Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knowledge has an incompatible new v3 file format #1253

Closed
markmc opened this issue Jul 23, 2024 · 5 comments
Closed

Knowledge has an incompatible new v3 file format #1253

markmc opened this issue Jul 23, 2024 · 5 comments

Comments

@markmc
Copy link

markmc commented Jul 23, 2024

See instructlab/sdg#160

A new v3 knowledge format has been added to InstructLab, with no backwards compatibility for v1 or v2 contributions - this till be released in InstructLab v0.18.0.

Existing knowledge contributions need to be updated, along with any documentation on creating knowledge contributions.

https://github.com/instructlab/instructlab/blob/main/scripts/test-data/e2e-qna-knowledge.yaml is an example of the new format

markmc referenced this issue in instructlab/sdg Jul 24, 2024
This is part of #160

The changes here originated from aakankshaduggal@5baf6df

There are two major changes here.

- When parsing a `qna.yaml` file from a taxonomy tree, adjust for the
  new schema for knowledge. There is no attempt to maintain
  compatibility with prior versions of the schema (v1, v2).

- Change how we translate the taxonomy data into the dataset sent into
  the pipeline as input. Instead of implementing a sliding window
  approach of 3 sample qna pairs at a time over all chunks of the
  document, we now create a row per seed_example (context and
  associated qna pairs) for each chunk of knowledge docs.

Co-authored-by: abhi1092 <[email protected]>
Co-authored-by: shiv <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
@bjhargrave
Copy link
Contributor

We have existing v1 knowledge in the main branch which needs to be fixed or removed.

@markmc
Copy link
Author

markmc commented Aug 6, 2024

xref #1260

@markmc
Copy link
Author

markmc commented Aug 6, 2024

v3 example: #1255

@markmc
Copy link
Author

markmc commented Aug 6, 2024

From @juliadenham 👍

The yaml file must include a minimum of 5 context fields, each with a minimum of 3 Q&A pairs relating to the context.

Context should be a chunk from the knowledge document(s) being submitted and should be in markdown format.

Length constraints:
500 words for context
250 words for Q&A pairs
750 total

Document_outline is a new field that replaces task_description. It should describe the information noting specifics from each context hunk.

@juliadenham
Copy link
Contributor

We now support V3 knowledge, yay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants