Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve EPUB 3 Content Documents chunking #123

Closed
bertfrees opened this issue Apr 10, 2018 · 10 comments · Fixed by #149
Closed

Improve EPUB 3 Content Documents chunking #123

bertfrees opened this issue Apr 10, 2018 · 10 comments · Fixed by #149

Comments

@bertfrees
Copy link
Member

bertfrees commented Apr 10, 2018

@rdeltour said (March 13, 2018):

Currently the *-to-epub3 scripts chunk content pretty naïvely (based on top-level sections). We should try to improve that:

@bertfrees
Copy link
Member Author

bertfrees commented Apr 10, 2018

This is how the chunking is currently implemented in:

  • daisy202-to-epub3: No chunking. The EPUB contains as many HTML documents as there are in the DAISY 2.02.
  • daisy3-to-epub3: Uses dtbook-to-zedai and zedai-to-html.
  • dtbook-to-epub3: Uses dtbook-to-zedai and zedai-to-epub3
  • dtbook-to-html: Uses dtbook-to-zedai and zedai-to-html.
  • epub3-to-epub3: No chunking
  • html-to-epub3: No chunking. The script accepts multiple input documents.
  • zedai-to-html: Chunks up the HTML with http://www.daisy.org/pipeline/modules/html-utils/html-chunker.xsl (not enabled in the script).
  • zedai-to-epub3: the ZedAI is converted into a single HTML which is then chunked up with http://www.daisy.org/pipeline/modules/html-utils/html-chunker.xsl

This is how html-chunker.xsl currently works:

  • Top-level sections are unwrapped and put in their own chunk. Attributes are moved to the body element.
  • Everything in between two top-level sections is put in its own chunk.
  • Child sections of a top-level "bodymatter" section (with epub:type "bodymatter") are put in their own chunk.
  • Everything in between two child sections of a "bodymatter" section is put in its own chunk.

@bertfrees
Copy link
Member Author

We can probably just enhance the html-chunker step and use it in some more places (e.g. html-to-epub3 and epub3-to-epub3). We should start by wrapping the XSLT in an XProc step. If it's too complicated to implement the improved chunking in XSLT only, we can consider adding some Java.

@bertfrees
Copy link
Member Author

bertfrees commented Apr 10, 2018

I think it depends on the exact requirements whether we want to keep the main part of the implementation in XSLT, and use Java only to do some size calculations, or whether we want to implement the splitting algorithm in Java and possibly use XSLT to do the actual chunking.

Volume breaking in braille needed to be very advanced and configurable and therefore had to be implemented in Java. HTML chunking probably doesn't need to be that configurable (only a maximum size in kB) but still the chunking algorithm might not be trivial if we need to weigh off several variables against each other: preferred break points, maximum size, evenness of chunk sizes, etc.

An important question is how aware users will be of the chunking. Will users even notice if a section is split at a random place?

@bertfrees
Copy link
Member Author

I suggest we create a generic px:chunk step that takes a document, a stylesheet URL and some other options and returns a sequence of documents. The style sheet is specific to the input format and should contain a set of matchers that somehow specify the break point opportunities. The benefit of this added complexity is that we can start out with a simple XSLT implementation of the step and easily move to Java if needed, while still keeping some of the "flexibility" of XSLT through the style sheet. In theory we could even support CSS, similar to how we do it for volume breaking in braille.

px:chunk-html could then be implemented as a px:chunk call with an HTML stylesheet, followed by a cleanup step that does some wrapping and unwrapping of elements.

@rdeltour @josteinaj thoughts?

@bertfrees
Copy link
Member Author

@josteinaj Any specific requirements from NLB?

@josteinaj
Copy link
Member

josteinaj commented Apr 11, 2018

@bertfrees I'll discuss it a bit with internally.

We're working with single-HTML documents as our master format now, and they have (or at least will have) this structure:

<html>
    <head>...</head>
    <body>
        <section><!-- chapter 1 --></section>
        <section><!-- chapter 2 --></section>
        <section><!-- chapter 3 --></section>
    </body>
</html>

My initial thoughts are that each of these top-level sections should become separate HTML files. At least, that was our intention when using this structure. We would maybe be interested in splitting on number of bytes or similar, to prevent too big files from occuring, but it should be possible to disable such behavior as well, and only split on top-level section elements, so that we get a predictable output.

It would probably make sense to split the files in a way that preserves the HTML5 structural outline. Not sure what implications that would have, if any.

Our structure (with only section elements allowed as top-level elements) is of course not a generic structure, so for a generic splitter, you'd need additional logic (probably by performing the HTML5 outlining algorithm?). In the future we'll try to force EPUBs "from the wild" to conform to our grammar, and maybe this generic chunking mechanism could help us with wrapping generic content into a series of section elements; we'll see.

id-attributes should be preserved when splitting.

We might need to split a SMIL file alongside the HTML file in the future. And preferably also the MP3 files. Splitting SMIL and MP3 files would probably be separate steps though, and should be relatively straight forward when the IDs are preserved.

bertfrees added a commit to daisy/pipeline that referenced this issue Apr 11, 2018
The generic part is just a dumb, HTML unaware chunker that chunks
based on a (HTML specific) style sheet and preserved the structure of
the input. The generic step is followed by a "finalize" step that
cleans up the structure of the resulting chunks.

See daisy/pipeline-scripts#123
@bertfrees
Copy link
Member Author

Thanks. I've pushed a first version so you get an idea of how it works.

@josteinaj
Copy link
Member

Neat. So we would just provide our own html-chunker-break-points.xsl with our own f:is-chunk?

@bertfrees
Copy link
Member Author

That's right.

bertfrees added a commit to daisy/pipeline that referenced this issue Apr 17, 2018
The generic part is just a dumb, HTML unaware chunker that chunks
based on a (HTML specific) style sheet and preserved the structure of
the input. The generic step is followed by a "finalize" step that
cleans up the structure of the resulting chunks.

See daisy/pipeline-scripts#123
bertfrees added a commit to daisy/pipeline-modules-common that referenced this issue May 28, 2018
The generic part is just a dumb, HTML unaware chunker that chunks
based on a (HTML specific) style sheet and preserved the structure of
the input. The generic step is followed by a "finalize" step that
cleans up the structure of the resulting chunks.

See daisy/pipeline-scripts#123
@bertfrees bertfrees self-assigned this Jun 12, 2018
bertfrees added a commit to daisy/pipeline-modules-common that referenced this issue Oct 1, 2018
The generic part is just a dumb, HTML unaware chunker that chunks
based on a (HTML specific) style sheet and preserved the structure of
the input. The generic step is followed by a "finalize" step that
cleans up the structure of the resulting chunks.

See daisy/pipeline-scripts#123
@bertfrees bertfrees added this to the v1.12.0 milestone Oct 1, 2018
@bertfrees
Copy link
Member Author

See PR: #149

bertfrees added a commit to daisy/pipeline-modules-common that referenced this issue Nov 19, 2018
The generic part is just a dumb, HTML unaware chunker that chunks
based on a (HTML specific) style sheet and preserved the structure of
the input. The generic step is followed by a "finalize" step that
cleans up the structure of the resulting chunks.

See daisy/pipeline-scripts#123
bertfrees added a commit to daisy/pipeline-modules-common that referenced this issue Nov 19, 2018
The generic part is just a dumb, HTML unaware chunker that chunks
based on a (HTML specific) style sheet and preserved the structure of
the input. The generic step is followed by a "finalize" step that
cleans up the structure of the resulting chunks.

See daisy/pipeline-scripts#123
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants