Improve EPUB 3 Content Documents chunking #123

bertfrees · 2018-04-10T10:10:06Z

@rdeltour said (March 13, 2018):

Currently the *-to-epub3 scripts chunk content pretty naïvely (based on top-level sections). We should try to improve that:

inspect the structure more finely, to chunk based on lower-level sections
possibly chunk based on content size (see split large text files by size pipeline#351)

The text was updated successfully, but these errors were encountered:

bertfrees · 2018-04-10T11:17:51Z

This is how the chunking is currently implemented in:

daisy202-to-epub3: No chunking. The EPUB contains as many HTML documents as there are in the DAISY 2.02.
daisy3-to-epub3: Uses dtbook-to-zedai and zedai-to-html.
dtbook-to-epub3: Uses dtbook-to-zedai and zedai-to-epub3
dtbook-to-html: Uses dtbook-to-zedai and zedai-to-html.
epub3-to-epub3: No chunking
html-to-epub3: No chunking. The script accepts multiple input documents.
zedai-to-html: Chunks up the HTML with http://www.daisy.org/pipeline/modules/html-utils/html-chunker.xsl (not enabled in the script).
zedai-to-epub3: the ZedAI is converted into a single HTML which is then chunked up with http://www.daisy.org/pipeline/modules/html-utils/html-chunker.xsl

This is how html-chunker.xsl currently works:

Top-level sections are unwrapped and put in their own chunk. Attributes are moved to the body element.
Everything in between two top-level sections is put in its own chunk.
Child sections of a top-level "bodymatter" section (with epub:type "bodymatter") are put in their own chunk.
Everything in between two child sections of a "bodymatter" section is put in its own chunk.

bertfrees · 2018-04-10T11:22:56Z

We can probably just enhance the html-chunker step and use it in some more places (e.g. html-to-epub3 and epub3-to-epub3). We should start by wrapping the XSLT in an XProc step. If it's too complicated to implement the improved chunking in XSLT only, we can consider adding some Java.

bertfrees · 2018-04-10T11:47:46Z

I think it depends on the exact requirements whether we want to keep the main part of the implementation in XSLT, and use Java only to do some size calculations, or whether we want to implement the splitting algorithm in Java and possibly use XSLT to do the actual chunking.

Volume breaking in braille needed to be very advanced and configurable and therefore had to be implemented in Java. HTML chunking probably doesn't need to be that configurable (only a maximum size in kB) but still the chunking algorithm might not be trivial if we need to weigh off several variables against each other: preferred break points, maximum size, evenness of chunk sizes, etc.

An important question is how aware users will be of the chunking. Will users even notice if a section is split at a random place?

bertfrees · 2018-04-10T13:37:16Z

I suggest we create a generic px:chunk step that takes a document, a stylesheet URL and some other options and returns a sequence of documents. The style sheet is specific to the input format and should contain a set of matchers that somehow specify the break point opportunities. The benefit of this added complexity is that we can start out with a simple XSLT implementation of the step and easily move to Java if needed, while still keeping some of the "flexibility" of XSLT through the style sheet. In theory we could even support CSS, similar to how we do it for volume breaking in braille.

px:chunk-html could then be implemented as a px:chunk call with an HTML stylesheet, followed by a cleanup step that does some wrapping and unwrapping of elements.

@rdeltour @josteinaj thoughts?

bertfrees · 2018-04-11T08:32:45Z

@josteinaj Any specific requirements from NLB?

josteinaj · 2018-04-11T09:09:01Z

@bertfrees I'll discuss it a bit with internally.

We're working with single-HTML documents as our master format now, and they have (or at least will have) this structure:

<html>
    <head>...</head>
    <body>
        <section><!-- chapter 1 --></section>
        <section><!-- chapter 2 --></section>
        <section><!-- chapter 3 --></section>
    </body>
</html>

My initial thoughts are that each of these top-level sections should become separate HTML files. At least, that was our intention when using this structure. We would maybe be interested in splitting on number of bytes or similar, to prevent too big files from occuring, but it should be possible to disable such behavior as well, and only split on top-level section elements, so that we get a predictable output.

It would probably make sense to split the files in a way that preserves the HTML5 structural outline. Not sure what implications that would have, if any.

Our structure (with only section elements allowed as top-level elements) is of course not a generic structure, so for a generic splitter, you'd need additional logic (probably by performing the HTML5 outlining algorithm?). In the future we'll try to force EPUBs "from the wild" to conform to our grammar, and maybe this generic chunking mechanism could help us with wrapping generic content into a series of section elements; we'll see.

id-attributes should be preserved when splitting.

We might need to split a SMIL file alongside the HTML file in the future. And preferably also the MP3 files. Splitting SMIL and MP3 files would probably be separate steps though, and should be relatively straight forward when the IDs are preserved.

The generic part is just a dumb, HTML unaware chunker that chunks based on a (HTML specific) style sheet and preserved the structure of the input. The generic step is followed by a "finalize" step that cleans up the structure of the resulting chunks. See daisy/pipeline-scripts#123

bertfrees · 2018-04-11T09:19:14Z

Thanks. I've pushed a first version so you get an idea of how it works.

josteinaj · 2018-04-11T11:55:00Z

Neat. So we would just provide our own html-chunker-break-points.xsl with our own f:is-chunk?

bertfrees · 2018-04-11T12:15:00Z

That's right.

The generic part is just a dumb, HTML unaware chunker that chunks based on a (HTML specific) style sheet and preserved the structure of the input. The generic step is followed by a "finalize" step that cleans up the structure of the resulting chunks. See daisy/pipeline-scripts#123

bertfrees · 2018-10-02T13:56:30Z

See PR: #149

The generic part is just a dumb, HTML unaware chunker that chunks based on a (HTML specific) style sheet and preserved the structure of the input. The generic step is followed by a "finalize" step that cleans up the structure of the resulting chunks. See daisy/pipeline-scripts#123

rdeltour mentioned this issue Apr 10, 2018

Improve EPUB 3 Content Documents chunking daisy/pipeline-tasks#123

Closed

bertfrees added the L label Apr 10, 2018

bertfrees mentioned this issue May 22, 2018

Basic support for combining marks nlbdev/pipeline#21

Closed

bertfrees added eili 2 - In Progress accepted labels Jun 5, 2018

bertfrees self-assigned this Jun 12, 2018

This was referenced Oct 1, 2018

Make html-chunker into a generic step and implement in Java daisy/pipeline-modules-common#113

Merged

Improve EPUB 3 Content Documents chunking #149

Merged

bertfrees added this to the v1.12.0 milestone Oct 1, 2018

bertfrees added 4 - Done and removed 2 - In Progress labels Oct 2, 2018

bertfrees closed this as completed in #149 Nov 20, 2018

bertfrees mentioned this issue Feb 12, 2020

Add "chunk-size" option to more scripts daisy/pipeline-modules#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve EPUB 3 Content Documents chunking #123

Improve EPUB 3 Content Documents chunking #123

bertfrees commented Apr 10, 2018 •

edited

Loading

bertfrees commented Apr 10, 2018 •

edited

Loading

bertfrees commented Apr 10, 2018

bertfrees commented Apr 10, 2018 •

edited

Loading

bertfrees commented Apr 10, 2018

bertfrees commented Apr 11, 2018

josteinaj commented Apr 11, 2018 •

edited

Loading

bertfrees commented Apr 11, 2018

josteinaj commented Apr 11, 2018

bertfrees commented Apr 11, 2018

bertfrees commented Oct 2, 2018

Improve EPUB 3 Content Documents chunking #123

Improve EPUB 3 Content Documents chunking #123

Comments

bertfrees commented Apr 10, 2018 • edited Loading

bertfrees commented Apr 10, 2018 • edited Loading

bertfrees commented Apr 10, 2018

bertfrees commented Apr 10, 2018 • edited Loading

bertfrees commented Apr 10, 2018

bertfrees commented Apr 11, 2018

josteinaj commented Apr 11, 2018 • edited Loading

bertfrees commented Apr 11, 2018

josteinaj commented Apr 11, 2018

bertfrees commented Apr 11, 2018

bertfrees commented Oct 2, 2018

bertfrees commented Apr 10, 2018 •

edited

Loading

bertfrees commented Apr 10, 2018 •

edited

Loading

bertfrees commented Apr 10, 2018 •

edited

Loading

josteinaj commented Apr 11, 2018 •

edited

Loading