-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve EPUB 3 Content Documents chunking #123
Comments
This is how the chunking is currently implemented in:
This is how html-chunker.xsl currently works:
|
We can probably just enhance the |
I think it depends on the exact requirements whether we want to keep the main part of the implementation in XSLT, and use Java only to do some size calculations, or whether we want to implement the splitting algorithm in Java and possibly use XSLT to do the actual chunking. Volume breaking in braille needed to be very advanced and configurable and therefore had to be implemented in Java. HTML chunking probably doesn't need to be that configurable (only a maximum size in kB) but still the chunking algorithm might not be trivial if we need to weigh off several variables against each other: preferred break points, maximum size, evenness of chunk sizes, etc. An important question is how aware users will be of the chunking. Will users even notice if a section is split at a random place? |
I suggest we create a generic
@rdeltour @josteinaj thoughts? |
@josteinaj Any specific requirements from NLB? |
@bertfrees I'll discuss it a bit with internally. We're working with single-HTML documents as our master format now, and they have (or at least will have) this structure: <html>
<head>...</head>
<body>
<section><!-- chapter 1 --></section>
<section><!-- chapter 2 --></section>
<section><!-- chapter 3 --></section>
</body>
</html> My initial thoughts are that each of these top-level sections should become separate HTML files. At least, that was our intention when using this structure. We would maybe be interested in splitting on number of bytes or similar, to prevent too big files from occuring, but it should be possible to disable such behavior as well, and only split on top-level section elements, so that we get a predictable output. It would probably make sense to split the files in a way that preserves the HTML5 structural outline. Not sure what implications that would have, if any. Our structure (with only section elements allowed as top-level elements) is of course not a generic structure, so for a generic splitter, you'd need additional logic (probably by performing the HTML5 outlining algorithm?). In the future we'll try to force EPUBs "from the wild" to conform to our grammar, and maybe this generic chunking mechanism could help us with wrapping generic content into a series of section elements; we'll see.
We might need to split a SMIL file alongside the HTML file in the future. And preferably also the MP3 files. Splitting SMIL and MP3 files would probably be separate steps though, and should be relatively straight forward when the IDs are preserved. |
The generic part is just a dumb, HTML unaware chunker that chunks based on a (HTML specific) style sheet and preserved the structure of the input. The generic step is followed by a "finalize" step that cleans up the structure of the resulting chunks. See daisy/pipeline-scripts#123
Thanks. I've pushed a first version so you get an idea of how it works. |
Neat. So we would just provide our own |
That's right. |
The generic part is just a dumb, HTML unaware chunker that chunks based on a (HTML specific) style sheet and preserved the structure of the input. The generic step is followed by a "finalize" step that cleans up the structure of the resulting chunks. See daisy/pipeline-scripts#123
The generic part is just a dumb, HTML unaware chunker that chunks based on a (HTML specific) style sheet and preserved the structure of the input. The generic step is followed by a "finalize" step that cleans up the structure of the resulting chunks. See daisy/pipeline-scripts#123
The generic part is just a dumb, HTML unaware chunker that chunks based on a (HTML specific) style sheet and preserved the structure of the input. The generic step is followed by a "finalize" step that cleans up the structure of the resulting chunks. See daisy/pipeline-scripts#123
See PR: #149 |
The generic part is just a dumb, HTML unaware chunker that chunks based on a (HTML specific) style sheet and preserved the structure of the input. The generic step is followed by a "finalize" step that cleans up the structure of the resulting chunks. See daisy/pipeline-scripts#123
The generic part is just a dumb, HTML unaware chunker that chunks based on a (HTML specific) style sheet and preserved the structure of the input. The generic step is followed by a "finalize" step that cleans up the structure of the resulting chunks. See daisy/pipeline-scripts#123
@rdeltour said (March 13, 2018):
Currently the
*-to-epub3
scripts chunk content pretty naïvely (based on top-level sections). We should try to improve that:The text was updated successfully, but these errors were encountered: