Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in "Enhancing web page" (jsdom) #177

Open
WolfgangDpunkt opened this issue Jul 17, 2024 · 4 comments
Open

Bug in "Enhancing web page" (jsdom) #177

WolfgangDpunkt opened this issue Jul 17, 2024 · 4 comments

Comments

@WolfgangDpunkt
Copy link

Environment

  • Operating System: Alpine Linux v3.20
  • node --version: v22.3.0
  • npm --version: 10.8.1
  • yarn --version, if using Yarn: 1.22.22
  • percollate --version: 4.2.1

Description

When I try to convert this web article into an epub, percollate breaks at the point "Enhancing web page" with the error message

@click did not match the Name production
[DOMException [InvalidCharacterError]: "@click" did not match the Name production]
Ignoring item. 

It seems to me to be a bug in jsdom similar issues exist.
Since the vast majority of websites work without any problems and the error occurs very rarely, it would be helpful if percollate in this case could execute the “Enhancing web page” process in some kind of forced mode, ignoring the warning message or skipping it.

Here is the full debug log:

percollate epub --individual --css 'html { font-size: 14pt; line-height:1.5}  pre{display:inline ;font-family:serif;white-space:pre-line ;margin:auto;}' --output /output-Epubs/ 'https://www.spiegel.de/netzwelt/netzpolitik/donald-trump-freude-in-sozialen-medien-ueber-den-mordversuch-ist-antidemokratisch-kolumne-a-c6c71c99-bf27-4d7f-bb28-925204411a6c' --debug

{
  command: 'epub',
  operands: [
    'https://www.spiegel.de/netzwelt/netzpolitik/donald-trump-freude-in-sozialen-medien-ueber-den-mordversuch-ist-antidemokratisch-kolumne-a-c6c71c99-bf27-4d7f-bb28-925204411a6c'
  ],
  opts: {
    individual: true,
    css: 'html { font-size: 14pt; line-height:1.5}  pre{display:inline ;font-family:serif;white-space:pre-line ;margin:auto;}',
    output: '/output-Perco-Epubs/',
    debug: true
  }
}
Fetching: https://www.spiegel.de/netzwelt/netzpolitik/donald-trump-freude-in-sozialen-medien-ueber-den-mordversuch-ist-antidemokratisch-kolumne-a-c6c71c99-bf27-4d7f-bb28-925204411a6c(node:189) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
 ✓
Enhancing web page: https://www.spiegel.de/netzwelt/netzpolitik/donald-trump-freude-in-sozialen-medien-ueber-den-mordversuch-ist-antidemokratisch-kolumne-a-c6c71c99-bf27-4d7f-bb28-925204411a6chttps://www.spiegel.de/netzwelt/netzpolitik/donald-trump-freude-in-sozialen-medien-ueber-den-mordversuch-ist-antidemokratisch-kolumne-a-c6c71c99-bf27-4d7f-bb28-925204411a6c: "@click" did not match the Name production
[DOMException [InvalidCharacterError]: "@click" did not match the Name production]
Ignoring item

Thank you very much for your wonderful tool, which I have been using for years with great benefit.

@danburzo
Copy link
Owner

Thank you for the kind words!

The error you’re seeing actually happens at the Readability step, when the library enumerates the HTML attributes. The spiegel.de website seems to use invalid @click attributes on elements, whose validation failure triggers an error in jsdom when the element’s attributes are enumerated. Unfortunately this crashes Readability so it’s not something that can be ignored.

It doesn’t seem entirely safe to disable this validation in the general case, but maybe we could do so with an --unsafe flag for the occasional page misusing HTML attributes?

@danburzo
Copy link
Owner

Released the --unsafe flag as part of [email protected]

@WolfgangDpunkt
Copy link
Author

Thank you very much for the prompt and immediately successful solution. This is exactly what I was hoping for.
The problem I mentioned has now been solved.
Very nice.

@danielnixon
Copy link

Looks like this is fixed in Readability itself in mozilla/readability#918 (merged recently), so when mozilla/readability#941 (next release) lands, you may be able to remove --unsafe entirely.

See https://github.com/mozilla/readability/blob/main/CHANGELOG.md#unreleased

@danburzo danburzo reopened this Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants