Skip to content
This repository has been archived by the owner on Mar 13, 2023. It is now read-only.

Article titles can contain ":" #3

Open
Dinoguy1000 opened this issue Apr 6, 2021 · 2 comments
Open

Article titles can contain ":" #3

Dinoguy1000 opened this issue Apr 6, 2021 · 2 comments

Comments

@Dinoguy1000
Copy link

Dinoguy1000 commented Apr 6, 2021

Currently, analyze_chunk() removes all titles that contain : under the assumption that these are non-mainspace titles. However, article titles can contain colons, e.g. Batman v Superman: Dawn of Justice or UTC+03:00 (or, on Simple Wikipedia, UTC+08:00 or Avatar: The Last Airbender). Many of these titles are actually redirects to titles without a colon, but all redirects are already removed by this point in the function, so that's immaterial.

@daveshap
Copy link
Owner

daveshap commented Apr 6, 2021

That's a good point. I will re-run without that rule and see what we get.

@daveshap
Copy link
Owner

There are a crazy number of non-articles with a colon in the title so we'll need to go back to the drawing board about how to filter these out while keeping the articles we want.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants