-
-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(lists): add support for wildcard lists using a custom Trie #1233
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #1233 +/- ##
==========================================
+ Coverage 93.66% 93.71% +0.04%
==========================================
Files 70 72 +2
Lines 5687 5884 +197
==========================================
+ Hits 5327 5514 +187
- Misses 279 286 +7
- Partials 81 84 +3 ☔ View full report in Codecov by Sentry. |
This looks great I'll look into it tomorrow... when I'm less drunk 🫣 |
In my usage its amazing so far. I use hagezi blocklists where the plain domains are 900k+ while the wildcards are only 280k+. Also I have noticed that if blocky was started and then stopped then previously it used to take the same cpu time in processing the lists again which was 1:30secs for me. But after this of it is stopped and restarted then the processing time is only 13. Amazing improvement so far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks clean and perfomes well(tests & benchmarks).
Great work. 👍
If there are any unformatted files, CI fails with confusing "read-only file system" error.
Also means we save a bit of time by not checking every single entry with all caches.
A couple other Trie implementations were tested but they use more memory and are slower. See PR 0xERR0R#1233 for details.
bf76cec
to
f8b1ebb
Compare
A couple other Trie implementations were tested but they use more memory and are slower. See PR 0xERR0R#1233 for details.
Thanks for taking a look to both of you! I updated the PR with a comment tweak cause I realized that the trie is not a full radix: only terminals are compressed. I think it's fine to keep like this cause memory use is acceptable and we avoid the extra complexity and speed decrease a full radix would bring. So I'll call it a feature. Also ran the benchmarks on the bigger versions of the Hagezi list out of curiosity. Results are pretty similar to OISD. For the lists committed in the repo, I thought of another possibility: add a script to download the files and put the dir it downloads to in Hagezi Pro
Hagezi Pro++
Hagezi Ultimate
It seems the trie uses a fair amount of extra memory for this test, not sure why and it still seems acceptable so I didn't look more into it. |
f8b1ebb
to
c183767
Compare
Maybe we should print a warning message after loading the lists for the first time and reaching a threshold number of regex? 🤔 Similar to the one you added to the configuration page. I'm fairly sure people pay more attention to the logs than the documentation if they have performance issues. |
Yeah I was thinking tackling that separately cause I wanted to refactor how
I just pushed a minimal patch to get a warning in already, and will tackle the rest separately :) |
A couple other Trie implementations were tested but they use more memory and are slower. See PR 0xERR0R#1233 for details.
This reverts commit ef396f56bca1a18afbbc870d4e95ad0f582e5d26.
267df49
to
79cf5da
Compare
Realized I missed a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good 👍
I did trie benchmarks and a custom trie seemed like the best approach. I recorded the benchmarks in the commit history so we can test others easily in the future, or even just for reference.
See details below.
There's two trie implementations (in two commits, only a single one in the code at a time). First one in feat(lists): add support for wildcard lists using a custom Trie and then refactor(trie): reduce memory use by implementing a radix trie.
The radix tree uses less memory but is a little bit slower to search. I think the tradeoff is worth it cause the speed penalty is unlikely to be noticeable outside benchmarks.
While the trie's absolute memory usage is higher than the plain string cache (i.e. for the same data it uses more memory), in practice a wildcard list needs less entries. So the OISD big wildcard list ends up using about the same amount of memory as the OISD big plain one despite the trie being less compact.
I made the benchmarks to reflect this, but there's a switch to make them run with the same data just to see.
After optimizing the Trie it's even better than the plain string cache in a couple ways:
There's also a couple minor fixes included:
And a quick way to see coverage locally: build: generate coverage.html when running tests.
Benchmarks
The benchmarks use two versions of the OISD big list because that seemed more realistic, as mentioned above.
The string cache benchmark adds all items from the OISD big plain list, while the others (regex and wildcard) add the ones from the wildcard list.
The querying benchmark populates the caches in the same way, but always searches all entries from the plain list.
The third party tries I tested were:
I committed the lists in
helpertest/data
. The files are large-ish, but we shouldn't need to update them regularly, even then, they're text so diff will limit the extra weight. So I think it's fine to have them in repo.Alternatively we could use a submodule but I didn't think it'd be worth the hassle that brings in.
Closes #1090