feat(lists): add support for wildcard lists using a custom Trie #1233

ThinkChaos · 2023-11-12T00:35:16Z

I did trie benchmarks and a custom trie seemed like the best approach. I recorded the benchmarks in the commit history so we can test others easily in the future, or even just for reference.
See details below.

There's two trie implementations (in two commits, only a single one in the code at a time). First one in feat(lists): add support for wildcard lists using a custom Trie and then refactor(trie): reduce memory use by implementing a radix trie.
The radix tree uses less memory but is a little bit slower to search. I think the tradeoff is worth it cause the speed penalty is unlikely to be noticeable outside benchmarks.

While the trie's absolute memory usage is higher than the plain string cache (i.e. for the same data it uses more memory), in practice a wildcard list needs less entries. So the OISD big wildcard list ends up using about the same amount of memory as the OISD big plain one despite the trie being less compact.
I made the benchmarks to reflect this, but there's a switch to make them run with the same data just to see.

After optimizing the Trie it's even better than the plain string cache in a couple ways:

peak memory usage is lower since we don't use different storage in the factory and the cache (could be changed in the plain string implementation)
search is much faster

There's also a couple minor fixes included:

And a quick way to see coverage locally: build: generate coverage.html when running tests.

Benchmarks

The benchmarks use two versions of the OISD big list because that seemed more realistic, as mentioned above.
The string cache benchmark adds all items from the OISD big plain list, while the others (regex and wildcard) add the ones from the wildcard list.
The querying benchmark populates the caches in the same way, but always searches all entries from the plain list.

// --- Cache Building ---
//
// Most memory efficient: Wildcard (blocky/trie    radix) because of peak
// Fastest:               Wildcard (blocky/trie original)
//
// BenchmarkRegexFactory-8                1     1 232 170 998 ns/op   430.60 fact_heap_MB   430.60 peak_heap_MB   1 792 669 136 B/op   9 826 987 allocs/op
// BenchmarkStringFactory-8               7       159 934 992 ns/op    11.79 fact_heap_MB    26.91 peak_heap_MB      67 613 644 B/op       1 305 allocs/op
// BenchmarkWildcardFactory-8            18        60 091 687 ns/op    16.61 fact_heap_MB    16.61 peak_heap_MB      26 733 498 B/op      92 213 allocs/op (original)
// BenchmarkWildcardFactory-8            16        69 790 156 ns/op    14.89 fact_heap_MB    14.89 peak_heap_MB      27 987 510 B/op      52 902 allocs/op (radix)
// BenchmarkDGHubbleWildcardFactory-8    13        80 772 887 ns/op    23.65 fact_heap_MB    23.65 peak_heap_MB      34 126 104 B/op     301 831 allocs/op
// BenchmarkPorfirionWildcardFactory-8    4       283 443 974 ns/op   183.30 fact_heap_MB   183.30 peak_heap_MB     200 634 492 B/op     811 260 allocs/op

// --- Cache Querying ---
//
// Most memory efficient: Wildcard (blocky/trie radix)
// Fastest:               Wildcard (blocky/trie original)
//
// BenchmarkStringCache-8                 6       204 754 798 ns/op    15.11 cache_heap_MB              0 B/op          0 allocs/op
// BenchmarkWildcardCache-8              14        76 186 334 ns/op    16.61 cache_heap_MB              0 B/op          0 allocs/op (original)
// BenchmarkWildcardCache-8              12        95 316 121 ns/op    14.91 cache_heap_MB              0 B/op          0 allocs/op (radix)
// BenchmarkDGHubbleWildcardCache-8      14        78 111 098 ns/op    23.65 cache_heap_MB              0 B/op          0 allocs/op
// BenchmarkPorfirionWildcardCache-8      4       304 584 455 ns/op   183.30 cache_heap_MB     26 797 744 B/op    305 718 allocs/op

The third party tries I tested were:

https://github.com/dghubble/trie (DGHubble above)
https://github.com/porfirion/trie (Porfirion above)

I committed the lists in helpertest/data. The files are large-ish, but we shouldn't need to update them regularly, even then, they're text so diff will limit the extra weight. So I think it's fine to have them in repo.
Alternatively we could use a submodule but I didn't think it'd be worth the hassle that brings in.

Closes #1090

codecov · 2023-11-12T00:37:26Z

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (dc66eff) 93.66% compared to head (79cf5da) 93.71%.
Report is 3 commits behind head on main.

Files	Patch %	Lines
lists/list_cache.go	60.00%	6 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1233      +/-   ##
==========================================
+ Coverage   93.66%   93.71%   +0.04%     
==========================================
  Files          70       72       +2     
  Lines        5687     5884     +197     
==========================================
+ Hits         5327     5514     +187     
- Misses        279      286       +7     
- Partials       81       84       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kwitsch · 2023-11-12T01:50:32Z

This looks great I'll look into it tomorrow... when I'm less drunk 🫣

t-e-s-tweb · 2023-11-12T07:16:07Z

In my usage its amazing so far. I use hagezi blocklists where the plain domains are 900k+ while the wildcards are only 280k+. Also I have noticed that if blocky was started and then stopped then previously it used to take the same cpu time in processing the lists again which was 1:30secs for me. But after this of it is stopped and restarted then the processing time is only 13.

Amazing improvement so far.

kwitsch

Code looks clean and perfomes well(tests & benchmarks).
Great work. 👍

If there are any unformatted files, CI fails with confusing "read-only file system" error.

Also means we save a bit of time by not checking every single entry with all caches.

A couple other Trie implementations were tested but they use more memory and are slower. See PR 0xERR0R#1233 for details.

ThinkChaos · 2023-11-12T18:38:46Z

Thanks for taking a look to both of you!

I updated the PR with a comment tweak cause I realized that the trie is not a full radix: only terminals are compressed. I think it's fine to keep like this cause memory use is acceptable and we avoid the extra complexity and speed decrease a full radix would bring. So I'll call it a feature.

Also ran the benchmarks on the bigger versions of the Hagezi list out of curiosity. Results are pretty similar to OISD.
Second force-push is minor tweaks to the benchmarks I did during that (they don't affect results).

For the lists committed in the repo, I thought of another possibility: add a script to download the files and put the dir it downloads to in .gitignore. The script could be ran by the Makefile and wouldn't do anything if the files exist.
If you'd prefer that I'll make the change. Only downside is everyone gets different versions of the lists, but anyways benchmarks already aren't comparable from machine to machine.

Hagezi Pro

// Build:
// BenchmarkRegexFactory-8       1   1 670 006 348 ns/op   554.80 fact_heap_MB    554.80 peak_heap_MB    2 317 890 352 B/op   13 068 246 allocs/op
// BenchmarkStringFactory-8      1   2 479 199 024 ns/op    17.59 fact_heap_MB     40.37 peak_heap_MB      104 051 184 B/op        1 528 allocs/op
// BenchmarkWildcardFactory-8   12     101 126 137 ns/op    22.78 fact_heap_MB     22.78 peak_heap_MB       43 899 928 B/op      101 319 allocs/op
//
// Query:
// BenchmarkStringCache-8        3     391 377 809 ns/op    22.78 cache_heap_MB                                      0 B/op            0 allocs/op
// BenchmarkWildcardCache-8     14      86 248 134 ns/op    22.86 cache_heap_MB                                      0 B/op            0 allocs/op

Hagezi Pro++

// Build:
// BenchmarkRegexFactory-8       1   2 273 046 076 ns/op   706.40 fact_heap_MB    706.40 peak_heap_MB    2 949 640 304 B/op   16 842 001 allocs/op
// BenchmarkStringFactory-8      1   3 710 888 341 ns/op    21.35 fact_heap_MB     49.14 peak_heap_MB      127 686 816 B/op        2 060 allocs/op
// BenchmarkWildcardFactory-8    8     132 437 111 ns/op    27.29 fact_heap_MB     27.29 peak_heap_MB       51 432 116 B/op      125 302 allocs/op
//
// Query:
// BenchmarkStringCache-8        2     504 437 782 ns/op    27.79 cache_heap_MB                                      0 B/op            0 allocs/op
// BenchmarkWildcardCache-8     10     112 441 120 ns/op    27.22 cache_heap_MB                                      0 B/op            0 allocs/op

Hagezi Ultimate

// Build:
// BenchmarkRegexFactory-8       1   2 779 399 175 ns/op   890.80 fact_heap_MB    890.80 peak_heap_MB    3 728 279 496 B/op   21 357 076 allocs/op
// BenchmarkStringFactory-8      1   6 609 416 144 ns/op    25.65 fact_heap_MB     58.60 peak_heap_MB      154 634 824 B/op        2 174 allocs/op
// BenchmarkWildcardFactory-8    5     214 580 937 ns/op    52.17 fact_heap_MB     52.17 peak_heap_MB       80 956 003 B/op      154 019 allocs/op
//
// Query:
// BenchmarkStringCache-8        2     614 883 620 ns/op    32.95 cache_heap_MB                                      0 B/op            0 allocs/op
// BenchmarkWildcardCache-8      7     162 815 173 ns/op    52.22 cache_heap_MB                                      0 B/op            0 allocs/op

It seems the trie uses a fair amount of extra memory for this test, not sure why and it still seems acceptable so I didn't look more into it.

kwitsch · 2023-11-12T19:04:49Z

Maybe we should print a warning message after loading the lists for the first time and reaching a threshold number of regex? 🤔

Similar to the one you added to the configuration page.

I'm fairly sure people pay more attention to the logs than the documentation if they have performance issues.

ThinkChaos · 2023-11-12T21:36:35Z

Yeah I was thinking tackling that separately cause I wanted to refactor how ListCache uses the caches to tackle a couple things at the same time:

there's no great place to put a warning based on how many regexes are used
both lists and string caches need the rules to detect what kind of entry a string is (/ prefix + suffix for regex and so on)
this detection needing to be done twice is wasteful, especially compiling regexes

I just pushed a minimal patch to get a warning in already, and will tackle the rest separately :)

A couple other Trie implementations were tested but they use more memory and are slower. See PR 0xERR0R#1233 for details.

This reverts commit ef396f56bca1a18afbbc870d4e95ad0f582e5d26.

ThinkChaos · 2023-11-14T23:43:35Z

Realized I missed a b.ResetTimer() in the factory benchmarks. Doesn't affect results besides the absolute numbers since all implementations had the same setup time added. Reran the benchmarks and updated the PR, and even absolute number pretty much didn't change.
Anyways letting this sit a bit more since it's a big change, I'll merge this weekend or when my next MR is ready :)

0xERR0R

Looks very good 👍

kwitsch approved these changes Nov 12, 2023

View reviewed changes

ThinkChaos added 3 commits November 12, 2023 11:24

fix: don't format files during build

3e0df5f

If there are any unformatted files, CI fails with confusing "read-only file system" error.

fix(cache): remove old group data if new data is empty on refresh

435615b

fix(cache): don't add entries multiple times in chains

bafa615

Also means we save a bit of time by not checking every single entry with all caches.

ThinkChaos added a commit to ThinkChaos/blocky that referenced this pull request Nov 12, 2023

feat(lists): add support for wildcard lists using a custom Trie

8d1043c

A couple other Trie implementations were tested but they use more memory and are slower. See PR 0xERR0R#1233 for details.

ThinkChaos force-pushed the feat/wildcard-lists branch from bf76cec to f8b1ebb Compare November 12, 2023 16:24

ThinkChaos added a commit to ThinkChaos/blocky that referenced this pull request Nov 12, 2023

feat(lists): add support for wildcard lists using a custom Trie

56fd90d

A couple other Trie implementations were tested but they use more memory and are slower. See PR 0xERR0R#1233 for details.

ThinkChaos force-pushed the feat/wildcard-lists branch from f8b1ebb to c183767 Compare November 12, 2023 18:38

ThinkChaos added 7 commits November 14, 2023 18:31

feat(lists): add support for wildcard lists using a custom Trie

7ad79c2

A couple other Trie implementations were tested but they use more memory and are slower. See PR 0xERR0R#1233 for details.

fix(server): don't print redundant memory stats

998461b

tmp: add benchmarks of other trie implementations

88bfb8f

refactor(trie): reduce memory use by implementing a radix trie

b77b38e

revert: "tmp: add benchmarks of other trie implementations"

cbc856b

This reverts commit ef396f56bca1a18afbbc870d4e95ad0f582e5d26.

build: generate coverage.html when running tests

681da05

feat: add warning if more than 500 regexes are in use

79cf5da

ThinkChaos force-pushed the feat/wildcard-lists branch from 267df49 to 79cf5da Compare November 14, 2023 23:40

0xERR0R approved these changes Nov 15, 2023

View reviewed changes

0xERR0R added the 🔨 enhancement New feature or request label Nov 17, 2023

0xERR0R added this to the v0.23 milestone Nov 17, 2023

0xERR0R merged commit b498bc5 into 0xERR0R:main Nov 17, 2023
11 checks passed

This was referenced Nov 17, 2023

Memory usage with large regex-based blocklists #1222

Closed

Support adblock list syntax #950

Open

ThinkChaos deleted the feat/wildcard-lists branch December 12, 2023 15:29

sholdee mentioned this pull request Dec 19, 2023

Blocky version / wildcard lists BeryJu/gravity#843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(lists): add support for wildcard lists using a custom Trie #1233

feat(lists): add support for wildcard lists using a custom Trie #1233

ThinkChaos commented Nov 12, 2023

codecov bot commented Nov 12, 2023 •

edited

Loading

kwitsch commented Nov 12, 2023

t-e-s-tweb commented Nov 12, 2023

kwitsch left a comment

ThinkChaos commented Nov 12, 2023

kwitsch commented Nov 12, 2023

ThinkChaos commented Nov 12, 2023 •

edited

Loading

ThinkChaos commented Nov 14, 2023

0xERR0R left a comment

feat(lists): add support for wildcard lists using a custom Trie #1233

feat(lists): add support for wildcard lists using a custom Trie #1233

Conversation

ThinkChaos commented Nov 12, 2023

Benchmarks

codecov bot commented Nov 12, 2023 • edited Loading

Codecov Report

kwitsch commented Nov 12, 2023

t-e-s-tweb commented Nov 12, 2023

kwitsch left a comment

Choose a reason for hiding this comment

ThinkChaos commented Nov 12, 2023

Hagezi Pro

Hagezi Pro++

Hagezi Ultimate

kwitsch commented Nov 12, 2023

ThinkChaos commented Nov 12, 2023 • edited Loading

ThinkChaos commented Nov 14, 2023

0xERR0R left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 12, 2023 •

edited

Loading

ThinkChaos commented Nov 12, 2023 •

edited

Loading