[Investigation] Reported performance regression when updating wasmtime from 19 to 24. #2058

Stebalien · 2024-10-15T22:16:04Z

We've seen a reported performance regression that appears to be associated with the wasmtime update that happened in FVM 4.4 (wasmtime 19 -> 24). This was an interesting performance regression because:

It shows up even when updating FVM 2.7 to 2.9 even when wasmtime 2.9 isn't being used, just included. This indicates that the issue is in some common wasmtime dependency.
None of the gas benchmarks in FVM v4 are picking up the regression, but those benchmarks mostly focus on non-wasm things. Importantly, I'm reasonably sure that:
1. It's not related to hashing.
2. It's not related to signature verification.
3. It's not related to calling into wasm contracts (see Re-Price Method Invocation #2057).
4. It's not related to trapping and running out of gas (we have an explicit benchmark for this because it ocasionally rears its head and turns into an $N^2$ problem).
5. It shouldn't be cranelift. That tends to be version-locked with a specific wasmtime version.

Things I haven't checked:

The cost of wasm instructions in general. We priced wasm instructions long ago and it was a bit of a manual process.
The cost of jumping in and out of wasm. Wasmtime has made some recent changes here however, I'd expect any issues there to be specific to the wasmtime version in-use (observation (1) above appears to conflict with this).

All I can think of now is... rustix? Or something like that?

Stebalien · 2024-10-15T22:25:26Z

My next step is to try to update FVM v2 in forest to see what deps get updated along with it.

Stebalien · 2024-10-16T19:10:40Z

We've confirmed that having wasmtime v24 anywhere in the build tree, even if unused, causes this slowdown.

Stebalien · 2024-10-16T21:25:11Z

It's not the write_core feature in object as far as I can tell.
I'm also seeing some changes in gimli, but v24 uses a different version of that so it shouldn't matter.

Interesting, I'm seeing smallvec now depending on serde. I can't imagine how that might be relevant, but I'll need to check.

Stebalien · 2024-10-17T00:55:51Z

I've tracked it down to wasmtime's cranelift feature (does not appear to be feature unification but I may be missing something). I'm now trying with just wasmtime-cranelift.

Stebalien · 2024-10-17T16:47:03Z

wasmtime-environ doesn't reproduce.
cranelift-wasm does.

Stebalien · 2024-10-17T17:01:43Z

cranelift-codegen reproduces, trying cranelift-control now

Stebalien · 2024-10-17T17:31:40Z

cranelift-control does not. Trying cranelift-codegen-meta, cranelift-codegen-shared, and cranelift-isle

Stebalien · 2024-10-17T18:18:26Z

Ok, it is cranelift-codegen specifically. Now I'm trying to see if it's because we're not supporting some architecture by building with CARGO_FEATURE_ALL_ARCH.... we'll see if that even works.

Stebalien · 2024-10-17T18:22:18Z

Er, trying with reduced features first. Only "std" and "unwind" because those are required.

Stebalien · 2024-10-17T18:44:04Z

Ok, still broken with those features. Now I'm trying with "all arch" to see if it's some kind if isle issue (I think it is, but I'm not sure if that's the fix).

Stebalien · 2024-10-17T19:10:42Z

Enabling all architectures doesn't help, and I can't skip the ISLE build (that option only exists if it's pre-built). Now I'm bisecting cranelift-codegen, starting with 0.109.

Stebalien · 2024-10-17T22:41:41Z

Ok, it's cranelift-codegen 0.107.0 exactly. Now I'm testing a build with all the FVM crates fully updated to get us all on a single wasmtime version just in case it's an issue with multiple wasmtime versions.

Stebalien · 2024-10-17T23:49:02Z

Ok, it is feature unification. Specifically adding the trace-log feature to regalloc2.

Stebalien · 2024-10-17T23:54:20Z

My hypothesis is that this is actually just a compilation slowdown, not an execution slowdown. We're only running through ~24 epochs here and will have to pause to lazily (IIRC?) compile actors every time we load a new one. This is my guess because I'm noticing that epochs 1 & 4 are taking a while.

Additionally, if we're speeding through multiple network upgrades (especially if we're switching FVM versions), we'll have to re-compile the actors per network version.

LesnyRumcajs · 2024-10-18T07:58:28Z

@Stebalien This could be verified by running a longer benchmark without network upgrades. The .env. file would need to be modified along the lines of:

LOTUS_IMAGE=ghcr.io/chainsafe/lotus-devnet:2024-10-10-600728e
FOREST_DATA_DIR=/forest_data
LOTUS_DATA_DIR=/lotus_data
FIL_PROOFS_PARAMETER_CACHE=/var/tmp/filecoin-proof-parameters
MINER_ACTOR_ADDRESS=f01000
LOTUS_RPC_PORT=1234
LOTUS_P2P_PORT=1235
MINER_RPC_PORT=2345
FOREST_RPC_PORT=3456
FOREST_OFFLINE_RPC_PORT=3457
F3_RPC_PORT=23456
F3_FINALITY=100000
GENESIS_NETWORK_VERSION=24
SHARK_HEIGHT=-10
HYGGE_HEIGHT=-9
LIGHTNING_HEIGHT=-8
THUNDER_HEIGHT=-7
WATERMELON_HEIGHT=-6
DRAGON_HEIGHT=-5
WAFFLE_HEIGHT=-4
TUKTUK_HEIGHT=-3
TARGET_HEIGHT=200

Note that the timeout would also need to be extended. Then, we compare the timings before and after the FVM upgrade.

In that way, we will be able to confirm if the slowdown is coming from re-compiling actors (which is not a big issue).

LesnyRumcajs · 2024-10-18T10:59:51Z

@Stebalien I believe your hypothesis holds. I tried on the config above locally (on a machine that previously reported 50% slowdown).

FVM 4.3.1
________________________________________________________
Executed in  781.81 secs    fish           external
   usr time    5.92 secs    0.00 micros    5.92 secs
   sys time    5.95 secs  696.00 micros    5.95 secs

FVM 4.4
________________________________________________________
Executed in  782.58 secs    fish           external
   usr time    5.81 secs    0.00 micros    5.81 secs
   sys time    6.01 secs  811.00 micros    6.01 secs

So I guess it's "okay-ish" in a sense that the slowdown only occurs under some synthetic conditions. That said, it might make sense to report it to the cranelift maintainers.

Stebalien · 2024-10-18T13:15:42Z

Yep, I plan on reporting it upstream. I'm also looking into possibly disabling the offending feature (requires upstream help, but I'm guessing it's only needed for GC which we don't use).

Also, for forest, you can modify the "quick" compilation profile to optimize cranelift-codegen (or maybe just regalloc2). That didn't get it to be quite fast enough to pass the test in the 5m timeout, but it definitely improved the situation.

I'm also going to write a quick "load/compile the actors" benchmark.

Stebalien · 2024-10-18T13:45:20Z

Oh, lol, no. This trace-log is literally just for debugging. I don't think this was supposed to get shipped to production.

Stebalien · 2024-10-18T13:48:41Z

I don't think this was supposed to get shipped to production.

Nvm:

# Note that this just enables `trace-log` for `clif-util` and doesn't turn it on
# for all of Cranelift, which would be bad.

It looks like they thought this wouldn't have a performance impact.

Stebalien · 2024-10-18T14:07:38Z

Ah, that comment is saying that trace-log is only enabled for clif-util.

And... already fixed but not in the version we're using: bytecodealliance/wasmtime#9128. Time to update to v25, I guess.

Importantly, this reverts a previous wasmtime change that enabled trace-logging in regalloc, massively slowing down compilation across all FVM versions. fixes #2058

LesnyRumcajs mentioned this issue Oct 16, 2024

Tracking issue for moving to FVM-4.4 ChainSafe/forest#4882

Closed

3 tasks

Stebalien mentioned this issue Oct 16, 2024

[TEST] Testing fvm update perf regression ChainSafe/forest#4908

Closed

Stebalien added a commit that referenced this issue Oct 18, 2024

feat: wasmtime v25

2d3ce72

Importantly, this reverts a previous wasmtime change that enabled trace-logging in regalloc, massively slowing down compilation across all FVM versions. fixes #2058

This was referenced Oct 18, 2024

feat: wasmtime v25 #2059

Merged

Add an actor compilation benchmark #2060

Merged

Stebalien closed this as completed in #2059 Oct 21, 2024

Stebalien closed this as completed in 480a6ba Oct 21, 2024

LesnyRumcajs mentioned this issue Oct 22, 2024

chore: bump older FVM versions ChainSafe/forest#4919

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Investigation] Reported performance regression when updating wasmtime from 19 to 24. #2058

[Investigation] Reported performance regression when updating wasmtime from 19 to 24. #2058

Stebalien commented Oct 15, 2024

Stebalien commented Oct 15, 2024

Stebalien commented Oct 16, 2024

Stebalien commented Oct 16, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

LesnyRumcajs commented Oct 18, 2024

LesnyRumcajs commented Oct 18, 2024

Stebalien commented Oct 18, 2024

Stebalien commented Oct 18, 2024

Stebalien commented Oct 18, 2024 •

edited

Loading

Stebalien commented Oct 18, 2024

[Investigation] Reported performance regression when updating wasmtime from 19 to 24. #2058

[Investigation] Reported performance regression when updating wasmtime from 19 to 24. #2058

Comments

Stebalien commented Oct 15, 2024

Stebalien commented Oct 15, 2024

Stebalien commented Oct 16, 2024

Stebalien commented Oct 16, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

Stebalien commented Oct 17, 2024

LesnyRumcajs commented Oct 18, 2024

LesnyRumcajs commented Oct 18, 2024

Stebalien commented Oct 18, 2024

Stebalien commented Oct 18, 2024

Stebalien commented Oct 18, 2024 • edited Loading

Stebalien commented Oct 18, 2024

Stebalien commented Oct 18, 2024 •

edited

Loading