Surface module cache for reuse. #1506

graydon · 2025-01-09T08:00:37Z

The goals of this PR are:

Surface the ModuleCache from inside soroban to its embedding environment.
Allow a module cache to outlive the soroban Host, and be reused from one Host to the next.
Allow the ModuleCache to be pre-populated from before the lifetime of a single Host, (eg. by stellar-core, with all the non-archived contracts currently live in the ledger, at startup or such.)

These goals are accomplished by the following structural changes:

Factoring out error-handling / error-mapping functionality of Host into a separate trait ErrorHandler that Host implements, and adding a new trait CompilationContext that extends both ErrorHandler + AsBudget. This abstracts the services the ModuleCache needs from the Host, such that the embedding environment can provide its own CompilationContext in the times pre- or post-Host lifetime.
Modifying the internal map that stores Modules inside the ModuleCache to have two possible forms: the existing MeteredOrdMap that is built once, immutable once built, eagerly copied, only really good for small numbers of modules, and is metered; and a new different form that uses a refcounted-shared-mutable std::map::BtreeMap that can be incrementally and efficiently added-to or removed-from, and is not metered. When you construct a ModuleCache now you get to choose which kind you want. The old kind is kept around for compatibility, and used any time a "throwaway" ModuleCache is built on a fixed set of contracts (eg. during upload); but the new "reusable" kind is available if you want to work with a long-lived ModuleCache that holds lots of contracts and outlives any given Host.
Module storage and arguments that were Rc<> are made Arc<> and the whole ModuleCache is made Send+Sync. This is mainly because we want to work with the reusable ModuleCache across C++ threads, and while Rust has no idea about C++ threads it at least helps us shoot ourselves in the foot less if all the Rust code that works with it preserves the sense of "this would be threadsafe in Rust if Rust was making the threads".
There is also a bit of gratuitous churn where things get renamed from foo => wasmi_foo and Foo => wasmi::Foo. This is because this change was done as part of (and there'll be a followup PR including) work equipping the Vm and ModuleCache with an experimental wasmtime/winch backend. It turns out this PR's work (on just the ModuleCache reuse part) is high enough value, and agnostic enough to which backend it's using, that we want to split this work out, move it up in the schedule and explore getting it into as early a release as possible. But I didn't bother reverting-out the wasmi_-qualification since it's intertwined with other changes here and harmless to keep.

I've just finished the initial splitting-up and will need to finish splitting the core side as well, so this PR isn't totally polished yet, but it's good enough for a preliminary glance over by reviewers. Feedback welcome!

(Obviously I also have to go write a CAP about it)

dmkozh

Looks good to me overall, only not sure about the gating of the old behavior (left a comment about that).

dmkozh · 2025-01-11T00:13:42Z

soroban-env-host/src/vm/module_cache.rs

+
+#[derive(Clone)]
+enum ModuleCacheMap {
+    MeteredSingleUseMap(MeteredOrdMap<Hash, Arc<ParsedModule>, Budget>),


If we don't try to preserve p22 compatibility (which we are not going to), then this should not be necessary in production code paths, right? It might make sense for simulation and unit tests though, so maybe it should be gated with testutils/test/recording_mode configurations. The debug budget would also be needed for metering this.

FWIW for simulation it would be nice to also use the reusable cache, but that requires some additional discussion and doesn't belong to this PR.

In either case it'll probably make sense to remove the MeteredSingleUseMap for better maintainability.

If we've decided to make caching overhead (specifically the BTreeMap operations) free in the global case, it most definitely won't hurt to make it free in the single use case (it will have negligible effect if we don't charge for it).

I'll see if I can remove it -- we still use it in a few code paths like upload where there's no reuse -- but possibly that doesn't need a cache at all, or can ignore the cache for the sake of measurement since it'd only ever be a 1-element cache that's thrown away anyways.

Concerning simulation: this is an interesting point! We're going to greatly overcharge people (and kinda negate the point of the change entirely) if we don't adjust simulation. I think the correct thing to do is populate a 1-entry cache before the simulation call (on a budget with a high-though-not-DoS-able limit), then pass that populated cache into the host being simulated. @dmkozh WDYT?

I think the blocking requirement is to have reasonably accurate numbers both in tests and in simulation. That means not charging for Wasm parsing unless the upload happens. For simulation, it would be nice to actually use the shared cache for all the simulations. But I think we can live without it initially.

I think the correct thing to do is populate a 1-entry cache before the simulation call

I'm not sure I understand how that would help. Before the simulation call we have no idea which contracts are going to be called, so we can't pre-populate anything. I think we need to just charge Wasm parsing to shadow budget, unless it is being uploaded. IIUC that should work fine with the new-style mutable cache.

@dmkozh So .. this seems to more-or-less work as you describe -- I can mostly rip-out all the weird immutable 1-tx-cache logic as well as all the ghastly emulation of its lifecycle we had to put in for recording mode -- but there is a small wrinkle: observations change. For two reasons:

we're not charging the (small) costs of MeteredOrdMap operaions on the cache anymore at all

we're no longer able to distinguish, in unit tests that use recording mode, between:

"upload and invoke" cases which we should charge for and

"invoke something that would have been in the cache", which we want to suppress and direct the charge to the shadow budget for

We cannot distinguish case 2 -- again only in unit tests -- because the distinguishing happens in frame::instantiate_vm and works by looking at "stuff that was in the storage snapshot", and in unit tests the storage snapshot is always empty. IOW unit tests actually are usually more like "upload and invoke" transactions; we've just been (somewhat inaccurately) observing some of them, in recording mode, as though they're recording things-that-were-just-uploaded. I attempted to compensate for this by adding a new call to the testutils routine that registers a contract, to automatically build a module cache that contains them, so we get module cache hits in that case, but I'm not sure that's a good idea either -- recording mode in tests is a weird case where it's not really trying to be "like simulation" nor is it trying to be "like real apply", it's just a sort of convenience for tests to not need footprints. Idk, open to suggestions here!

I've uploaded a version of this that has these changes. Will keep poking at them tomorrow.

I'll take a look at the code, and in any case maybe it would make more sense to leave the testing/simulation improvements for the followups. That said, here is how I conceptually see the budget accounting for tests:

A new test host is instantiated together with an empty mutable module cache

Any wasm upload (be it a test utility, or a host fn call) charges the Wasm parse and puts the module into cache

Any future call to the cached contract within this host reads the contract from the cache
It seems like this is as close as it gets to the on-chain behavior, unless I'm missing something. Basically, if ledger snapshots are not involved, then every test builds the storage from scratch anyways, so scenario of 'invoking something that would have been in the cache' doesn't seem possible to me at all.

recording mode in tests is a weird case where it's not really trying to be "like simulation" nor is it trying to be "like real apply", it's just a sort of convenience for tests to not need footprints

FWIW recording mode is a de-facto standard for the unit tests (I'm not even sure if the SDK provides a way to specify the footprint - it really wouldn't cover anything interesting about the contract logic). And we also want the metering to be sane. For sure, we'll miss XDR parsing costs here and there, and testing with a compiled-in contract is less accurate than testing with Wasms. However, I think we should strive to keep the main cost components correct (like charging/not charging for Wasm parsing, given a Wasm contract).

@dmkozh Ok, I've done this now (and fixed all the test fallout). One interesting thing is I had to bump up the margin of error on simulations fairly significantly. I think this is mostly due to the dominant-shared-cost of parsing modules going away. But I'm not 100% certain.

jayz22 · 2025-01-14T12:54:05Z

soroban-env-host/src/vm.rs

+    }
+
+    /// Instantiates a VM given the arguments provided in [`Self::new`],
+    /// or [`Self::new_from_module_cache`]


Nit: this comment is outdated.
Self::new_from_module_cache ==> Self::from_parsed_module

fixed / made-obsolete

jayz22 · 2025-01-14T13:26:20Z

soroban-env-host/src/vm/module_cache.rs

+
+#[derive(Clone)]
+enum ModuleCacheMap {
+    MeteredSingleUseMap(MeteredOrdMap<Hash, Arc<ParsedModule>, Budget>),


In either case it'll probably make sense to remove the MeteredSingleUseMap for better maintainability.

If we've decided to make caching overhead (specifically the BTreeMap operations) free in the global case, it most definitely won't hurt to make it free in the single use case (it will have negligible effect if we don't charge for it).

jayz22 · 2025-01-14T14:03:54Z

soroban-env-host/src/e2e_invoke.rs

@@ -336,6 +336,44 @@ pub fn invoke_host_function_with_trace_hook<T: AsRef<[u8]>, I: ExactSizeIterator
    base_prng_seed: T,
    diagnostic_events: &mut Vec<DiagnosticEvent>,
    trace_hook: Option<TraceHook>,
+) -> Result<InvokeHostFunctionResult, HostError> {


Do we still need this version? Seems like it's sufficient to keep the _and_module_cache version with the optional module cache.

I thought I'd keep it to avoid gratuitous breakage of clients (it is after all the API most are likely to call). I can remove it if you strongly prefer. @dmkozh this is kinda "your API", do you have a preference?

AFAIK only Core uses this as of today. Clippy won't allow us to break this though, unless we bump the version or use unstable-next-api feature. I think breaking this in p23 is very low risk (if someone has been using it, they must be advanced enough to migrate to the module cache interface as well). But we don't have to do that in this PR.

Ok makes sense, I don't feel strongly so feel free to keep it for backward compatibility.

jayz22 · 2025-01-14T14:09:49Z

soroban-env-host/src/vm/module_cache.rs

+        )
+    }
+
+    pub fn parse_and_cache_module<Ctx: CompilationContext>(


Nit: this can be pub (crate) fn, so is add_stored_contracts

No, a major point of this change is that the module cache can now be (publicly) operated by stellar-core before and after a host exists.

graydon requested review from sisuresh and dmkozh as code owners January 9, 2025 08:00

graydon requested review from leighmcculloch, dmkozh and sisuresh and removed request for sisuresh, dmkozh and leighmcculloch January 9, 2025 08:00

graydon mentioned this pull request Jan 9, 2025

Wasmtime/winch sketch #1507

Draft

graydon requested a review from jayz22 January 9, 2025 08:17

graydon marked this pull request as draft January 9, 2025 08:17

graydon force-pushed the reusable-module-cache branch 2 times, most recently from 0f58a15 to 86ea7ea Compare January 10, 2025 19:03

Surface module cache for reuse.

ad0d6d2

graydon force-pushed the reusable-module-cache branch from 86ea7ea to ad0d6d2 Compare January 10, 2025 19:34

dmkozh reviewed Jan 11, 2025

View reviewed changes

graydon marked this pull request as ready for review January 11, 2025 01:14

graydon mentioned this pull request Jan 11, 2025

Reusable module cache stellar/stellar-core#4621

Open

5 tasks

jayz22 reviewed Jan 14, 2025

View reviewed changes

graydon requested a review from a team as a code owner January 16, 2025 07:30

Remove p22/immutable mode from module cache, only support p23/mutable

cae6fff

graydon force-pushed the reusable-module-cache branch from 60ed680 to cae6fff Compare January 16, 2025 07:33

Simplify model to only have one type of module cache.

9515643

graydon force-pushed the reusable-module-cache branch from 692d056 to 9515643 Compare January 18, 2025 07:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface module cache for reuse. #1506

Surface module cache for reuse. #1506

graydon commented Jan 9, 2025 •

edited

Loading

dmkozh left a comment

dmkozh Jan 11, 2025

jayz22 Jan 14, 2025

graydon Jan 14, 2025

dmkozh Jan 14, 2025

graydon Jan 16, 2025

dmkozh Jan 16, 2025

graydon Jan 18, 2025

jayz22 Jan 14, 2025

graydon Jan 16, 2025

graydon Jan 18, 2025

jayz22 Jan 14, 2025

jayz22 Jan 14, 2025

graydon Jan 16, 2025

dmkozh Jan 16, 2025

jayz22 Jan 17, 2025

jayz22 Jan 14, 2025

graydon Jan 16, 2025

Surface module cache for reuse. #1506

Are you sure you want to change the base?

Surface module cache for reuse. #1506

Conversation

graydon commented Jan 9, 2025 • edited Loading

dmkozh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

graydon commented Jan 9, 2025 •

edited

Loading