Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface module cache for reuse. #1506

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

graydon
Copy link
Contributor

@graydon graydon commented Jan 9, 2025

The goals of this PR are:

  1. Surface the ModuleCache from inside soroban to its embedding environment.
  2. Allow a module cache to outlive the soroban Host, and be reused from one Host to the next.
  3. Allow the ModuleCache to be pre-populated from before the lifetime of a single Host, (eg. by stellar-core, with all the non-archived contracts currently live in the ledger, at startup or such.)

These goals are accomplished by the following structural changes:

  1. Factoring out error-handling / error-mapping functionality of Host into a separate trait ErrorHandler that Host implements, and adding a new trait CompilationContext that extends both ErrorHandler + AsBudget. This abstracts the services the ModuleCache needs from the Host, such that the embedding environment can provide its own CompilationContext in the times pre- or post-Host lifetime.
  2. Modifying the internal map that stores Modules inside the ModuleCache to have two possible forms: the existing MeteredOrdMap that is built once, immutable once built, eagerly copied, only really good for small numbers of modules, and is metered; and a new different form that uses a refcounted-shared-mutable std::map::BtreeMap that can be incrementally and efficiently added-to or removed-from, and is not metered. When you construct a ModuleCache now you get to choose which kind you want. The old kind is kept around for compatibility, and used any time a "throwaway" ModuleCache is built on a fixed set of contracts (eg. during upload); but the new "reusable" kind is available if you want to work with a long-lived ModuleCache that holds lots of contracts and outlives any given Host.
  3. Module storage and arguments that were Rc<> are made Arc<> and the whole ModuleCache is made Send+Sync. This is mainly because we want to work with the reusable ModuleCache across C++ threads, and while Rust has no idea about C++ threads it at least helps us shoot ourselves in the foot less if all the Rust code that works with it preserves the sense of "this would be threadsafe in Rust if Rust was making the threads".
  4. There is also a bit of gratuitous churn where things get renamed from foo => wasmi_foo and Foo => wasmi::Foo. This is because this change was done as part of (and there'll be a followup PR including) work equipping the Vm and ModuleCache with an experimental wasmtime/winch backend. It turns out this PR's work (on just the ModuleCache reuse part) is high enough value, and agnostic enough to which backend it's using, that we want to split this work out, move it up in the schedule and explore getting it into as early a release as possible. But I didn't bother reverting-out the wasmi_-qualification since it's intertwined with other changes here and harmless to keep.

I've just finished the initial splitting-up and will need to finish splitting the core side as well, so this PR isn't totally polished yet, but it's good enough for a preliminary glance over by reviewers. Feedback welcome!

(Obviously I also have to go write a CAP about it)

@graydon graydon requested review from leighmcculloch, dmkozh and sisuresh and removed request for sisuresh, dmkozh and leighmcculloch January 9, 2025 08:00
@graydon graydon mentioned this pull request Jan 9, 2025
@graydon graydon requested a review from jayz22 January 9, 2025 08:17
@graydon graydon marked this pull request as draft January 9, 2025 08:17
@graydon graydon force-pushed the reusable-module-cache branch 2 times, most recently from 0f58a15 to 86ea7ea Compare January 10, 2025 19:03
@graydon graydon force-pushed the reusable-module-cache branch from 86ea7ea to ad0d6d2 Compare January 10, 2025 19:34
Copy link
Contributor

@dmkozh dmkozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall, only not sure about the gating of the old behavior (left a comment about that).


#[derive(Clone)]
enum ModuleCacheMap {
MeteredSingleUseMap(MeteredOrdMap<Hash, Arc<ParsedModule>, Budget>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't try to preserve p22 compatibility (which we are not going to), then this should not be necessary in production code paths, right? It might make sense for simulation and unit tests though, so maybe it should be gated with testutils/test/recording_mode configurations. The debug budget would also be needed for metering this.

FWIW for simulation it would be nice to also use the reusable cache, but that requires some additional discussion and doesn't belong to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In either case it'll probably make sense to remove the MeteredSingleUseMap for better maintainability.

If we've decided to make caching overhead (specifically the BTreeMap operations) free in the global case, it most definitely won't hurt to make it free in the single use case (it will have negligible effect if we don't charge for it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see if I can remove it -- we still use it in a few code paths like upload where there's no reuse -- but possibly that doesn't need a cache at all, or can ignore the cache for the sake of measurement since it'd only ever be a 1-element cache that's thrown away anyways.

Concerning simulation: this is an interesting point! We're going to greatly overcharge people (and kinda negate the point of the change entirely) if we don't adjust simulation. I think the correct thing to do is populate a 1-entry cache before the simulation call (on a budget with a high-though-not-DoS-able limit), then pass that populated cache into the host being simulated. @dmkozh WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the blocking requirement is to have reasonably accurate numbers both in tests and in simulation. That means not charging for Wasm parsing unless the upload happens. For simulation, it would be nice to actually use the shared cache for all the simulations. But I think we can live without it initially.

I think the correct thing to do is populate a 1-entry cache before the simulation call

I'm not sure I understand how that would help. Before the simulation call we have no idea which contracts are going to be called, so we can't pre-populate anything. I think we need to just charge Wasm parsing to shadow budget, unless it is being uploaded. IIUC that should work fine with the new-style mutable cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmkozh So .. this seems to more-or-less work as you describe -- I can mostly rip-out all the weird immutable 1-tx-cache logic as well as all the ghastly emulation of its lifecycle we had to put in for recording mode -- but there is a small wrinkle: observations change. For two reasons:

  1. we're not charging the (small) costs of MeteredOrdMap operaions on the cache anymore at all
  2. we're no longer able to distinguish, in unit tests that use recording mode, between:
    • "upload and invoke" cases which we should charge for and
    • "invoke something that would have been in the cache", which we want to suppress and direct the charge to the shadow budget for

We cannot distinguish case 2 -- again only in unit tests -- because the distinguishing happens in frame::instantiate_vm and works by looking at "stuff that was in the storage snapshot", and in unit tests the storage snapshot is always empty. IOW unit tests actually are usually more like "upload and invoke" transactions; we've just been (somewhat inaccurately) observing some of them, in recording mode, as though they're recording things-that-were-just-uploaded. I attempted to compensate for this by adding a new call to the testutils routine that registers a contract, to automatically build a module cache that contains them, so we get module cache hits in that case, but I'm not sure that's a good idea either -- recording mode in tests is a weird case where it's not really trying to be "like simulation" nor is it trying to be "like real apply", it's just a sort of convenience for tests to not need footprints. Idk, open to suggestions here!

I've uploaded a version of this that has these changes. Will keep poking at them tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at the code, and in any case maybe it would make more sense to leave the testing/simulation improvements for the followups. That said, here is how I conceptually see the budget accounting for tests:

  • A new test host is instantiated together with an empty mutable module cache
  • Any wasm upload (be it a test utility, or a host fn call) charges the Wasm parse and puts the module into cache
  • Any future call to the cached contract within this host reads the contract from the cache
    It seems like this is as close as it gets to the on-chain behavior, unless I'm missing something. Basically, if ledger snapshots are not involved, then every test builds the storage from scratch anyways, so scenario of 'invoking something that would have been in the cache' doesn't seem possible to me at all.

recording mode in tests is a weird case where it's not really trying to be "like simulation" nor is it trying to be "like real apply", it's just a sort of convenience for tests to not need footprints

FWIW recording mode is a de-facto standard for the unit tests (I'm not even sure if the SDK provides a way to specify the footprint - it really wouldn't cover anything interesting about the contract logic). And we also want the metering to be sane. For sure, we'll miss XDR parsing costs here and there, and testing with a compiled-in contract is less accurate than testing with Wasms. However, I think we should strive to keep the main cost components correct (like charging/not charging for Wasm parsing, given a Wasm contract).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmkozh Ok, I've done this now (and fixed all the test fallout). One interesting thing is I had to bump up the margin of error on simulations fairly significantly. I think this is mostly due to the dominant-shared-cost of parsing modules going away. But I'm not 100% certain.

@graydon graydon marked this pull request as ready for review January 11, 2025 01:14
}

/// Instantiates a VM given the arguments provided in [`Self::new`],
/// or [`Self::new_from_module_cache`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this comment is outdated.
Self::new_from_module_cache ==> Self::from_parsed_module

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed / made-obsolete


#[derive(Clone)]
enum ModuleCacheMap {
MeteredSingleUseMap(MeteredOrdMap<Hash, Arc<ParsedModule>, Budget>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In either case it'll probably make sense to remove the MeteredSingleUseMap for better maintainability.

If we've decided to make caching overhead (specifically the BTreeMap operations) free in the global case, it most definitely won't hurt to make it free in the single use case (it will have negligible effect if we don't charge for it).

@@ -336,6 +336,44 @@ pub fn invoke_host_function_with_trace_hook<T: AsRef<[u8]>, I: ExactSizeIterator
base_prng_seed: T,
diagnostic_events: &mut Vec<DiagnosticEvent>,
trace_hook: Option<TraceHook>,
) -> Result<InvokeHostFunctionResult, HostError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this version? Seems like it's sufficient to keep the _and_module_cache version with the optional module cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I'd keep it to avoid gratuitous breakage of clients (it is after all the API most are likely to call). I can remove it if you strongly prefer. @dmkozh this is kinda "your API", do you have a preference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK only Core uses this as of today. Clippy won't allow us to break this though, unless we bump the version or use unstable-next-api feature. I think breaking this in p23 is very low risk (if someone has been using it, they must be advanced enough to migrate to the module cache interface as well). But we don't have to do that in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok makes sense, I don't feel strongly so feel free to keep it for backward compatibility.

)
}

pub fn parse_and_cache_module<Ctx: CompilationContext>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this can be pub (crate) fn, so is add_stored_contracts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, a major point of this change is that the module cache can now be (publicly) operated by stellar-core before and after a host exists.

@graydon graydon requested a review from a team as a code owner January 16, 2025 07:30
@graydon graydon force-pushed the reusable-module-cache branch from 60ed680 to cae6fff Compare January 16, 2025 07:33
@graydon graydon force-pushed the reusable-module-cache branch from 692d056 to 9515643 Compare January 18, 2025 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants