Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embed public and private shares into DkgPrivateBegin and DkgEndBegin #98

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

xoloki
Copy link
Collaborator

@xoloki xoloki commented Nov 22, 2024

This change addresses issues arising from out-of-order messages that may arise when WSTS is used in a p2p network. It adds a config option to the Coordinator and Signer state machines, that when enabled embeds the DkgPublicShares and DkgPrivateShares messages into DkgPrivateBegin and DkgPrivateEnd.

@xoloki xoloki force-pushed the embed-public-private-shares branch from c90232f to 38cf8ee Compare December 2, 2024 16:06
@xoloki xoloki changed the title embed public and private shares into DkgPrivateBegin and DkgEndBegin Embed public and private shares into DkgPrivateBegin and DkgEndBegin Dec 2, 2024
@xoloki xoloki requested a review from djordon December 2, 2024 18:24
@xoloki xoloki marked this pull request as ready for review December 2, 2024 18:56
@xoloki xoloki force-pushed the embed-public-private-shares branch 2 times, most recently from e78101f to a4c33be Compare December 6, 2024 16:12
Copy link
Contributor

@djordon djordon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done yet. I had started this a review a while ago, figured that I'd post what I had written so far rather than finish the review.

src/state_machine/coordinator/fire.rs Show resolved Hide resolved
src/state_machine/coordinator/fire.rs Show resolved Hide resolved
@@ -891,7 +921,9 @@ impl<SignerType: SignerTrait> StateMachine<State, Error> for Signer<SignerType>
}
#[cfg(test)]
/// Test module for signer functionality
pub mod test {
mod test {
use hashbrown::HashMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the standard library version of HashMap, generally speaking?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, there was a goal to provide no_std support, since this is a relatively common use case (see #60) in Rust code generally, and crypto code specifically. Hence the usage of hashbrown for HashMap and HashSet, rand_core vs rand, core for fmt, etc.

Now that the fire::Coordinator needs to do timeouts, though, this is no longer possible for the crate as a whole. There is no replacement for std::time::Instant in core or alloc. We could however still make the non-state machine code no_std.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right.

And we also use Vec as well. I would think that the simplest path forward to support no_std is to go through the alloc crate, which has "replacements" for Vec but not for HashMap (I say replacements but I think std re-exports alloc types). alloc does have a replacement for BTreeMap, which might be a good move too. We've moved to using the BTreeMap during serialization in sbtc signers so that everything is deterministic, might be worthwhile here as well.

But yeah, if the no_std version of things is something we want to do, we shouldn't use the std version of HashMap here, but maybe switching to the BTreeMap version is the way to go. Thoughts?

Copy link
Collaborator Author

@xoloki xoloki Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely on board with switching to BTreeMap where it makes sense. It's more space efficient anyway. Where are you currently running into this? The state machines are like half HashMap half BTreeMap, but the rest of the code is pure HashMap.

Thinking more about no_std, I looked over p256k1 and realized that it's chock full of std, over 1k LOC. So until we can move to k256, I think we should stop worrying about no_std and just do what's right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely on board with switching to BTreeMap where it makes sense. It's more space efficient anyway. Where are you currently running into this?

It's not an issue that we're running into with WSTS. But we serialize our messages as protobufs before broadcasting them on the network. The protobuf spec doesn't have a canonical serialization, so we've picked one serialization implementation so that things are deterministic and so on, and converting maps to BTreeMaps before serialization is part of that process. Well, we haven't proved that our specific implementation is deterministic yet (stacks-network/sbtc#1001), but that's the goal.

Thinking more about no_std, I looked over p256k1 and realized that it's chock full of std, over 1k LOC. So until we can move to k256, I think we should stop worrying about no_std and just do what's right.

Yeah this makes sense to me.

I'm not sure where I'd land on the BTreeMap vs std HashMaps here. My guess is that users would usually have a small number of signers (maybe hundreds or thousands) but perhaps a large number of keys (thousands or maybe tens of thousands). With just that I'd lean BTreeMap but I suppose either works. Not sure if it's worth the effort to benchmark though.


No need for any of these changes in this PR, I was just asking. I think I had asked before and forgot what the resolution was.


Is moving to k256 a goal?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the BTreeMap discussion is def orthogonal to this PR.

k256 is pure Rust so there's definitely some advantages to moving towards it; p256k1 has bindgen and FFI that make upgrading the underlying secp256k1 library difficult. But the performance benefits of p256k1 are significant, especially with the multiexponentiation support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, is the one added in RustCrypto/elliptic-curves#955, just super slow compared to libsecp256k1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 3x slower. Which is not a deal breaker for sBTC, and maybe not even for Nakamoto (when they get their coordinator issues sorted out). It was definitely a dealbreaker when DKG was taking 90 minutes, but those days are long gone.

Copy link
Collaborator Author

@xoloki xoloki Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT k256 is just using precomputed lookup tables for lincomb, whereas secp256k1 is doing a full Pippenger algo which gives a logarithmic speedup.

https://github.com/RustCrypto/elliptic-curves/blob/master/k256/src/arithmetic/mul.rs#L294

https://github.com/Trust-Machines/p256k1/blob/master/p256k1/_secp256k1/src/ecmult_impl.h#L498

Which makes sense, because p256k1 ecmult (which uses precomputed tables) is basically the same performance as k256 lincomb.

@xoloki xoloki force-pushed the embed-public-private-shares branch from 3a6289d to 95fe283 Compare December 19, 2024 13:48
@xoloki xoloki self-assigned this Dec 19, 2024
Copy link
Contributor

@djordon djordon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not entirely sure that this is the way forward to solve the underlying issue. And since we don't have a pressing need for an immediate solution, I'd like us to think through what the "right" solution really is. It might be this but I'm not so sure since I haven't taken any time to think it through.

BTW, it's a good idea for the sbtc signers to cap the message size, and I now think that they do so by default when before I didn't.

src/net.rs Show resolved Hide resolved
TryFromInt,
}

impl From<AesGcmError> for Error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can replace this with AesGcm(#[from] AesGcmError) above no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can't:

83 |     AesGcm(#[from] AesGcmError),
   |              ^^^^ method cannot be called on `&Error` due to unsatisfied trait bounds
   |
  ::: /Users/axoloki/.cargo/registry/src/index.crates.io-6f17d22bba15001f/aead-0.5.2/src/lib.rs:79:1
   |
79 | pub struct Error;
   | ---------------- doesn't satisfy `aes_gcm::Error: AsDynError<'_>` or `aes_gcm::Error: StdError`
   |
   = note: the following trait bounds were not satisfied:
           `aes_gcm::Error: StdError`
           which is required by `aes_gcm::Error: AsDynError<'_>`
           `&aes_gcm::Error: StdError`
           which is required by `&aes_gcm::Error: AsDynError<'_>`

thiserror is awesome, but sadly we can't do much about upstream that errors that don't satisfy the trait bounds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, the AES DOS fix required me to define an EncryptionError type, which takes care of wrapping aes_gcm::Error. So we only have to do the manual chaining once.

if !self.embed_public_private_shares {
self.dkg_public_share(dkg_public_shares)
} else {
Ok(vec![])
Copy link
Contributor

@jferrant jferrant Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this warrant a warning if we are getting a dkgpubilcshares message? EDIT: OH we still expect them to send the DkgPublicShares and DKgPrivateShares but we just wait forDkgEndBegin to actually do anything with them? Just making sure I understand this right.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the signers will still see the public and private shares coming from other signers, they just ignore them if configured to wait for the coordinator messages that embed them.

Copy link
Contributor

@jferrant jferrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question for my own understanding, but looks good.

@xoloki
Copy link
Collaborator Author

xoloki commented Dec 21, 2024

I'd like us to think through what the "right" solution really is. It might be this but I'm not so sure since I haven't taken any time to think it through.

Not sure when this comment was made relative to the last time we discussed it in standup. Are you still on the fence about the proper way to do this?

Given sBTC networking characteristics, I don't see any other real option. The current band aid of sprinkling sleeps throughout the code is extremely suboptimal, and will invariably cause problems.

BTW, it's a good idea for the sbtc signers to cap the message size, and I now think that they do so by default when before I didn't.

Even with 100 signers/key shares, the size needed will be

    100 * 100 * 32 ~ 320k

That will comfortably fit under a 1MB cap, much less the 2MB which was last mentioned.

@xoloki xoloki force-pushed the embed-public-private-shares branch from 95fe283 to 252b2a0 Compare December 21, 2024 16:42
@@ -229,10 +229,18 @@ impl<Aggregator: AggregatorTrait> Coordinator<Aggregator> {
dkg_id = %self.current_dkg_id,
"Starting Private Share Distribution"
);
let dkg_public_shares = if self.config.embed_public_private_shares {
(0..self.config.num_signers)
.map(|id| (id, self.dkg_public_shares[&id].clone()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's refrain from direct array access.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frost::Coordinator requires full participation at every stage of DKG and Sign operations, so the direct array access here is safe.

But let's be pedantic anyway :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the issue is that we need to 100% confident that there is no way for this code to panic. This might be true now, but when we modify some part of this code that may no longer be the case. And if that happens I'd rather return an error than panic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -251,10 +259,18 @@ impl<Aggregator: AggregatorTrait> Coordinator<Aggregator> {
dkg_id = %self.current_dkg_id,
"Starting DKG End Distribution"
);
let dkg_private_shares = if self.config.embed_public_private_shares {
(0..self.config.num_signers)
.map(|id| (id, self.dkg_private_shares[&id].clone()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

TryFromInt,
}

impl From<AesGcmError> for Error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should.

@djordon
Copy link
Contributor

djordon commented Dec 21, 2024

Not sure when this comment was made relative to the last time we discussed it in standup. Are you still on the fence about the proper way to do this?

It was a day or so before we spoke about it at stand-up yesterday. I haven't given it enough thought though. I think the proposed approach works and fixes our issue, but I just need some more time to give it some thought.

Given sBTC networking characteristics, I don't see any other real option. The current band aid of sprinkling sleeps throughout the code is extremely suboptimal, and will invariably cause problems.

There are other options, but yeah many of them are worse than the solution proposed here. I'll give it a some thinking, I expect to arrive at the same place that you're at.

…messages

make embedding of public and private shares configurable

reenable start_private_shares test now that embedding is configurable

inc major semver for backwards incompatible API changes

test modules do not need to be public unless someone is using them externally

allow a static mut ref to an atomic variable used for test log initialization; allow many arguments to Signer::new

allow many arguments to setup_with_timeouts

fmt and code change for rng PR rebase

rebase fixes

add TryFromInt to signer and coordinator errors so we can remove more unwrap calls

remove unwrap in signer state machine

remove unnecessary AesGcm error since we now have EncryptionError; remove bracket access from frost coordinator

add MissingPrivateShares error; remove bracket access in frost Coordinator
@xoloki xoloki force-pushed the embed-public-private-shares branch from 3335f25 to 89a0811 Compare December 21, 2024 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants