-
Notifications
You must be signed in to change notification settings - Fork 344
feat(ckbtc): bump limit on concurrent withdrawals #4804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat(ckbtc): bump limit on concurrent withdrawals #4804
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @mducroux, overall LGTM! Just one minor comment from my side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would add a failure testcase here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I've added it in the unit tests in rs/bitcoin/ckbtc/minter/src/updates/tests.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mducroux for this PR! Mostly minor comments, the main one relates to the kind of buckets we need to measure the latency of tECDSA
@@ -328,6 +328,9 @@ pub async fn sign_with_ecdsa( | |||
sign_with_ecdsa, EcdsaCurve, EcdsaKeyId, SignWithEcdsaArgument, | |||
}; | |||
|
|||
// Record start time of method execution for metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would remove that comment as I don't think it brings much (one can just check where that variable is used)
// Record start time of method execution for metrics |
thread_local! { | ||
pub static GET_UTXOS_CLIENT_CALLS: Cell<u64> = Cell::default(); | ||
pub static GET_UTXOS_MINTER_CALLS: Cell<u64> = Cell::default(); | ||
pub static UPDATE_CALL_LATENCY: RefCell<BTreeMap<NumUtxoPages,LatencyHistogram>> = RefCell::default(); | ||
pub static GET_UTXOS_CALL_LATENCY: RefCell<BTreeMap<(NumUtxoPages, CallSource),LatencyHistogram>> = RefCell::default(); | ||
pub static GET_UTXOS_RESULT_SIZE: RefCell<BTreeMap<CallSource,NumUtxosHistogram>> = RefCell::default(); | ||
pub static SIGN_WITH_ECDSA_LATENCY: RefCell<BTreeMap<MetricsResult, LatencyHistogram>> = RefCell::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this will use the same values for BUCKETS_MS
as for UPDATE_CALL_LATENCY
; however, the situation is different:
UPDATE_CALL_LATENCY
is used in to measure the latency insideupdate_balance
, which may involve several cross-net calls to the bitcoin canister, hence the expected latency here could be quite high.SIGN_WITH_ECDSA_LATENCY
measures the latency ofsign_with_ecdsa
which is on the same subnet (since our canisters are on the fiduciary subnet) so that call does not involve cross-net calls and so the latency could be much lower. I think there it would be best to ask#eng-crypto
what they think reasonable buckets would look like for end-to-end latency of tECDSA on the fiduciary subnet (cc @andreacerulli ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signature latency distribution from this dashboard shows values distributed roughly as follows: min: 0.5s, avg: 2s, max: 15s. This is for signatures coming from pre-generated values. Based on this, I would suggest 8 buckets with roughly exponential values: [1, 2, 4, 6, 8, 12, 20, inf]. WDYT @gregorydemay?
@@ -22,7 +22,7 @@ use icrc_ledger_types::icrc1::transfer::{TransferArg, TransferError}; | |||
use icrc_ledger_types::icrc2::transfer_from::{TransferFromArgs, TransferFromError}; | |||
use num_traits::cast::ToPrimitive; | |||
|
|||
const MAX_CONCURRENT_PENDING_REQUESTS: usize = 1000; | |||
const MAX_CONCURRENT_PENDING_REQUESTS: usize = 5000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to bump this? Did we observe that the previous limit (1000) was reached, or closed to be reached?
If I'm not mistaken, the canister message queue has a hard limit of 500. If each pending request represents a pending inter-canister call which takes up one spot in the queue (reservation for the return result), then it is already impossible to reach 1000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out that I was mistaken... the requests are actually batched with a max batch size of 100, and batches are signed and submitted every 5 seconds. So it is unlikely to hit the message queue limit, unless each ecdsa sign call takes too long to complete (and there is another limit on the total outstanding signature request imposed by threshold signing protocol).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just another thought, maybe add some metrics to track failure to sign, and failure to send? It'll be interesting to see if we actually hit the limit of making signatures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure to sign should already be there but I can add a failure to send metric indeed
XC-322 (ckBTC): Bump limit on concurrent withdrawals (1,000 -> 5,000).
These changes allow the ckBTC minter to process 5,000 concurrent BTC withdrawal requests, compared to 1,000 before. Additionally, this PR adds latency metrics for the
sign_with_ecdsa
management call (for both the success and failure cases) used in the withdrawal process. This will help track the behavior of tECDSA in case of load spikes.