You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Validator uptime is a strong requirement for a node operator both in terms of SLA and validator's reputation
And failover is an important part of running and maintaining a blockchain validator
We tried using a FN (Fullnode) as a hot failover backup node, but it didn't work as expected
Looking through discord I found some useful discussions about the best way to do a failover, possible proposals, what to avoid, etc
I'll put the tl;dr on what I could gather from the discussion here, and we could probably track this feature better (the discussion is too important to get lost in discord noise)
It's not safe to do the failover mid-epoch as it won't work, because the validator will try to build the checkpoints it had already built (in the primary node), and cause mismatch and crash (which we have faced in the past)
There were some discussions about having a foundation funded validator node for the sole purpose of taking regular snapshots of the consensus_db and make it publicly available for others to use during failover
But this approach wouldn't work, because the db (generated with one set of keys) would mismatch when plugged with the operator keys
For a mid-epoch migration, they proposed about having a read-only-replica system, which the backup validator will be running as and keep syncing
There was also a mention of validator producing db checkpoints, that can be used for backup and restore - it could be a low effort short term thing, but needs to be checked if it would even work
A possible solution which we were thinking, was the ability of the validator node to take database point-in-time snapshots (not only at epoch boundaries), so that the operator could just restore the snapshot on the backup node and can failover the node mid-epoch
I guess it aligns with the db checkpoints approach
For now, it seems the only safe way is to do the failover at epoch boundaries, and since there is no automatic slashing, it would not result in any loss of rewards (unless other validators report you for downtime, so make sure to announce it on Discord)
But it would be great if there is a well tested failover strategy that allows node operators to better maintain their node, and improve the network's resiliency long term
The text was updated successfully, but these errors were encountered:
Thanks a lot for the detailed writeup and summary!
Downloading consensus and checkpoint data from other Sui validators are supported, including downloading own blocks and checkpoints. But Sui does not handle dropping data from unfinished epoch gracefully, because it is tricky to sort out how to avoid equivocations. When Sui implements parallel execution, we may be able to consider allowing some host failures without bringing down the whole cluster.
For mid epoch failovers, is it only necessary when losing the machine of the validator, or will it be more frequently used for planning migrations with graceful shutdown? And how often will you perform mid epoch migrations?
For planned migrations, we could do away with graceful shutdowns at epoch boundaries
But the mid-epoch failover if most critical for hardware failures, where we expect the primary node unreachable, and the only way to bring back the validator is start the backup node with the keys
Adding another point, there can be times (and we have seen this before), that we're not sure if the primary validator could come back up, after the failover
But I guess this is already taken care of with the on-chain network addresses, with DNS or bare IPs, but just something to keep in mind, the possibility of two validator nodes running with the same keys - unintentionally ofcourse
Validator uptime is a strong requirement for a node operator both in terms of SLA and validator's reputation
And failover is an important part of running and maintaining a blockchain validator
We tried using a FN (Fullnode) as a hot failover backup node, but it didn't work as expected
Looking through discord I found some useful discussions about the best way to do a failover, possible proposals, what to avoid, etc
I'll put the tl;dr on what I could gather from the discussion here, and we could probably track this feature better (the discussion is too important to get lost in discord noise)
consensus_db
and make it publicly available for others to use during failoverBut this approach wouldn't work, because the db (generated with one set of keys) would mismatch when plugged with the operator keys
Here is the link to the discord thread: https://discord.com/channels/916379725201563759/1233812546117435403
A possible solution which we were thinking, was the ability of the validator node to take database point-in-time snapshots (not only at epoch boundaries), so that the operator could just restore the snapshot on the backup node and can failover the node mid-epoch
I guess it aligns with the db checkpoints approach
For now, it seems the only safe way is to do the failover at epoch boundaries, and since there is no automatic slashing, it would not result in any loss of rewards (unless other validators report you for downtime, so make sure to announce it on Discord)
But it would be great if there is a well tested failover strategy that allows node operators to better maintain their node, and improve the network's resiliency long term
The text was updated successfully, but these errors were encountered: