Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raftstore: gc abnormal snapshots and destroy peer if failed to apply snapshots. #16992

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

LykxSassinator
Copy link
Contributor

@LykxSassinator LykxSassinator commented May 10, 2024

What is changed and how it works?

Issue Number: Close #15292

What's Changed:

Previously, there were pending tasks to address the scenario where TiKV would panic if applying snapshots failed due to abnormal conditions such as IO errors or unexpected issues.

This pull request resolves the issue by introducing additional traits tombstone: bool to SnapshotApplied, indicating whether the failure occurred due to abnormal snapshots.
Additionally, the abnormal peer will send ExtraMessageType::MsgForcelyRemovePeerRequest to related leader of this region, trigger a new ConfChange with RemoveNode to remove the associated peer. Finally, this peer will be destroyed later to ensure the cluster will add one new peer by sending a fresh snapshot to the affected node.

Replace `SnapshotApplied` with `SnapshotApplied { peer_id: u64, tombstone: bool}`. And if `tombstone` == true, the
relative peer will be automatically GCed.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note

None.

…a snapshot.

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Copy link
Contributor

ti-chi-bot bot commented May 10, 2024

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

Copy link
Contributor

ti-chi-bot bot commented May 10, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@glorv
Copy link
Contributor

glorv commented May 14, 2024

I have 2 questions about this PR:

  1. Is it possible to directly retry applying the snapshot if the failure is caused by unexpected(maybe some IO layer) error?
  2. Why need to tombstone the peer after applying snapshot failed? Why not just switch the peer status to normal and so the leader should start a new snapshot automatically.

@LykxSassinator
Copy link
Contributor Author

I have 2 questions about this PR:

  1. Is it possible to directly retry applying the snapshot if the failure is caused by unexpected(maybe some IO layer) error?
  2. Why need to tombstone the peer after applying snapshot failed? Why not just switch the peer status to normal and so the leader should start a new snapshot automatically.
  1. Nope, it's caused by loading abnormal blocks. According to the logs in the issue tikv panic repeatedly after this tikv recover from io hang #15292 and tikv panic repeatedly with “\"[region 16697056] 19604003 applying snapshot failed\"” after down this tikv for 20mins and recover #16958, it shows that the given snapshot has abnormal file blocks, causing the failure of the applying.
  2. For safety. Tombstone the peer and destroy it will assure that this TiKV node do not have any other remains data / meta data about this peer.

@LykxSassinator
Copy link
Contributor Author

LykxSassinator commented May 14, 2024

By the way, as for the point 2, I agree what u mentioned before. But for safety, this pr takes current implementation.

Why not just switch the peer status to normal and so the leader should start a new snapshot automatically.

I'll have a try for the point 2 u mentioned and do some extra tests on it to find the more appropriate way.

@overvenus
Copy link
Member

I don't think we can directly mark it as Tombstone or Normal, because both options violate the current raft state machine protocol.

  • Tombstone indicates that the peer has been fully removed, including its Raft membership and data. However, in this case, the peer remains a valid member.
  • Normal means that the peer has all data up to its last commit index. But, in this case, it does not, as its last commit index has been updated to the snapshot index.

There are two ways to fix the panic:

  1. Have PD remove this peer via a confchange, which I believe is the simplest solution.
  2. Introduce a new RPC to instruct the leader to resend the snapshot, which may change lots of code.

@glorv
Copy link
Contributor

glorv commented May 14, 2024

Introduce a new RPC to instruct the leader to resend the snapshot, which may change lots of code.

Why need this extra RPC? At the leader side, it will switch the peer's state to normal after finishing send the snapshot. At the follower side, when apply snapshot failed, it is also doable to restore the raft state to its previous state before this snapshot. Thus, If I understand correctly, the leader should start another snapshot without any extra operation?

@overvenus
Copy link
Member

overvenus commented May 14, 2024

At the follower side, when apply snapshot failed, it is also doable to restore the raft state to its previous state before this snapshot.

Do you mean persisting the previous state so it can be restored even after restarting TiKV?

That's doable (without introducing a new RPC), but it does add extra complexity to raftstore (and we'll need to review every code path related to snapshot handling).

@hbisheng
Copy link
Contributor

Out of curiosity and just trying to learn, what are the typical scenarios where applying snapshots may fail? In those cases, would it help if we retry a few times? If we destroy the peer and add a new one, it's possible that the new peer may hit the same issue again, right?

@LykxSassinator
Copy link
Contributor Author

LykxSassinator commented May 24, 2024

Out of curiosity and just trying to learn, what are the typical scenarios where applying snapshots may fail? In those cases, would it help if we retry a few times? If we destroy the peer and add a new one, it's possible that the new peer may hit the same issue again, right?

Yep. One typical case is that loading snapshots encounters IO errors. In this case, the issue can be attributed to physical errors on the disk or system errors, causing the TiKV panic.
And as for whether the retrying is success or not, it depends on the root cause of the IO error. If it's due to physical errors on the disk, retries may not always be successful. But if it's a system error or bugs, retries can have a higher chance of success.

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
@@ -244,6 +244,8 @@ procinfo = { git = "https://github.com/tikv/procinfo-rs", rev = "7693954bd1dd86e
# After the PR to rust-rocksdb is merged, remember to comment this out and run `cargo update -p rocksdb`.
# [patch.'https://github.com/tikv/rust-rocksdb']
# rocksdb = { git = "https://github.com/your_github_id/rust-rocksdb", branch = "your_branch" }
[patch.'https://github.com/pingcap/kvproto']
kvproto = { git = "https://github.com/LykxSassinator/kvproto", branch = "tombstone_peer" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be updated with latest if this pr is approved.

@@ -99,3 +99,7 @@ exceptions = [
unknown-git = "deny"
unknown-registry = "deny"
allow-org = { github = ["tikv", "pingcap", "rust-lang"] }
# Allow list.
allow-git = [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be cleared when this pr is approved.

self.fsm.has_ready = true;
// If failed on applying snapshot, it should tomebstone the peer.
self.fsm.peer.should_tombstone =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not send msg here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent the loss of messages in case of sending failures due to unexpected issues. Furthermore, in the event of message loss, this abnormal peer will be left in a dangled state and unable to process any messages from the leader.

The flag should_tombstone is used to ensure that the abnormal peer continues to send the TombStone message to the leader until it is removed from the raft group.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still can send it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the abnormal peer will stop sending messages once it is destroyed, which is triggered by receiving a RaftMessage with the flag RaftMessage::is_tombstone set to true.

Mark `unavailable == true` if the abnormal peer failed to apply snapshot and
could not be normarlly removed by ConfChange.

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
components/raftstore/src/store/fsm/peer.rs Outdated Show resolved Hide resolved
// If failed on applying snapshot, send ConfChange to the leader to make the
// region tombstone the peer.
if self.fsm.peer.should_tombstone {
self.fsm.peer.send_tombstone_peer_msg(self.ctx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to avoid sending too many msgs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the abnormal peer is destoryed, it will not send message anymore.

self.fsm.has_ready = true;
// If failed on applying snapshot, it should tomebstone the peer.
self.fsm.peer.should_tombstone =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still can send it here

components/raftstore/src/store/fsm/peer.rs Outdated Show resolved Hide resolved
This reverts commit cad382d.

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
@LykxSassinator
Copy link
Contributor Author

This pr needs extra support from PD side to make the implementation more valid.

Hold until tikv/pd#8266 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tikv panic repeatedly after this tikv recover from io hang
5 participants