raftstore: introduce a precheck process before snapshot generation #17019

hbisheng · 2024-05-15T07:11:57Z

What is changed and how it works?

Issue Number: Close #15972

What's Changed:

This commit implements a snapshot precheck process that leverages the snapshot
concurrency limiter introduced in #17015. The leader only proceeds to generate
the snapshot after it receives a precheck succeed message from the follower. The
precheck request and response are sent via the raft `ExtraMessage` where two new
message types are added in kvproto.

Related changes

PR to update pingcap/docs/pingcap/docs-cn:
Need to cherry-pick to the release branch

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Release note

None

ti-chi-bot · 2024-05-15T07:11:59Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

Connor1996

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

ti-chi-bot · 2024-05-15T07:12:06Z

Hi @hbisheng. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This commit implements a snapshot precheck process that leverages the snapshot concurrency limiter introduced in tikv#17015. The leader only proceeds to generate the snapshot after it receives a precheck succeed message from the follower. The precheck request and response are sent via the raft `ExtraMessage` where two new message types are added in kvproto. TODO: 1. Handle the case where the follower is on an old version without the snapshot precheck API. 2. Throttle the snapshot precheck requests, if necessary. Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

tests/failpoints/cases/test_snap.rs

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

hbisheng · 2024-05-22T05:37:08Z

I just realized a problem with the current concurrency limiter. If a single region leader keeps sending precheck requests to the same receiver (this may happen if the precheck response is somehow lost on the network), it would consume all reservations, thereby blocking other snapshots. I believe the concurrency limiter needs to be updated to deduplicate requests based on region_id.

Created #17051 for this.

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

components/raftstore/src/store/peer.rs

components/raftstore/src/store/fsm/peer.rs

…ant for precheck interval Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

hbisheng · 2024-05-23T11:13:07Z

/cc @overvenus for review

ti-chi-bot · 2024-05-23T11:13:11Z

@hbisheng: GitHub didn't allow me to request PR reviews from the following users: for, review.

Note that only tikv members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @overvenus for review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

overvenus

Rest LGTM

overvenus · 2024-05-30T07:14:01Z

tests/failpoints/cases/test_snap.rs

+    // Wait for the first snapshot to be received and paused.
+    let (tx, rx) = mpsc::channel();
+    let tx = Mutex::new(tx);
+    fail::cfg_callback("receiving_snapshot_callback", move || {


We can reuse receiving_snapshot_net_error as they are in the same position.

I'm actually using both failpoints. What I want to achieve is to trigger a callback AND pause the thread. I don't know if there's a way to do that with a single failpoint.

fail_point!("receiving_snapshot_callback"); fail_point!("receiving_snapshot_net_error");

components/test_raftstore/src/server.rs

components/raftstore/src/store/peer_storage.rs

components/raftstore/src/store/fsm/peer.rs

Co-authored-by: Neil Shen <overvenus@gmail.com> Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

src/server/snap.rs

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

…ck requests Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

hbisheng · 2024-06-03T03:58:48Z

The PR is ready for another review @overvenus @Connor1996

components/raftstore/src/store/worker/region.rs

… limiter Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

components/raftstore/src/store/peer.rs

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

wuhuizuo · 2024-06-04T12:18:24Z

/check-dco

ti-chi-bot · 2024-06-04T12:18:35Z

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

baaf635 Merge branch 'master' into snap-send-precheck-pr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

Connor1996 · 2024-06-06T08:38:43Z

/ok-to-test

ti-chi-bot · 2024-06-06T08:40:47Z

@hbisheng: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test	`211f5ee`	link	true	`/test pull-unit-test`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Connor1996

LGTM

ti-chi-bot bot added do-not-merge/release-note-label-needed contribution Type: PR - From contributors first-time-contributor needs-ok-to-test labels May 15, 2024

hbisheng marked this pull request as draft May 15, 2024 07:12

ti-chi-bot bot added do-not-merge/work-in-progress size/XL release-note-none and removed do-not-merge/release-note-label-needed labels May 15, 2024

hbisheng force-pushed the snap-send-precheck-pr branch from c977c70 to 89fa491 Compare May 21, 2024 07:04

ti-chi-bot bot added size/L and removed size/XL labels May 21, 2024

hbisheng force-pushed the snap-send-precheck-pr branch from 89fa491 to 5001cad Compare May 21, 2024 07:58

hbisheng marked this pull request as ready for review May 21, 2024 07:59

ti-chi-bot bot removed the do-not-merge/work-in-progress label May 21, 2024

Connor1996 reviewed May 21, 2024

View reviewed changes

tests/failpoints/cases/test_snap.rs Outdated Show resolved Hide resolved

use featuregate; define a handle_snapshot_send func

2ef7574

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

hbisheng marked this pull request as draft May 22, 2024 05:37

ti-chi-bot bot added the do-not-merge/work-in-progress label May 22, 2024

hbisheng added 2 commits May 23, 2024 10:37

Merge branch 'master' into snap-send-precheck-pr

baaf635

refactor

0e9dc08

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

hbisheng marked this pull request as ready for review May 23, 2024 04:08

ti-chi-bot bot removed the do-not-merge/work-in-progress label May 23, 2024

remove noise

ead9dda

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

hbisheng requested a review from Connor1996 May 23, 2024 05:59

Connor1996 reviewed May 23, 2024

View reviewed changes

components/raftstore/src/store/peer.rs Outdated Show resolved Hide resolved

components/raftstore/src/store/fsm/peer.rs Outdated Show resolved Hide resolved

address comment: start gen_task in on_extra_message, define a const…

339e291

…ant for precheck interval Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

ti-chi-bot bot requested a review from overvenus May 23, 2024 11:13

overvenus reviewed May 30, 2024

View reviewed changes

hbisheng and others added 2 commits May 31, 2024 15:53

Apply suggestions from code review

c6508c1

Co-authored-by: Neil Shen <overvenus@gmail.com> Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

don't generate if peer is not found

3ad6587

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

ti-chi-bot bot added size/XL and removed size/L labels May 31, 2024

fix var name

4a0c123

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

Connor1996 reviewed May 31, 2024

View reviewed changes

src/server/snap.rs Show resolved Hide resolved

hbisheng added 2 commits May 31, 2024 16:47

release reservation after applying snapshot

8bd17a2

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

Introduce some randomness in the time interval between sending preche…

89b1bca

…ck requests Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

Connor1996 reviewed Jun 3, 2024

View reviewed changes

components/raftstore/src/store/worker/region.rs Outdated Show resolved Hide resolved

hbisheng added 2 commits June 4, 2024 16:36

Account for the number of pending applies in the snapshot concurrency…

3cf1cf7

… limiter Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

revert: release reservation after applying snapshot

79f5ab4

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

ti-chi-bot bot added the dco-signoff: no label Jun 4, 2024

Connor1996 reviewed Jun 4, 2024

View reviewed changes

components/raftstore/src/store/peer.rs Outdated Show resolved Hide resolved

revert to a iteration-count-based approach

d3a8a4e

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

Use better variable names (iteration -> tick)

211f5ee

Signed-off-by: Bisheng Huang <hbisheng@gmail.com>

ti-chi-bot bot added ok-to-test and removed needs-ok-to-test labels Jun 6, 2024

Connor1996 approved these changes Jun 6, 2024

View reviewed changes

ti-chi-bot bot added the status/LGT1 Status: PR - There is already 1 approval label Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raftstore: introduce a precheck process before snapshot generation #17019

raftstore: introduce a precheck process before snapshot generation #17019

hbisheng commented May 15, 2024 •

edited

ti-chi-bot bot commented May 15, 2024 •

edited

ti-chi-bot bot commented May 15, 2024

hbisheng commented May 22, 2024 •

edited

hbisheng commented May 23, 2024

ti-chi-bot bot commented May 23, 2024

overvenus left a comment

overvenus May 30, 2024

hbisheng May 31, 2024

hbisheng commented Jun 3, 2024

wuhuizuo commented Jun 4, 2024

ti-chi-bot bot commented Jun 4, 2024

Connor1996 commented Jun 6, 2024

ti-chi-bot bot commented Jun 6, 2024

Connor1996 left a comment

raftstore: introduce a precheck process before snapshot generation #17019

Are you sure you want to change the base?

raftstore: introduce a precheck process before snapshot generation #17019

Conversation

hbisheng commented May 15, 2024 • edited

What is changed and how it works?

Related changes

Check List

Release note

ti-chi-bot bot commented May 15, 2024 • edited

ti-chi-bot bot commented May 15, 2024

hbisheng commented May 22, 2024 • edited

hbisheng commented May 23, 2024

ti-chi-bot bot commented May 23, 2024

overvenus left a comment

Choose a reason for hiding this comment

overvenus May 30, 2024

Choose a reason for hiding this comment

hbisheng May 31, 2024

Choose a reason for hiding this comment

hbisheng commented Jun 3, 2024

wuhuizuo commented Jun 4, 2024

ti-chi-bot bot commented Jun 4, 2024

Connor1996 commented Jun 6, 2024

ti-chi-bot bot commented Jun 6, 2024

Connor1996 left a comment

Choose a reason for hiding this comment

hbisheng commented May 15, 2024 •

edited

ti-chi-bot bot commented May 15, 2024 •

edited

hbisheng commented May 22, 2024 •

edited