upgradecluster: retry until cluster stable with unavailable nodes #124288

kvoli · 2024-05-16T17:50:32Z

Nodes can be transiently unavailable (failing a heartbeat), in which case the upgrade manager will error out. Retry UntilClusterStable for up to 10 times when there are unavailable nodes before returning an error.

Resolves: #120521
Resolves: #121069
Resolves: #119696
Release note: None

cockroach-teamcity · 2024-05-16T17:50:41Z

This change is

Nodes can be transiently unavailable (failing a heartbeat), in which case the upgrade manager will error out. Retry `UntilClusterStable` for up to 10 times when there are unavailable nodes before returning an error. Resolves: cockroachdb#120521 Resolves: cockroachdb#121069 Resolves: cockroachdb#119696 Release note: None

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @kvoli)

pkg/upgrade/upgradecluster/cluster.go line 79 at r1 (raw file):

	unavailableRetries := 0

	for {

nit: would a retry ctx make sense here?

	retryOpts := base.DefaultRetryOptions()
	retryOpts.MaxRetries = 10
	for r := retry.StartWithCtx(ctx, retryOpts); r.Next(); {
		// ...
	}

pkg/upgrade/upgradecluster/cluster.go line 111 at r1 (raw file):

	}
	if len(unavailable) > 0 {
		return 0, errors.Newf("unavailable node(s): %v", unavailable)

is there any concern that this would create a new error scenario that didn't exist before (asking since we'd want to backport this)? or is it just that this problem could already happen, but returning the error here is more clear?

kvoli self-assigned this May 16, 2024

kvoli force-pushed the 240513.retry-unavailable-upgradecluster branch from 7e7ae23 to ab76750 Compare May 16, 2024 18:53

kvoli mentioned this pull request May 16, 2024

ccl/schemachangerccl: TestBackupMixedVersionElements failed [node unavailable in KV NodeLiveness even though it started up] #120521

Open

kvoli requested a review from andrewbaptist May 28, 2024 13:11

kvoli marked this pull request as ready for review May 28, 2024 13:11

rafiss reviewed May 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgradecluster: retry until cluster stable with unavailable nodes #124288

upgradecluster: retry until cluster stable with unavailable nodes #124288

kvoli commented May 16, 2024

cockroach-teamcity commented May 16, 2024

rafiss left a comment

upgradecluster: retry until cluster stable with unavailable nodes #124288

Are you sure you want to change the base?

upgradecluster: retry until cluster stable with unavailable nodes #124288

Conversation

kvoli commented May 16, 2024

cockroach-teamcity commented May 16, 2024

rafiss left a comment

Choose a reason for hiding this comment