Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgradecluster: retry until cluster stable with unavailable nodes #124288

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented May 16, 2024

Nodes can be transiently unavailable (failing a heartbeat), in which case the upgrade manager will error out. Retry UntilClusterStable for up to 10 times when there are unavailable nodes before returning an error.

Resolves: #120521
Resolves: #121069
Resolves: #119696
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli self-assigned this May 16, 2024
Nodes can be transiently unavailable (failing a heartbeat), in which
case the upgrade manager will error out. Retry `UntilClusterStable` for
up to 10 times when there are unavailable nodes before returning an
error.

Resolves: cockroachdb#120521
Resolves: cockroachdb#121069
Resolves: cockroachdb#119696
Release note: None
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @kvoli)


pkg/upgrade/upgradecluster/cluster.go line 79 at r1 (raw file):

	unavailableRetries := 0

	for {

nit: would a retry ctx make sense here?

	retryOpts := base.DefaultRetryOptions()
	retryOpts.MaxRetries = 10
	for r := retry.StartWithCtx(ctx, retryOpts); r.Next(); {
		// ...
	}

pkg/upgrade/upgradecluster/cluster.go line 111 at r1 (raw file):

	}
	if len(unavailable) > 0 {
		return 0, errors.Newf("unavailable node(s): %v", unavailable)

is there any concern that this would create a new error scenario that didn't exist before (asking since we'd want to backport this)? or is it just that this problem could already happen, but returning the error here is more clear?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment