Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An interrupted node-left operation can leave a node (even if it rejoins later) on a list of faultyNodes that do not receive cluster state updates until a LagDetector timeout #108690

Open
DiannaHohensee opened this issue May 15, 2024 · 3 comments
Labels
>bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team

Comments

@DiannaHohensee
Copy link
Contributor

DiannaHohensee commented May 15, 2024

Related to #91447 test failure. We believe the failure circumstances are rare: the circumstances were created by a NullPointerException that has been fixed, and what remains is hypothetical.

It's possible for a node-left task to get interrupted prior to removing the node from the master's list of faultyNodes. Nodes on the faultyNodes list do not receive cluster state updates, and are eventually removed. Subsequently, when the node attempts to rejoin, after test network disruptions have ceased, the node-join request can succeed, but the node will never receive the cluster state update, consider the node-join a failure, and will resend node-join requests until the LagDetector removes the node from the faultyNodes list.

A solution would be for a node-join request to first run a new node-left request, if the node is seen to still be present in the cluster state. Complete the node-left operation before the node-join proceeds. This will ensure that all of the node-left logic runs successfully, including removing the node from the list of faultyNodes, and there's clean state on which to apply a node-join request. A comment on the test failure has further details on this suggestion.

@DiannaHohensee DiannaHohensee added >bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team labels May 15, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

DiannaHohensee added a commit to DiannaHohensee/elasticsearch that referenced this issue May 15, 2024
It's possible for a node-left task to get interrupted prior to removing
the node from the master's list of faultyNodes. Nodes on the faultyNodes
list do not receive cluster state updates, and are eventually removed.

Subsequently, when the node attempts to rejoin, after test network
disruptions have ceased, the node-join request can succeed, but the
node will never receive the cluster state update, consider the node-join
a failure, and will resend node-join requests until the LagDetector
removes the node from the faultyNodes list.
elastic#108690 will address the
node-join issue.
@DaveCTurner
Copy link
Contributor

if the node is seen to still be present in the cluster state

... and is considered faulty. Non-faulty nodes will rejoin the cluster on (e.g.) every master election and we don't want to drop them (and all their shards) from the cluster first.

@DiannaHohensee
Copy link
Contributor Author

Hmm. You're right, there could be multiple node-join requests sent and received out-of-order, don't want to run node-left every time.

I was hoping to avoid directly checking for the node in the faultyNodes list: if that's the only thing that can go wrong on a node-left operation, then it seems like it would be better to directly remove the node from that list and skip the rest of the node-left logic.

DiannaHohensee added a commit that referenced this issue May 15, 2024
It's possible for a node-left task to get interrupted prior to removing
the node from the master's list of faultyNodes. Nodes on the faultyNodes
list do not receive cluster state updates, and are eventually removed.

Subsequently, when the node attempts to rejoin, after test network
disruptions have ceased, the node-join request can succeed, but the
node will never receive the cluster state update, consider the node-join
a failure, and will resend node-join requests until the LagDetector
removes the node from the faultyNodes list.
#108690 will address the
node-join issue.

Closes #91447
parkertimmins pushed a commit to parkertimmins/elasticsearch that referenced this issue May 17, 2024
…ic#108691)

It's possible for a node-left task to get interrupted prior to removing
the node from the master's list of faultyNodes. Nodes on the faultyNodes
list do not receive cluster state updates, and are eventually removed.

Subsequently, when the node attempts to rejoin, after test network
disruptions have ceased, the node-join request can succeed, but the
node will never receive the cluster state update, consider the node-join
a failure, and will resend node-join requests until the LagDetector
removes the node from the faultyNodes list.
elastic#108690 will address the
node-join issue.

Closes elastic#91447
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

3 participants