Inconsistent QQ state (between mnesia & ra state machine) when remove a node under high CPU load #11029
Replies: 5 comments 9 replies
-
The guide on Upgrades explicitly recommends upgrading when the system is not under stress. QQ or stream membership changes do update two places, and under close to peak CPU (or disk I/O) load, one of them can hit a timeout. Khepri won't change things dramatically either. While it is much closer to quorum queues and streams in terms of the algorithm used, you still run the risk of hitting a timeout, the higher the load is. Ra supports timeouts for specific state machine operations. Many CLI tools commands accept a
But even with higher timeouts, with two places where the list of members is stored, you will always run this risk. This is why #8218 introduces a periodic repair operation. Still, the recommendation not to upgrade clusters under close to peak load still stands and always will. This simply doesn't come up often enough, as about six years of real world quorum queue experience suggests. Perhaps because most RabbitMQ clusters are upgraded outside of their peak load periods. |
Beta Was this translation helpful? Give feedback.
-
@michaelklishin Thank you for the quick response! I strongly agree with the recommendation to not upgrade / terminate instances under peak load. This report comes out of a series of tests we're doing to validate the stability of QQs under different edge cases. Regarding mitigation, as a user, I would expect either |
Beta Was this translation helpful? Give feedback.
-
I think I can see a problem here. If we look at the error that was returned:
we can see that it is an aggregate error. When we try to remove a member we try a list of members (got from mnesia) and if any of them errors we try the next one and so on. We can see here that one of the members returned I think we need to make a change where we evaluate the error properly after trying each member, if we receive I do think that this particular error must have occurred after we'd already failed to update mnesia in another test and it would be good to see errors from the first failed run also. It appears we have a function for making the amqqueue record match the truth (ra:members/1) but it isn't called automatically from anywhere: Perhaps we should do this periodically like we do for the leader pid: Also related: #7863 |
Beta Was this translation helpful? Give feedback.
-
@kjnilsson Could add it to auto reconcile, but feel that it should perhaps be done by some other periodic process (as auto reconcile might not be turned on by most) |
Beta Was this translation helpful? Give feedback.
-
I have created a PR to Ra which I think could be part of improving the behaviour here: rabbitmq/ra#433 Before all errors would result in the Combined with this change we could improve the code in: rabbitmq-server/deps/rabbit/src/rabbit_quorum_queue.erl Lines 1281 to 1282 in a8bcf4a such that we still update the mnesia record when it returns |
Beta Was this translation helpful? Give feedback.
-
Description
When a cluster broker is under high CPU load, removing a node via
forget_cluster_node
command can cause QQs to get into inconsistent state between mnesia metadata & ra state machine.Once this happens,
delete_member
would fail, similar to #6511 (comment). The mitigation mentioned in the thread is to enablequorum_queue.continuous_membership_reconciliation.auto_remove = true
. However, @SimonUnge said that likely won't work because auto reconcilliation callsdelete_member
underneath.Please let me know if I can provide more information. I'm able to reproduce this quite easily in my test stack.
Reproduction
forget_cluster_node
on one node.Example logs and command outputs
Here is a sample queue
qq-140
after a few rounds of instance replacements:Failure to remove member
rabbit@ip-10-0-5-25.ap-southeast-2.compute.internal
:quorum_status
still showsrabbit@ip-10-0-5-25.ap-southeast-2.compute.internal
:< NOTE: there is a time gap between the previous quorum_status command and the following 2 commands >
Mnesia shows
rabbit@ip-10-0-5-25.ap-southeast-2.compute.internal
to be a member:But
ra:members
doesn't have it:Cannot run
delete_member
:Beta Was this translation helpful? Give feedback.
All reactions