-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft topology: a node finishes bootstrapping and "completes initialization" while other nodes still didn't learn that it's normal #18678
Comments
If node 3 is pending according to erm, why would coordinator require 4 nodes? In that topology version there are 2 normal replicas + 1 pending. |
Ok I see, it's because CL=3 is supposed to fail during bootstrap of the third node, unlike RF=3 which works fine. In test, you could wait on node2 for topology to quiesce. I added an API for that in 190bdc3 (not merged yet). |
I believe this should be fixed. I stated this on earlier occasions, ScyllaDB should actively cooperate with the clients during bootstrap on its bootstrap status, and it should only begin serving queries after all the reasonable efforts were made to get up to date with the cluster and make the cluster aware of this node. Indeed it's impossible to guarantee in all cases that a starting node can achieve a desired state, but the goal is to avoid errors like this one in a healthy network with healthy nodes. Issuing an extra read barrier therefore seems to be a reasonable effort - it should not be mandatory for becoming live, perhaps should time-out with a warning in the log after 5 seconds. |
@kostja Sounds reasonable to me. Maybe except 5 sec timeout, which may be not enough in our CI environment. It could be a regular barrier, we do several of them already so one more won't make us more vulnerable to unavailability. |
Why would CL=ALL work? |
Sorry, I think I made a mistake. With RF=3 + CL=ALL, we would calculate that 3 natural replicas are needed, (the calculation doesn't take into account how many token owners there are in the ring, it looks at keyspace configuration instead), therefore during bootstrap when hitting a pending range, we'd require 3 natural + 1 pending, 4 in total, so CL=ALL should also fail. Will edit the post and remove that part (edit: done)
It could work like that:
Or do you have something else (maybe simpler) in mind @tgrabiec? |
Looks suspiciously similar to "wait for gossip to settle"... which we disabled... |
A potential alternative solution to introducing barriers etc., would be to bring back "wait for gossip to settle" at the end of node startup procedure, before it says "initialization completed". If we want to speed things up, we could introduce a "finish early" condition to this wait. For example, we could send RPCs to all nodes to check if they see us as UP and NORMAL already, and if all of those RPCs finish, we finish the wait, even if gossip is theoretically not settled yet. In healthy network this would finish almost instantaneously. |
I am totally OK with an RPC, which is very simila to the happy path of the read barrier, I think we should avoid going back to gossip. |
I actually don't know if this is a bug and if it should be considered an issue. But we can use this issue to discuss and potentially decide that it is fixed.
A dtest failed in debug mode:
update_cluster_layout_tests.py::TestUpdateClusterLayout::test_simple_add_two_nodes_in_parallel[case_1]
attaching logs:
dtest-gw2.log
node1.log
node2.log
node3.log
In short, the node bootstrapped 3 nodes. After confirming that they all saw each other as alive and all printed "initialization completed" it connects to node2 (exclusive CQL connection) and tries to do a query with CL=THREE.
The test failed because the query returned unavailable exception:
First of all, if I understand correctly how this exception is constructed, the printed numbers are actually wrong if there is a bootstrapping node and we hit a key that is replicated to a pending replica. If there is a bootstrapping node, then in the case of cl=THREE it should say "Requires 4" because we increase the CL by 1.
I didn't confirm 100% that my explanation below is the cause of this failure, but I can see that my explanation is something that could happen, and it is a plausible explanation for this failure, and I have no other hypotheses -- so that's probably it.
The explanation is that even though node 3 printed "initialization completed", node 2 still thinks it is bootstrapping, so
token_metadata
contains tokens of node 3 as pending. This can happen because we don't do a global read barrier between leavingwrite_both_read_new
and the bootstrapping node printing "initialization completed". So those last commands that update topology state might not have yet reached node 2.Therefore, at the moment of the test trying to do a CL=THREE query through node 2, the node thinks that there is a pending replica, so it requires 4 alive nodes for that query.
Also note that recently we removed "wait for gossip to settle" before printing "initialization completed" and opening CQL port (65cfb9b) -- this would actually prevent this failure from happening. (And from what I see, this test only started failing recently, so that could be the cause.)
The question is:
use_new
) when writing down spec for consistent topology changes@kostja @gleb-cloudius @tgrabiec -- opinions?
The text was updated successfully, but these errors were encountered: