-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestUpdateClusterLayout.test_change_node_ip fails (when getting host-id from nodetool) #18693
Comments
Could it be another manifestation of #17857? |
#17857 sounds very plausible (cc @bhalevy). In this test we have a node which changes IP address. Once it restarts, other nodes see it in gossip, but before performing Idea to reproduce: start two nodes A and B, then inject a sleep at the beginning of |
Actually it cannot be #17857 and what I wrote above doesn't make sense. #17857 was observed in a test run #16902 in gossip-topology mode. This issue was observed in dtest run with raft-topology and there we don't call In raft-topology mode, the function responsible for reacting to new endpoints appearing in gossip is However I analyzed the code and I don't see how the bug can happen. The function doesn't do anything if it sees that the host ID is missing: future<>
on_endpoint_change(gms::inet_address endpoint, gms::endpoint_state_ptr ep_state, gms::permit_id permit_id, const char* ev) {
auto app_state_ptr = ep_state->get_application_state_ptr(gms::application_state::HOST_ID);
if (!app_state_ptr) {
co_return;
}
raft::server_id id(utils::UUID(app_state_ptr->value())); and when it updates Something fishy is going on here. Perhaps we still have some other place which is incorrectly updating |
This is what the nodes saw when the test executed nodetool:
|
BTW there is a slight difference in how Scylla nodetool behaves compared to Cassandra nodetool: if host ID is missing in hostID = hostIDMap.get(endpoint);
new nodetool will print endpoint_host_id.contains(ep) ? endpoint_host_id.at(ep) : "?" cc @denesb (probably doesn't matter) |
Yes, I know about this difference, and I thought "?" makes more sense than "null". Note that other columns in the status output also use "?" for when there is no data available (e.g. tokens, load). |
Ok I see how the issue can happen It's just an artifact of how nodetool is implemented. It performs two REST API endpoint calls: one to get "token to endpoint map", another to get "host ID map". The IP change notification inside Scylla can be handled in-between those calls. New nodetool gets the host ID map first, and the token-to-endpoint map second: const auto endpoint_host_id = rjson_to_map<sstring>(client.get("/storage_service/host_id"));
...
const auto tokens_endpoint_res = client.get("/storage_service/tokens_endpoint", std::move(tokens_endpoint_params)); then it iterates over endpoints found in So if IP change notification is handled in-between, then Curiously it seems that old nodetool has the reverse problem: Map<String, String> tokensToEndpoints = probe.getTokenToEndpointMap();
liveNodes = probe.getLiveNodes();
unreachableNodes = probe.getUnreachableNodes();
hostIDMap = probe.getHostIdMap(); so it would sometimes observe the old IP without host ID assigned. But the way the test logic is written, it won't catch this scenario, because it only looks for new IP, and if new IP is found, then it verifies its ID: hostid3 = node3.hostid()
# Verify all nodes in the cluster see node3 is using the new ip address
for node in cluster.nodelist():
status = nodetool_status(node)
logger.debug(f"nodetool status from {node.name}: {status}")
for n in status["nodes"]:
if n["address"] == ip3:
assert n["host id"] == hostid3 So summarizing, this failure is a small artifact of then new nodetool implementation, but old nodetool had a "reverse" version of this bug, the test just wouldn't catch it due to how it's written. |
Can be easily "fixed" (in reality, masked, just to shut the test up) by reversing the order of rest endpoint calls so it's the same as in old nodetool. @denesb |
https://jenkins.scylladb.com/job/scylla-master/job/gating-dtest-release-with-consistent-topology-changes/1589/testReport/junit/update_cluster_layout_tests/TestUpdateClusterLayout/FullDtest___full_split001___test_change_node_ip/
test was re-enabled recently by https://github.com/bhalevy/scylla-dtest/commit/73fd1b731a74ad4a6c00ef51a17a50b08464dbce (from https://github.com/scylladb/scylla-dtest/issues/4273)
EDIT from @kbr-scylla: explanation here: #18693 (comment)
The text was updated successfully, but these errors were encountered: