Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disrupted bootstrap causes errors in CDC generation publisher fiber #18682

Open
margdoc opened this issue May 15, 2024 · 2 comments
Open

Disrupted bootstrap causes errors in CDC generation publisher fiber #18682

margdoc opened this issue May 15, 2024 · 2 comments
Assignees
Labels
P4 Low Priority

Comments

@margdoc
Copy link
Contributor

margdoc commented May 15, 2024

When running

./test.py test_ip_mappings

on margdoc@e4b4a62, the first node gets an error:

ERROR 2024-05-15 09:50:28,549 [shard 0:strm] raft_topology - CDC generation publisher fiber got error exceptions::unavailable_exception (Cannot achieve consistency level for cl ONE. Requires 1, alive 0)

The scenario of this test is as follows:

  • the first node bootstraps correctly
  • the second node starts bootstrapping
  • both nodes crash just before the first node saves an IP of the second node in the system.peers table (error injection "crash-before-bootstrapping-node-added"). Bootstrap tokens of the second node are already saved in topology_coordinator::handle_topology_transition.
  • restart the first node and query it with inserts and selects
  • the node gets this error after entering commit cdc generation transition state

CDC generation publisher fiber repeats this operation until it succeeds. It succeeds when the first node enters the left token ring transition state. This doesn't break the node but should be investigated.

topology_experimental_raft.test_ip_mappings.1.log
scylla-1.log
scylla-2.log

@kbr-scylla
Copy link
Contributor

@margdoc please investigate at least which read exactly is failing with this error -- if it's a local read (from system table), then it's suspicious that it's failing.

(BTW I was hoping you'd introduce this info when opening the issue, as we agreed...)

@margdoc
Copy link
Contributor Author

margdoc commented May 16, 2024

The failing write is from system_distributed_keyspace -> system_keyspace::create_cdc_desc

co_await max_concurrent_for_each(ms, 20, [&] (mutation& m) -> future<> {
        // We use the storage_proxy::mutate API since CQL is not the best for handling large batches.
        co_await _sp.mutate(
            { std::move(m) },
            quorum_if_many(ctx.num_token_owners),
            db::timeout_clock::now() + 30s,
            nullptr, // trace_state
            empty_service_permit(),
            db::allow_per_partition_rate_limit::no,
            false // raw_counters
        );
    });

@kbr-scylla kbr-scylla added the P3 Medium Priority label May 17, 2024
@kbr-scylla kbr-scylla added P4 Low Priority and removed P3 Medium Priority labels Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P4 Low Priority
Projects
None yet
Development

No branches or pull requests

3 participants