test: test_topology_ops: reenable background writes in tablets mode #17589

patjed41 · 2024-02-29T16:22:56Z

#17585 reenables background writes in test_topology_ops, but only when the test is run without tablets. The test fails with tablets because of #17025. Once that issue if fixed, we should reenable background writes with tablets in test_topology_ops.

Ideally, before sending and merging the fix, we should investigate if test_topology_ops with tablets and background writes is flaky. Before #17585, without tablets and with background writes the test was flaky in CI (1-2 failures in 1000 runs in dev mode).

The text was updated successfully, but these errors were encountered:

patjed41 · 2024-03-21T16:48:14Z

#17025 has been fixed, so this issue can be worked on.

bhalevy · 2024-03-25T07:45:07Z

@tchaikov can you please look into this issue?

tchaikov · 2024-03-26T04:49:43Z

after applying following patch

modified   test/topology_experimental_raft/test_topology_ops.py                                                                                                                                                       @@ -40,9 +40,7 @@ async def test_topology_ops(request, manager: ManagerClient, tablets_enabled: bo
     await wait_for_cql_and_get_hosts(manager.cql, servers, time.time() + 60)
     cql = await reconnect_driver(manager)                                                                                                                                                                            -    # FIXME: we disable background writes with tablets enabled because the test fails due to #17025.
-    # We need to re-enable them once this issue is fixed (by removing `if` here and above `await finish_writes()`).
-    finish_writes = await start_writes(cql) if not tablets_enabled else None
+    finish_writes = await start_writes(cql)
     logger.info(f"Decommissioning node {servers[0]}")                                                                                                                                                                     await manager.decommission_node(servers[0].server_id)
@@ -72,8 +70,7 @@ async def test_topology_ops(request, manager: ManagerClient, tablets_enabled: bo
     servers = servers[1:]
     logger.info("Checking results of the background writes")
-    if not tablets_enabled:
-        await finish_writes()
+    await finish_writes()
     for server in servers:                                                                                                                                                                                                    await check_node_log_for_failed_mutations(manager, server)

i have following test failure

    @pytest.mark.asyncio                                                                                                                                                                                                                      
    #@pytest.mark.parametrize("tablets_enabled", ["true", "false"])                                                                                                                                                                           
    async def test_topology_ops(request, manager: ManagerClient, tablets_enabled: bool = True):                                                                                                                                               
        """Test basic topology operations using the topology coordinator."""                                                                                                                                                                  
        cfg = {'experimental_features': ['consistent-topology-changes']}                                                                                                                                                                      
        if tablets_enabled:                                                                                                                                                                                                                               cfg['experimental_features'].append('tablets')                                                                                                                                                                                    
                                                                                                                                                                                                                                              
        logger.info("Bootstrapping first node")                                                                                                                                                                                                       servers = [await manager.server_add(config=cfg)]                                                                                                                                                                                      
                                                                                                                                                                                                                                              
        logger.info(f"Restarting node {servers[0]}")                                                                                                                                                                                                  await manager.server_stop_gracefully(servers[0].server_id)                                                                                                                                                                            
        await manager.server_start(servers[0].server_id)                                                                                                                                                                                      
                                                                                                                                                                                                                                                      logger.info("Bootstrapping other nodes")                                                                                                                                                                                              
        servers += await manager.servers_add(3, config=cfg)                                                                                                                                                                                                                                                                                                                                                                                                                                 
        await wait_for_cql_and_get_hosts(manager.cql, servers, time.time() + 60)                                                                                                                                                              
        cql = await reconnect_driver(manager)                                                                                                                                                                                                         finish_writes = await start_writes(cql)                                                                                                                                                                                               
                                                                                                                                                                                                                                                      logger.info(f"Decommissioning node {servers[0]}")                                                                                                                                                                                     
        await manager.decommission_node(servers[0].server_id)                                                                                                                                                                                 
        await check_token_ring_and_group0_consistency(manager)                                                                                                                                                                                        servers = servers[1:]                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                      logger.info(f"Restarting node {servers[0]} when other nodes have bootstrapped")                                                                                                                                                       
        await manager.server_stop_gracefully(servers[0].server_id)                                                                                                                                                                            
        await manager.server_start(servers[0].server_id)                                                                                                                                                                                      
                                                                                                                                                                                                                                              
        logger.info(f"Stopping node {servers[0]}")                                                                                                                                                                                            
        await manager.server_stop_gracefully(servers[0].server_id)                                                                                                                                                                            
        await check_node_log_for_failed_mutations(manager, servers[0])                                                                                                                                                                        
                                                                                                                                                                                                                                              
        logger.info(f"Replacing node {servers[0]}")                                                                                                                                                                                           
        replace_cfg = ReplaceConfig(replaced_id = servers[0].server_id, reuse_ip_addr = False, use_host_id = False)                                                                                                                           
        servers = servers[1:] + [await manager.server_add(replace_cfg)]                                                                                                                                                                       
        await check_token_ring_and_group0_consistency(manager)                                                                                                                                                                                
                                                                                                                                                                                                                                              
        logger.info(f"Stopping node {servers[0]}")                                                                                                                                                                                            
        await manager.server_stop_gracefully(servers[0].server_id)                                                                                                                                                                            
        await check_node_log_for_failed_mutations(manager, servers[0])                                                                                                                                                                        
                                                                                                                                                                                                                                              
        logger.info(f"Removing node {servers[0]} using {servers[1]}")                                                                                                                                                                         
>       await manager.remove_node(servers[1].server_id, servers[0].server_id)  
...
self = <test.pylib.rest_client.UnixRESTClient object at 0x7f31a33b7ad0>
method = 'PUT', resource = '/cluster/remove-node/4', response_type = None
host = None, port = None, params = None
json = {'expected_error': None, 'ignore_dead': [], 'server_id': 3}                                                                                                                                                    timeout = 1000

    async def _fetch(self, method: str, resource: str, response_type: Optional[str] = None,
                     host: Optional[str] = None, port: Optional[int] = None,                                                                                                                                                               params: Optional[Mapping[str, str]] = None,
                     json: Optional[Mapping] = None, timeout: Optional[float] = None) -> Any:
        # Can raise exception. See https://docs.aiohttp.org/en/latest/web_exceptions.html
        assert method in ["GET", "POST", "PUT", "DELETE"], f"Invalid HTTP request method {method}"
        assert response_type is None or response_type in ["text", "json"], \
                f"Invalid response type requested {response_type} (expected 'text' or 'json')"
        # Build the URI
        port = port if port else self.default_port if hasattr(self, "default_port") else None                                                                                                                                 port_str = f":{port}" if port else ""                                                                                                                                                                                 assert host is not None or hasattr(self, "default_host"), "_fetch: missing host for " \
                "{method} {resource}"
        host_str = host if host is not None else self.default_host                                                                                                                                                            uri = self.uri_scheme + "://" + host_str + port_str + resource                                                                                                                                                        logging.debug(f"RESTClient fetching {method} {uri}")

        client_timeout = ClientTimeout(total = timeout if timeout is not None else 300)                                                                                                                                       async with request(method, uri,                                                                                                                                                                                                          connector = self.connector if hasattr(self, "connector") else None,
                           params = params, json = json, timeout = client_timeout) as resp:
            if resp.status != 200:                                                                                                                                                                                                    text = await resp.text()
>               raise HTTPError(uri, resp.status, params, json, text)                                                                                                                                                 E               test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/remove-node/4, params: None, json: {'server_id': 3, 'ignore_dead': [], 'expected_error': None}, body:
E               removenode failed (initiator: ScyllaServer(4, 127.198.48.4, 53a4cc60-298c-4194-a139-c037309e32ba), to_remove: ScyllaServer(3, 127.198.48.3, 565cc314-62fa-484c-b587-49135a51ee89), ignore_dead: []), c
heck log file at /home/kefu/dev/scylladb/testlog/release/scylla-4.log, error: "HTTP error 500, uri: http://127.198.48.4:10000/storage_service/remove_node, params: {'host_id': '565cc314-62fa-484c-b587-49135a51ee89', 'ignore_nodes': ''}, json: None, body:
E               {"message": "std::runtime_error (Removenode failed. See earlier errors (Rolled back: Failed to drain tablets: std::runtime_error (Unable to find new replica for tablet 7d477460-eb2b-11ee-ba7a-b34fcafc9b0e:3 on 565cc314-62fa-484c-b587-49135a51ee89:1 when draining {565cc314-62fa-484c-b587-49135a51ee89})))", "code": 500}"

when running

$ ./test.py --verbose --mode release topology_experimental_raft/test_topology_ops

tchaikov · 2024-03-26T05:59:38Z

i am taking a closer look to understand the root cause.

tgrabiec · 2024-03-26T11:03:30Z

There are 4 nodes, RF=3, and we decommission and then removenode. So node count falls below RF. This is not supported with tablets.

kbr-scylla · 2024-03-28T12:23:27Z

We could adjust the test to add one more node in tablets mode at the beginning.

tchaikov · 2024-03-28T14:01:54Z

thank you for your insights, gentlemen. will add one more node to appease tablets.

kbr-scylla · 2024-05-15T12:54:07Z

ping, guys, please don't forget about this issue

tchaikov · 2024-05-15T13:02:21Z

@kbr-scylla i have a WIP locally. will send it tomorrow. thanks for the ping.

in e7d4e08, we reenabled the background writes in this test, but when running with tablets enabled, background writes are still disabled because of scylladb#17025, which was fixed last week. so we can enable background writes with tablets. in this change, * background writes are enabled with tablets. * increase the number of nodes by 1 so that we have enough nodes to fulfill the needs of tablets, which enforces that the number of replicas should always satisfy RF. * pass rf to `start_writes()` explicitly, so we have less magic numbers in the test, and make the data dependencies more obvious. Fixes scylladb#17589 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

kbr-scylla · 2024-05-23T13:44:21Z

@bhalevy I believe it's a very important, basic test that we should have -- and milestone should be 6.0.

If this test detects issues but we disable it, we will see the same issues later down the line in SCT runs.

It's much easier to debug and fix those issues based on test.py output, and enable the test to prevent regressions, than debugging SCT longevity later.

bhalevy · 2024-05-23T17:53:01Z

ok, let's restore the milestone to 6.0 then, to indicate the fix should backported there

in e7d4e08, we reenabled the background writes in this test, but when running with tablets enabled, background writes are still disabled because of scylladb#17025, which was fixed last week. so we can enable background writes with tablets. in this change, * background writes are enabled with tablets. * increase the number of nodes by 1 so that we have enough nodes to fulfill the needs of tablets, which enforces that the number of replicas should always satisfy RF. * pass rf to `start_writes()` explicitly, so we have less magic numbers in the test, and make the data dependencies more obvious. Fixes scylladb#17589 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

patjed41 added area/topology changes area/tablets labels Feb 29, 2024

patjed41 assigned kbr-scylla Feb 29, 2024

patjed41 mentioned this issue Mar 21, 2024

test: test_topology_ops: fix flakiness and reenable bg writes #17585

Merged

kbr-scylla assigned tgrabiec Mar 22, 2024

bhalevy assigned tchaikov and unassigned tgrabiec and kbr-scylla Mar 25, 2024

lkshminarayanan mentioned this issue Mar 27, 2024

Track and limit memory used by bloom filters #17771

Merged

tgrabiec added this to the 6.0 milestone May 16, 2024

tchaikov linked a pull request May 16, 2024 that will close this issue

test: test_topology_ops: adapt to tablets #18707

Draft

1 task

bhalevy modified the milestones: 6.0, 6.1 May 22, 2024

bhalevy modified the milestones: 6.1, 6.0 May 23, 2024

mykaul modified the milestones: 6.0, 6.1 May 30, 2024

mykaul added the backport/6.0 label May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: test_topology_ops: reenable background writes in tablets mode #17589

test: test_topology_ops: reenable background writes in tablets mode #17589

patjed41 commented Feb 29, 2024

patjed41 commented Mar 21, 2024

bhalevy commented Mar 25, 2024

tchaikov commented Mar 26, 2024 •

edited

tchaikov commented Mar 26, 2024 •

edited

tgrabiec commented Mar 26, 2024

kbr-scylla commented Mar 28, 2024

tchaikov commented Mar 28, 2024 •

edited

kbr-scylla commented May 15, 2024

tchaikov commented May 15, 2024 •

edited

kbr-scylla commented May 23, 2024

bhalevy commented May 23, 2024

test: test_topology_ops: reenable background writes in tablets mode #17589

test: test_topology_ops: reenable background writes in tablets mode #17589

Comments

patjed41 commented Feb 29, 2024

patjed41 commented Mar 21, 2024

bhalevy commented Mar 25, 2024

tchaikov commented Mar 26, 2024 • edited

tchaikov commented Mar 26, 2024 • edited

tgrabiec commented Mar 26, 2024

kbr-scylla commented Mar 28, 2024

tchaikov commented Mar 28, 2024 • edited

kbr-scylla commented May 15, 2024

tchaikov commented May 15, 2024 • edited

kbr-scylla commented May 23, 2024

bhalevy commented May 23, 2024

tchaikov commented Mar 26, 2024 •

edited

tchaikov commented Mar 26, 2024 •

edited

tchaikov commented Mar 28, 2024 •

edited

tchaikov commented May 15, 2024 •

edited