Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: test_topology_ops: reenable background writes in tablets mode #17589

Open
patjed41 opened this issue Feb 29, 2024 · 11 comments · May be fixed by #18707
Open

test: test_topology_ops: reenable background writes in tablets mode #17589

patjed41 opened this issue Feb 29, 2024 · 11 comments · May be fixed by #18707

Comments

@patjed41
Copy link
Contributor

#17585 reenables background writes in test_topology_ops, but only when the test is run without tablets. The test fails with tablets because of #17025. Once that issue if fixed, we should reenable background writes with tablets in test_topology_ops.

Ideally, before sending and merging the fix, we should investigate if test_topology_ops with tablets and background writes is flaky. Before #17585, without tablets and with background writes the test was flaky in CI (1-2 failures in 1000 runs in dev mode).

@patjed41
Copy link
Contributor Author

#17025 has been fixed, so this issue can be worked on.

@bhalevy
Copy link
Member

bhalevy commented Mar 25, 2024

@tchaikov can you please look into this issue?

@tchaikov
Copy link
Contributor

tchaikov commented Mar 26, 2024

after applying following patch

modified   test/topology_experimental_raft/test_topology_ops.py                                                                                                                                                       @@ -40,9 +40,7 @@ async def test_topology_ops(request, manager: ManagerClient, tablets_enabled: bo
     await wait_for_cql_and_get_hosts(manager.cql, servers, time.time() + 60)
     cql = await reconnect_driver(manager)                                                                                                                                                                            -    # FIXME: we disable background writes with tablets enabled because the test fails due to #17025.
-    # We need to re-enable them once this issue is fixed (by removing `if` here and above `await finish_writes()`).
-    finish_writes = await start_writes(cql) if not tablets_enabled else None
+    finish_writes = await start_writes(cql)
     logger.info(f"Decommissioning node {servers[0]}")                                                                                                                                                                     await manager.decommission_node(servers[0].server_id)
@@ -72,8 +70,7 @@ async def test_topology_ops(request, manager: ManagerClient, tablets_enabled: bo
     servers = servers[1:]
     logger.info("Checking results of the background writes")
-    if not tablets_enabled:
-        await finish_writes()
+    await finish_writes()
     for server in servers:                                                                                                                                                                                                    await check_node_log_for_failed_mutations(manager, server)

i have following test failure

    @pytest.mark.asyncio                                                                                                                                                                                                                      
    #@pytest.mark.parametrize("tablets_enabled", ["true", "false"])                                                                                                                                                                           
    async def test_topology_ops(request, manager: ManagerClient, tablets_enabled: bool = True):                                                                                                                                               
        """Test basic topology operations using the topology coordinator."""                                                                                                                                                                  
        cfg = {'experimental_features': ['consistent-topology-changes']}                                                                                                                                                                      
        if tablets_enabled:                                                                                                                                                                                                                               cfg['experimental_features'].append('tablets')                                                                                                                                                                                    
                                                                                                                                                                                                                                              
        logger.info("Bootstrapping first node")                                                                                                                                                                                                       servers = [await manager.server_add(config=cfg)]                                                                                                                                                                                      
                                                                                                                                                                                                                                              
        logger.info(f"Restarting node {servers[0]}")                                                                                                                                                                                                  await manager.server_stop_gracefully(servers[0].server_id)                                                                                                                                                                            
        await manager.server_start(servers[0].server_id)                                                                                                                                                                                      
                                                                                                                                                                                                                                                      logger.info("Bootstrapping other nodes")                                                                                                                                                                                              
        servers += await manager.servers_add(3, config=cfg)                                                                                                                                                                                                                                                                                                                                                                                                                                 
        await wait_for_cql_and_get_hosts(manager.cql, servers, time.time() + 60)                                                                                                                                                              
        cql = await reconnect_driver(manager)                                                                                                                                                                                                         finish_writes = await start_writes(cql)                                                                                                                                                                                               
                                                                                                                                                                                                                                                      logger.info(f"Decommissioning node {servers[0]}")                                                                                                                                                                                     
        await manager.decommission_node(servers[0].server_id)                                                                                                                                                                                 
        await check_token_ring_and_group0_consistency(manager)                                                                                                                                                                                        servers = servers[1:]                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                      logger.info(f"Restarting node {servers[0]} when other nodes have bootstrapped")                                                                                                                                                       
        await manager.server_stop_gracefully(servers[0].server_id)                                                                                                                                                                            
        await manager.server_start(servers[0].server_id)                                                                                                                                                                                      
                                                                                                                                                                                                                                              
        logger.info(f"Stopping node {servers[0]}")                                                                                                                                                                                            
        await manager.server_stop_gracefully(servers[0].server_id)                                                                                                                                                                            
        await check_node_log_for_failed_mutations(manager, servers[0])                                                                                                                                                                        
                                                                                                                                                                                                                                              
        logger.info(f"Replacing node {servers[0]}")                                                                                                                                                                                           
        replace_cfg = ReplaceConfig(replaced_id = servers[0].server_id, reuse_ip_addr = False, use_host_id = False)                                                                                                                           
        servers = servers[1:] + [await manager.server_add(replace_cfg)]                                                                                                                                                                       
        await check_token_ring_and_group0_consistency(manager)                                                                                                                                                                                
                                                                                                                                                                                                                                              
        logger.info(f"Stopping node {servers[0]}")                                                                                                                                                                                            
        await manager.server_stop_gracefully(servers[0].server_id)                                                                                                                                                                            
        await check_node_log_for_failed_mutations(manager, servers[0])                                                                                                                                                                        
                                                                                                                                                                                                                                              
        logger.info(f"Removing node {servers[0]} using {servers[1]}")                                                                                                                                                                         
>       await manager.remove_node(servers[1].server_id, servers[0].server_id)  
...
self = <test.pylib.rest_client.UnixRESTClient object at 0x7f31a33b7ad0>
method = 'PUT', resource = '/cluster/remove-node/4', response_type = None
host = None, port = None, params = None
json = {'expected_error': None, 'ignore_dead': [], 'server_id': 3}                                                                                                                                                    timeout = 1000

    async def _fetch(self, method: str, resource: str, response_type: Optional[str] = None,
                     host: Optional[str] = None, port: Optional[int] = None,                                                                                                                                                               params: Optional[Mapping[str, str]] = None,
                     json: Optional[Mapping] = None, timeout: Optional[float] = None) -> Any:
        # Can raise exception. See https://docs.aiohttp.org/en/latest/web_exceptions.html
        assert method in ["GET", "POST", "PUT", "DELETE"], f"Invalid HTTP request method {method}"
        assert response_type is None or response_type in ["text", "json"], \
                f"Invalid response type requested {response_type} (expected 'text' or 'json')"
        # Build the URI
        port = port if port else self.default_port if hasattr(self, "default_port") else None                                                                                                                                 port_str = f":{port}" if port else ""                                                                                                                                                                                 assert host is not None or hasattr(self, "default_host"), "_fetch: missing host for " \
                "{method} {resource}"
        host_str = host if host is not None else self.default_host                                                                                                                                                            uri = self.uri_scheme + "://" + host_str + port_str + resource                                                                                                                                                        logging.debug(f"RESTClient fetching {method} {uri}")

        client_timeout = ClientTimeout(total = timeout if timeout is not None else 300)                                                                                                                                       async with request(method, uri,                                                                                                                                                                                                          connector = self.connector if hasattr(self, "connector") else None,
                           params = params, json = json, timeout = client_timeout) as resp:
            if resp.status != 200:                                                                                                                                                                                                    text = await resp.text()
>               raise HTTPError(uri, resp.status, params, json, text)                                                                                                                                                 E               test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/remove-node/4, params: None, json: {'server_id': 3, 'ignore_dead': [], 'expected_error': None}, body:
E               removenode failed (initiator: ScyllaServer(4, 127.198.48.4, 53a4cc60-298c-4194-a139-c037309e32ba), to_remove: ScyllaServer(3, 127.198.48.3, 565cc314-62fa-484c-b587-49135a51ee89), ignore_dead: []), c
heck log file at /home/kefu/dev/scylladb/testlog/release/scylla-4.log, error: "HTTP error 500, uri: http://127.198.48.4:10000/storage_service/remove_node, params: {'host_id': '565cc314-62fa-484c-b587-49135a51ee89', 'ignore_nodes': ''}, json: None, body:
E               {"message": "std::runtime_error (Removenode failed. See earlier errors (Rolled back: Failed to drain tablets: std::runtime_error (Unable to find new replica for tablet 7d477460-eb2b-11ee-ba7a-b34fcafc9b0e:3 on 565cc314-62fa-484c-b587-49135a51ee89:1 when draining {565cc314-62fa-484c-b587-49135a51ee89})))", "code": 500}"

when running

$ ./test.py --verbose --mode release topology_experimental_raft/test_topology_ops

@tchaikov
Copy link
Contributor

tchaikov commented Mar 26, 2024

i am taking a closer look to understand the root cause.

@tgrabiec
Copy link
Contributor

There are 4 nodes, RF=3, and we decommission and then removenode. So node count falls below RF. This is not supported with tablets.

@kbr-scylla
Copy link
Contributor

We could adjust the test to add one more node in tablets mode at the beginning.

@tchaikov
Copy link
Contributor

tchaikov commented Mar 28, 2024

thank you for your insights, gentlemen. will add one more node to appease tablets.

@kbr-scylla
Copy link
Contributor

ping, guys, please don't forget about this issue

@tchaikov
Copy link
Contributor

tchaikov commented May 15, 2024

@kbr-scylla i have a WIP locally. will send it tomorrow. thanks for the ping.

@tgrabiec tgrabiec added this to the 6.0 milestone May 16, 2024
tchaikov added a commit to tchaikov/scylladb that referenced this issue May 16, 2024
in e7d4e08, we reenabled the background writes in this test, but
when running with tablets enabled, background writes are still
disabled because of scylladb#17025, which was fixed last week. so we can
enable background writes with tablets.

in this change,

* background writes are enabled with tablets.
* increase the number of nodes by 1 so that we have enough nodes
  to fulfill the needs of tablets, which enforces that the number
  of replicas should always satisfy RF.
* pass rf to `start_writes()` explicitly, so we have less
  magic numbers in the test, and make the data dependencies
  more obvious.

Fixes scylladb#17589
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
@tchaikov tchaikov linked a pull request May 16, 2024 that will close this issue
1 task
tchaikov added a commit to tchaikov/scylladb that referenced this issue May 20, 2024
in e7d4e08, we reenabled the background writes in this test, but
when running with tablets enabled, background writes are still
disabled because of scylladb#17025, which was fixed last week. so we can
enable background writes with tablets.

in this change,

* background writes are enabled with tablets.
* increase the number of nodes by 1 so that we have enough nodes
  to fulfill the needs of tablets, which enforces that the number
  of replicas should always satisfy RF.
* pass rf to `start_writes()` explicitly, so we have less
  magic numbers in the test, and make the data dependencies
  more obvious.

Fixes scylladb#17589
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
@bhalevy bhalevy modified the milestones: 6.0, 6.1 May 22, 2024
@kbr-scylla
Copy link
Contributor

@bhalevy I believe it's a very important, basic test that we should have -- and milestone should be 6.0.

If this test detects issues but we disable it, we will see the same issues later down the line in SCT runs.

It's much easier to debug and fix those issues based on test.py output, and enable the test to prevent regressions, than debugging SCT longevity later.

@bhalevy
Copy link
Member

bhalevy commented May 23, 2024

ok, let's restore the milestone to 6.0 then, to indicate the fix should backported there

@bhalevy bhalevy modified the milestones: 6.1, 6.0 May 23, 2024
@mykaul mykaul modified the milestones: 6.0, 6.1 May 30, 2024
tchaikov added a commit to tchaikov/scylladb that referenced this issue Jun 3, 2024
in e7d4e08, we reenabled the background writes in this test, but
when running with tablets enabled, background writes are still
disabled because of scylladb#17025, which was fixed last week. so we can
enable background writes with tablets.

in this change,

* background writes are enabled with tablets.
* increase the number of nodes by 1 so that we have enough nodes
  to fulfill the needs of tablets, which enforces that the number
  of replicas should always satisfy RF.
* pass rf to `start_writes()` explicitly, so we have less
  magic numbers in the test, and make the data dependencies
  more obvious.

Fixes scylladb#17589
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
tchaikov added a commit to tchaikov/scylladb that referenced this issue Jun 3, 2024
in e7d4e08, we reenabled the background writes in this test, but
when running with tablets enabled, background writes are still
disabled because of scylladb#17025, which was fixed last week. so we can
enable background writes with tablets.

in this change,

* background writes are enabled with tablets.
* increase the number of nodes by 1 so that we have enough nodes
  to fulfill the needs of tablets, which enforces that the number
  of replicas should always satisfy RF.
* pass rf to `start_writes()` explicitly, so we have less
  magic numbers in the test, and make the data dependencies
  more obvious.

Fixes scylladb#17589
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants