tests: relax check in AutomaticLeadershipBalancingTest #18497

ztlpn · 2024-05-15T15:20:18Z

Relax the shard leader count check because leader balancer may not be able to achieve balanced counts due to interplay between topic-aware and total counts objectives (see https://github.com/redpanda-data/core-internal/issues/1282).

Fixes #17150

Also mute just restarted nodes in leader_balancer, as their health reports can have incomplete partition info, and they are probably busy recovering partitions anyway.

Backports Required

Release Notes

Improvements

Don't try to transfer leadership to just restarted nodes when balancing leaders.

vbotbuildovich · 2024-05-15T17:28:59Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49162#018f7d17-82bc-48cf-b433-0c9851414504

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49162#018f7d1f-a86b-476f-af6b-18eede55586a

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49162#018f7d1f-a86e-4f3a-bf44-258891b862ca

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f81-40f2-8711-9c4910f8634f

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68ec-4efa-96e8-a7e7a72f486f

bharathv · 2024-05-16T02:17:51Z

tests/rptest/tests/leadership_transfer_test.py

        for s, count in shard2leaders.items():
-            expected_min = math.floor(expected_on_shard * 0.8)
+            # Check with a lot of slack because leader balancer may not be able to achieve


wonder if we should mark this as ok_to_fail instead, so we don't lose track of tightening the check once the underlying issue is fixed.

Well if it is marked ok_to_fail we will surely lose track :) and I don't think we can mark individual assertions ok_to_fail... Also even in this form the check is somewhat useful

Just restarted nodes may have their health reports incomplete because not all partitions have started yet. Also right after restart the node is probably busy catching up and replicating data that was produced in its absense. Because of these two reasons just restarted nodes are bad candidates for leadership transfers, mute them.

Fixes redpanda-data#17150

vbotbuildovich · 2024-05-16T16:55:30Z

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f86-4639-be96-7e0e03f9e76b:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_and_segment_metadata"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c581-409c-b052-78e58d78c3c4:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_and_segment_metadata"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f81-40f2-8711-9c4910f8634f:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8226-1f89-46ed-b36f-d8f40d5f346a:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c585-4e68-89d5-245de545bb40:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c583-4ebe-a3d7-96108f6a4b42:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_and_segment_metadata"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8215-c57e-4a43-97a7-8423e98bb3c6:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49228#018f8300-c364-4b33-9d07-7b67d4bb629d:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9da7-7fbe-433f-9aa9-bf906ba7c3c4:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9da7-7fbc-46e2-94ab-2fea83f5e43d:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_and_segment_metadata"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9da7-7fc2-4294-9018-8ace8d69812e:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=no_check"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68e3-4246-a9f1-0e9ff37873de:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_and_segment_metadata"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68e6-4789-bc86-964fc89cb749:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_and_segment_metadata"
"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.S3.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68e9-47b9-9923-9e87cf7f343c:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=check_manifest_existence"

new failures in https://buildkite.com/redpanda/redpanda/builds/49396#018f9daf-68ec-4efa-96e8-a7e7a72f486f:

"rptest.tests.topic_recovery_test.TopicRecoveryTest.test_many_partitions.cloud_storage_type=CloudStorageType.ABS.check_mode=no_check"

bharathv · 2024-05-21T17:28:36Z

@ztlpn is this ready for review? Lots of failures, so unsure if they are related or not.

ztlpn · 2024-05-21T17:32:06Z

@bharathv they are related, though this is more of a test problem. Currently discussing with the storage team how to fix the test.

ztlpn · 2024-05-21T23:07:55Z

merged #18603, retrying ci...

ztlpn · 2024-05-21T23:08:09Z

/ci-repeat

ztlpn requested review from bharathv, bashtanov and mmaslankaprv May 15, 2024 15:20

github-actions bot added the area/redpanda label May 15, 2024

bharathv reviewed May 16, 2024

View reviewed changes

ztlpn added 2 commits May 16, 2024 15:36

tests: relax check in AutomaticLeadershipBalancingTest

e68016b

Fixes redpanda-data#17150

ztlpn force-pushed the fix-17150 branch from ca483a4 to e68016b Compare May 16, 2024 14:26

ztlpn requested a review from bharathv May 16, 2024 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: relax check in AutomaticLeadershipBalancingTest #18497

tests: relax check in AutomaticLeadershipBalancingTest #18497

ztlpn commented May 15, 2024

vbotbuildovich commented May 15, 2024 •

edited

bharathv May 16, 2024

ztlpn May 16, 2024

vbotbuildovich commented May 16, 2024 •

edited

bharathv commented May 21, 2024

ztlpn commented May 21, 2024

ztlpn commented May 21, 2024

ztlpn commented May 21, 2024

tests: relax check in AutomaticLeadershipBalancingTest #18497

Are you sure you want to change the base?

tests: relax check in AutomaticLeadershipBalancingTest #18497

Conversation

ztlpn commented May 15, 2024

Backports Required

Release Notes

Improvements

vbotbuildovich commented May 15, 2024 • edited

bharathv May 16, 2024

Choose a reason for hiding this comment

ztlpn May 16, 2024

Choose a reason for hiding this comment

vbotbuildovich commented May 16, 2024 • edited

bharathv commented May 21, 2024

ztlpn commented May 21, 2024

ztlpn commented May 21, 2024

ztlpn commented May 21, 2024

vbotbuildovich commented May 15, 2024 •

edited

vbotbuildovich commented May 16, 2024 •

edited