feat: Sharding allocation strategy based on slice ranges #32418

patriknw · 2024-05-15T14:15:47Z

for the database sharding we want to reduce number of db connections when using many Akka nodes
allocate entity sharding by slice ranges, which is also used by the database sharding
thereby the db connections from one Akka node will go to one database, instead of to all

Early draft so far.

* for the database sharding we want to reduce number of db connections when using many Akka nodes * allocate entity sharding by slice ranges, which is also used by the database sharding * thereby the db connections from one Akka node will go to one database, instead of to all

patriknw · 2024-05-16T05:30:57Z

...luster-sharding/src/main/scala/akka/cluster/sharding/SliceRangeShardAllocationStrategy.scala

+    val regionsByMbr = regionsByMember(currentShardAllocations.keySet)
+    val regions = regionsByMbr.keysIterator.toIndexedSeq.sorted(Member.ageOrdering).map(regionsByMbr(_))
+    val rangeSize = NumberOfSlices / regions.size
+    val i = slice / rangeSize


A problem with this first naive approach is that adding/removing nodes will reshuffle many shards. Consistent hashing would be nice, but since it's (dynamic) ranges I don't see how that can be used. I have an idea that it can instead find existing adjacent shards and prefer allocation to same region.

Haven't tracked what the updated algorithm is yet, but feels like this issue could always be there in some form.

The optimal allocation (however it ends up there) will have all the shards for a slice range together. When these need to be reallocated, the optimal is to allocate as many as possible to one node again, but that either leads to reshuffles, or needing to accept unbalanced distributions, or they need to be redistributed over the other nodes creating fragmentation for the slice range.

But maybe in practice the find-neighbours approach naturally ends up balancing the tradeoff over time?

* to avoid too much reshuffling when adding/removing members

patriknw · 2024-05-17T12:00:07Z

@johanandren @pvlugter I have something with decent results for the simulations. Can you take a look before I continue with tuning and real tests.

johanandren

Looking good so far.

johanandren · 2024-05-20T09:06:53Z

...luster-sharding/src/main/scala/akka/cluster/sharding/SliceRangeShardAllocationStrategy.scala

+          // This covers the rounding case for the last region, which we just distribute over all regions.
+          // May also happen if member for that region has been removed, but that should be a rare case.


Suggested change

// This covers the rounding case for the last region, which we just distribute over all regions.

// May also happen if member for that region has been removed, but that should be a rare case.

// This covers the rounding case for the last slice, which we just distribute over all regions.

// May also happen if member for that region has been removed, but that should be a rare case.

johanandren · 2024-05-20T09:12:27Z

...luster-sharding/src/main/scala/akka/cluster/sharding/SliceRangeShardAllocationStrategy.scala

+    val overfill = 2
+    val maxShards = (NumberOfSlices / currentShardAllocations.size) + overfill
+
+    // FIXME take a look at ShardSuitabilityOrdering for member status and appVersion preference


This seems quite important, or I think it will try to stick to allocating to old nodes until quite a number of cluster nodes has rolled, as it prefers neighbours. Maybe I'm missing something with the maxShards protecting against that?

johanandren · 2024-05-20T09:23:33Z

...luster-sharding/src/main/scala/akka/cluster/sharding/SliceRangeShardAllocationStrategy.scala

+        emptyRebalanceResult
+      } else {
+        // this is the number of slices per region that we are aiming for
+        val overfill = 1


Should it align with the overfill on allocation? (just the 1 in diff but seems strange)

johanandren · 2024-05-20T09:26:34Z

...luster-sharding/src/main/scala/akka/cluster/sharding/SliceRangeShardAllocationStrategy.scala

+        val overfill = 1
+        val targetSize = NumberOfSlices / sortedRegionEntries.size + overfill
+        val selected = Vector.newBuilder[ShardId]
+        // FIXME ShardSuitabilityOrdering isn't used, but it seems better to use most shards first, combine them?


Missing the handling of leaving regions and app version from there because there if we don't want to use it

Lol, that sentence was not parseable, just wanted to note that the ShardSuitabilityOrdering looks at member leaving and app version, so if we don't want to use it we should probably do something around those properties as well.

This reverts commit 1d5d05b.

patriknw · 2024-05-21T09:42:01Z

...ing-typed/src/main/scala/akka/cluster/sharding/typed/SliceRangeShardAllocationStrategy.scala

+
+    findRegionWithNeighbor(slice, sortedRegionEntries) match {
+      case Some(regionWithNeighbor) =>
+        Future.successful(regionWithNeighbor)


I had an idea that I thought would be good, and implemented in 1d5d05b. Reverted because "try distributions" didn't show an improvement in reduction of connections. The idea was to also look at the already allocated range from min to max slice in that region, and if the slice is outside of the optimal range try a to find a region with lower/upper slice neighbor instead.

I'm still puzzled by that this didn't work, maybe I made some mistake in the implementation. I might debug it a little more.

patriknw · 2024-05-21T11:16:03Z

...typed/src/test/scala/akka/cluster/sharding/typed/SliceRangeShardAllocationStrategySpec.scala

+  // These are not real tests, but can be useful for exploring the algorithm and tuning
+  "SliceRangeShardAllocationStrategy simulations" must {
+
+    "try distributions" ignore {


A few examples of the result:

total of 73 connections from 20 nodes to 8 backend ranges, reduction by 87 total of 88 connections from 20 nodes to 16 backend ranges, reduction by 232 total of 84 connections from 50 nodes to 8 backend ranges, reduction by 316 total of 98 connections from 50 nodes to 16 backend ranges, reduction by 702 total of 125 connections from 100 nodes to 8 backend ranges, reduction by 675 total of 141 connections from 100 nodes to 16 backend ranges, reduction by 1459

And connection here should be read as db connection pool, so multiply by a factor of 10 or size of the pool.

Ok, so for the numbers we've been testing with (around 25 nodes and 8 slice ranges) it's a decent improvement, but still over 3x the optimal number of connections. And for higher number of nodes it's easier to get affinity.

Let's say a connection pool of max 20, then 73*20/8=274 connections per db should still be fine.
I'd say that it's more important for the larger clusters, so good that it looks better for that.

One idea that I have been pondering is if we should have a pre-allocation phase where it would allocate all shards in order 0-1023. Then it would be near perfect affinity. Would have to be triggered from the outside of the allocation strategy itself, in the coordinator, or by something asking the coordinator for shard homes. Could be triggered when reaching min-nr-of-members.

However, it would only help for the initial allocation and not for rebalance.

I made a small change that had a significant improvement for the smaller clusters. I adjust how far to look for neighbors depending on optimal range size, previously it was hardcoded to 10. bc9f86d

New results:

total of 46 connections from 20 nodes to 8 backend ranges, reduction by 114 total of 67 connections from 20 nodes to 16 backend ranges, reduction by 253 total of 84 connections from 50 nodes to 8 backend ranges, reduction by 316 total of 98 connections from 50 nodes to 16 backend ranges, reduction by 702 total of 124 connections from 100 nodes to 8 backend ranges, reduction by 676 total of 138 connections from 100 nodes to 16 backend ranges, reduction by 1462

Cool. Makes sense that it has more opportunity to coalesce with looking further.

johanandren

Looking pretty good, esp for the bigger cluster sizes when running the "try distributions" case.

johanandren · 2024-05-21T10:54:49Z

...ing-typed/src/main/scala/akka/cluster/sharding/typed/SliceRangeShardAllocationStrategy.scala

+    override def entityId(envelope: ShardingEnvelope[M]): String = envelope.entityId
+
+    override def shardId(entityId: String): String = {
+      // FIXME shall we have the Persistence extension dependency here, or re-implement sliceForPersistenceId?


Optional dependency on persistence seems fine to me.

I was thinking that this is only useful together with database sharding, i.e. persistence included. I was even thinking that it should only be documented in https://doc.akka.io/docs/akka-persistence-r2dbc/current/data-partition.html

I agree that makes sense

johanandren · 2024-05-21T11:10:25Z

...ing-typed/src/main/scala/akka/cluster/sharding/typed/SliceRangeShardAllocationStrategy.scala

+    if (slice >= numberOfSlices)
+      throw new IllegalArgumentException("slice must be between 0 and 1023. Use `ShardBySliceMessageExtractor`.")
+
+    val sortedRegionEntries = regionEntriesFor(currentShardAllocations).toVector.sorted(shardSuitabilityOrdering)


👍 nice, covers leaving and app version without repeating that logic

yes, had to make a parameter for using least shards or not in the ordering

johanandren · 2024-05-21T11:13:51Z

...ter-sharding/src/main/scala/akka/cluster/sharding/internal/ClusterShardAllocationMixin.scala

        // prefer the node with the least allocated shards
        JInteger.compare(allocatedShardsX.size, allocatedShardsY.size)
+      } else if (x.member.upNumber != y.member.upNumber) {
+        // prefer older
+        Member.ageOrdering.compare(x.member, y.member)


Why prefer older? Not saying younger or ignoring age would be better, but want to understand the rationale. (I understand this is the case where nodes have same version so not rolling)

No strong preference, mostly just wanted to have a deterministic order. Possibly that a younger could be more "cold" and shouldn't be overloaded by many new shards immediately.

...ing-typed/src/main/scala/akka/cluster/sharding/typed/SliceRangeShardAllocationStrategy.scala

johanandren · 2024-05-21T13:02:05Z

...ing-typed/src/main/scala/akka/cluster/sharding/typed/SliceRangeShardAllocationStrategy.scala

+        val currentNumberOfShards = sortedRegionEntries.map(_.shardIds.size).sum
+        val limitedResult = result.take(limit(currentNumberOfShards)).toSet
+        previousRebalance = previousRebalance.union(limitedResult)
+        if (previousRebalance.size >= numberOfSlices / 4)


Did it trigger rebalance-loop with the previous lower value (100 I think)?

No exact science around that choice. In some simulations I could see loops with 100, which looked better with 256.

patriknw · 2024-05-21T13:50:08Z

Some api docs and maybe some cleanup remaining, but marking as ready.

pvlugter

LGTM. Good starting point to improve things for number of connections 👍🏼

The tension between distribution and affinity feels fundamental to the problem here, and likely always some tradeoffs to make. Maybe it's useful to have an option to trigger more reshuffling to get closer to optimal in terms of affinity, if cluster membership changes are not expected to happen often and more movement is ok during changes before settling again? Or to retain optimal affinity allocation after a rolling update (compared with cluster size changes).

Some extra thoughts (not for now): wonder if there's potential for this to become a general affinity-based allocation strategy, where some affinity function is provided. And thinking about whether there's a hashing + affinity/preference approach that would be useful here (compared with searching for neighbours / high affinity shards).

pvlugter · 2024-05-22T05:09:32Z

...typed/src/test/scala/akka/cluster/sharding/typed/SliceRangeShardAllocationStrategySpec.scala

+  // These are not real tests, but can be useful for exploring the algorithm and tuning
+  "SliceRangeShardAllocationStrategy simulations" must {
+
+    "try distributions" ignore {


Ok, so for the numbers we've been testing with (around 25 nodes and 8 slice ranges) it's a decent improvement, but still over 3x the optimal number of connections. And for higher number of nodes it's easier to get affinity.

johanandren

LGTM!

patriknw commented May 16, 2024

View reviewed changes

patriknw added 8 commits May 16, 2024 13:34

allocate by finding neighbor slices

dda5f18

* to avoid too much reshuffling when adding/removing members

look at actual number of connections per node

d9b7ca1

trying rebalance, but probably not what we want

6193468

simplified rebalance

5a71baa

structured playground

0c13976

rebalance that seems to work fairly well

b097609

simulation

5b3999e

alternative neighbor region when at boundary

6c289b5

johanandren reviewed May 20, 2024

View reviewed changes

patriknw added 10 commits May 20, 2024 14:01

improve order and rounding

7b62514

remove the alternative search at boundaries, for now

ecd6dec

bring back small overfill, needed for rebalance

a22903e

some real tests

ec33f4b

more tests

a38e228

move to akka-cluster-sharding-typed

4f50cba

ShardBySliceMessageExtractor

2342cb1

take range size into account when allocating

1d5d05b

Revert "take range size into account when allocating"

fe68f7a

This reverts commit 1d5d05b.

minor cleanup

e67e5c1

patriknw force-pushed the wip-sharding-by-slice-patriknw branch from 1634bec to e67e5c1 Compare May 21, 2024 09:39

patriknw commented May 21, 2024

View reviewed changes

johanandren reviewed May 21, 2024

View reviewed changes

patriknw marked this pull request as ready for review May 21, 2024 13:50

pvlugter reviewed May 22, 2024

View reviewed changes

PersistenceId.concat

db6b0ce

patriknw added 2 commits May 22, 2024 08:56

regionIndex instead of i

08b20f9

api docs

e968487

johanandren approved these changes May 22, 2024

View reviewed changes

adjust how far to look for neighbors depending on optimal range size

bc9f86d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Sharding allocation strategy based on slice ranges #32418

feat: Sharding allocation strategy based on slice ranges #32418

patriknw commented May 15, 2024

patriknw May 16, 2024

pvlugter May 20, 2024

patriknw commented May 17, 2024

johanandren left a comment

johanandren May 20, 2024

johanandren May 20, 2024

johanandren May 20, 2024

johanandren May 20, 2024

johanandren May 20, 2024

patriknw May 21, 2024 •

edited

patriknw May 22, 2024

patriknw May 21, 2024

pvlugter May 22, 2024

patriknw May 22, 2024

patriknw May 22, 2024

patriknw May 22, 2024

pvlugter May 23, 2024

johanandren left a comment

johanandren May 21, 2024

patriknw May 21, 2024

johanandren May 22, 2024

johanandren May 21, 2024

patriknw May 21, 2024

johanandren May 21, 2024

patriknw May 21, 2024

johanandren May 21, 2024

patriknw May 21, 2024

patriknw commented May 21, 2024

pvlugter left a comment

pvlugter May 22, 2024

johanandren left a comment

		// This covers the rounding case for the last region, which we just distribute over all regions.
		// May also happen if member for that region has been removed, but that should be a rare case.

feat: Sharding allocation strategy based on slice ranges #32418

Are you sure you want to change the base?

feat: Sharding allocation strategy based on slice ranges #32418

Conversation

patriknw commented May 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patriknw commented May 17, 2024

johanandren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patriknw May 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johanandren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patriknw commented May 21, 2024

pvlugter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johanandren left a comment

Choose a reason for hiding this comment

patriknw May 21, 2024 •

edited