Allow balancing weights to be set per tier #126091

nicktindall · 2025-04-02T07:07:46Z

@henningandersen made the good point on my first attempt at this, that that approach did make some assumptions about the way the allocation deciders would govern shard movement. This iteration is an attempt to make those assumptions more explicit by introducing actual partitioning of the nodes.

Approach

Instead of assuming the BalancedShardsAllocator applies to the entire cluster, I've added the concept of "partitions" into the balancing. The partitions must be mutually disjoint subsets of the shards and nodes - i.e. the set of shards in a partition are only ever allocated to the set of nodes in the same partition, as is the case in serverless. WeightFunctions and NodeSorters are scoped to partitions.

The status quo behaviour is defined by the GlobalPartitionedClusterFactory, it produces a single global partition. The serverless version is called TieredPartitionedClusterFactory and will define a partition for each of the search and indexing tiers.

The serverless code is in here for now, but it will be moved to serverless if we think this is a sound approach.

Alternatives

It would be nice if we could just do the partitioning on the RoutingAllocation and just run a Balancer for each of the partitions (search and indexing), but the contents of e.g. Metadata and RoutingNodes is so heavily cross-referenced and aggregated it might be tricky to pull apart.

A better approach might be to do the partitioning at the point that e.g. the RoutingNodes is being generated from the GlobalRoutingTable. But that seems like a more impactful refactor.

…tier_v2

nicktindall · 2025-04-03T06:19:22Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

        ) {
            this.writeLoadForecaster = writeLoadForecaster;
            this.allocation = allocation;
            this.routingNodes = allocation.routingNodes();
            this.metadata = allocation.metadata();
-            this.weightFunction = weightFunction;
            this.threshold = threshold;
            avgShardsPerNode = WeightFunction.avgShardPerNode(metadata, routingNodes);
            avgWriteLoadPerNode = WeightFunction.avgWriteLoadPerNode(writeLoadForecaster, metadata, routingNodes);
            avgDiskUsageInBytesPerNode = WeightFunction.avgDiskUsageInBytesPerNode(allocation.clusterInfo(), metadata, routingNodes);


I left the averages calculated globally, I did experiment with making them local to the partition, but I'm not sure of the benefit of this. I think it might be more costly to filter the nodes and shards by partition to calculate these numbers. But definitely something we can pursue if we think it's worthwhile.

I wonder if the split would be significant if we had substantially differently-sized search and indexing tiers.

Maybe this wouldn't be costly -- just more complex -- if we pre-calculated primary vs replica counts when setting up ProjectMetadata#totalNumberOfShards? Similarly with the other values. But.. that is quite a bit of fiddly code to maintain.

I wonder, though, whether this code could already be doing weird things to our balance calculations in serverless. For example, hypothetically, if the search tier has 3x the number of shards -- say 3 replica copies per shard -- then the index tier nodes are going to increase the shard weight component (balancing-factor-constant x difference-from-average).

I think we would probably want to base the counting on ShardRole, because perhaps there might be a future where primaries live in the search tier for read-only indices? (wild speculation there, but you never know). That would keep the logic for deciding "which shards are search/indexing shards" in one place, in the StatelessShardRoutingRoleStrategy.

I wonder, though, whether this code could already be doing weird things to our balance calculations in serverless.

Yes, it's a good point. Even just if there are different sized indexing and search tiers, some weights could never be balanced. e.g. if there were 5 indexing nodes and 1 search node, with an index with 5 shards, 1 replica. The average shard count would be 10/6 = 1.66~ so the indexing nodes would all be under-weight and the search node massively overweight in terms of shard count (I think).

nicktindall · 2025-04-03T06:20:09Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+            // Balance each partition
+            for (NodeSorter nodeSorter : partitionedNodeSorter.allNodeSorters()) {
+                balanceByWeights(nodeSorter);
+            }


We balance once for each NodeSorter (i.e. once for each partition)

I wonder if it would be more clear to use the node sorter in a "partition" class/holder so we're going through the partitions, rather than the sorter. It might also be a naming thing.

Yeah I know what you mean. I've made it now so NodeSorters is an Iterable<NodeSorter> the "partition" terminology has faded from the implementation somewhat, but I think it's an important concept.

I added some javadoc to NodeSorter in b9403bb to make it clear it's scoped to a partition. I did have the partition terminology more front-and-centre in earlier versions of the PR but it also didn't look quite right.

Also, we don't always need NodeSorters. See in NodeAllocationStatsAndWeightsCalculator we use the BalancingWeightsFactory to create the BalancingWeights but we don't ever call createNodeSorters, so refactoring to just return e.g. a PartitionedCluster that contained Partitions which each hold a NodeSorter and a WeightFunction, though conceptually cleanest, won't fit the way we use the API.

…hts and BalancedShardsAllocator

…tier_v2

elasticsearchmachine · 2025-04-08T06:10:58Z

Hi @nicktindall, I've created a changelog YAML for you.

elasticsearchmachine · 2025-04-08T06:12:58Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

nicktindall · 2025-04-14T05:45:09Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

@@ -1323,7 +1299,7 @@ public boolean containsShard(ShardRouting shard) {
        }
    }

-    static final class NodeSorter extends IntroSorter {
+    public static final class NodeSorter extends IntroSorter {


annoyingly these need to become public so the StatelessBalancingWeightsFactory can see/instantiate them

# Conflicts: # server/src/main/java/org/elasticsearch/cluster/ClusterModule.java # server/src/main/java/org/elasticsearch/cluster/routing/allocation/NodeAllocationStatsAndWeightsCalculator.java # server/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java # server/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancerSettings.java # server/src/test/java/org/elasticsearch/cluster/routing/allocation/AllocationStatsServiceTests.java # server/src/test/java/org/elasticsearch/cluster/routing/allocation/BalanceConfigurationTests.java # server/src/test/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocatorTests.java # test/framework/src/main/java/org/elasticsearch/cluster/ESAllocationTestCase.java

…tier_v2

pxsalehi

LGTM! Great work!

pxsalehi · 2025-04-17T14:48:13Z

...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java

+            // Balance each partition
+            for (NodeSorter nodeSorter : partitionedNodeSorter.allNodeSorters()) {
+                balanceByWeights(nodeSorter);
+            }


I wonder if it would be more clear to use the node sorter in a "partition" class/holder so we're going through the partitions, rather than the sorter. It might also be a naming thing.

…tier_v2

henningandersen

LGTM.

…tier_v2

Different approach at allowing separate weights per tier

f36bfd5

elasticsearchmachine added the v9.1.0 label Apr 2, 2025

nicktindall mentioned this pull request Apr 2, 2025

Allow balancing weights to be set per tier #125824

Closed

nicktindall added 3 commits April 3, 2025 10:27

Merge remote-tracking branch 'origin/main' into separate_weights_per_…

45f8c44

…tier_v2

Compiling without errors

b40f3a4

Minimised

91b2072

nicktindall commented Apr 3, 2025

View reviewed changes

nicktindall added 11 commits April 3, 2025 18:04

Hack in tiered partitions for serverless

4807773

Merge branch 'main' into separate_weights_per_tier_v2

05a27b6

Register settings

539499f

Use the same PartitionedClusterFactory for NodeAllocationStatsAndWeig…

2db4d2e

…hts and BalancedShardsAllocator

Use the same PartitionedClusterFactory for NodeAllocationStatsAndWeig…

8bc0aba

…hts and BalancedShardsAllocator

Move balancer settings testing to use BalancerSettings

6e7d2e0

Tidy

453e819

Add assertion around tiering

daff958

Handle case when there are no search nodes

5fbf544

Handle zero nodes in NodeSorter

f8f579c

Merge remote-tracking branch 'origin/main' into separate_weights_per_…

afea2ef

…tier_v2

Update docs/changelog/126091.yaml

2f87308

github-actions bot deployed to docs-preview April 8, 2025 06:11 View deployment

nicktindall marked this pull request as ready for review April 8, 2025 06:12

nicktindall requested a review from a team as a code owner April 8, 2025 06:12

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Apr 8, 2025

Fix changelog message

6952169

github-actions bot deployed to docs-preview April 14, 2025 04:02 View deployment

This was referenced Apr 14, 2025

Allow float settings to be configured with other settings as default #126751

Merged

Deduplicate monitoring of balancer settings #126752

Merged

Make ModelNode and NodeSorter public

2d1ec90

github-actions bot deployed to docs-preview April 14, 2025 05:16 View deployment

nicktindall commented Apr 14, 2025

View reviewed changes

Merge branch 'main' into separate_weights_per_tier_v2

9e7f13d

github-actions bot deployed to docs-preview April 15, 2025 03:44 View deployment

github-actions bot deployed to docs-preview April 15, 2025 07:12 View deployment

nicktindall added 2 commits April 17, 2025 12:03

Merge remote-tracking branch 'origin/main' into separate_weights_per_…

bfdc214

…tier_v2

PartitionedNodeSorter -> NodeSorters

b9d0241

github-actions bot deployed to docs-preview April 17, 2025 02:22 View deployment

Add tests for weights-by-partition

218901f

github-actions bot deployed to docs-preview April 17, 2025 07:26 View deployment

Merge remote-tracking branch 'origin/main' into separate_weights_per_…

da8a69e

…tier_v2

github-actions bot deployed to docs-preview April 17, 2025 07:54 View deployment

nicktindall requested review from DiannaHohensee and henningandersen April 17, 2025 08:00

pxsalehi approved these changes Apr 17, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into separate_weights_per_…

e50bd1b

…tier_v2

github-actions bot deployed to docs-preview April 22, 2025 02:57 View deployment

Make NodeSorters iterable

3361bbe

github-actions bot deployed to docs-preview April 22, 2025 03:17 View deployment

Javadoc on NodeSorter

b9403bb

github-actions bot deployed to docs-preview April 22, 2025 03:48 View deployment

henningandersen approved these changes Apr 24, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into separate_weights_per_…

da14071

…tier_v2

github-actions bot deployed to docs-preview April 25, 2025 04:42 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow balancing weights to be set per tier #126091

Allow balancing weights to be set per tier #126091

nicktindall commented Apr 2, 2025 •

edited

Loading

nicktindall Apr 3, 2025

nicktindall Apr 8, 2025

DiannaHohensee Apr 11, 2025

nicktindall Apr 14, 2025 •

edited

Loading

nicktindall Apr 3, 2025

pxsalehi Apr 17, 2025

nicktindall Apr 22, 2025

elasticsearchmachine commented Apr 8, 2025

elasticsearchmachine commented Apr 8, 2025

nicktindall Apr 14, 2025

pxsalehi left a comment

pxsalehi Apr 17, 2025

henningandersen left a comment

Allow balancing weights to be set per tier #126091

Are you sure you want to change the base?

Allow balancing weights to be set per tier #126091

Conversation

nicktindall commented Apr 2, 2025 • edited Loading

Approach

Alternatives

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicktindall Apr 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Apr 8, 2025

elasticsearchmachine commented Apr 8, 2025

Choose a reason for hiding this comment

pxsalehi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

nicktindall commented Apr 2, 2025 •

edited

Loading

nicktindall Apr 14, 2025 •

edited

Loading