Issue
When more no of actors in a shard the rebalancing is not happening properly when rolling restart of nodes.
When the nodes restarted one by one sequentially some shard not rebalancing to other node properly.
And other cluster sharding operations not working after the sequential restart.But the nodes still connected to each other in the cluster.
Cluster Setup
Using Akka-Cluster 2.5.32. Cluster formed with 3 nodes.
Each node will have 2 shards. In each shard there will be 600 entity Actors.
So totally 6 shards → 3600 Entity Actors in the Cluster.
Enabled
akka.cluster.sharding.remember-entities = on
akka.cluster.sharding.remember-entities-store = ddata
akka.cluster.sharding.distributed-data.durable.keys = []
akka.remote.artery{
enabled = on
transport = tcp
}
At Start
- Node 1 Shard 4,3 with [ 600,600 ] entities
- Node 2 Shard 1,5 with [ 600,600 ] entities
- Node 3 Shard 0,2 with [ 600,600 ] entities
Restart & Rebalance Flow
When Restarting the node will leave the cluster by invoking cluster.leave(cluster.selfAddress());
i)Node 1 Restarts
14:59:45:024
Node 1 Leaves the Cluster.
14:59:48:800
In Node 1 : Starting HandOffStopper for shard Shard_3 to terminate 600 entities.
14:59:48:800
In Node 1 : Starting HandOffStopper for shard Shard_4 to terminate 600 entities.
15:00:48:805
In Node 3 there is a log Rebalance shard [Shard_4] done [false]
& Rebalance shard [Shard_3] done [false]
15:00:51:935
In Node 2 The Shard_3 is Rebalanced And all 600 entity Actors recreated.
15:00:48:980
In Node 2 The Shard_4 is Rebalanced And all 600 entity Actors recreated.
After Node 1 Restarts Now
- Node 1 Down
- Node 2 Shard 1,5,3 with [ 600,600,600 ] entities
- Node 3 Shard 0,2,4 with [ 600,600,600 ] entities
ii)Node 2 Restarts
15:01:23:209
Node 1 Rejoins the Cluster.
15:01:24:052
Node 2 Leaves the Cluster.
15:01:26:804
In Node 2 : Starting HandOffStopper for shard Shard_3 to terminate 600 entities.
15:01:26:804
In Node 2 : Starting HandOffStopper for shard Shard_5 to terminate 600 entities.
15:01:26:804
In Node 2 : Starting HandOffStopper for shard Shard_1 to terminate 600 entities.
15:02:26:794
In Node 3 there is a log Rebalance shard [Shard_3] done [false]
,Rebalance shard [Shard_5] done [false]
& Rebalance shard [Shard_1] done [false]
15:02:27:028
In Node 1 The Shard_5 is Rebalanced And Only 580 entity Actors reacreated.
15:02:28:237
In Node 1 The Shard_1 is Rebalanced And Only 61 entity Actors recreated.
15:02:32:967
In Node 1 The Shard_3 is Rebalanced And Only 35 entity Actors recreated.
After Node 2 Restarts Now
- Node 1 Shard 1,5,3 with [ 35,580,61 ] entities
- Node 2 Down
- Node 3 Shard 0,2,4 with [ 600,600,600 ] entities
iii)Node 3 Restarts
15:02:51:133
Node 2 Rejoins the Cluster.
15:02:51:799
Node 3 Leaves the Cluster.
15:02:55:621
In Node 3 : Starting HandOffStopper for shard Shard_0 to terminate 600 entities.
15:02:55:621
In Node 3 : Starting HandOffStopper for shard Shard_2 to terminate 600 entities.
15:02:55:622
In Node 3 : Starting HandOffStopper for shard Shard_4 to terminate 600 entities.
15:03:55:772
In Node 2 The Shard_0 is Rebalanced And Only 116 entity Actors recreated.
And Shard_2 & Shard_4 not rebalance to Node 2.
15:04:27:943
Node 3 Rejoins the Cluster
After Node 3 Restarts Now
- Node 1 Shard 1,5,3 with [ 35,580,61 ] entities
- Node 2 Shard 0 with [ 116 ] entities
- Node 3 No Shards.
We have a Singleton actor in the Cluster, which Logs the Current Cluster member Status for a time interval. And it will log the No of Actors in the Cluster by invoking
GetClusterShardingStats getStats = new ShardRegion.GetClusterShardingStats(FiniteDuration.create(5000, TimeUnit.MILLISECONDS));
Future<Object> ack = Patterns.ask(region, getStats, timeout).toCompletableFuture();
ClusterShardingStats ss = (ClusterShardingStats) ack.get(5000, TimeUnit.MILLISECONDS);
With this cluster stats The singlton will Log the Active Actor Count with Shard wise split-up.
Before this sequential Restart it logs
Active Actor Count 3600 & Members [ Node1:Up,Node2:Up,Node3:Up ]
After this sequential Restart it logs
Members [ Node1:Up,Node2:Up,Node3:Up ]
And it unable to fetch the Active Actors Count from in
ClusterShardingStats ss = (ClusterShardingStats) ack.get(5000, TimeUnit.MILLISECONDS);
Queries
- In my use case the nodes will be restarted frequently like this.
- Why the Cluster Sharding rebalance is not properly done here and any clue on what may be the problem here?
- Is there any more information needed to further debug this issue?