Intermittently we are receiving following errors
2022-05-25 08:32:30,691 ERROR app=abc a.c.s.DDataShardCoordinator - The ShardCoordinator was unable to update a distributed state within ‘updating-state-timeout’: 2000 millis (retrying). Perhaps the ShardRegion has not started on all active nodes yet? event=ShardRegionRegistered(Actor[akka://application@10.52.174.4:25520/system/sharding/abcapp#-1665332307])
2022-05-25 08:32:31,348 WARN app=abc a.c.s.ShardRegion - abcapp: Trying to register to coordinator at [ActorSelection[Anchor(akka://application@10.52.103.132:25520/), Path(/system/sharding/abcappCoordinator/singleton/coordinator)]], but no acknowledgement. Total [22] buffered messages. [Coordinator [Member(address = akka://application@10.52.103.132:25520, status = Up)] is reachable.]
While we check cluster members by using /cluster/members we got “10.52.174.4:25520” this as
{
“node”: “akka://application@10.52.252.4:25520”,
“nodeUid”: “7353086881718190138”,
“roles”: [
“dc-default”
],
“status”: “Up”
},
Which says its healthy but problem resolves while we remove this node from the cluster using
/cluster/members/{address} (leave operation to remove 10.52.252.4 from cluster, once it’s removed cluster will create new pod and rebalance.
Need help to understand the best way of handling this error.
Thanks