"The ShardCoordinator was unable to get an initial state" during rolling update

thjaeckle · December 13, 2018, 9:32am

Hi there.

We are facing a problem when doing rolling updates in our OpenSource project, Eclipse Ditto: https://github.com/eclipse/ditto
We are using Akka 2.5.17 and rely on Akka Persistence together with Sharding.

During rolling updates we see many (> 150) WARN messages from the DDataShardCoordinator saying:

The ShardCoordinator was unable to get an initial state within 'waiting-for-state-timeout': 5000 millis (retrying). Has ClusterSharding been started on all nodes?

We have about 40 services instances running when doing the rolling update - the instance types for which this fails have different amounts of instances:

one has 4 instances
one has 3
two have only 2 instances

I assume that having only 2 instances of a cluster-sharding role and rolling update 1 of them would cause that error message as the 1 remaining instance has no majority.
But what about the cluster-sharding roles with 4 and 3 instances? If we rolling update 1 instance at a time, the remaining instances should still have a majority, right?

I just stumbled upon akka.cluster.min-nr-of-members which we currently don’t have set. We however set the cluster role: https://github.com/eclipse/ditto/blob/master/services/things/starter/src/main/resources/things.conf#L112

We also see WARN messages from other services which try to send message to the affected shard regions:

Retry request for shard [52] homes from coordinator at [Actor[akka.tcp://ditto-cluster@192.168.11.213:2552/system/sharding/thingCoordinator/singleton/coordinator#-1696191213]]. [1] buffered messages.

Could we have misconfigured something or need different amount of instances per role in order to correctly do rolling updates?

Thanks in advance and best regards
Thomas

patriknw · December 14, 2018, 8:46pm

That sounds strange. It should be able to do rolling updates for all these numbers.

How do you shutdown the nodes? Best is to leave the cluster gracefully, by CoordinatedShutdown e.g. triggered by sigterm.

If the leaving isn’t successful, what downing do you use to remove the shutdown node from the cluster membership?

Also, it is best to start the rolling update with the youngest node and leave the oldest until last.

thjaeckle · December 17, 2018, 7:24am

The nodes are shut down via sigterm (triggered by Docker when it stops a container), CoordinatedShutdown should be used. We also hook in the CoordinatedShutdown in order to stop HTTP routes, etc.

Now that you mention it: we don’t really have a strategy for removing a node where leaving was not successful. As downing strategy we have a simple “majority” mechanism which - in case of a network partition - downs the minority. But that wouldn’t be triggered during rolling update, I just saw that this even only runs and checks for minority/majority every 30s:
https://github.com/eclipse/ditto/blob/master/services/utils/cluster/src/main/java/org/eclipse/ditto/services/utils/cluster/ClusterMemberAwareActor.java

So I guess this is where we would have to improve.
How can we find out if leaving wasn’t successful?

Also, it is best to start the rolling update with the youngest node and leave the oldest until last.

Is this possible in a Kubernetes managed cluster?

patriknw · December 18, 2018, 9:09pm

It would be unreachable, but still have status Up.

That is the order for Deployments.

Topic		Replies	Views
`majority-min-cap` default value Akka Libraries akka-cluster	4	901	October 19, 2019
Cluster Sharding, slow handover during rolling restart Akka Libraries akka-cluster	1	590	April 5, 2021
Node stuck in Leaving after deploy Akka Libraries	2	686	February 22, 2020
akka.cluster.sharding.ShardRegion - No coordinator found to register. Probably, no seed-nodes configured and manual cluster join not performed? Total [100] buffered messages Akka Libraries	1	3819	November 19, 2018
Akka Cluster Sharding Issue Persistence / Event Sourcing	2	2304	April 17, 2018

"The ShardCoordinator was unable to get an initial state" during rolling update

Related topics