We are using persistence actor to store actor state in postgresql. We enabled rememberEntities feature (with persistence as a state store for maintaining shard data) to have all actors in in-memory. After enabling this feature, shard coordinator pod got terminated due to memory spike. After this, cluster is getting formed but the shard region is not able to register with coordinator.
We are getting following error continuously and no events are getting processed WARNING : Trying to register to coordinator at [ActorSelection[Anchor(akka://actor-system/), Path(/system/sharding/ActorSystemCoordinator/singleton/coordinator)]], but no acknowledgement. Total [3] buffered messages. [Coordinator [Member(address = akka://actor-system@ip:port, status = Up)] is reachable.] ERROR : Exception in receiveRecover when replaying event type
[akka.cluster.sharding.ShardCoordinator$Internal$ShardHomeDeallocated] with sequence number [12980] for persistenceId [/sharding/DeviceActorCoordinator].
Shard [-20] not allocated: State(Map())
As of now, we have disabled rememberEntities and shard-state-store to make cluster stable
Akka version used: V2.6.6
split brain resolver: akka.cluster.sbr.SplitBrainResolverProvider
It is not possible to enable remember entities in a rolling upgrade, so maybe trying to do that caused the problem?
It is probably best to stop cluster, clear out the cluster sharding state from your database and start the cluster anew with remember entities enabled.
@johanandren In our case, rememberEntities is working fine with rolling update. The state is getting corrupted, when the pod does not shutdown gracefully
It is probably best to stop cluster, clear out the cluster sharding state from your database and start the cluster anew with remember entities enabled.
We cannot do this everytime, we face this issue right? If we do this, then the entities created before restart will not be remembered until we get request for those entities
Is there a way to make cluster self-heal in these scenarios?
If you can reproduce this, when not doing a rolling upgrade enabling remember entities, but during normal operations after remember entities was enabled after a full cluster stop it is a bug and we are interested in details about steps to repeat and if possible a minimal reproducer project.
If it is during or because of a rolling upgrade, with the new version enabling remember entities, that is not expected to work safely.
Terminate shard-coordinator pod (not graceful shutdown) immediately after sending the request
Repeat 2 and 3 step until “Trying to register to coordinator” logs appear
I noticed now you are running Akka 2.6.6 which is quite old, in 2.6.7 we did some considerable rework of the remember entities implementation. Can you try with latest Akka (2.6.14) and see if you can repeat the problem there as well?