Akka cluster in kubernetes gets into inconsistent state

We recently introduced the first akka cluster over EKS in our system.
We have an Custom Pod Autoscaler that adds/removes pods based on the number of active requests (min 5 pods)
We experience the cluster getting into an inconsistent state a couple of time per week.
What we see is many warnings coming from the same node:

Jan 31, 2023 @ 06:05:33.315 sample-variants-server production sample-variants-server-5568495897-2xqvj WARN akka.stream.Materializer 2ww3BoYBLYVeWOb1XK68 [outbound connection to [akka://sample-variants-server@10.12.31.201:25520], message stream] Upstream failed, cause: Association$OutboundStreamStopQuarantinedSignal$:

Jan 31, 2023 @ 06:05:40.310 sample-variants-server production sample-variants-server-5568495897-2xqvj WARN akka.cluster.sharding.DDataShardCoordinator kgw3BoYBLYVeWOb14eGZ SampleVariants: The ShardCoordinator was unable to update a distributed state within ‘updating-state-timeout’: 10000 millis (retrying). Attempt 1. Perhaps the ShardRegion has not started on all active nodes yet? event=ShardHomeDeallocated(169)

What’s confusing is that the node (10.12.31.201) that can’t be contacted has left the cluster:

Jan 31, 2023 @ 05:58:43.255 sample-variants-server sample-variants-server-5568495897-fnrzt INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:43.255 sample-variants-server sample-variants-server-5568495897-fnrzt INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.738 sample-variants-server sample-variants-server-5568495897-7rzbf INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.738 sample-variants-server sample-variants-server-5568495897-7rzbf INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.705 sample-variants-server sample-variants-server-5568495897-n99pf INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.705 sample-variants-server sample-variants-server-5568495897-n99pf INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.526 sample-variants-server sample-variants-server-5568495897-g79jd INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.525 sample-variants-server sample-variants-server-5568495897-g79jd INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.365 sample-variants-server sample-variants-server-5568495897-h7vbr INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.365 sample-variants-server sample-variants-server-5568495897-h7vbr INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:34.969 sample-variants-server sample-variants-server-5568495897-2xqvj INFO akka.remote.artery.Association Association to [akka://sample-variants-server@10.12.31.201:25520] having UID [-6604413684070633292] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
Jan 31, 2023 @ 05:58:34.967 sample-variants-server sample-variants-server-5568495897-7rzbf INFO akka.remote.artery.Association Association to [akka://sample-variants-server@10.12.31.201:25520] having UID [-6604413684070633292] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
Jan 31, 2023 @ 05:58:34.965 sample-variants-server sample-variants-server-5568495897-g79jd INFO akka.remote.artery.Association Association to [akka://sample-variants-server@10.12.31.201:25520] having UID [-6604413684070633292] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
Jan 31, 2023 @ 05:58:34.965 sample-variants-server sample-variants-server-5568495897-h7vbr INFO akka.remote.artery.Association Association to [akka://sample-variants-server@10.12.31.201:25520] having UID [-6604413684070633292] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
Jan 31, 2023 @ 05:58:34.965 sample-variants-server sample-variants-server-5568495897-n99pf INFO akka.remote.artery.Association Association to [akka://sample-variants-server@10.12.31.201:25520] having UID [-6604413684070633292] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
Jan 31, 2023 @ 05:58:34.965 sample-variants-server sample-variants-server-5568495897-fnrzt INFO akka.remote.artery.Association Association to [akka://sample-variants-server@10.12.31.201:25520] having UID [-6604413684070633292] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
Jan 31, 2023 @ 05:58:34.340 sample-variants-server sample-variants-server-5568495897-2xqvj INFO akka.cluster.Cluster Cluster Node [akka://sample-variants-server@10.12.16.54:25520] - Leader is removing confirmed Exiting node [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:33.938 sample-variants-server sample-variants-server-5568495897-g79jd INFO akka.cluster.Cluster Cluster Node [akka://sample-variants-server@10.12.30.49:25520] - Exiting confirmed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:33.937 sample-variants-server sample-variants-server-5568495897-2xqvj INFO akka.cluster.Cluster Cluster Node [akka://sample-variants-server@10.12.16.54:25520] - Exiting confirmed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:32.300 sample-variants-server sample-variants-server-5568495897-2xqvj INFO akka.cluster.Cluster Cluster Node [akka://sample-variants-server@10.12.16.54:25520] - Leader is moving node [akka://sample-variants-server@10.12.31.201:25520] to [Exiting]
Jan 31, 2023 @ 05:58:29.240 sample-variants-server sample-variants-server-5568495897-2xqvj INFO akka.cluster.Cluster Cluster Node [akka://sample-variants-server@10.12.16.54:25520] - Leader is moving node [akka://sample-variants-server@10.12.31.201:25520] to [Up]

Using akka 2.7.0.

Did it get stuck like that, the shard coordinator will retry that update so if the update happens with an unfortunate timing it should succeed on a subsequent retry?

It may be worth it to look at tuning akka.cluster.sharding.coordinator-state.write-majority-plus according to your cluster size, default is 3 which means all nodes in a 5 node cluster needs to accept a write or it will retry.

One thing to be aware of is k8 interacting with sharding/singletons in an unfortunate way since a few k8 versions back, around which order it rolls the nodes: Starting from k8s 1.22 ReplicaSets are not scaled down youngest first any more · Issue #31383 · akka/akka · GitHub

This seems like a big deal. What is the impact on the standard akka-cluster setup in k8?

If you use sharding or singleton and just use the default autoscaling, the worst case scenario with a rolling upgrade or auto-scale down where the oldest pod is stopped first means each shutdown leads to the singleton (and shard coordinator) will move over to each of the oldest nodes until the roll has completed.

It got stuck and forced us to downscale to 0 and then rescale back.

I will say that the problem is (for now) alleviated after these steps:

  1. Set akka.cluster.sharding.coordinator-state.write-majority-plus=1
  2. Tweaking the custom-pod-autoscaler to scale nodes one by one.