We recently introduced the first akka cluster over EKS in our system.
We have an Custom Pod Autoscaler that adds/removes pods based on the number of active requests (min 5 pods)
We experience the cluster getting into an inconsistent state a couple of time per week.
What we see is many warnings coming from the same node:
Jan 31, 2023 @ 06:05:33.315 sample-variants-server production sample-variants-server-5568495897-2xqvj WARN akka.stream.Materializer 2ww3BoYBLYVeWOb1XK68 [outbound connection to [akka://sample-variants-server@10.12.31.201:25520], message stream] Upstream failed, cause: Association$OutboundStreamStopQuarantinedSignal$:
Jan 31, 2023 @ 06:05:40.310 sample-variants-server production sample-variants-server-5568495897-2xqvj WARN akka.cluster.sharding.DDataShardCoordinator kgw3BoYBLYVeWOb14eGZ SampleVariants: The ShardCoordinator was unable to update a distributed state within ‘updating-state-timeout’: 10000 millis (retrying). Attempt 1. Perhaps the ShardRegion has not started on all active nodes yet? event=ShardHomeDeallocated(169)
What’s confusing is that the node (10.12.31.201) that can’t be contacted has left the cluster:
Jan 31, 2023 @ 05:58:43.255 sample-variants-server sample-variants-server-5568495897-fnrzt INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:43.255 sample-variants-server sample-variants-server-5568495897-fnrzt INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.738 sample-variants-server sample-variants-server-5568495897-7rzbf INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.738 sample-variants-server sample-variants-server-5568495897-7rzbf INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.705 sample-variants-server sample-variants-server-5568495897-n99pf INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.705 sample-variants-server sample-variants-server-5568495897-n99pf INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.526 sample-variants-server sample-variants-server-5568495897-g79jd INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.525 sample-variants-server sample-variants-server-5568495897-g79jd INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.365 sample-variants-server sample-variants-server-5568495897-h7vbr INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:42.365 sample-variants-server sample-variants-server-5568495897-h7vbr INFO akka.cluster.singleton.ClusterSingletonManager Member removed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:34.969 sample-variants-server sample-variants-server-5568495897-2xqvj INFO akka.remote.artery.Association Association to [akka://sample-variants-server@10.12.31.201:25520] having UID [-6604413684070633292] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
Jan 31, 2023 @ 05:58:34.967 sample-variants-server sample-variants-server-5568495897-7rzbf INFO akka.remote.artery.Association Association to [akka://sample-variants-server@10.12.31.201:25520] having UID [-6604413684070633292] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
Jan 31, 2023 @ 05:58:34.965 sample-variants-server sample-variants-server-5568495897-g79jd INFO akka.remote.artery.Association Association to [akka://sample-variants-server@10.12.31.201:25520] having UID [-6604413684070633292] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
Jan 31, 2023 @ 05:58:34.965 sample-variants-server sample-variants-server-5568495897-h7vbr INFO akka.remote.artery.Association Association to [akka://sample-variants-server@10.12.31.201:25520] having UID [-6604413684070633292] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
Jan 31, 2023 @ 05:58:34.965 sample-variants-server sample-variants-server-5568495897-n99pf INFO akka.remote.artery.Association Association to [akka://sample-variants-server@10.12.31.201:25520] having UID [-6604413684070633292] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
Jan 31, 2023 @ 05:58:34.965 sample-variants-server sample-variants-server-5568495897-fnrzt INFO akka.remote.artery.Association Association to [akka://sample-variants-server@10.12.31.201:25520] having UID [-6604413684070633292] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
Jan 31, 2023 @ 05:58:34.340 sample-variants-server sample-variants-server-5568495897-2xqvj INFO akka.cluster.Cluster Cluster Node [akka://sample-variants-server@10.12.16.54:25520] - Leader is removing confirmed Exiting node [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:33.938 sample-variants-server sample-variants-server-5568495897-g79jd INFO akka.cluster.Cluster Cluster Node [akka://sample-variants-server@10.12.30.49:25520] - Exiting confirmed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:33.937 sample-variants-server sample-variants-server-5568495897-2xqvj INFO akka.cluster.Cluster Cluster Node [akka://sample-variants-server@10.12.16.54:25520] - Exiting confirmed [akka://sample-variants-server@10.12.31.201:25520]
Jan 31, 2023 @ 05:58:32.300 sample-variants-server sample-variants-server-5568495897-2xqvj INFO akka.cluster.Cluster Cluster Node [akka://sample-variants-server@10.12.16.54:25520] - Leader is moving node [akka://sample-variants-server@10.12.31.201:25520] to [Exiting]
Jan 31, 2023 @ 05:58:29.240 sample-variants-server sample-variants-server-5568495897-2xqvj INFO akka.cluster.Cluster Cluster Node [akka://sample-variants-server@10.12.16.54:25520] - Leader is moving node [akka://sample-variants-server@10.12.31.201:25520] to [Up]
Using akka 2.7.0.