Using akka 2.6.15, management 1.0.9 and akkaHttp 10.1.11 to spin up a cluster with 4 nodes, and 4 shards using Kubernetes discovery. The configuration is given below:
akka {
cluster {
log-info-verbose = on
shutdown-after-unsuccessful-join-seed-nodes = 60s
downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
split-brain-resolver {
active-strategy = keep-oldest
}
min-nr-of-members = 4
sharding {
remember-entities = on
passivate-idle-entity-after = off
distributed-data.durable.keys = []
number-of-shards = 4
least-shard-allocation-strategy {
rebalance-absolute-limit = 4
}
}
}
management {
cluster.bootstrap {
contact-point-discovery {
service-name = "the-app"
discovery-method = kubernetes-api
required-contact-point-nr = 1
}
}
}
discovery {
method = kubernetes-api
kubernetes-api {
pod-namespace = "some-namespace"
pod-label-selector = "app=the-app"
pod-port-name = "management"
}
}
coordinated-shutdown.exit-jvm = on
}
I can confirm that the cluster is formed correctly by listening to Member events and logging MemberUp events - 4 nodes are up and running.
The application is starting 4 different shards, which are running in only two nodes (2 shards per node). Note that each shard spins up tens of persistent child reactors, all with unique names and persistence ids. Those child actors register themselves to the receptionist.
In the logs, I see a lot of deadletters like the one shown below:
DeadLetter: [akka.cluster.ddata.Replicator$ModifyFailure] recipient: 'Actor[akka://the-app/deadLetters]' sender: 'Actor[akka://the-app/system/clusterReceptionist/replicator#-1681777846]' message: ModifyFailure [ReceptionistKey_0]: Update failed: missing value for ServiceKey[com.app.SomeMessage$Interface](the_unique_id_for_the_actor)
In addition to that, there are some errors like this:
Couldn't process DeltaPropagation from [UniqueAddress(akka://the-app@172.20.4.68:25520,-3105484348826649608)] due to java.lang.IllegalStateException: missing value for ServiceKey[com.app.SomeMessage$Interface](the _unique_id_for_the_actor)
To provide more context, I can see the cluster creation log twice (this is happening sometimes though, not everytime I start the cluster)
Cluster Node [akka://the-app@172.20.4.68:25520] - Node [akka://the-app@172.20.4.68:25520] is JOINING itself (with roles [dc-default], version [0.0.0]) and forming new cluster
Cluster Node [akka://the-app@172.20.5.221:25520] - Node [akka://the-app@172.20.5.221:25520] is JOINING itself (with roles [dc-default], version [0.0.0]) and forming new cluster
I believe this errors are leading to receptionist state not geetting replicated properly, thus some actors cannot be found. My question is:
- Why is this happening (I explored this issue Cluster receptionist looses service keys on new node joining #26284 but doesn’t seem very related)
Additional info worth mentioning is that SomeMessage.Interface is the base message type of a message adapter. Each actor registers a message adapter to receive common messages (like Tick messages) without defining them over and over again. The way each actor registers its own message adapter is given below. Interface
is the same class for all adapters.
ActorRef<Interface> adapter =
reactor.getContext().messageAdapter(Interface.class, this::onCommonMessage);
ServiceKey<Interface> commonMessagesServiceKey = CommonMessages.serviceKeyForPath(unique_reactor_path);
reactor
.getContext()
.getSystem()
.receptionist()
.tell(Receptionist.register(commonMessagesServiceKey, adapter));
In addition to that, I tried taking down one pod, which lead to restarting that pod, and all the actors were re-spawned and re-registered to the receptionist, however the other pods didn’t receive the updates from the receptionist.
Thanks in advance