Dear list members,
My company runs a large scale deployment (hundreds of JVMs) based on Akka, deployed in different regions globally, while some of the services are communicating using Akka Remoting (TCP, not artery).
As it goes, global cloud deployments suffer from occasional disconnections between different regions, total disconnections or severe packet loss. We expect things to be shaky while network disruption happens, but we also expect everything to go back to normal, when storm passes.
Observing the logs we see many instances of the following:
Tried to associate with unreachable remote address [akka.tcp://systemName b@192.168.236.12:2558]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]
AssociationError [akka.tcp://com-company-resource-sip@192.168.222.36:2558] -> [akka.tcp://systemName@192.168.236.11:2558]: Error [Invalid address: akka.tcp://M systemName@192.168.236.11:2558] [ akka.remote.InvalidAssociation: Invalid address: akka.tcp://systemName@192.168.236.11:2558 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted. ]
Reading Akka Remoting documentation, those errors mean that the two remote actor system in question would never be able to communicate with each other, unless the systems are restarted.
What is a proper expected way of recovering from those situations? It does not sound logical to me that I need to restart all nodes of the system every time network disconnection occurs, what am I missing here?
Thanks in advance for your replies.
Regards,
Dima Gutzeit