Hi,
we are trying to implement a system where a set of Akka system nodes form a peer-to-peer network. The maximum number of nodes as well as their addresses are known by each node in advance. All nodes can come and go whenever they want. Note that I don’t use Akka Cluster but Artery Remoting only.
Each node has a dedicated NodeObserver actor which periodically checks if other nodes are reachable with the help of an ActorSelection having a path to a node’s NodeObserver actor. Once a NodeObserver actor detects that another one on a remote system is reachable, it starts death watching it instead. If a watched remote NodeObserver actor terminates the watching NodeObserver falls back to periodically checking for reachability using the ActorSelection approach outlined above again.
If we shut down a node or start a node all other running nodes react as expected. So this approach works great… Until a network partition is induced by disabling the network adapter, for example. In that case, the now non-reachable nodes become quarantined. And only a restart of all(!) nodes will fix this problem. From what I read, this seems to be the only way to recover from the quarantined state.
This is clearly the worst case because we chose Akka in order to get a resilient solution. And this situation is actually the contrary.
Is it possible to fiddle around with some configuration parameters and make our current implementation magically work? I played around with some configuration parameters related to the quarantining mechanism with no success. Please help, what can we do? (I hope we don’t need to go for a completely different solution because release date is close.)