How to avoid nodes to be quarantined in Akka Cluster?

joymufeng · August 24, 2018, 10:08am

The doc of Akka Remoting (Artery) describes quarantine as follows:

Nodes will be quarantined when remote failure detector triggers in Akka Remote, but will not in Akka Cluster.

The doc of Akka Cluster describes quarantine as follows:

Nodes will be quarantined when there are too many unacknowledged system messages, but will not be quarantined when failure detector triggers.

So I have 3 questions:

In Akka Cluster, failure detector triggers don’t cause nodes to be quarantined, so what are their roles?
“too many unacknowledged system messages”, what’s the number of “too many” ? Can we addjust it ?
How to config in Akka Cluster to make nodes will not or be harder to be quarantined ?

Thanks in advance.

patriknw · August 24, 2018, 1:36pm

To detect network problems and crashed nodes. If the heartbeat messages (requrest-reply) can’t get through (lost or delayed) it will mark them as Unreachable. When heartbeats can get through again it will be marked as reachable again. This doesn’t mean that nodes are removed from the cluster membership.

To decide when an Unreachable node should be removed from the cluster it has to be Downed. That is done by a downing provider, such as Lightbend’s Split Brain Resolver, or manually with Cluster management tool.

Some cluster tools, such as Cluster aware routers, use the reachability information to avoid routing messages to unreachable nodes.

That should be rare, but for example if many actors stop at the same time and there are watchers of these actors on other nodes there may be a storm of Terminated messages sent more quickly than they can be delivered and thereby filling up buffers.

The default size of the system messages bufffer is 20000 and it can be increased with configuration property akka.remote.artery.advanced.system-message-buffer-size. There is no drawback apart from possible memory consumption to increase this. The buffer is an ArrayDeque so it grows as needed, but doesn’t shrink.

There is also another queue for outgoing control (system) messages and the max size of that is configured with akka.remote.artery.advanced.outbound-control-queue-size. The default is 3072. This is a LinkedBlockingQueue so it’s also ok to increase. I think we should increase the default of this, by the way.

Answer to previous question covers this as well.

joymufeng · August 25, 2018, 1:04am

Great answer! Thanks very much!

Topic		Replies	Views
Why my cluster breaks after a while due to quarantined nodes? Akka Cluster	0	636	May 6, 2019
[Akka 2.5.x][Remoting] - Recovering from guaranteed nodes Akka Libraries	5	1802	May 23, 2018
Unexpected quarantine Akka Libraries	4	1328	November 17, 2021
Quarantine breaks cluster abstraction Akka Cluster	2	969	September 17, 2018
Quarantined node haven't joined back the cluster even after multiple restart Persistence / Event Sourcing	0	1038	September 20, 2018

How to avoid nodes to be quarantined in Akka Cluster?

Related topics