To detect network problems and crashed nodes. If the heartbeat messages (requrest-reply) can’t get through (lost or delayed) it will mark them as Unreachable. When heartbeats can get through again it will be marked as reachable again. This doesn’t mean that nodes are removed from the cluster membership.
To decide when an Unreachable node should be removed from the cluster it has to be Downed. That is done by a downing provider, such as Lightbend’s Split Brain Resolver, or manually with Cluster management tool.
Some cluster tools, such as Cluster aware routers, use the reachability information to avoid routing messages to unreachable nodes.
That should be rare, but for example if many actors stop at the same time and there are watchers of these actors on other nodes there may be a storm of Terminated messages sent more quickly than they can be delivered and thereby filling up buffers.
The default size of the system messages bufffer is 20000 and it can be increased with configuration property akka.remote.artery.advanced.system-message-buffer-size. There is no drawback apart from possible memory consumption to increase this. The buffer is an ArrayDeque so it grows as needed, but doesn’t shrink.
There is also another queue for outgoing control (system) messages and the max size of that is configured with akka.remote.artery.advanced.outbound-control-queue-size. The default is 3072. This is a LinkedBlockingQueue so it’s also ok to increase. I think we should increase the default of this, by the way.