Hello,
We’re using akka version 2.5.11
In a 3-node cluster setup, upon killing and restarting the first seed node (as listed in the akka.conf), we sometimes see that the node does not or is not allowed to rejoin the cluster. After the seed node timeout, which we have set to 12s, it joins itself and forms a cluster island. This is automated in a robot test environment.
As an example, we have node1 (first seed node - address 10.30.170.94), node2 (10.30.170.26), and node3 (10.30.170.84). Initially all the nodes are started and connected and the cluster is working correctly. Then a kill -9 is issued to node1. Both node2 and node3 lose connection and mark node1 as unreachable as expected. On node2, we see:
2018-07-16T11:17:14,609 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-20 | ReliableDeliverySupervisor | 41 - com.typesafe.akka.slf4j - 2.5.11 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]] Caused by: [Connection refused: /10.30.170.94:2550]
2018-07-16T11:17:18,229 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.94:2550: Unreachable [Unreachable] (1), akka.tcp://opendaylight-cluster-data@10.30.170.84:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.94:2550: Unreachable [Unreachable] (1)], member status: [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.84:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.94:2550 Up seen=false]
On node3:
2018-07-16T11:18:14,859 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | ReliableDeliverySupervisor | 41 - com.typesafe.akka.slf4j - 2.5.11 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]] Caused by: [Connection refused: /10.30.170.94:2550]
After restarting node1, we see that it sends InitJoin messages to node2 and node3 and actually reports an InitJoinAck from node2 but eventually joins itself after 22s:
2018-07-16T11:18:20,251 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-21 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Received InitJoinAck message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.26:2550/system/cluster/core/daemon#-1596937295]] to [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]
2018-07-16T11:18:42,585 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] is JOINING, roles [member-1, dc-default]
2018-07-16T11:18:42,614 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] to [Up]
On node2, we see InitJoin messages incoming and InitJoinAcks outgoing and that it noticed that node1 was restarted with a new incarnation, marked it as Down, and stated it was removing the previous unreachable incarnation:
2018-07-16T11:18:20,105 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#-2094984711]] to [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550]
2018-07-16T11:18:20,106 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#-2094984711]]
2018-07-16T11:18:20,291 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - New incarnation of existing member [Member(address = akka.tcp://opendaylight-cluster-data@10.30.170.94:2550, status = Up)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.
2018-07-16T11:18:20,292 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Marking unreachable node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] as [Down]
2018-07-16T11:18:21,226 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-22 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Leader is removing unreachable node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]
2018-07-16T11:18:30,564 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-2#2006731653]] to [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550]
2018-07-16T11:18:30,565 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-2#2006731653]]
followed by several more InitJoin/InitJoinAck sequences…
node3 shows similar InitJoin/InitJoinAck sequences.
I can’t see any reason from this output why node1 was not allowed to rejoin. Clearly there was network communication between node1 and node2/node3 after the restart and node2 and node3 remained connected (ie there were no disconnect/disassociation log messages reported). So the old incarnation of node1 should be removed (as was reported in the logs) and the new incarnation should be allowed to join.
Note this happens sporadically - most of the time it works as expected.
Also note that this only seems to happen when the node is killed - we haven’t seen it when the node is gracefully shutdown.
Thanks,
Tom