My understanding is that Cluster.get(context().system()).state() should return the cluster state having down/unreachable members only in the state.getUnreachable() set but not in state.getMembers() set.
This is true for all clusters we have except one.
What I am trying to find out is which configuration / setting could cause this?
Most likely that setting does not match the value in other clusters but I could not find any difference in any of the settings between the clusters - any help / pointers highly appreciated.
getMembers() can contain both unreachable and reachable nodes. When a node is downed or leaves the cluster gracefully, it becomes removed, and after that it is not in getMembers() anymore.
If a node is unreachable, it will also end up in the getUnreachable() set.
Note that the Cluster.state is the nodes own view of the cluster, so if there is a network partition for example, the view on the different nodes will be different until the partition heals (or one side of the partition is downed). It is also driven by the cluster gossip, meaning that it is eventually consistent, there is no guarantee that at a given point in time, all nodes will perceive the state as the exact same.
This should not be affected by any settings, when the node is part of the cluster it will get information about all the other members of the cluster. It can be affected by a “split brain”, if you use auto-downing for example, you may end up with a cluster split into two clusters that both think that they are the cluster and that the other side was shutdown.
Thanks for the details - I agree that Cluster.state is not guaranteed to be correct at any given time but will be eventually consistent and also about the split brain problem.
In the scenario I mentioned, the down node is showing up in the unreachable set and the member.status() is still showing as “Up”.
Is there any configuration/setting that can cause delay in removing a down node from the cluster?
Is there any configuration/setting that can cause a down node to still have the status as “Up” for a long time?
I see that in an older cluster, for a down node, the member.status() whould show as “Down” and that node would not be present in the getMembers() set. So one of the developers must have changed some setting in the new cluster to cause this behavior to change which is what I am trying to figure out!