Singleton are used for “master/slave” for services who doesn’t support a real and beautiful sharding.
When I start my cluster, everything is fine. But if I kill -9 my leader (like machine down suddenly), there is not autodiscovery of a new leader if I set akka.cluster.auto-down-unreachable-after=off
With settings like akka.cluster.auto-down-unreachable-after=10s, everything is fine, old leader is removed and a new one take the lead.
But this parameter is discouraged for production environment.
And if I try to implement my own version, I will definitively do a far less better job than you.
So, my question is: I understand the risk like documentation said (https://doc.akka.io/docs/akka/current/cluster-usage.html), but is there a way do to “better” without ringing someone in middle of a night in case of machine shutdown ?
In the normal case (scaling down the cluster, rolling out an upgrade etc) you should strive for nodes gracefully leaving the cluster rather than abruptly killing the machines, this is done pretty much out of the box for 2.5 using the graceful shutdown - it’s triggered by a JVM shutdown hook.
While interesting, I think Nikos talk doesn’t actually mention how it is done.
One option, which might be surprising, is to have ops monitor the cluster for unreachability and do manual decisions about if a part of the cluster should be downed on partitions. Depends a bit on what kind of infrastructure you are running on, if it is the cloud maybe less of an option.