I’m experiencing this type of errors, originating from random nodes in our cluster:
Scheduled sending of heartbeat was delayed. Previous heartbeat was sent
[15799] ms ago, expected interval is [1000] ms. This may cause failure
detection to mark members as unreachable. The reason can be thread
starvation, e.g. by running blocking tasks on the default dispatcher, CPU
overload, or GC.
These messages appear even though I do not feed the cluster with any data to process (cpu < 1.0%).
In my knowledge, we do not run any blocking tasks in the default dispatcher,
but, just in case, I’ve tried to isolate gossip-related task with a dedicated
dispatcher, as specified in the documentation:
We use physical machines (no VMs).
I will double check the GC and blocked threads.
About blocked threads, I was expecting that specifying another dispatcher in akka.cluster.use-dispatcher would exclude this hypothesis.
Perhaps the scheduler in charge of emitting heartbeat is still using the default dispatcher?
No GC pauses, and no blocked threads.
Furthermore, I checked in the akka codebase, and, if I understand it well, the dispatcher used by the cluster’s scheduler is specified by akka.cluster.use-dispatcher.
So, using a cluster-specific dispatcher should have prevented such warning messages; or, am I wrong?
PS. I’m using PersistentActor and PersistentQuery with the cassandra plugin, and kamon to get some business metrics.
Sounds very strange, and it’s a very long delay.
The Sceduler has one Thread that triggers all scheduled tasks, like this one. The task is running on another dispatcher so should’t block the Scheduler.
Might have to hook up a profiler to get more insights of what is going on.
Can you reproduce on other machines? How often does it occur?
There were 4 suspicious time points where lots of Java Thread Park events were
registered simultaneously for Akka threads (actors & remoting)
and all of them correlate to heartbeat issues:
Around 07:05:39 there were no “heartbeat was delayed” logs, but was this one:
07:05:39,673 WARN PhiAccrualFailureDetector heartbeat interval is growing too large for address SOME_IP: 3664 millis
No correlation with halt events or blocked threads were found during
Java Flight Recording session, only two Safepoint Begin events
in proximity to delays:
CFS throttling
The application CPU usage is low, so we thought it could be related to how K8s schedule our application node for CPU.
But turning off CPU limits haven’t improved things much,
though kubernetes.cpu.cfs.throttled.second metric disappeared.
Separate dispatcher
Using a separate dispatcher seems to be unnecessary since delays happen even when
there is no load, we also built an explicit application similar to our own which
does nothing but heartbeats and it still experience these delays.
K8s cluster
From our observations it happens way more frequently on a couple of K8s nodes in
a large K8s cluster shared with many other apps when our application doesn’t loaded much.
A separate
dedicated K8s cluster where our app is load tested almost have no issues with
heartbeat delays.
I am also facing the issues with the delayed heartbeats. It is very strange because it is happening even without load. I also tried to use a custom dispatcher for the cluster with no luck. I created a repo, which you can use to reproduce it.
I commented on StackOverflow as well, but I was unable to reproduce from your repro. Are you saying it is happening even without load? Because I thought perhaps the missing piece was the load generator.
I tried on minishift rather than minikube, but that seems unlikely to be the difference.
I tried running the Kubernetes cluster on AWS and it works perfectly without the heartbeat delays. I guess something is wrong with my configuration on my local minikube setup.
If the request.cpu is not specified and only the limit is, how can you use the auto-scale features of Kubernetes ? Because in this case, the resource cpu on pods is not specified. E.g :
resource cpu on pods (as a percentage of request): / 60%