Warnings about scheduled sending of heartbeat

apm · March 26, 2019, 4:38pm

Hello,

I’m experiencing this type of errors, originating from random nodes in our cluster:

  Scheduled sending of heartbeat was delayed. Previous heartbeat was sent
   [15799] ms ago, expected interval is [1000] ms. This may cause failure
   detection to mark members as unreachable. The reason can be thread
   starvation, e.g. by running blocking tasks on the default dispatcher, CPU
   overload, or GC.

These messages appear even though I do not feed the cluster with any data to process (cpu < 1.0%).

In my knowledge, we do not run any blocking tasks in the default dispatcher,
but, just in case, I’ve tried to isolate gossip-related task with a dedicated
dispatcher, as specified in the documentation:

  akka.cluster.use-dispatcher = cluster-dispatcher

  cluster-dispatcher {
    type = "Dispatcher"
    executor = "fork-join-executor"
    fork-join-executor {
      parallelism-min = 2
      parallelism-max = 4
    }
  }

But the WARNING messages still occur.

I’m using akka 2.5.16.

Could this issue be related to the akka’s Scheduler?
Any insight to help understanding the cause of this problem would be appreciated!

patriknw · March 26, 2019, 7:41pm

Have you enabled GC logging and verified that there are no long pauses that correlate with these warnings?

What ”hardware” are you running on and is it shared with others?

Take a few thread dumps and look for blocked threads.

apm · March 27, 2019, 10:36pm

Thanks for your answer.

We use physical machines (no VMs).
I will double check the GC and blocked threads.
About blocked threads, I was expecting that specifying another dispatcher in akka.cluster.use-dispatcher would exclude this hypothesis.
Perhaps the scheduler in charge of emitting heartbeat is still using the default dispatcher?

apm · March 28, 2019, 4:43pm

No GC pauses, and no blocked threads.
Furthermore, I checked in the akka codebase, and, if I understand it well, the dispatcher used by the cluster’s scheduler is specified by akka.cluster.use-dispatcher.
So, using a cluster-specific dispatcher should have prevented such warning messages; or, am I wrong?

PS. I’m using PersistentActor and PersistentQuery with the cassandra plugin, and kamon to get some business metrics.

patriknw · March 28, 2019, 5:11pm

Sounds very strange, and it’s a very long delay.
The Sceduler has one Thread that triggers all scheduled tasks, like this one. The task is running on another dispatcher so should’t block the Scheduler.

Might have to hook up a profiler to get more insights of what is going on.

Can you reproduce on other machines? How often does it occur?

apm · March 29, 2019, 4:25pm

Yes, it happens on several machines (same datacenter), ~1 time per hour.
I tried JProfiler, and didn’t see anything suspicious.

I’m using PersistentQuery wrapped in actor to emulate the old PersistentView behavior. Perhaps it might be related to akka-stream/cassandra/netty?

patriknw · March 29, 2019, 5:37pm

Does it also trigger unreachability events, if you have acceptable-heartbeat-pause default value or less than 13 seconds?

mikla · May 31, 2019, 2:53pm

@apm have you solved the issue?
I’m experiencing the same…
Akka Cassandra Persistence plugin.

arbitrary-dev · September 19, 2019, 7:54pm

Originally posted on https://stackoverflow.com/questions/58015699/akka-cluster-heartbeat-delays-on-kubernetes

Our Scala application (Kubernetes deployment) constantly experience Akka Cluster
heartbeat delays of ≈3s.

Once we even had a 200s delay which also manifested itself in the following graph:

Can someone suggest things to investigate further?

Java Flight Recording

Some example:

timestamp    delay_ms
06:24:55.743 2693
06:30:01.424 3390
07:31:07.495 2487
07:36:12.775 3758

There were 4 suspicious time points where lots of Java Thread Park events were
registered simultaneously for Akka threads (actors & remoting)
and all of them correlate to heartbeat issues:

Around 07:05:39 there were no “heartbeat was delayed” logs, but was this one:

07:05:39,673 WARN PhiAccrualFailureDetector heartbeat interval is growing too large for address SOME_IP: 3664 millis

No correlation with halt events or blocked threads were found during
Java Flight Recording session, only two Safepoint Begin events
in proximity to delays:

CFS throttling

The application CPU usage is low, so we thought it could be related to
how K8s schedule our application node for CPU.
But turning off CPU limits haven’t improved things much,
though kubernetes.cpu.cfs.throttled.second metric disappeared.

Separate dispatcher

Using a separate dispatcher seems to be unnecessary since delays happen even when
there is no load, we also built an explicit application similar to our own which
does nothing but heartbeats and it still experience these delays.

K8s cluster

From our observations it happens way more frequently on a couple of K8s nodes in
a large K8s cluster shared with many other apps when our application doesn’t loaded much.

A separate
dedicated K8s cluster where our app is load tested almost have no issues with
heartbeat delays.

patriknw · September 22, 2019, 7:49am

Are you sure you have defined limits and requests correctly? See:

Ras · September 22, 2019, 4:57pm

I am also facing the issues with the delayed heartbeats. It is very strange because it is happening even without load. I also tried to use a custom dispatcher for the cluster with no luck. I created a repo, which you can use to reproduce it.

Link : https://github.com/CostasChaitas/Akka-Demo

Any help would be much appreciated. Thanks

davidogren · September 22, 2019, 10:23pm

I commented on StackOverflow as well, but I was unable to reproduce from your repro. Are you saying it is happening even without load? Because I thought perhaps the missing piece was the load generator.

I tried on minishift rather than minikube, but that seems unlikely to be the difference.

arbitrary-dev · September 23, 2019, 2:03pm

Yep, we have following in our K8s manifest:

requests.cpu = 16
# limits.cpu not set

After reading this I’m still not sure, does it make sense to add -XX:+UseContainerSupport or not?

arbitrary-dev · September 24, 2019, 2:27pm

Oh, I see -XX:+UseContainerSupport is enabled by default in Java 11, so no need to define it explicitly:

Ras · September 24, 2019, 7:56pm

I tried running the Kubernetes cluster on AWS and it works perfectly without the heartbeat delays. I guess something is wrong with my configuration on my local minikube setup.

If the request.cpu is not specified and only the limit is, how can you use the auto-scale features of Kubernetes ? Because in this case, the resource cpu on pods is not specified. E.g :

resource cpu on pods (as a percentage of request): / 60%

Topic		Replies	Views
Thread starvation Akka Libraries java , akka-cluster	4	2374	May 18, 2021
Actor processes messages way too late, dispatcher configuration problem? Akka Libraries java	2	472	February 8, 2021
Failure-detector.acceptable-heartbeat-pause does not seem to work Akka Libraries	2	744	September 27, 2019
Heartbeat interval is growing too large when sending large payload between nodes Akka Libraries	2	2658	October 20, 2019
In Singleton Actor, Scheduler schedule method not running after 1 or 2 runs Akka Libraries java , scala , akka-cluster	21	866	September 20, 2023

Warnings about scheduled sending of heartbeat

Java Flight Recording

CFS throttling

Separate dispatcher

K8s cluster

Related topics