I use akka with persistence (akka-persistence-jdbc version 5.0.0) and cluster version 2.6.10, scala version 2.13
After two days of running a 2 node cluster setup with raspberry pi compute module’s, the system is insanely slow i.e. nothing really happens.
I get the following warn log: [WARN ] [com.zaxxer.hikari.pool.HikariPool] [] - slick.db - Thread starvation or clock leap detected (housekeeper delta=58s277ms542µs362ns).
The worst delta is: [WARN ] [com.zaxxer.hikari.pool.HikariPool] [] - slick.db - Thread starvation or clock leap detected (housekeeper delta=6h51m20s16ms852µs738ns).
also, if I restart the cluster, the akka scheduler often starts to trigger events too late after one day already.
Since this lasts until I restart the cluster, I exclude a clock leap problem (it’s very unlikely to have so many clock leaps and I checked the clock multiple times).
I monitored CPU and memory of the OS but this is ok (less than 50% total CPU and still enough available memory).
I also analyzed a lot the logs but could not find anything which would point out a thread starvation problem root.
So next, I took thread dumps of the jvm and analyzed them using jstack.review.
Since I’m not an expert with analyzing thread dumps or akka dispatchers, I cannot tell whether it is normal to have such high numbers at the end of the dispatcher name and whether the overview states some problem(s):
Here is the configuration in application.conf file for the custom dispatcher:
# Applied to actors doing blocking input/output operations.
akka.blocking-io-dispatcher {
type = Dispatcher
executor = "thread-pool-executor"
thread-pool-executor {
fixed-pool-size = 32
}
# Throughput defines the maximum number of messages to be
# processed per actor before the thread jumps to the next actor.
# Set to 1 for as fair as possible.
throughput = 1
}
Questions
Is it ok to have a blocking-io-dispatcher-84632 or does this point to a thread starvation problem?
Does the overview look fine or is this problematic for an akka application?
What else could I do to approach the problem described above?
I’m afraid it is not really possible to say much from the shared information. It does sound like it could be starvation.
A few hints about what to investigate:
look at the used heap of the JVM in addition to the system memory, if heap is close to full your app can worst case spend most CPU cycles trying to desperately find some memory to free.
take and look at a thread dump from the JVM and go through what each thread is doing, especially if you have threads in blocked state - make sure those are not running on the default dispatcher as that will starve you application actors and streams from doing any work (for example completing a query and returning the connection to the connection pool). This part of the docs is a good read to understand why this is a problem: Dispatchers • Akka Documentation
Thank you for the inputs. I looked at the used heap of the JVM and the analyzed heapdump tells that in problematic state of the application, only a total of 16.4 MB is used. I run the jvm with argument -Xmx256m so this should not be the problem.
I go through what each thread is doing in the thread dump and will share my insights.
Just to mention: there are no threads in BLOCKED state at all (especially true for the default-dispatcher).
Is the huge number 84632 after blocking-io-dispatcher to be expected?
I went through the threads in the thread dump: there are no threads in BLOCKED state. I found some LinkedBlockingQueue calls in thread stacks of threads blocking-io-dispatcher-x. They’re all in thread state TIMED_WAITING (parking). Here is such a stack:
"SELServer-akka.blocking-io-dispatcher-83968" #90435 prio=5 os_prio=0 cpu=244.86ms elapsed=1988.75s tid=0x61a5ac00 nid=0x31d1 waiting on condition [0x5c1be000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.2-BellSoft/Native Method)
- parking to wait for <0x69ce81e8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.2-BellSoft/LockSupport.java:234)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.2-BellSoft/AbstractQueuedSynchronizer.java:2123)
at java.util.concurrent.LinkedBlockingQueue.poll(java.base@11.0.2-BellSoft/LinkedBlockingQueue.java:458)
at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.2-BellSoft/ThreadPoolExecutor.java:1053)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.2-BellSoft/ThreadPoolExecutor.java:1114)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.2-BellSoft/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.2-BellSoft/Thread.java:834)
In addition there are some ManagedArrayBlockingQueue calls in thread stacks of threads slick.db-x. They’re all in thread state TIMED_WAITING (parking). Here is such a stack:
"slick.db-1" #60 daemon prio=5 os_prio=0 cpu=38654757.29ms elapsed=262856.04s tid=0x00922400 nid=0x184e waiting on condition [0x5d7a0000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.2-BellSoft/Native Method)
- parking to wait for <0x69fa9510> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(java.base@11.0.2-BellSoft/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.2-BellSoft/AbstractQueuedSynchronizer.java:2081)
at slick.util.ManagedArrayBlockingQueue.$anonfun$take$1(ManagedArrayBlockingQueue.scala:116)
at slick.util.ManagedArrayBlockingQueue$$Lambda$1894/0x5db03a28.apply(Unknown Source)
at slick.util.ManagedArrayBlockingQueue.lockedInterruptibly(ManagedArrayBlockingQueue.scala:206)
at slick.util.ManagedArrayBlockingQueue.take(ManagedArrayBlockingQueue.scala:115)
at slick.util.ManagedArrayBlockingQueue.take(ManagedArrayBlockingQueue.scala:17)
at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.2-BellSoft/ThreadPoolExecutor.java:1054)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.2-BellSoft/ThreadPoolExecutor.java:1114)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.2-BellSoft/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.2-BellSoft/Thread.java:834)
The reason for huge numbers like 84632 after blocking-io-dispatcher was the usage of non-async MqttClient (eclipse paho). I implemmented an async version and put that appropriate actor to default-dispatcher. Now the huge numbers are gone and the system runs without the weird behaviour described above.
As already mentioned in the akka docs, try to prevent blocking ops where possible. Really do it.
Still I don’t understand why there was so much trouble with that blocking-io-dispatcher and how it could influence the system.