Inquiry about Akka Cluster Malfunction Issue

jinjunwoo · October 16, 2024, 1:39am

Akka Version

-  Implementation-Version: 2.5.17
-rw-r--r-- 1 add add   47695  5월  5  2017 com.enragedginger.akka-quartz-scheduler_2.12-1.6.1-akka-2.5.x.jar
-rw-r--r-- 1 add add   96013 11월 12  2016 com.github.romix.akka.akka-kryo-serialization_2.12-0.5.2.jar
-rw-r--r-- 1 add add  113947 10월  5  2018 com.safety-data.akka-persistence-redis_2.12-0.4.1.jar
-rw-r--r-- 1 add add 3434517  9월 27  2018 com.typesafe.akka.akka-actor_2.12-2.5.17.jar
-rw-r--r-- 1 add add  303955  9월 27  2018 com.typesafe.akka.akka-cluster-metrics_2.12-2.5.17.jar
-rw-r--r-- 1 add add  544173  9월 27  2018 com.typesafe.akka.akka-cluster-sharding_2.12-2.5.17.jar
-rw-r--r-- 1 add add  592532  9월 27  2018 com.typesafe.akka.akka-cluster-tools_2.12-2.5.17.jar
-rw-r--r-- 1 add add  967458  9월 27  2018 com.typesafe.akka.akka-cluster_2.12-2.5.17.jar
-rw-r--r-- 1 add add 1153010  9월 27  2018 com.typesafe.akka.akka-distributed-data_2.12-2.5.17.jar
-rw-r--r-- 1 add add 3356002 12월  1  2017 com.typesafe.akka.akka-http-core_2.12-10.0.11.jar
-rw-r--r-- 1 add add  386165  9월 27  2018 com.typesafe.akka.akka-multi-node-testkit_2.12-2.5.17.jar
-rw-r--r-- 1 add add  970741 12월  1  2017 com.typesafe.akka.akka-parsing_2.12-10.0.11.jar
-rw-r--r-- 1 add add  594556  1월 11  2018 com.typesafe.akka.akka-persistence-cassandra_2.12-0.80.jar
-rw-r--r-- 1 add add  103918  9월 27  2018 com.typesafe.akka.akka-persistence-query_2.12-2.5.17.jar
-rw-r--r-- 1 add add  905484  9월 27  2018 com.typesafe.akka.akka-persistence_2.12-2.5.17.jar
-rw-r--r-- 1 add add  483677  9월 27  2018 com.typesafe.akka.akka-protobuf_2.12-2.5.17.jar
-rw-r--r-- 1 add add 2369344  9월 27  2018 com.typesafe.akka.akka-remote_2.12-2.5.17.jar
-rw-r--r-- 1 add add   16054  9월 27  2018 com.typesafe.akka.akka-slf4j_2.12-2.5.17.jar
-rw-r--r-- 1 add add 4057596  9월 27  2018 com.typesafe.akka.akka-stream_2.12-2.5.17.jar
-rw-r--r-- 1 add add  265034  9월 27  2018 com.typesafe.akka.akka-testkit_2.12-2.5.17.jar
-rw-r--r-- 1 add add   95503  1월 12  2018 com.typesafe.play.play-akka-http-server_2.12-2.6.11.jar
-rw-r--r-- 1 add add   23187  6월 15  2018 pl.immutables.akka-reasonable-downing_2.12-1.1.0.jar

Environment

Akka cluster configured with node1 to node3

node1 (172.10.50.4), node2 (172.10.50.5), node3 (172.10.50.6)
OS: CentOS 7.2

Using Akka Cluster with the "All Act" process```


*JVM Setting*

Setting -X directly (-J is stripped)

-J-X

-J-Xms32768m -J-Xmx32768m
-J-XX:+UseG1GC -J-XX:MetaspaceSize=256m -J-XX:MaxMetaspaceSize=256m -J-XX:G1HeapRegionSize=2m -J-XX:+UseStringDeduplication
-J-XX:G1RSetUpdatingPauseTimePercent=5 -J-XX:MaxGCPauseMillis=500 -J-XX:+UseLargePagesInMetaspace
-J-XX:+PrintGCDetails -J-verbosegc -J-XX:+PrintGCDateStamps -J-XX:+PrintHeapAtGC -J-XX:+PrintTenuringDistribution
-J-XX:+PrintGCApplicationStoppedTime -J-XX:+PrintPromotionFailure -J-XX:PrintFLSStatistics=1
-J-Xloggc:/home/test/test/test/gclog/test_gc.log
-J-XX:+UseGCLogFileRotation -J-XX:NumberOfGCLogFiles=2 -J-XX:GCLogFileSize=100M
-J-XX:+HeapDumpOnOutOfMemoryError -J-XX:HeapDumpPath=/home/test/test/test/gclog/heapdump_test.hprof
-Dpname=test
-Dpidfile.path=/dev/null
-Dconfig.file=/home/test/CFG/test/aad/conf/application.conf
-Dmetric.file=/home/test/CFG/test/aad/conf/metric.conf
-Dlogger.file=/home/test/CFG/test/aad/conf/logback.xml
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=5012
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dakka.cluster.seed-nodes.0=akka.tcp://172.10.50.4:5150
-Dakka.cluster.seed-nodes.1=akka.tcp://172.10.50.5:5150
-Dakka.cluster.seed-nodes.2=akka.tcp://172.10.50.6:5150


* Issue*

When checking the status of each Akka cluster node (node1 to node3), they are in an “Unreachable: true” state. It appears that the cluster is broken, and when restarting the processes using Akka, nodes either fail to join the Akka cluster or do not leave correctly. As the cluster is broken, simply stopping the process results in an error as it cannot find the necessary information. Is there a way to resolve this issue? (Currently, the suspected cause is that garbage collection (GC) is occurring and exceeding the Akka heartbeat check time, causing each node to be marked as “Unreachable: true” in the cluster.)

node1 Cluster
{
“leader”: “akka.tcp://172.10.50.4:5150”,
“members”: [
{
“address”: “akka.tcp://172.10.50.4:5150”,
“roles”: [
“dc-default”
],
“status”: “Up”,
“unreachable”: false
},
{
“address”: “akka.tcp://172.10.50.5:5150”,
“roles”: [
“dc-default”
],
“status”: “Up”,
“unreachable”: true
},
{
“address”: “akka.tcp://172.10.50.6:5150”,
“roles”: [
“dc-default”
],
“status”: “Up”,
“unreachable”: true
}
]
}
node2 Cluster
{
“leader”: “akka.tcp://172.10.50.4:5150”,
“members”: [
{
“address”: “akka.tcp://172.10.50.4:5150”,
“roles”: [
“dc-default”
],
“status”: “Up”,
“unreachable”: true
},
{
“address”: “akka.tcp://172.10.50.5:5150”,
“roles”: [
“dc-default”
],
“status”: “Up”,
“unreachable”: false
},
{
“address”: “akka.tcp://172.10.50.6:5150”,
“roles”: [
“dc-default”
],
“status”: “Up”,
“unreachable”: true
}
]
}
node3 Cluster
{
“leader”: “akka.tcp://172.10.50.4:5150”,
“members”: [
{
“address”: “akka.tcp://172.10.50.4:5150”,
“roles”: [
“dc-default”
],
“status”: “Up”,
“unreachable”: true
},
{
“address”: “akka.tcp://172.10.50.5:5150”,
“roles”: [
“dc-default”
],
“status”: “Up”,
“unreachable”: true
},
{
“address”: “akka.tcp://172.10.50.6:5150”,
“roles”: [
“dc-default”
],
“status”: “Up”,
“unreachable”: false
}
]
}

johanandren · October 16, 2024, 1:01pm

The Akka 2.5 release line reached EOL back in November 2020, I recommend that you upgrade to a recent version and see if the problem reappears.

davidogren · October 17, 2024, 12:52pm

Similarly, CentOS 7.2 hasn’t received patches in 7 years. That’s probably not related to these issues, but it isn’t even remotely safe from a security perspective

I doubt that this is about GC issues, it’s almost more certainly more serious networking config issues. Which is why getting your OS to a supported release is also critical.

Topic		Replies	Views
Akka Kubernetes Error Akka Libraries	2	1963	November 11, 2020
Akka Cluster is from unavailability to recovery, and member nodes cannot connect to the seed node again Akka Cluster akka-cluster	6	2797	October 20, 2022
Akka cluster in kubernetes gets into inconsistent state Akka Cluster kubernetes	6	574	February 14, 2023
First seed node not rejoining cluster Akka Cluster	4	1956	July 23, 2018
The shutdown of one node in the cluster results in the shutdown of the other node Akka Cluster java , akka-cluster	2	1034	November 4, 2022

Inquiry about Akka Cluster Malfunction Issue

Setting -X directly (-J is stripped)

-J-X

Related topics