Inquiry about Akka Cluster Malfunction Issue

Akka Version

-  Implementation-Version: 2.5.17
-rw-r--r-- 1 add add   47695  5월  5  2017 com.enragedginger.akka-quartz-scheduler_2.12-1.6.1-akka-2.5.x.jar
-rw-r--r-- 1 add add   96013 11월 12  2016 com.github.romix.akka.akka-kryo-serialization_2.12-0.5.2.jar
-rw-r--r-- 1 add add  113947 10월  5  2018 com.safety-data.akka-persistence-redis_2.12-0.4.1.jar
-rw-r--r-- 1 add add 3434517  9월 27  2018 com.typesafe.akka.akka-actor_2.12-2.5.17.jar
-rw-r--r-- 1 add add  303955  9월 27  2018 com.typesafe.akka.akka-cluster-metrics_2.12-2.5.17.jar
-rw-r--r-- 1 add add  544173  9월 27  2018 com.typesafe.akka.akka-cluster-sharding_2.12-2.5.17.jar
-rw-r--r-- 1 add add  592532  9월 27  2018 com.typesafe.akka.akka-cluster-tools_2.12-2.5.17.jar
-rw-r--r-- 1 add add  967458  9월 27  2018 com.typesafe.akka.akka-cluster_2.12-2.5.17.jar
-rw-r--r-- 1 add add 1153010  9월 27  2018 com.typesafe.akka.akka-distributed-data_2.12-2.5.17.jar
-rw-r--r-- 1 add add 3356002 12월  1  2017 com.typesafe.akka.akka-http-core_2.12-10.0.11.jar
-rw-r--r-- 1 add add  386165  9월 27  2018 com.typesafe.akka.akka-multi-node-testkit_2.12-2.5.17.jar
-rw-r--r-- 1 add add  970741 12월  1  2017 com.typesafe.akka.akka-parsing_2.12-10.0.11.jar
-rw-r--r-- 1 add add  594556  1월 11  2018 com.typesafe.akka.akka-persistence-cassandra_2.12-0.80.jar
-rw-r--r-- 1 add add  103918  9월 27  2018 com.typesafe.akka.akka-persistence-query_2.12-2.5.17.jar
-rw-r--r-- 1 add add  905484  9월 27  2018 com.typesafe.akka.akka-persistence_2.12-2.5.17.jar
-rw-r--r-- 1 add add  483677  9월 27  2018 com.typesafe.akka.akka-protobuf_2.12-2.5.17.jar
-rw-r--r-- 1 add add 2369344  9월 27  2018 com.typesafe.akka.akka-remote_2.12-2.5.17.jar
-rw-r--r-- 1 add add   16054  9월 27  2018 com.typesafe.akka.akka-slf4j_2.12-2.5.17.jar
-rw-r--r-- 1 add add 4057596  9월 27  2018 com.typesafe.akka.akka-stream_2.12-2.5.17.jar
-rw-r--r-- 1 add add  265034  9월 27  2018 com.typesafe.akka.akka-testkit_2.12-2.5.17.jar
-rw-r--r-- 1 add add   95503  1월 12  2018 com.typesafe.play.play-akka-http-server_2.12-2.6.11.jar
-rw-r--r-- 1 add add   23187  6월 15  2018 pl.immutables.akka-reasonable-downing_2.12-1.1.0.jar

Environment

Akka cluster configured with node1 to node3

node1 (172.10.50.4), node2 (172.10.50.5), node3 (172.10.50.6)
OS: CentOS 7.2

Using Akka Cluster with the "All Act" process```


*JVM Setting*

Setting -X directly (-J is stripped)

-J-X

-J-Xms32768m -J-Xmx32768m
-J-XX:+UseG1GC -J-XX:MetaspaceSize=256m -J-XX:MaxMetaspaceSize=256m -J-XX:G1HeapRegionSize=2m -J-XX:+UseStringDeduplication
-J-XX:G1RSetUpdatingPauseTimePercent=5 -J-XX:MaxGCPauseMillis=500 -J-XX:+UseLargePagesInMetaspace
-J-XX:+PrintGCDetails -J-verbosegc -J-XX:+PrintGCDateStamps -J-XX:+PrintHeapAtGC -J-XX:+PrintTenuringDistribution
-J-XX:+PrintGCApplicationStoppedTime -J-XX:+PrintPromotionFailure -J-XX:PrintFLSStatistics=1
-J-Xloggc:/home/test/test/test/gclog/test_gc.log
-J-XX:+UseGCLogFileRotation -J-XX:NumberOfGCLogFiles=2 -J-XX:GCLogFileSize=100M
-J-XX:+HeapDumpOnOutOfMemoryError -J-XX:HeapDumpPath=/home/test/test/test/gclog/heapdump_test.hprof
-Dpname=test
-Dpidfile.path=/dev/null
-Dconfig.file=/home/test/CFG/test/aad/conf/application.conf
-Dmetric.file=/home/test/CFG/test/aad/conf/metric.conf
-Dlogger.file=/home/test/CFG/test/aad/conf/logback.xml
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=5012
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dakka.cluster.seed-nodes.0=akka.tcp://172.10.50.4:5150
-Dakka.cluster.seed-nodes.1=akka.tcp://172.10.50.5:5150
-Dakka.cluster.seed-nodes.2=akka.tcp://172.10.50.6:5150


* Issue*

When checking the status of each Akka cluster node (node1 to node3), they are in an “Unreachable: true” state. It appears that the cluster is broken, and when restarting the processes using Akka, nodes either fail to join the Akka cluster or do not leave correctly. As the cluster is broken, simply stopping the process results in an error as it cannot find the necessary information. Is there a way to resolve this issue? (Currently, the suspected cause is that garbage collection (GC) is occurring and exceeding the Akka heartbeat check time, causing each node to be marked as “Unreachable: true” in the cluster.)

  • node1 Cluster
    {
    “leader”: “akka.tcp://172.10.50.4:5150”,
    “members”: [
    {
    “address”: “akka.tcp://172.10.50.4:5150”,
    “roles”: [
    “dc-default”
    ],
    “status”: “Up”,
    “unreachable”: false
    },
    {
    “address”: “akka.tcp://172.10.50.5:5150”,
    “roles”: [
    “dc-default”
    ],
    “status”: “Up”,
    “unreachable”: true
    },
    {
    “address”: “akka.tcp://172.10.50.6:5150”,
    “roles”: [
    “dc-default”
    ],
    “status”: “Up”,
    “unreachable”: true
    }
    ]
    }

  • node2 Cluster
    {
    “leader”: “akka.tcp://172.10.50.4:5150”,
    “members”: [
    {
    “address”: “akka.tcp://172.10.50.4:5150”,
    “roles”: [
    “dc-default”
    ],
    “status”: “Up”,
    “unreachable”: true
    },
    {
    “address”: “akka.tcp://172.10.50.5:5150”,
    “roles”: [
    “dc-default”
    ],
    “status”: “Up”,
    “unreachable”: false
    },
    {
    “address”: “akka.tcp://172.10.50.6:5150”,
    “roles”: [
    “dc-default”
    ],
    “status”: “Up”,
    “unreachable”: true
    }
    ]
    }

  • node3 Cluster
    {
    “leader”: “akka.tcp://172.10.50.4:5150”,
    “members”: [
    {
    “address”: “akka.tcp://172.10.50.4:5150”,
    “roles”: [
    “dc-default”
    ],
    “status”: “Up”,
    “unreachable”: true
    },
    {
    “address”: “akka.tcp://172.10.50.5:5150”,
    “roles”: [
    “dc-default”
    ],
    “status”: “Up”,
    “unreachable”: true
    },
    {
    “address”: “akka.tcp://172.10.50.6:5150”,
    “roles”: [
    “dc-default”
    ],
    “status”: “Up”,
    “unreachable”: false
    }
    ]
    }

The Akka 2.5 release line reached EOL back in November 2020, I recommend that you upgrade to a recent version and see if the problem reappears.

Similarly, CentOS 7.2 hasn’t received patches in 7 years. That’s probably not related to these issues, but it isn’t even remotely safe from a security perspective

I doubt that this is about GC issues, it’s almost more certainly more serious networking config issues. Which is why getting your OS to a supported release is also critical.

i was also facing this problem.