Hi folks, I maintain an akka-cluster–sharding service that reads events from an sqs queue, and performs external requests to about 4 other services for validation, aggregation, etc and keeps all of that in memory and publishes updates on that data aggregation downstream. So the service throughput is directly tied to these external calls.
The cluster received an unusual amount of event backlogs and including the high GC count+times, we also saw this error in the logs (usually a single pod), that seems to indicate that akka (http?) actor just terminated (trying to request an external service) and was never able to perform calls again during the remaining service uptime, requiring the entire cluster to be restarted.
akka.stream.StreamTcpException: The connection actor has terminated. Stopping now.
Killing the pod just seems to move the problem to another pod after some time, likely a specific Shard Entity.
What I want to know is:
- Is this log message correlated to garbage collection of Shard Entities somehow?
- Is there a way to restart these terminated actors without having to restart the cluster?
Thanks in advance and let me know if I need to provide further context.
The error itself does not explain why it is happening, the underlying Akka Stream TCP implementation in turn uses the actor based TCP implementation and the only thing the error tells us is that the actor representing that TCP connection terminated for some reason. I’d expect there to be more log entries saying why that actor terminated at warning or error level.
The TCP stream operator is used in remoting, HTTP, but also possibly in Alpakka connectors, third party libraries or directly in a project so it is not possible to know what is going on from that single log entry.
If you can reproduce the problem and there is no other hints in your logs about what is going on, enabling debug logging for akka.io
could possibly help (but will likely lead to a quite high amount of log entries).
It does not sound like this is related to sharding, unless your sharded entities are talking TCP with some other service.
If the entities are in a broken state after this happen, you should make sure to crash them, that way a broken entity will be restarted on the next message sent to it over sharding. This can be quite specific to your entity implementation but for example, by using watchTermination
if the entity owns a streaming TCP connection and have it message the actor about the stream terminating so that the entity can act on stream failure.
Thanks for the reply @johanandren !
For further context, I identified two occurrences in the logs:
-
That error message caused by an OOM and logged by a dispatcher we use to process http responses - this triggers a shutdown due to jvm-exit-on-fatal-error=on
-
Same error message, same dispatcher, but it’s logged by the EntityActor for the same types of calls - this suggests to me that it’s close to an OOM, e.g. gc counts/times were quite high, but not enough to cause OOM, but everything is very slowed down. In this situation, no system shutdown is triggered and the shard keeps going until marked as unreachable, which ranged from 20min to 1h in some pods.
Forgot to mention: Entity passivation was fixed at 24h (now 4h), in-flight requests per shard can go up to 6k, and the dispatcher in question is used on all 4 external http clients for body serialization (circe).
I suspect the slow system might have had a knock-on effect that indirectly cause the connection actor to terminate.
Since StreamTcpException used in several use cases and can adopt several messages, using watchTermination wouldn’t make much difference from the current code that captures and logs the error in situation 2.
I’m interested in determining if I should a) terminate the entire system so it’s restarted or b) terminate the entity actor, but given that situation 2 is indicative of a fully unhealthy system, I should likely go with a) but let me know if I’m missing anything.