The current behavior is :
When Cassandra DB is down for some reason, and is restarting.
During that down time if Services tries to connect to DB, it gives Cassandra DB not found errors and exits the request.
We are now required to restart the Service manually, after which it is able to connect to Cassandra.
(This issue is not observed during container startup or since we are handling it using the Init_Container feature of Kubernetes.
We are seeing this issue only when Cassandra goes down intermittently for some reason and is bringing itself up.)
Expected/Preferred behavior:
If Cassandra is not available, Service should keep checking or wait till it is up, and then reconnect to it.
This will provide a graceful reconnection mechanism.
Can you please let us know if there is any inbuilt lagom feature that would enable this behavior.
Or if we should write a retry mechanism code.
We observed this reconnection mechanism already exists for Kafka. Whenever Kafka goes down and up, the services connect to it automatically.
We are seeing the same behavior in k8s. If the cassandra statefulset restarts a pod, the IP address changes, but lagom/akka-persistence continues to use the old addresses. Note that we are binding to the cassandra service, but akka-persistence is caching the initial ip addresses from the dns service lookup.
I’m not sure if you are expiriencing the same issue but regarding cassandra endpoint discovery and access will mainly depend on used ServiceLocator implementation.
ServiceLocator is responsible for, depending in the implementation, querying endpoints, caching and load-balancing.
The connection pool in the Cassandra driver should reconnect itself after the initial discovery and connect to the contact points. I can see that this could be a problem if the entire Cassandra cluster is restarted with new IP adresses. I think there is a recently added issue about this, https://github.com/akka/akka-persistence-cassandra/issues/445
We are using the following configuration for Cassandra persistence.
When Cassandra is down, service is coming down and not looking for Cassandra connectivity when Cassandra is up and running.