Continue buffering new messages when underlying persistence plugin throw ex

I have some eventSourcedBehaviour setup as a cluster. And im uisng r2dbc plugin with postgres for both the journal and snapshot.

when I simulate the failure of postgres(by switching it completely off),

i first get this exception:

persistenceId=TradeProcessingEventSourced|3, akkaSource=akka://CandleCalculator/system/sharding/TradeProcessingEventSourced/51/3, sourceActorSystem=CandleCalculator}
akka.persistence.typed.internal.JournalFailureException: Failed to persist event type [com.okcoin.sharded.candle.engine.demo.eventsourced.SecAgg$Event$Trade] with sequence number [11117] for persistenceId [TradeProcessingEventSourced|3]

and then the next exception

Exception during recovery from snapshot. PersistenceId [TradeProcessingEventSourced|3]. Connection validation failed MDC: {persistencePhase=load-snap, akkaAddress=akka://CandleCalculator@127.0.0.1:2554, persistenceId=TradeProcessingEventSourced|3, akkaSource=akka://CandleCalculator/system/sharding/TradeProcessingEventSourced/51/3, sourceActorSystem=CandleCalculator}
akka.persistence.typed.internal.JournalFailureException: Exception during recovery from snapshot. PersistenceId [TradeProcessingEventSourced|3]. Connection validation failed

Apparently once the persistence failed one time, the actor will switch to recovery mode. and recovery will also fail because the db is down. then the actor will get stuck in the recovery mode until recovery is successful.

based on my observation, the messages sent to the cluster during that time are completely lost because every actor in the cluster is in continuous loop of “attempt to recover and fail to recover”.

im under the impression that new messages should still be buffered while recovery is being attempted. Am i doing something wrong that causes the messages to be lost? or is this the expected behaviour?

thanks in advance.

Persistence will stop the actor on journal failure, sharding will start an entity that is not running when it receives a message, starting a stopped persistent actor when the database is unavailable will lead to replay failing. Once the database is available, recovery will succeed.

In other words, what you see is the expected behaviour.

If you want retries or other ways to get better guarantee of delivery in face of infrastructure failure, that is best implemented on the sending side. Note that if you do this at the edge of the Akka cluster you also have to take into account the scenario of retrying in the face of the cluster request cluster node going away (because of failures or because of a rolling upgrade)

If you want to do it in the Akka cluster there are a few tools in the toolbox that you can look into: Reliable delivery • Akka Documentation or Futures patterns • Akka Documentation