Akka Recovery Timedout

Hey guys! hope you all fine. We’ve a microservice using akka classic with persistence actor through Mongo (using reactivemongo).

We’ve yesterday seen

akka.persistence.RecoveryTimedOut: Recovery timed out, didn't get snapshot within 30000 milliseconds

during a deploy. During the deploy we estimate that each node was trying to recover ~6k actors.

I’ve been thinking on two parameters:

  • akka.persistence.max-concurrent-recoveries

  • connection pool through our mongo datastore (this is controlled by nbChannelsPerNode)

Does those values should be equals (or at least similar)? In our case we have them set on 250
concurrent recoveries through 70 connections per node.

PS: We’re aware of this config but 30 seconds is a lot of time for our SLA, se we need to perform quite faster

I’d make sure to understand where the bottleneck is before starting tweaking config.

For example:
Look into how much time each actor takes to recover - perhaps a different/more frequent snapshot scheme can make recoveries faster.
Probably also good to have a gut feeling of the throughput limitations of the database to know if that is the bottleneck you are hitting - perhaps more resources on the db side is the solution.