Context: my application (Querki) is one of the older production Akka applications (once upon a time, I was Lightbend’s smallest customer), and is currently running on an unforgiveably ancient version of the Akka stack. I’m starting to upgrade that, but in the meantime I just had a major production crash – the system is completely down, so I’m trying to puzzle out how to get it back up again.
The errors in the logs manifest in two forms, one like this:
[akka.tcp://querki-server-2@10.64.6.107:10007/system/cassandra-journal/$f] Invalid replayed event [sequenceNr=5126, writerUUID=aa42b54e-1ee0-4f6f-b753-e504ec1cea19] from a new writer. An older writer already sent an event [sequenceNr=5126, writerUUID=5ad395b7-6928-425c-a091-3dc774555921] whose sequence number was equal or greater for the same persistenceId [/sharding/IdentityCacheCoordinator]. Perhaps, the new writer journaled the event out of sequence, or duplicate persistentId for different entities?
and the other like this:
[akka.tcp://querki-server-2@10.64.6.107:10007/system/sharding/IdentityCacheCoordinator/singleton/coordinator] Exception in receiveRecover when replaying event type [akka.cluster.sharding.ShardCoordinator$Internal$ShardHomeDeallocated] with sequence number [5128] for persistenceId [/sharding/IdentityCacheCoordinator].
From digging around, it sounds like the problem has to have been a split-brain that corrupted the histories of the Coordinators for four of my sharding regions. (Querki is heavily cluster sharded, with a bunch of different entity types.)
I have no idea how it got that split-brain (my homebrew system tends to be over-conservative specifically to avoid that), but that’s arguably a lesser concern: once I get the stack up to modern snuff, I’ll be switching over to the SBR, now that that has been open-sourced. For now, I’m just trying to get things back up and running.
Based on comments here, I get the impression that it is reasonable, while the system is down, to simply blow away the corrupted histories – that Coordinator histories don’t really matter across full shutdowns. Is this correct? And if so, am I correct that that is probably the easiest way to get things up and running again?
Thanks in advance for any insights you might be able to provide.