Corrupted Event Journal in Akka Persistence

Dear Akka Community,

We’ve encountered a JournalFailureException during entity recoveries in our production environment, and we suspect it might be linked to an outage in our Cassandra provider, DataStax. We’re uncertain if such behavior is possible in Akka, so we’re seeking clarification on this matter.

During a recent outage on March 1, 2024, we observed an Akka persistent entity creating a record with a sequence number it shouldn’t have – essentially causing a repeat. It appears that during the outage, the entity may have progressed ahead of itself, writing new events despite potentially having only partial or outdated data due to the data center outage. This scenario differs from a network partition, where multiple entities might persist the same data twice. Here, the entity attempted to recover from past events but failed to do so adequately, leading it to proceed with persisting new events and causing a sequence number collision.

We don’t believe this is a network partition issue, as the old data was originally persisted two years ago, and the entity was reactivated to persist new events. However, it failed to recover from the old events and proceeded to persist new events with a sequence number of 1. To provide insight into the situation, here’s a snapshot of such a collision from the event journal’s “messages” table(certain fields are masked for confidentiality):

As seen in the snapshot, the first event was persisted on Sep 15, 2021, and the new event was persisted on Mar 1, 2024, both with the same sequence number.

Our question is: under what circumstances can Akka create a corrupted event journal during a database outage? We expect Akka not to push through with persisting events while the data center is experiencing issues returning the full set of data.

We appreciate any insights or guidance on this matter.

What consistency settings were in use in your Akka Persistence Cassandra config (and for Astra, is the Cassandra spread out across regions (or even availability zones, though different clouds use different definitions for AZs as to whether they’re different datacenters))?

Specifically, if LOCAL_QUORUM is used and Cassandra is spread across datacenters, the docs note:

The risk of using LOCAL_QUORUM is that in the event of a datacenter outage events that have been confirmed… may not have be replicated to the other datacenters. If a persistent actor for which this has happened is started in another datacenter it may not see the [events which weren’t replicated] (the verbiage in the docs talks about latest events, but that’s not strictly true –LR). If the Cassandra data in the datacenter with the outage is recovered then the event that was not replicated will eventually be replicated to all datacenters resulting in a duplicate sequence number.

Generally, if the query for events from the last snapshot (or sequence_nr 1 if no snapshot was found) succeeds but returns no results, Akka Persistence will start from sequence number 1. In the case of Cassandra, this can happen, for instance, if a weaker-than-QUORUM consistency level is used (e.g. consistency ONE where the first answer received is “nothing”), and there are also scenarios where this could even happen on a poorly-operated cluster (unlikely in this case: presumably Datastax knows what they’re doing operationally).

thank you for shedding the light!! @leviramsey

We did confirm our consistency settings have always been QUORUM and not LOCAL_QUORUM in production. I’m leaning towards a poorly-managed cluster by Datastax. So we looked into the RCA from Datastax:

Root Cause:

Disk Space issues across multiple racks caused the database data service layer to
be unstable.
There was an issue where Kubernetes VMs crashed which caused part of our data
service to go down for the database; we have engaged with Azure as to why this
occurred. While the data service was recovering and restoring the replica counts,
the sstable metadata on the remaining live data service components filled to
100%, resulting in the disk running out of space and writes going into an
unavailable state.

Resolution
●Additional space was added to the pods, and a node-by-node restart was
performed to address the issue.
●We have fixed the alerting mechanism to trigger the disk utilization alert at
85%

When a database vendor tells you they had run out of disk space, that might indicate something really bad occurred which caused the incident…

In situations where Cassandra sstables are lost (including due to node unavailability), it is possible to observe that sort of time travel (data is there, it isn’t there, then it is again). It’s also far from unheard of in my experience for losing a Cassandra node to result in substantial spikes in disk usage on the other nodes.