Hello Guys,
We recently had a big production issue. All the nodes in our cassandra cluster were restarted at the same time.
When the cassandra nodes recovered, there is a very strange behaviour that happened in our lagom applications. Most (if not all) of our active readside processors had the column timeuuidoffset
reset back from the beginning. We verified this by seeing old events in our logs, and also checking the offsetstore table of each of the lagom service keyspaces:
SELECT toTimestamp(timeuuidoffset), eventprocessorid, timeuuidoffset FROM <keyspace>.offsetstore;
We have verified that the time went to as old as 2019.
A lot of our events got replayed, and many of the processors have side effects (sending emails).
Anyway, here’s the summary of what happened:
- lagom services are online
- production C* went down
- revived prouction C*
- production C* had a lot of commitlogs, processing took a lot of time
- production C* is up
- offsettores are reset
We managed to replicate this issue in our staging enviroment:
- lagom services are online
- mounted prod disk snapshots to staging cluster
- started staging C*
- staging C* had a lot of commitlogs, processing took a lot of time
- staging C* is up
- offsettores are reset
Initially, we thought that it was purely just a cassandra issue and we were so concerned about the integrity of our messages table.
However when we did the following:
- lagom services are taken offline
- mounted prod disk snapshots to staging cluster (same snapshot)
- started staging C*
- staging C* had a lot of commitlogs, processing took a lot of time
- staging C* is up
- offsettores are not reset
- lagom services are revived
- offsettores are not reset
Sadly, the akka logs are set to log level WARN
so we weren’t able to see the information when the application has reconnected to the offsetstore, or what it did during that time. We will do that as a follow up (redeploy our apps with the appropriate log level, and do a restoration). However, this is not something we can do right away.
In this case, is it possible to get pointers on how to debug this?
I created a bare minimum project to try and replicate the issue, but to no avail.
Here it is:
We’re running on:
lagom 1.6.4
cassandra 3.11.3
I tried publishing many messages, but I think my dataset is limited. Any insights on how to replicate this issue with minimal code is highly appreciated.
Thanks!
Clyde