I’m running an event-sourced system with ~300 million entities, and I’m persisting events using the R2DBC plugin. In the log, I’m getting frequent messages that indicate unexpectedly long DB operations, e.g.
insert [Variant|1_d91439d5261ba186ddc0ddf2e3351291_9502fb2b89cb53c3ef24aed51ca09994] - getConnection took [464194] µs
insert [Variant|1_cff35e4413aceb4de82fe113f755451c_b42d35106898db1d2ecd262649c69f23] - DB call completed [2023-08-07T22:02:15.866209Z] in [1213377] µs
select eventsByPersistenceId [Parent|1_35bd1a5323e49022a8cca07e8e21e1f2] - Selected [2] rows in [1310022] µs
select highest seqNr [Parent|1_b2b064e73a2b764b895b5718a93745d1] - Selected [1] rows in [780619] µs
The high latency doesn’t seem to be related to the DB server: p99 latency there is below 1ms.
So, this has to be client-side, right? What exactly is happening? Is it something like the (completed) Future
s containing the results from the DB are not being scheduled to run on the CPU soon enough?
I’m running the Starvation Detector on all of the relevant dispatchers, and I do get the occasional warning, e.g.
Exceedingly long scheduling time on ExecutionContext Dispatcher[command-hub.dispatchers.persistence]. Starvation detector task was only executed after 499 ms which is longer than the warning threshold of 100 milliseconds. This might be a sign of thread, CPU, or dispatcher starvation. This can be caused by blocking calls, CPU overload, GC, or overlong work units.
Among those 4 causes, this seems like CPU overload, because I’ve been pretty careful to avoid the other 3. In situations like these, is the solution to either a) throw more compute resources at the problem, or b) decrease parallelism to a level that the current compute resources can handle?
I guess I’m using this post as a sounding board for my own conclusion. But also, maybe I’m missing something or am flat-out wrong. Any insight is appreciated.