Hi again,
This is a follow up question to my previous post regarding scaling up SSE streams on Play 2.7.x (with Akka Http 10.1 under the hood).
Our max-connections issues got resolved, but we are now looking at the closing of connections on the server side, when the client either shuts down or invokes the close method on the JS eventsource object.
It is perhaps important to mention here that we do have a F5 load balancer between our clients and the Play server, which basically means it is actually the F5 interface that talks to the Play server.
Considering the way we have constructed the event stream, the detection of the client side closure is to be ensured by the attempts to send keep alive SSE messages. When that attempt fails, the Akka stream completes, the source actor is terminated and that termination is captured by our session actor which will unsubscribe from the various cluster topics and stop itself (context.stop(self)).
We have been doing tests in 3 different ways to validate the proper closure of connections and the termination of the supporting actors.
- Using a curl command that directly opens the event stream on the server
- As part of the load tests we are performing with the Gatling SSE functionality
- Using our regular web client for our solution : closing here means that we just close the browser tab. We recently also added a window.beforeunload event in our REACT app, that will invoke the close() method on the JS eventsource.
What are we observing ?
- For the curl use case - after pressing Ctrl-C to interrupt the connection - we see that the connection on the server side will be terminated as expected with a “connection reset by peer” in our logs, just before the cleanup starts. The cleanup starts with the next attempt of delivering a keep alive SSE message.
- For the Gatling test, we have a strange observation : for the majority of the connections everything works as expected (the only difference with test case 1 is that this time a “broken pipe” is the trigger to start the cleanup. A small subset however, does not appear to show the same behavior. What we notice is that on the server side, the keep alives keep being sent for these. This means we have a resource leak here that might grow over time. It looks as if our F5 still keeps the connection going and our server thinks there is still a partner to receive the messages, thus not leading to a completion of the stream and the cleanup we hope to see
- When using our regular web client we never see anything being cleaned up. Closing the browser tab, or calling close on the JS evensource object, seems to have no effect. The connection appears to stay open, the keep alives SSE messages continue to be generated (our logging shows this), the Akka stream is not terminated and thus we are having a resource leak because over time we will have a growing number of actors that keep on running without a purpose.
We are having some suspicions towards our F5 (load balancer) - although we could not find any evidence for this and our F5 administrator is still to be involved in the troubleshooting - but we cannot exclude that the error might be somewhere else.
Our environment:
Play 2.7.4
Akka HTTP 10.1.12
SSE stream using Akka Streams (prematerialized Source actor, which is watched by a session actor that subscribes to the internal even publisher of our system)
I am interested to know whether:
- there are known issues with this in the version we are using and whether an upgrade is required
- there are other people that have seen issues when load balancers are used