Akka-http random connections close

Hello,
We would like to get an advice for tracing akka-http.

We believe we have issue with random connections closing, presumably by akka-http server. It exists for more then a year already, but it started to actually affect us only couple of months ago.
What is happening:
We have amazon Classic Load Balancer that redrives HTTP requests to our proxy server which is built on top of akka-http. Sometimes (around each ~800 K requests, which results in ~100-150 reproductions per day) ELB returns 504 GATEWAY_TIMEOUT error with almost zero backend processing time with minimal info (as if connection was closed prematurely). We don’t see any requests in our proxy logs at the moment of such errors, it looks like our async request handler is not getting called when this happens.
We couldn’t get any correlations between requests url\verb\payload size and their “cancellation”. Most likely, issue is caused by “previous” request, not the one that got 504 and we have no idea how to track that.
We upgraded akka-http versions several times from 10.0.3 on Scala 2.11 up to latest 10.1.6 on 2.12, problem persisted across all these changes.

Our current guesses are:

  1. The backend server closing the connection with the ELB prematurely by sending a RST.
  2. Backend sends a response without “Connection: close” header but closes the connection anyway, causing the ELB to send 504 to some subsequent request.
  3. Closes connection without sending FIN or RST packet (?)

We can get any logs and tcpdumps at client side and at host machine. What info\logs\metrics we can gather from akka-http that may help us to track down the issue? Any suggestions?

Hi @Discipe,

thanks a lot for this comprehensive problem report.

The most likely non-bug reason would be that the idle timeout in Akka HTTP triggered and closed a connection(hard to say if backend connection would idle given your request rate). Actually, that’s just regular behavior that any HTTP server usually implements (because clients like to keep persistent connections but the server wants to guard against too many open but idle connections). This behavior introduces a race condition where a client (or reverse proxy) tries to run another request on a persistent connection right in the moment when the server decides to close the connection leading to the request to fail with some error (or just not being responded to). As this is regular HTTP/1.1 server behavior usually reverse proxies (and other clients) retry requests that didn’t make it to the server. However, that is only safe to do for idempotent and safe requests (i.e. no POST or PUT, etc. requests).

Is the idle-timeout a possibility in your case? You should see warnings when it happens. You said there’s no correlation to requests, so is it correct that the issue happens for GET requests as well? Can you try to identify why the Amazon ELB doesn’t retry requests in case of a backend connection being closed by the server?

Could you try changing the akka.http.server.idle-timeout e.g. to infinite which might be safe if your proxy service is only exposed to the ELB?

As explained above, both of these can happen when the idle timeout triggers. Either, the ELB sends the request when the connection was just still alive, so on the TCP level the request made it to the server but the server didn’t read it any more. Or, the ELB is just a bit later, i.e. it hasn’t seen the FIN from the server but tries to send some data in which case the server might have completely closed the connection already and a RST is sent back by the OS.

Hope this helps tracking down that issue.

Johannes

Hello, @jrudolph. Thank you for your reply!

We’ll try to set idle-timeout to infinite in one of our future experiments (it now set for 60s for server and 30s for host connection pool).

Here’s our current (but not final) status of investigation:
We now able to pretty reliably (1 per several thousand requests) reproduce the issue using small http client that send requests in a cycle. We observed RST for very short lived connections (time span between SYN and RST is less than 500 ms, see screenshot below).
However, there could be problems with my interpretation of tcp dump and how wireshark shows packets (i’m not used to debug TCP stack :slight_smile:).

Can you try to identify why the Amazon ELB doesn’t retry requests in case of a backend connection being closed by the server?

We were surprised by this too. The idea to look for RST was proposed by AWS support, I guess the just didn’t retry requests for some reason (because they were able to fully send request body? or they just don’t do that at all?).

Here’s screenshot from tcp dump with filtered TCP “stream” which ended with 504 on ELB.

For all 504 errors we were managed to reproduce and catch we have the same pattern:

  1. Got a request with transfer-encoding:chunked (however, almost all requests are chunked in this test)
  2. Send a response to it
  3. Receive an incoming request (non chunked; yes, even GET requests are interrupted)
  4. Connection closure (RST). Often it looks like multiple RST sent for the same connection at the end.

Still, I’m not sure if this behavior is abnormal (it’s hard to process GBs of dumps, maybe we have same pattern which not ended in 504).

We’ll try to change our test setup and play with “chunked” requests. Maybe we would be able to tune our tool to reproduce the issue more reliably, or even write a test in akka-http.

However, we would very appreciate any help in making something similar to a “tcp dump” on akka-http side, simply to compare Linux-level dump against application-level.

Just looking at all those materializations, flows and implicits make it hard to track down a place where RST can be send (or connection close). I didn’t analyze akka-http sources thoroughly yet… Any hints for places (classes\functions) where this could happen? Maybe there are some logs we can turn on or exceptions we could catch?

Hi @Discipe,

interesting. Can you post the tcpdump data for that request/response cycle somewhere or send it to me by email?

What I can see here is:

  • A POST request is received
  • A response is immediately sent out (does it set a Connection: close header? can’t see that from the screendump)
  • The next packet from the server after the response is a FIN packet, so the server decided to close the connection
  • The client continues sending data on the connection
  • The server answers each further packet with RST (probably because the socket was fully closed on the server)

A few things are weird in this dump:

  • The client seems to send data (72801, 72803) before the first request (might be a wireshark weirdness?). What’s in those two data packets from the client before the POST?
  • The client sends further data (72988, 72990) before it has read the response (or the server sends out the response early before it has read the complete request).
  • The client tries to send another request after the previous response closed the connection (though, I don’t know whether the response actually contained the Connection: close header or not).
  • The server doesn’t wait for the client’s FIN at least for a while to prevent acks running into a completely closed socket.
  • It seems a client-server round-trip is about 0.5ms. The client should know quite early that the connection is broken but still keeps sending data for a while. (I don’t know, but on a busy client like the ELB it could be that those kind of notifications are queued somewhere and aren’t processed for a while?)

So, there’s still a bit of uncertainty about what is happening.

  • Can you check first if the response contains a Connection: close header or not? If not, then there’s some kind of bug in akka-http.
  • Can you also check what’s the contents of the intermediate data frames from the client that are not labelled by wireshark?

Johannes

Hi @jrudolph,

Shortly: you are right, akka doesn’t send “connection: close” header. Everything else is just a quirks of wireshark.

In details:

  1. Wireshark “glues” together separate TCP packets and infers higher-level protocols. Packets 72801, 72803 and 72805 - are all part of a single http request. Wiresharks shows it as HTTP protocol at the last packet of the sequence. Here are details of the packet 72805:
    image

  2. Packets 72988 and 72990 are, again, parts of a single incoming request. Since “connection: close” header wasn’t sent, ELB simply routed on of the requests to that connection, assuming it is still open.

  3. I guess, missing header also explains lack of response of FIN.

Wireshark shows whole session from the screenshot as 3 HTTP requests, where last request is received, but never answered:

Turning off “chunked” seems to help in our situation (but we still cannot guarantee anything at this point).
To make sure, we’ll try to create minimal reproduction example to track the issue down in akka\akka-http. If you have any suggestions for reproducing\testing the issue or how can we help with fixing it, please share your thoughts.

Thanks for the clarifications.

It seems I found a potential reproducer.

It has two parts:

  • Your response contains a Connection: keep-alive header which wasn’t added by Akka HTTP (in HTTP/1.1 there’s no need to add it). It’s probably forwarded by your proxy logic. In fact, one reason why we don’t yet provide proxying out of the box is because it is not trivial to get right. Some headers like Connection should not be forwarded because they only refer to the current connection and not to the request / response cycle.
  • You probably sent out the response before reading the complete request. In that case Akka HTTP logs a warning "Sending an 2xx 'early' response before end of request for ... received", adds a Connection: close header and closes the connection after the response was sent out. It seems if the HttpResponse object already contains a Connection header, Akka HTTP “forgets” to add a Connection: close header.

Can you check if you see those early response warning in your logs? If yes, can you check if you actually consume request bodies completely before sending out a response? Can you also make sure not to include Connection: keep-alive headers in your response and see if that fixes the issue?

I’ll create an issue to fix the early-response closing behavior to include an Connection: close header in any case.

Thanks again for the report.

Johannes

Added https://github.com/akka/akka-http/issues/2403 for the issue.

Thanks a lot for the info! You are right for both parts here.

We indeed copying request & response headers back and forth and this already was a source of problems. It seems like we overlooked “Connection” header somehow and the only place it blew out were “chunked” requests. I’ll remove “Connection” header from response and test it out. Thank you.

About warnings, indeed, we have them in our proxy logs (Sending an 2xx 'early' response before end of request for ... received).
But I’m not sure how to check if we “fully consume request body”… First, some details. Here’s the way we start akka server:

  def apply(): Unit = {
    val (host, port) = (Configuration.appServerInterface, Configuration.appServerPort)
    val serverSource = Http().bind(interface = host, port = port)

    serverSource.to(Sink.foreach { connection =>
      connection.handleWithAsyncHandler { request =>
        try {
          processRequest(request)
        } catch {
          case e: Exception => Future.successful(exceptionToResponse(e, request))
        }
      }
    }).run()
  }

Our workflow looks like this:

  1. Receive incoming request
  2. Select proxy route based on request headers
  3. Create new request based on routes: (originalRequest.copy(_.withHeaders(...))). Copy headers from incoming request (filter out TimeoutAccess header and add our own headers). Send request using Http().singleRequest function.
  4. The result of “singleRequest” function (Future[HttpResponse]) is the return value for handleWithAsyncHandler. I.e. we are passing response as is.

I don’t quite understand how it is possible to get these warnings. In my opinion we don’t need to call methods like “.toStrict” on incoming request with our approach to proxy. We don’t have any errors which may imply incomplete request or response on either end of the proxy. Is it possible that we receive that warning because we have new (unexpected) incoming request in the closing channel?

I’d say in a proxy it can only happen if the downstream server has the same “problem”. Early responses are under-specified in HTTP/1.1 (but explicitly allowed in HTTP/2), so it’s hard to say how client / server software should behave exactly. In Akka HTTP, we chose to support it somewhat but only as an exceptional case.

To solve it in your code, you would have to hold back the response until the whole request has been sent over to the backend. You can do this by attaching a watchTermination stage to the entity of the request and then wait for that completion before sending the response further.

Here’s a sketch (untested):

val request = ...
val completionPromise = Promise[Done]()
val finalRequest = 
  request.transformEntityDataBytes(Flow[ByteString].watchTermination()(
    (_, completionFuture) => 
      completionPromise.completeWith(completionFuture)
  ))

Http.singleRequest(finalRequest).flatMap { response =>
  completionPromise.future.map { _ => // wait for both futures to complete before sending response on
    response
  }
}

@jrudolph
Thank you again all your help!
It seems like filtering “Connection” headers helped!
We still need to investigate “early responses” using sketch you kindly proposed. It’s pretty confusing to have that issue while our backends are built on top of Play :slight_smile:


I’m sorry for still bothering you as this probably not related to the original discussion, but maybe this would ring any bells about Akka behavior. While testing our changes we did get single 504 error from ELB, but unusual one. Here are details:

Proxy got a request and forwarded it to a service which is failed to respond in time (idle timeout is 60 sec. same as for ELB). Proxy, in turn, failed to recognize connection closing from ELB.

HTTP perspective:
Proxy (#1) receives a request (packets 801367-801370 on a screenshot), doesn’t get a timely response and sends out its own http response with status code 504 and “Connection: close” header (packet number 812904).

TCP perspective:
While waiting for a response from end server it got “FIN, ACK” packet from ELB (812899) and respond with “ACK”. However, after 500 ms it started to send out own 504 response followed by “FIN, ACK” packet (812905). As a result: 2 RST packets from ELB as if connection was already closed.

Could it be some kind of bad timing since both timeouts are the same (even with 500ms response delay from akka…)?

Could it be some kind of bad timing since both timeouts are the same (even with 500ms response delay from akka…)?

Yes, I’d say it’s kind of a race condition. In any case, as the ELB made the first step, nothing that proxy had done would have changed the result.

I guess, there might be some room for improvement here in akka-http. The 500ms delay looks like akka-http should have seen the connection closure. I guess since it didn’t expect it, it might not have handled it promptly for some reason.

Thank you for all your help @jrudolph!