I have a question for the S3 connector for alpakka, basically just want to make sure I understand the problem we are currently facing. Essentially we have clients that will upload hundreds of files to s3 and then will send a link to download the files out to whoever needs to the view them. This download link will gather all the files from s3 (or they can download each individually) and zip them and stream the download to the user. We use Play to serve the requests for the download. The problem is that max-open-requests seems to overflow and users can’t download the files and get a Network Error. We increased the max-open-requests to 128 and it solves it except if 2 users at the same time try to download the files (or if different files doesn’t seem to matter). I’ve tracked the download request here : https://github.com/akka/alpakka/blob/master/s3/src/main/scala/akka/stream/alpakka/s3/impl/S3Stream.scala#L447. Essentially I think what is happening is that this connector is using the cachedHostConnectionPool and that means no matter how many different incoming connections are made it will limit the outgoing connections to 1? So even if we increase the host-connection-pool max-connections it will not make a difference? Are the assumptions here correct, is there anything we can do to work around this issue?
Libraries used:
Play 2.6.11
“com.lightbend.akka”%%“akka-stream-alpakka-s3”%“0.18”
one gotcha when using Http().singleRequest(...) is that it is very easy to overflow the underlying connection pool and get the infamous max-open-requests exceeded exception. This usually happens when Http().singleRequest(...) is called without looking at the returned Future and not slowing down.
Ideally you should model the file downloading logic from S3 as an Akka Stream and use mapAsync(parallelism = ...) when calling S3 Alpakka APIs (or even use the stream APIs from the beginning if those are available). In that case you can control how many outgoing requests you want to have at most at any given time by adjusting parallelism parameter.
@2m I think I understand what you are saying. I just want to understand why that would make an impact on this because I still see a bottleneck on the connection pool.
Lets say we have we the following scenario (which is common for us) :
100 Documents in an attachment.
6 Users start downloading documents at the same time.
In that scenario we have mapAsync(parallelism = 32) ,max-open-requests = 128 and max-connections = 16. So that would each incoming user request a max of 32 outgoing requests. (32 requests * 6 users) 192 > 128 and then all other requests would fail right? Should we just bump the max-open-requests because max connections is not a respected configuration variable? Do we decrease parallelism to an even greater amount (4 maybe)? Would changing the alpakka connector to a connection level request help? I understand establishing the connection is expensive but is less expensive than lost requests entirely. I could be looking at this entirely wrong as well, I would love to hear your thoughts.
My suggestion would be to use the same stream that has mapAsync(parallelism = 32) across all users. Then it is given that the stream will never exceed the max-open-requests limit. Also if more users come and try to do the download, the requests will all queue up in the stream and be backpressured until the there are more free spots in the mapAsync operator.