Ich have a big stream of strings (1M,10M,100M)
and want to make a reduce.
If I use following very easy solution, then I need much memory but it is quite fast (1M needs 100ms on my machine)
def simpleLoop:String = {
val sb = new StringBuilder
for (i <- 1 to end){
if (sb.isEmpty){
sb.append(i.toString)
}
else{
sb.append("." +i.toString)
}
}
sb.toString()
}
If I use a very simple solution with akka streams, then it is very slow.
the question is why you would like to use streams for that particular use case. This seems like a pretty basic operation that wouldn’t gain anything from using streams. Streams introduce some overhead and if the operations itself are very simple that overhead might be significant. In particular, operations that go from memory to memory will likely not see any benefit from using streams.
Also the comparison is somewhat unfair because you use a single mutable StringBuilder (with amortized linear append cost and memory allocations) in the simple loop but use immutable String in the streams versions (probably more like O(n^2) append costs and memory allocations because of all the string copies involved).
A use case would be to write big files (csv or so) but also a line could be very big.
Let say you have 10M incoming entries. And you have to group a subset of them, so in output you have only 1M but some of them contains a whole group of maybe 10k per line.
The question is: is ist better to create one line as a synchronous process (just a string builder) or with a stream (just a fold).
The main question is if your workload is a streaming workload, i.e. can you process the input one by one, so that you can benefit from not loading all of the data into memory. In that case, akka-stream probably applies.
Concatenating data in memory is not a streaming use case
Writing data to a file sequentially is a streaming use case
Grouping data one by one element without doing aggregation per group and writing these out somewhere is a streaming use case
Grouping data but then doing an aggregation that needs all data per group requires all data in memory, so streaming won’t help a lot.
Sorting data is generally not a streaming use case
Hard to say from your description if your grouping pattern would benefit from streaming or not. It sounds as if it could work if you write one file per group?