Get a fast reduce function of a big stream

sigurd · February 5, 2020, 2:25pm

Hey guys,

Ich have a big stream of strings (1M,10M,100M)
and want to make a reduce.

If I use following very easy solution, then I need much memory but it is quite fast (1M needs 100ms on my machine)

 def simpleLoop:String = {
      val sb = new StringBuilder
      for (i <- 1 to end){
        if (sb.isEmpty){
          sb.append(i.toString)
        }
        else{
          sb.append("." +i.toString)
        }
      }
      sb.toString()
  }

If I use a very simple solution with akka streams, then it is very slow.

 Source(List.range( 1,1_000_000))
      .map(x => x.toString)
      .runFold("")((x,y) => concat(x,y,"."))

  def concat(s1:String,s2:String,d:String) = if (s1 == "") s2 else s1 + d + s2

Of course I can group that, then I need about 3s for 1M on my machine.

 Source(List.range( 1,1_000_000))
      .map(x => x.toString)
      .grouped(1000)
      .mapAsync(1000)(l => Source(l).runFold("")((x,y) => concat(x,y,".")))
      .runFold("")((x,y) => concat(x,y,"."))

  def concat(s1:String,s2:String,d:String) = if (s1 == "") s2 else s1 + d + s2

But again thats 30x slower. Is there any trick to make it better?

Best
Sigurd

jrudolph · February 6, 2020, 10:45am

Hi @sigurd,

the question is why you would like to use streams for that particular use case. This seems like a pretty basic operation that wouldn’t gain anything from using streams. Streams introduce some overhead and if the operations itself are very simple that overhead might be significant. In particular, operations that go from memory to memory will likely not see any benefit from using streams.

Also the comparison is somewhat unfair because you use a single mutable StringBuilder (with amortized linear append cost and memory allocations) in the simple loop but use immutable String in the streams versions (probably more like O(n^2) append costs and memory allocations because of all the string copies involved).

Hope that explains what you are seeing.

sigurd · February 6, 2020, 11:07am

A use case would be to write big files (csv or so) but also a line could be very big.
Let say you have 10M incoming entries. And you have to group a subset of them, so in output you have only 1M but some of them contains a whole group of maybe 10k per line.
The question is: is ist better to create one line as a synchronous process (just a string builder) or with a stream (just a fold).

jrudolph · February 6, 2020, 12:29pm

The main question is if your workload is a streaming workload, i.e. can you process the input one by one, so that you can benefit from not loading all of the data into memory. In that case, akka-stream probably applies.

Concatenating data in memory is not a streaming use case
Writing data to a file sequentially is a streaming use case
Grouping data one by one element without doing aggregation per group and writing these out somewhere is a streaming use case
Grouping data but then doing an aggregation that needs all data per group requires all data in memory, so streaming won’t help a lot.
Sorting data is generally not a streaming use case

Hard to say from your description if your grouping pattern would benefit from streaming or not. It sounds as if it could work if you write one file per group?

sigurd · February 10, 2020, 10:03am

Thank you very much for this answer.
I want to know whether there is a better solution
but it makes totally sense what you explain.

In the end I need one file (however it is stored). It is possible to store it in chunks and read it chunk by chunk if its requested (by a HTTP call).

Topic		Replies	Views
Most efficient way to process data with akka-steams Akka Streams & Alpakka akka	9	2352	March 16, 2018
GroupWhile on Akka Streams? Akka Streams & Alpakka streams	7	2627	January 9, 2020
Aggregating a million records using Akka Streams Akka Streams & Alpakka	1	1013	August 7, 2019
Executing a flow on each element in a list as part of a batch operation Akka Libraries kafka , alpakka , streams	0	414	October 27, 2021
How Do I Pause consumption from kafka source when using groupBy/groupedWithin Akka Streams & Alpakka	4	663	July 15, 2023

Get a fast reduce function of a big stream

Related topics