Transform a CSV file into multiple CSV files using Akka Stream

sfali · January 12, 2019, 4:44pm

Hi,

I am trying to figure out a solution using Akka Stream but after spending couple of days couldn’t figure out how to solve this particular problem.

Following is my problem:

Given a Source[String, NotUsed], following is an example of source:

abc,...,....
xyz,...,..
lmn,...,...
xyz,...,...
abc,...,...

Now we need to transfer given source to Source[Map(String, Seq[String), NotUsed], where key of the map is the first value of each line of CSV file. Once we have this we need to save the values of each key-value pair in separate file (actually it needs to save in the S3) where name of the needs to be in the following format:

<key>_<length_of_seq>_<random_generated_uuid>.csv

For example, name of one the file would be:

abc_2_d339f8f8-92b8-46f2-bdd0-31406e9c44da.csv

and its content would be:

abc,...,....
abc,...,....

Any suggestion how to approach this?

Regards,

Syed

tg44 · January 13, 2019, 9:27am

Hy!

This is not as trivial as it should be in an ideal world

So you want:

parse the csv to some data representation
group the lines by the first field
write the groups to separate files csv
name the files with a naming convention

The problems are:

you cant write to a file inside a stream with a name given by the data (or at least its not easy)
if you want to name the file with the number of lines, then you either need to buffer all the lines to the memory (it kills the whole streaming idea), or you need to name the file after you finished the writing to it.

I would start with a hacky solution:

first stream (data processing):
- Open the file as a ByteString source
- Read it with the alpakka csv connector
- (Some validation to the number of fields etc. could be good but not musthave)
- Start a groupBy branching (or maybe read this too)
- instead of the mergeSubStream I would close all the branches with a fileSink with a randomly generated filename
after the stream finished I would start a new stream
second stream (file naming)
- get the filenames from the dir that you used as an output dir from the prev stream as a Source[String]
- open the file
- get the first line’s first block and count the lines (statefulMapConcat could do this)
- rename the files with the gathered information

The problem with this method is; when you recieve the “stream ended” signal in the first stream, the file descriptors are not necessarly released the files at the end of the stream, so the next phase not necessary sees the whole files… You need to wait a bit between the first stream end and the second stream start. (This is why I said its the hacky solution).

If you have this solution you could try a better one. The main problem in this scenario that we not know when the filesinks finished. You could ducktape a solution with a custom filesink which not a “sink” but a flow with the materialized value given downstreams when the upstream finished. If you could build this stage, then you could use the prebild second stream with a mergeSubstream and a map, and you could easily call the uploadToS3 at the end. This is “ducktape” bcs you need to place this class to the good package hierarhy to reach internal functions, but totally doable. Maybe worth a PR to the main library or to the alpakka file connector.

johanandren · January 14, 2019, 1:56pm

For the file naming one trick you could potentially do is to combine groupBy with lazy sink, it’s a bit messy and highlights a deprecated factory method with no replacement, but I think it should work:

case class Entry(group: String, field1: String, field2: String)
val sink: Sink[Entry, NotUsed] = Flow[Entry]
  .groupBy(200, entry => entry.group)
  .to(Sink.lazyInit(first =>
    Future.successful(
      Flow[Entry].map(entry =>
        ByteString(entry.toString) // make bytes out of it
      ).to(FileIO.toPath(Paths.get("/somewhere", first.group)))
    ),
    () => Future.failed[IOResult](new RuntimeException("")) // won't be used so doesn't really matter
  ))

sfali · January 14, 2019, 1:59pm

Hi guys,

Thanks for your replies, I’ll try it out.

Regards,

Syed Farhan Ali

sfali · May 27, 2020, 12:40pm

Hello,

What is the alternative to do similar operation in Akka 2,6.+, I see documentation is referring to use Sink.lazyFutureSink’ in combination with 'Flow.prefixAndTail(1), but can’t figure out how to achieve this.

Regards,

Syed Farhan Ali

ennru · June 1, 2020, 11:50am

Hi Syed,

You’re right prefixAndTail would need some examples to show the different use cases.

This example in the current docs for futureFlow might help you: https://doc.akka.io/docs/akka/current/stream/operators/Flow/futureFlow.html#examples

The idea is to use prefixAndTail(1) to get access to the first element, extract what you need to create the sink and put it back into the stream to be written.

Cheers,
Enno.

Topic		Replies	Views
Zip multiple files as a stream (on the fly) Akka Streams & Alpakka streams	1	1214	November 6, 2019
Kafka to Textfiles Akka Streams & Alpakka akka , scala	8	1388	May 7, 2018
Processing a csv file with Akka Stream Akka Libraries	6	2175	July 24, 2018
Streaming Multiple files periodically from shared directory Akka Streams & Alpakka	6	1495	June 15, 2020
Writing element each to its own file, problem on last element Akka Streams & Alpakka alpakka	6	906	January 20, 2021

Transform a CSV file into multiple CSV files using Akka Stream

Related topics