I am performing a groupBy then a mergeSubstreams after some processing and naturally the elements from mergeSubstreams could be out of order due to the parallel processing of these substreams. Is it possible to merge the substreams and retain the original ordering of events?
I guess it’s difficult since 1 event could be mapped to 0, 1 or n events but any pointers would be helpful.
Yes, you are right. It is a bit tricky. You could tackle that from multiple fronts. I am not sure what you are trying to achieve, but you could switch to parallelism that is a bit more under control, for example mapAsync. The parallelism parameter will allow you to control how many Futures can be inflight at any given moment. And the order will be managed for you.
If that is a bit too much constraining, you could take a look at the very new, and still changing, Source/Flow with context feature, which allows to add a context to elements, that flow through behind the scenes even when used with operators that could duplicate or even drop elements.
Thanks for the info. Using mapAsync is where I ended up but did give the Source with context a try.
Essentially what I’m trying to do is consume from kafka (partitioned source from alpakka-kafka), groupBy within one of these partitioned sources and write to hdfs using alpakka-hdfs across many substreams. Then merge these substreams to form a stream of OutgoingMessages. The reason I want in-order is to store all the rotation events in a file with the offsets in order (for recovery).
So I’ve ended up keeping alpakka-kafka but not using alpakka-hdfs and using my own writer which is shame.
An approach I was trying was to modify the SubFlowImpl.MergeBack to use my own OrderedFlattenMerge and zipWithIndex on the flow before the groupBy. Then the OrderedFlattenMerge would only push an element if the index is lastSeen+1 or if all the substreams have an element buffered and so pushes the smallest index. What do you think?