I’m working on a Kafka consumer application using Akka Kafka connector.
I would like to consumer to process messages parallelly.
which consumer group should I use https://doc.akka.io/docs/alpakka-kafka/current/consumer.html?
how can I configure the parallelism on the consumer side?
You can achieve parallelism a number of different ways with Alpakka Kafka. The standard way is to run multiple applications that use the same Kafka consumer group id and topic subscription.
You may also achieve parallelism within each application by using the mapAsync operator or the “partitioned” Sources. mapAsync will give you a “threadpool”-like ability to process messages you consume. For CPU-bound processing you can should the parallelism argument of mapAsync to the number of cores available to the app, for IO-bound operations you should choose a value that works best for you (i.e. max number of connections you want to make to a networked service).
The partitioned Sources provide a way of processing messages from different partitions separately, instead of funnelling messages of all partitions into one stream. You may use the various sub stream operators here to achieve parallelism.
As for what Source to choose, it’s probably best to start with plainSource or committableSource. Use plainSource if you don’t need to commit offsets back to Kafka.
@seglo one more question.
My application is IO bound and Kafka topic is only one partition.
is it possible to parallelize the consumer to process the message process concurrently?
Alternatively, I did partitioned topic and create the same number of consumer instances to process parallelly but was wondering if I can do similar when there is only one partition?
In a 1 partition scenario consumer groups wouldn’t serve much purpose, except as fail over, because a partition can only be assigned to one consumer group member at a time. You may process many messages from a single partition in parallel if you like. Since it’s IO-bound you could try using larger numbers for the parallelism argument of mapAsync.
mapAsync will ensure that transformed results are in the same order as input messages, which can add extra delay (i.e. message A is still processing when message B is already complete, message B will be blocked by message A). When you want to ensure strict ordering then this is good, but if you don’t care then you could use mapAsyncUnordered which will immediately push results downstream regardless of order. Keep in mind that if you commit offsets then this could cause you to skip offsets and break at least once semantics on failure/restarts, so in those cases it’s best to maintain the ordering.