I’m using the work pulling mechanism to implement a dynamic, distributed pool of workers. I’ve read the documentation and the relevant sample code, but while testing some failure modes I’ve encountered an unexpected behavior.
I’ve reduced my context to the smallest possible one:
- One node
- One worker: simplest possible one. It doesn’t use a child executor like the one in akka-sample-distributed-workers-scala.
- One job
I’ve modified my worker so that it always fails by throwing a RuntimeException
. Since it’s configured with the restart supervision strategy, the worker actor will always restart when it fails.
Reading the documentation this is what I expect:
- System starts
- The producer receives a
RequestNext
message for the only active worker. - I send a job to the producer, so that it’s sent to the idle worker.
- The worker receives the job but fails and doesn’t send the confirm message to the
ConsumerController
. - Since the
WorkPullingProducerController
has tracked that job as unconfirmed, it’s informed by the infrastracture that the worker it has sent that job to has been restarted, and it doesn’t have other workers to send that message to, it re-sends the unconfirmed message to the restarted worker.
And this is precisely what happens. However I’d have also expected that during this process the producer wouldn’t receive any further RequestNext
message, since there is only one worker and it’s working on something already. Instead, I see this behavior:
- Producer: receives a
RequestNext
- Producer: uses
RequestNext.sendNextTo
to send the job to the worker. - Producer: almost immediately received another
RequestNext
- Consumer: fails, then restarts
- Producer: receives another
RequestNext
- Consumer: receives the same job, fails and restarts again
- Producer: receives another
RequestNext
The producers received more and more RequestNext
messages even if there is only one worker and it does use the sendNextTo
ActorRef
only of the first one. Why is that?
For the time being I’ve modified my code so that the worker never fails, always replies with the ConsumerController.Confirmed
message and tracks the successful or failed execution elsewhere. However I’d like to really understand how the work pull pattern works in the above scenario.
Edit: I’ve also noticed that, since I keep track of job statuses (pending, in progress and completed) in an external database, and every time a job is submitted its status in the database is updated, often I see more in progress jobs than available workers (like: 40 workers and 63 in progress jobs). I think this might be related to the above behavior.