[zeromq-dev] Handling network failures in parallel pipeline ventilator?

Joe Planisky joe.planisky at temboo.com
Thu Aug 23 23:01:55 CEST 2012

I'm a little stumped about how to handle network failures in a system that uses PUSH/PULL sockets as in the Parallel Pipeline case in chapter 2 of The Guide.

As in the Guide, suppose my ventilator is pushing tasks to 3 workers.  It doesn't matter which task gets pushed to which worker, but it's very important that all tasks eventually get sent to a worker.  

Everything is working fine; tasks are being load balanced to workers, workers are doing their thing and sending the results on to a sink.  Now suppose there's a network failure between the ventilator and one of the workers. Suppose the ethernet cable to one of the worker machines is unplugged.

Based on what we've seen in practice, the ventilator socket will still attempt to push some number of tasks to the now disconnected worker before realizing there's a problem.  Tasks intended for that worker start backing up, presumably in ZMQ buffers and/or in buffers in the underlying OS (Ubuntu 10.04 in our case).  Eventually, the PUSH socket figures out that something is wrong and stops trying to send additional tasks to that worker. All new tasks are then load balanced to the remaining workers.  

However, the tasks that are queued up for the disconnected worker are stuck and are never sent anywhere unless or until the original worker comes back online.  If the original worker never comes back, those tasks never get executed. (If it does come back, it gets a burst of all the backed up tasks and the PUSH socket resumes load balancing new tasks to all 3 workers.)

We'd like to prevent this backup from happening or at least minimize the number of tasks that get stuck.  We've tried setting high water marks, send and receive timeouts, and send and receive buffer sizes in ZMQ to small values (e.g. 1) hoping that it would cause the PUSH socket to notice the problem sooner, but at best we still get several dozen task messages backed up before the socket notices the problem and stops trying.  (Our task messages are small, about 520 bytes each.)

If we have to, we can deal with the same task getting sent to more than one worker on an occasional basis, but we'd like to avoid that if possible.

We're using ZMQ 2.2.0, but are also investigating the 3.2.0 release candidate.  If it matters, we're accessing ZMQ with Java using the jzmq bindings.  The underlying OS is Ubuntu 10.04.

Any suggestions for how to deal with this?


More information about the zeromq-dev mailing list