[zeromq-dev] Load balancing and fault tolerance

Brian Granger ellisonbg at gmail.com
Tue Jul 13 09:41:20 CEST 2010


Hi,

One of the core parts of a system I am building with zeromq is a task
distribution system.  The system has 1 master process and N workers.
I would like to use XREP/XREQ sockets for this, but am running into
two problems:

1.  The load balancing algorithm (round robin) of the XREQ socket is
not efficient for certain types of loads.  I think the recent
discussions on other load balancing algorithms could help in this
regard.

2.  Fault tolerance.  If a worker goes down, I need to know about it
and be able to requeue any tasks that went down with the worker.  To
monitor the health of workers, we have implemented a heartbeat
mechanism that works great.  BUT, with the current interface, I don't
have any way of discovering which tasks were on the worker that went
down.  This is because the routing information (which client gets a
message) is not exposed in the API.

Because of this second limitation, we are having to look at crazy
things like using a PAIR socket for each worker.  This is quite a pain
because I can't use the 0MQ load balancing and I have to manage all of
those PAIR sockets (1 for each worker).  With lots of workers this
doesn't scale very well.

Any thought on how to make XREQ/XREP more fault tolerant?

Cheers,

Brian

-- 
Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger at calpoly.edu
ellisonbg at gmail.com



More information about the zeromq-dev mailing list