[zeromq-dev] Load balancing and fault tolerance
Brian Granger
ellisonbg at gmail.com
Tue Jul 13 09:41:20 CEST 2010
Hi,
One of the core parts of a system I am building with zeromq is a task
distribution system. The system has 1 master process and N workers.
I would like to use XREP/XREQ sockets for this, but am running into
two problems:
1. The load balancing algorithm (round robin) of the XREQ socket is
not efficient for certain types of loads. I think the recent
discussions on other load balancing algorithms could help in this
regard.
2. Fault tolerance. If a worker goes down, I need to know about it
and be able to requeue any tasks that went down with the worker. To
monitor the health of workers, we have implemented a heartbeat
mechanism that works great. BUT, with the current interface, I don't
have any way of discovering which tasks were on the worker that went
down. This is because the routing information (which client gets a
message) is not exposed in the API.
Because of this second limitation, we are having to look at crazy
things like using a PAIR socket for each worker. This is quite a pain
because I can't use the 0MQ load balancing and I have to manage all of
those PAIR sockets (1 for each worker). With lots of workers this
doesn't scale very well.
Any thought on how to make XREQ/XREP more fault tolerant?
Cheers,
Brian
--
Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger at calpoly.edu
ellisonbg at gmail.com
More information about the zeromq-dev
mailing list