[zeromq-dev] error handler for load-balancing global exchange

Martin Sustrik sustrik at fastmq.com
Thu Mar 12 22:32:52 CET 2009


Hi Aamir,

> We have a compute cluster that solves a certain mathematical problem.
> We send work request messages to the cluster nodes and get results
> back. I was thinking of using a load-balancing 0MQ exchange in order
> to distribute the work requests across the cluster.
> 
> This works without knowing how many nodes are connected to the
> load-balancing exchange, but of course we do need to know how many
> nodes are in our HPC cluster since that determines how fast we can get
> results out of the system. And when a node goes adown, it is important
> for us to know which node failed so that we can try to revive it.
> That's why I was hoping that the load-balancing global exchange itself
> could tell me how many nodes are connected, who the nodes are (IP
> address), and also which node failed in the event of a disconnection.

I would rather use a watchdog process to restart particular node of the 
cluster. Exchange cannot be really aware of whether the node is running 
or not - it can only know whether connection to the node is alive or 
broken. Basing the restart policy on the network availability can cause 
some really nasty problems, so I would rather avoid it.

> But I have found the load-balancing exchange does not really work very
> well. My calculation was much slower as soon as I switched to the
> load-balancing 0MQ exchange (instead of doing round-robin load
> balancing by myself). I noticed that every once in a while the
> load-balancing exchange would send messages to only ONE queue (and
> always the same SAME queue). This means that every few seconds, the
> load on our cluster becomes completely unbalanced as all work is sent
> to only one server while the rest remain idle. I don't know if this is
> a bug in the library or a problem with how we are using it (all of the
> nodes receive messages on local queues using a blocking call to the
> 0MQ API, so it is surprising that this is happening). I will try to
> find someway to reproduce this problem in sample code that be posted
> here.

Aamir, this is a pretty serious bug. We'll run some tests ourselves, but 
any help with reproducing it would be highly appreciated. Once we are 
able to reproduce it, we'll fix it as a matter of priority.

Thanks in advance!
Martin



More information about the zeromq-dev mailing list