[zeromq-dev] error handler for load-balancing global exchange
Aamir M
aamirjvm at gmail.com
Thu Mar 12 21:41:39 CET 2009
Hi Martin,
We have a compute cluster that solves a certain mathematical problem.
We send work request messages to the cluster nodes and get results
back. I was thinking of using a load-balancing 0MQ exchange in order
to distribute the work requests across the cluster.
This works without knowing how many nodes are connected to the
load-balancing exchange, but of course we do need to know how many
nodes are in our HPC cluster since that determines how fast we can get
results out of the system. And when a node goes adown, it is important
for us to know which node failed so that we can try to revive it.
That's why I was hoping that the load-balancing global exchange itself
could tell me how many nodes are connected, who the nodes are (IP
address), and also which node failed in the event of a disconnection.
But I have found the load-balancing exchange does not really work very
well. My calculation was much slower as soon as I switched to the
load-balancing 0MQ exchange (instead of doing round-robin load
balancing by myself). I noticed that every once in a while the
load-balancing exchange would send messages to only ONE queue (and
always the same SAME queue). This means that every few seconds, the
load on our cluster becomes completely unbalanced as all work is sent
to only one server while the rest remain idle. I don't know if this is
a bug in the library or a problem with how we are using it (all of the
nodes receive messages on local queues using a blocking call to the
0MQ API, so it is surprising that this is happening). I will try to
find someway to reproduce this problem in sample code that be posted
here.
Thanks,
Aamir
On Thu, Mar 12, 2009 at 10:06 AM, Martin Sustrik <sustrik at fastmq.com> wrote:
> Hi,
>
>> I use a global load-balancing exchange to send data to a cluster of
>> client systems. So a single server process creates global
>> load-balancing exchange, and multiple client processes bind local
>> queues to the exchange. I am assuming that this is how it is intended
>> to be used. The biggest issue I'm experiencing with this setup is that
>> if a client process disconnects, the server process has no way of
>> knowing WHICH client process disconnected. The server process can
>> register an error handler, but the error handler only gets the name of
>> the local object (the global exchange) and knows nothing about which
>> client failed. A less serious issue is that (without adding additional
>> ZeroMQ wirings) the server process also has no way of knowing how many
>> clients are currently connected to the load-balancing global exchange
>> ... how does one keep track of the resources that are being
>> load-balanced without such features?
>
> Once again, it would be useful to understand your use case. One of the main
> features of messaging middleware is that message producers are decoupled
> from message consumers (think of email - you don't care who of the billions
> of email users is connected at the moment - what you care about is getting
> messages addressed to you). Thus, the application actually SHOULDN'T know
> whether the peer(s) are connecting/disconnecting or how much pears there are
> connected at the moment.
>
> Martin
>
More information about the zeromq-dev
mailing list