[zeromq-dev] I'm losing messages doing 2 way heartbeat checks

MinRK benjaminrk at gmail.com
Mon May 23 06:08:57 CEST 2011


I think you might be getting caught up with the fact that zmq FD
events are edge-triggered.  When you have a handler registered for a
POLLIN event, it must handle *all* available incoming messages,
otherwise if you have multiple incoming messages ready at the same
time, you will only handle one of them and your inbox will start to
grow each time this happens.

It's easier when you are using pyzmq's Tornado eventloop if you use
the ZMQStream wrapper instead of plain sockets.  They have an on_recv
method for registering a callback to be run with incoming messages,
which makes this sort of thing easier.

short doc here: http://zeromq.github.com/pyzmq/eventloop.html

I've tweaked your code to use the ZMQStream, and it looks like there
aren't any missing heartbeats anymore:


On Sun, May 22, 2011 at 20:30, Joseph Bowman <bowman.joseph at gmail.com> wrote:
> Hello,
> I'm new to ZeroMQ. After reading about it and going through the guide, I had
> the idea of building a self configuring load balancing application using the
> Least Recently Used methodology demonstrated throughout the guide.
> I've started work, and I've run into my first problem that I've beat on for
> a while and I think I must be missing something, but I can't figure out what
> it is.
> Basic information:
> ZeroMQ v 2.1.7
> pyzmq v 2.1.7
> Python 2.6.1
> Operating systems tested:
> OSX on a 13in MPB (core 2 duo)
> Rackspace Ubuntu 9.10 server, 4 cores.
> Results were the same on both servers.
> The high level description is I've got 2 applications. One is the broker,
> the other is the worker. I haven't gotten to getting multiple workers
> talking to the broker yet.
> Direct links to code are:
> github repo - https://github.com/joerussbowman/Scale0
> broker code
> worker code
> To run it start scale0.py, then start test_worker.py. Each server keeps a
> list of pings/heartbeats it sends. The pings/heartbeats include timestamps.
> When they get a response they delete that timestamp from the list. If you
> run them you will see that the lists start growing on both ends. It's like
> it switches off with one heartbeat working for while, then the other.
> The only other thing I can think of to note is I am using the eventloop.
> I've done my best to add in some comments to the code to make it easier to
> understand what's going on, though I'm sure it could be commented better.
> More detailed description follows.
> Broker has 1 XREP socket open.
> Worker has 1 XREP socket and 1 XREQ socket open.
> Both send heartbeat messages to each other, expecting a response to validate
> the other is alive. Broker sends ping, expects pong. Worker sends heartbeat,
> expects heartbeatreply.
> Workers connect to the Broker and send a ready. The Broker adds the worker
> to the LRU. The Worker also starts sending heartbeat requests. The Broker
> every second goes through it's LRU queue and sends a ping to each worker in
> the queue. This is done by creating a socket and sending it to the
> connection the worker informed the broker it has available.
> If anyone could get me pointed in the right direction to not get the
> heartbeat misses, I'd be grateful.
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev

More information about the zeromq-dev mailing list