[zeromq-dev] zmq_poll ignoring some (types of) events when busy?

John D. Mitchell jdmitchell at gmail.com
Wed May 18 09:44:40 CEST 2011

Hi folks,

As you can tell, I'm investigating zmq and so I've been going through the Guide and the myriad examples to learn. I'm testing this on a MBP running OS X 10.6.7 using the latest zmq v2.1.7 from github.

In http://zguide.zeromq.org/page:all#Node-Coordination , it talks about the syncpub & syncsub. At the end it mentions:
A more robust model could be:
	• Publisher opens PUB socket and starts sending "Hello" messages (not data).
	• Subscribers connect SUB socket and when they receive a Hello message they tell the publisher via a REQ/REP socket pair.
	• When the publisher has had all the necessary confirmations, it starts to send real data.

So I made various versions of those examples to try out different features. E.g., I was able to add in the above using the blocking send/recv. So far so good.

Then, I made my way up to hacking in using zmq_poll() in syncpub to explore the non-blocking support. I purposefully kept the syncsub code using the blocking send/recv. The code is: https://gist.github.com/978103

The code basically works. If you start up up to the 10 subscribers they will do the dance and then ingest the 1M messages and everybody will die peacefully.

However, if I start up more than 10 subscribers, I can start getting surprising behavior... If I start say 15 all together they can behave as expected. If I start some new subscribers after the 10 have started receiving the 1M messages then the latecomers will end up hung waiting for the publisher to notice the client is there at all (in the recv at syncsub4.c, line 23).

Steps to reproduce:
in terminal 1:
% syncpub3
in terminal 2:
% syncsub4 &; syncsub4 &; syncsub4 &; syncsub4 &; syncsub4 &; syncsub4 &; 
% syncsub4 &; syncsub4 &; syncsub4 &; syncsub4 &; syncsub4 &; syncsub4 &; 
...wait for the publisher to print "Switching to Spew mode..." and then start up some more subscribers...
% syncsub4 &; syncsub4 &; syncsub4 &;
% syncsub4 &; syncsub4 &; syncsub4 &;

As more subscribers come in later, they will more and more likely hang in that first recv() call.

I've tried various experiments (and fixed a few bugs in my code/understanding :-) but it comes down to this at the moment... by adding the call to usleep() at line 92 of syncpub3.c I can tune how easy/hard it is to reproduce the problem. That is to say that if there's no usleep() there at all the problem always happens but as I lengthen the usleep delay the problem shows up less and less. With the usleep(10) that's in that gist, I'm seeing it saturate all 8 cores on this box and then reliably hanging everybody after that.

The observed behavior seems to be that the new connections are *not* completed while the  publisher is busy spewing as fast as it's able to and so the latecomers are basically ignored until/unless there's a gap for them to sneak through.

Is this something missing in how zmq is dealing with it's internal scheduling fairness or is this an issue in the kernel itself or am I missing something blatantly obvious?


More information about the zeromq-dev mailing list