[zeromq-dev] Help with PUB/SUB reconnect after PUB host reboot

Pawel Osiczko p.osiczko at tetrapyloctomy.org
Wed May 21 20:26:13 CEST 2014


Dear all,

I’m a ZMQ newbie trying to learn the ropes. Having gone over the Guide, I’m attempting to learn/test recovery mechanism of a very simple PUB/SUB mechanism with publisher being hosted on one machine and subscriber on another. 

In fact, the example I’m working is almost identical to the ‘weather station’ scenario from the Guide with minor tweaks to allow observation of all the incoming data. And so, I observe that shutdown and restart of a publisher process results in subscriber resuming receival of the data sent by the PUB process. If, however, I reboot the host where publisher lives, I get very inconsistent behavior. So the sequence of events for is as follows:

1. Start the subscriber on the command line on the subscriber host.
2. Start the publisher on the command line on the publisher host.
3. Observer everything working correctly with subscriber receiving the data.
4. Reboot the host where publisher is running.
5. Observe the subscriber not receive data.
6. Verify the publisher process is running on the newly rebooted host.
7. Observe the subscriber not receive data.

The publisher is started up from init script with 'nohup $EXEC > $LOGFILE 2>&1 &’ where EXEC points to the publisher. Firewall on the publisher machine is turned off.

The confusing part is that reboot causes subscribers not to receive the data in 9 out of 10 reboots. There is an occasional run where the subscriber does receive data post-reboot, i.e. recovery successfully takes place. 

Running the publisher post-reboot from the command line results in slightly better statistics, i.e. subscriber fails to receive the data in 6 out of 10 reboots or so. And, naturally, restart of subscribers fixes everything, so one could conceivably implement a heartbeat which would rebind the subscribers to ‘fix’ this issue. That does not answer the question as to why the reboot breaks the SUB recovery while SUB/PUB process(es) restart recovers just fine. Any ideas why that would be the case? Should I be implementing a heartbeat thread on the SUB host verifying PUB host is reachable?

Subscriber code available here: http://pastebin.com/Z79ckVS1
Publisher code available here: http://pastebin.com/fKDZQ4qV
ZMQ version: 4.0.4
OS: Centos 6.5 (Pub) / OSX (Sub)

Thank you!

Pawel




More information about the zeromq-dev mailing list