[zeromq-dev] Router socket reconnection failure

Andre Caron andre.l.caron at gmail.com
Tue Dec 16 16:14:45 CET 2014

Hi all,

I'm experimenting with a router-router setup and I'm getting a strange issue when peers reconnect.

Basically, I have three nodes, which I'll call D, P1 and P2.  The idea is that D has a known TCP endpoint and socket identity.  P1 and P2 connect to D, register their TCP endpoint and identify and then discover each other through D (the directory).  At this point, one of them connects to the other and they become peers.  Through heartbeating, they can successfully detect connections and disconnections of the other peer.  Because the topology is dynamic and volatile, peers explicitly disconnect when they detect that one of their peers is unresponsive.

So far, my prototype implementations of programs for D and P* are working as intended.

The issue I'm having is with this sequence:
- P1 and P2 discover each other through D;
- P1 connects to P2 and P2 waits for a connection from P1 (direction is determined by lexicographical ordering of identities, which both peers have prior to connecting);
- Peers exchange heartbeats for a while;
- I forcibly crash P2;
- P1 eventually detects that P2 is unresponsive and explicitly disconnects;
- after this happens, I restart P2;
- P1 and P2 discover each other through D again;
- P1 tries to connect to P2 and P2 expects a connection from P1;
- both peers send heartbeats, but neither peer receives the other's messages and it appears the connection is never established.

Also note that after this has happened, context termination hangs despite closing the (only) socket and setting the linger to 1 second.

If I crash P1 instead of P2, the reconnection is successful.  Also, if after the error sequence above I crash P1, peers reconnect successfully.

As far as I can tell, the problem seems to be that a sequence of zmq_connect(), zmq_disconnect() and zmq_connect() on the same router socket and with the same endpoint corrupts the router socket.

Has anyone encountered this issue before?  I'm using ZMQ 4.1.0 via the PyZMQ bindings.

I may be able to work out a minimalist repro if necessary.



More information about the zeromq-dev mailing list