[zeromq-dev] Router socket reconnection failure

Justin Karneges justin at affinix.com
Tue Dec 16 23:17:14 CET 2014


Hi Andre,

On Tue, Dec 16, 2014, at 07:14 AM, Andre Caron wrote:
> The issue I'm having is with this sequence:
> - P1 and P2 discover each other through D;
> - P1 connects to P2 and P2 waits for a connection from P1 (direction is
> determined by lexicographical ordering of identities, which both peers
> have prior to connecting);
> - Peers exchange heartbeats for a while;
> - I forcibly crash P2;
> - P1 eventually detects that P2 is unresponsive and explicitly
> disconnects;
> - after this happens, I restart P2;
> - P1 and P2 discover each other through D again;
> - P1 tries to connect to P2 and P2 expects a connection from P1;
> - both peers send heartbeats, but neither peer receives the other's
> messages and it appears the connection is never established.
> 
> Also note that after this has happened, context termination hangs despite
> closing the (only) socket and setting the linger to 1 second.
> 
> If I crash P1 instead of P2, the reconnection is successful.  Also, if
> after the error sequence above I crash P1, peers reconnect successfully.

This is a known issue, and I reported it earlier this year:
http://lists.zeromq.org/pipermail/zeromq-dev/2014-February/025202.html

I believe the problem is that once a connector queue learns the ID of a
remote address, this binding sticks for life. The reason that you can
restart P1 and things work is because connectors maintain queues even if
there are no connections, but binders don't.

Unfortunately I haven't had time yet to look at a fix.

Justin



More information about the zeromq-dev mailing list