[zeromq-dev] Router socket reconnection failure

André Caron andre.l.caron at gmail.com
Wed Dec 17 02:09:55 CET 2014


Hi Justin,

Thanks for the info :-)

Just read that thread, but the case seems slightly different: all my nodes
use a persistent identity, which I set immediately after creating the
socket and thus before any bind or connect operation.  However, I just
tried having P2 restart with a new identity and I get the same problem.

I'm really confused by the answers from Laurent near the end of the
thread.  It seems to me like the whole point of the identity socket option
is to send the string to the peer so that it can "resume" a session across
multiple TCP connections and/or process executions.  It also seems to me
like if it doesn't work in this scenario, then the identity's only purpose
would be for debugging purposes.  In addition, nothing I've seen so far
explains the fact that this scenario causes zmq_term() to hang forever
despite closing all sockets and setting a non-zero linger value, which is
clearly a bug.

I tried playing around with my code a bit more.  Using ZMQ 4.0.5, I get the
error. If I switch to ZMQ 4.1.0, the peers reconnect, but I zmq_term()
still hangs as soon as P1 reconnects to P2.  I don't know what was fixed
between those two releases, but something almost fixed the problem!

If it's of any help, setting the ZMQ_ROUTER_HANDOVER option to 1 doesn't
prevent zmq_term() from hanging.  This option doesn't exist in 4.0
releases, so I can't try it out there.

André

On Tue, Dec 16, 2014 at 5:17 PM, Justin Karneges <justin at affinix.com> wrote:
>
> Hi Andre,
>
> On Tue, Dec 16, 2014, at 07:14 AM, Andre Caron wrote:
> > The issue I'm having is with this sequence:
> > - P1 and P2 discover each other through D;
> > - P1 connects to P2 and P2 waits for a connection from P1 (direction is
> > determined by lexicographical ordering of identities, which both peers
> > have prior to connecting);
> > - Peers exchange heartbeats for a while;
> > - I forcibly crash P2;
> > - P1 eventually detects that P2 is unresponsive and explicitly
> > disconnects;
> > - after this happens, I restart P2;
> > - P1 and P2 discover each other through D again;
> > - P1 tries to connect to P2 and P2 expects a connection from P1;
> > - both peers send heartbeats, but neither peer receives the other's
> > messages and it appears the connection is never established.
> >
> > Also note that after this has happened, context termination hangs despite
> > closing the (only) socket and setting the linger to 1 second.
> >
> > If I crash P1 instead of P2, the reconnection is successful.  Also, if
> > after the error sequence above I crash P1, peers reconnect successfully.
>
> This is a known issue, and I reported it earlier this year:
> http://lists.zeromq.org/pipermail/zeromq-dev/2014-February/025202.html
>
> I believe the problem is that once a connector queue learns the ID of a
> remote address, this binding sticks for life. The reason that you can
> restart P1 and things work is because connectors maintain queues even if
> there are no connections, but binders don't.
>
> Unfortunately I haven't had time yet to look at a fix.
>
> Justin
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20141216/1ef49e94/attachment.htm>


More information about the zeromq-dev mailing list