[zeromq-dev] router socket hangs on write (was detecting dead MDP workers)

Luca Boccassi luca.boccassi at gmail.com
Sat Feb 18 19:32:45 CET 2017


On Fri, 2017-02-17 at 10:53 +0100, Gyorgy Szekely wrote:
> Hi,
> Sorry for spamming the list :( I will rate limit myself.
> 
> I reviewed the docs for ZMQ_ROUTER_MANDATORY and it's clear now that
> the
> router socket may block if the message can be routed but HWM is
> reached and
> ZMQ_DONTWAIT is not specified. This is the exact code path my
> application
> blocks in.
> 
> The problem is that HWM is not reached in my case.
> zmq::router_t::xsend()
> checks HWM with zmq::pipe_t::check_write(), which returns false, but
> not
> because HWM is reached, but beacuse pipe state is
> zmq::pipe_t::waiting_for_delimiter.
> 
> Summary:
> I don't think it's reasonable for zmq::router_t::xsend() to return -1
> EAGAIN, when the corresponding pipe is being terminated. It's obvious
> that
> the message can't be sent in the future, there's no point in
> retrying.
> 
> (For the time being, as a workaround I specify ZMQ_DONTWAIT on the
> send,
> and I consider the worker dead with either EHOTUNREACH or EAGAIN.)
> 
> What's your opinion on this?
> 
> 
> Regards,
>   Gyorgy

Is the pipe terminated when the underlying socket is disconnected? I
can't remember and I'd have to double check, but if that's the case
then it could come back, so EAGAIN would be appropriate, right?

Also the check_write just returns true/false, and given it's in the hot
path I'd be wary of overloading it to cater for a single corner case.

> On Thu, Feb 16, 2017 at 10:44 PM, Gyorgy Szekely <hoditohod at gmail.com
> >
> wrote:
> 
> > Hi,
> > I dug a bit deeper, here are my findings:
> > - removing the on/off switching for the ZMQ_ROUTER_MANDATORY flag,
> > and
> > enabling it before the router socket bind: makes no difference
> > - removing the monitor trigger and heartbeating the workers
> > periodically
> > (2.5 sec) drastically reduces the occurrence rate, the program
> > hangs after
> > 3-4 hours, instead of seconds. (in the background a worker
> > connects/disconnects with 4 second period time)
> > 
> > From this I suspect the issue appears in a small timeframe which is
> > close
> > to the monitor event, but otherwise hard to hit.
> > 
> > With GDB is see the following:
> > - in zmq::socket_base_t::send() the call to xsend() returns EAGAIN.
> > This
> > should not happen since the ZMQ_DONTWAIT is not specified.
> > - ZMQ_DONTWAIT is not specified, so the function won't return -1,
> > but
> > block (see trace in prev mail).
> > 
> > - inside zmq::router_t::xsend() the pipe is found in the outpipes
> > map, but
> > the check_write() on it returns false
> > - the if(mandatory) check in this block (router.cpp:218) returns
> > with -1,
> > EAGAIN
> > - a similar block 10 lines below returns with -1, EHOSTUNREACH
> > 
> > Should both if(mandatory) checks return EHOSTUNREACH? There's also
> > a
> > comment in the header for bool mandatory, that it will report
> > EAGAIN, but
> > this contradicts with the documentation.
> > 
> > Can you help to clarify?
> > 
> > 
> > Regards,
> >   Gyorgy
> > 
> > 
> > It
> > 
> > On Thu, Feb 16, 2017 at 12:22 PM, Gyorgy Szekely <hoditohod at gmail.c
> > om>
> > wrote:
> > 
> > > Hi,
> > > Continuing my journey on detecting dead workers I reduced the
> > > design to
> > > the minimal, and eliminated the messy file descriptors.
> > > I only have:
> > > - a router socket, with some number of peers
> > > - a monitor socket attached to the router socket
> > > 
> > > When the monitor detects a disconnect on the router socket:
> > > - do setsockopt(ZMQ_ROUTER_MANDATORY, 1);
> > > - send heartbeat message to every known peer
> > > - if EHOSTUNREACH returned: remove the peer
> > > - do setsockopt(ZMQ_ROUTER_MANDATORY, 0);
> > > 
> > > What happens: _my app regularly hangs_ in zmq_msg_send(). Roughly
> > > 20% of
> > > the invocations. The call never returns, I have to kill the
> > > application.
> > > 
> > > What am I doing wrong??? According to the RFC's router sockets
> > > should
> > > never block.
> > > I attached a full stacktrace with info locals and args for each
> > > relevant
> > > frame (sorry for the machine readable format).
> > > 
> > > Env:
> > > libzmq 4.2.1 stable, debug build
> > > Ubuntu 16.04 64bit (the same happens with ubuntu packaged lib)
> > > 
> > > Regards,
> > >   Gyorgy
> > > 
> > > 
> 
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20170218/c8a8268e/attachment.sig>


More information about the zeromq-dev mailing list