[zeromq-dev] router socket hangs on write (was detecting dead MDP workers)

Gyorgy Szekely hoditohod at gmail.com
Sun Feb 19 12:13:16 CET 2017


Hi Luca,
Unfortunately I'm not familiar with libzmq internals, so I can't decide
whether or not EAGAIN is appropriate. But as a library user I expect either:
sending a message on a mandatory router socket not to block when queue is
below HWM
-or-
documentation explicitly stating that ZMQ_ROUTER_MANDATORY must always be
used with ZMQ_DONTWAIT because the socket may occasionally block
(independent of HWM).

I attached a code example to demonstrate the problem. The router socket
send blocks, HWM is not reached (only 1 message in the queue), and the
socket never recovers as the pipe never returns to a state where it accepts
messages.

I agree that this is a corner case, the timeframe when the socket may block
is really short (sending exactly the same moment the peer disconnects), but
still the operation can't be called non-blocking. The attached example
triggers the issue with high occurrence rate by sending the message when
the monitor reports peer disconnect, but I could also reproduce the issue
without the monitor event (much lower occurrence rate of course).

With my very limited knowledge of the library internals I would replace the
condition:
router.cpp:213     if (it != outpipes.end ())
with something like this:
router.cpp:213    if (it != outpipes.end () &&
it->second.pipe->check_active())      // (out_active && state == active)
but it's probably not that simple. :)

Regards,
  Gyorgy

On Sat, Feb 18, 2017 at 7:32 PM, Luca Boccassi <luca.boccassi at gmail.com>
wrote:

> On Fri, 2017-02-17 at 10:53 +0100, Gyorgy Szekely wrote:
> > Hi,
> > Sorry for spamming the list :( I will rate limit myself.
> >
> > I reviewed the docs for ZMQ_ROUTER_MANDATORY and it's clear now that
> > the
> > router socket may block if the message can be routed but HWM is
> > reached and
> > ZMQ_DONTWAIT is not specified. This is the exact code path my
> > application
> > blocks in.
> >
> > The problem is that HWM is not reached in my case.
> > zmq::router_t::xsend()
> > checks HWM with zmq::pipe_t::check_write(), which returns false, but
> > not
> > because HWM is reached, but beacuse pipe state is
> > zmq::pipe_t::waiting_for_delimiter.
> >
> > Summary:
> > I don't think it's reasonable for zmq::router_t::xsend() to return -1
> > EAGAIN, when the corresponding pipe is being terminated. It's obvious
> > that
> > the message can't be sent in the future, there's no point in
> > retrying.
> >
> > (For the time being, as a workaround I specify ZMQ_DONTWAIT on the
> > send,
> > and I consider the worker dead with either EHOTUNREACH or EAGAIN.)
> >
> > What's your opinion on this?
> >
> >
> > Regards,
> >   Gyorgy
>
> Is the pipe terminated when the underlying socket is disconnected? I
> can't remember and I'd have to double check, but if that's the case
> then it could come back, so EAGAIN would be appropriate, right?
>
> Also the check_write just returns true/false, and given it's in the hot
> path I'd be wary of overloading it to cater for a single corner case.
>
> > On Thu, Feb 16, 2017 at 10:44 PM, Gyorgy Szekely <hoditohod at gmail.com
> > >
> > wrote:
> >
> > > Hi,
> > > I dug a bit deeper, here are my findings:
> > > - removing the on/off switching for the ZMQ_ROUTER_MANDATORY flag,
> > > and
> > > enabling it before the router socket bind: makes no difference
> > > - removing the monitor trigger and heartbeating the workers
> > > periodically
> > > (2.5 sec) drastically reduces the occurrence rate, the program
> > > hangs after
> > > 3-4 hours, instead of seconds. (in the background a worker
> > > connects/disconnects with 4 second period time)
> > >
> > > From this I suspect the issue appears in a small timeframe which is
> > > close
> > > to the monitor event, but otherwise hard to hit.
> > >
> > > With GDB is see the following:
> > > - in zmq::socket_base_t::send() the call to xsend() returns EAGAIN.
> > > This
> > > should not happen since the ZMQ_DONTWAIT is not specified.
> > > - ZMQ_DONTWAIT is not specified, so the function won't return -1,
> > > but
> > > block (see trace in prev mail).
> > >
> > > - inside zmq::router_t::xsend() the pipe is found in the outpipes
> > > map, but
> > > the check_write() on it returns false
> > > - the if(mandatory) check in this block (router.cpp:218) returns
> > > with -1,
> > > EAGAIN
> > > - a similar block 10 lines below returns with -1, EHOSTUNREACH
> > >
> > > Should both if(mandatory) checks return EHOSTUNREACH? There's also
> > > a
> > > comment in the header for bool mandatory, that it will report
> > > EAGAIN, but
> > > this contradicts with the documentation.
> > >
> > > Can you help to clarify?
> > >
> > >
> > > Regards,
> > >   Gyorgy
> > >
> > >
> > > It
> > >
> > > On Thu, Feb 16, 2017 at 12:22 PM, Gyorgy Szekely <hoditohod at gmail.c
> > > om>
> > > wrote:
> > >
> > > > Hi,
> > > > Continuing my journey on detecting dead workers I reduced the
> > > > design to
> > > > the minimal, and eliminated the messy file descriptors.
> > > > I only have:
> > > > - a router socket, with some number of peers
> > > > - a monitor socket attached to the router socket
> > > >
> > > > When the monitor detects a disconnect on the router socket:
> > > > - do setsockopt(ZMQ_ROUTER_MANDATORY, 1);
> > > > - send heartbeat message to every known peer
> > > > - if EHOSTUNREACH returned: remove the peer
> > > > - do setsockopt(ZMQ_ROUTER_MANDATORY, 0);
> > > >
> > > > What happens: _my app regularly hangs_ in zmq_msg_send(). Roughly
> > > > 20% of
> > > > the invocations. The call never returns, I have to kill the
> > > > application.
> > > >
> > > > What am I doing wrong??? According to the RFC's router sockets
> > > > should
> > > > never block.
> > > > I attached a full stacktrace with info locals and args for each
> > > > relevant
> > > > frame (sorry for the machine readable format).
> > > >
> > > > Env:
> > > > libzmq 4.2.1 stable, debug build
> > > > Ubuntu 16.04 64bit (the same happens with ubuntu packaged lib)
> > > >
> > > > Regards,
> > > >   Gyorgy
> > > >
> > > >
> >
> > _______________________________________________
> > zeromq-dev mailing list
> > zeromq-dev at lists.zeromq.org
> > https://lists.zeromq.org/mailman/listinfo/zeromq-dev
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20170219/fef6add3/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: repro.cpp
Type: text/x-c++src
Size: 3326 bytes
Desc: not available
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20170219/fef6add3/attachment.cpp>


More information about the zeromq-dev mailing list