[zeromq-dev] router socket hangs on write (was detecting dead MDP workers)

Gyorgy Szekely hoditohod at gmail.com
Thu Feb 16 22:44:04 CET 2017


Hi,
I dug a bit deeper, here are my findings:
- removing the on/off switching for the ZMQ_ROUTER_MANDATORY flag, and
enabling it before the router socket bind: makes no difference
- removing the monitor trigger and heartbeating the workers periodically
(2.5 sec) drastically reduces the occurrence rate, the program hangs after
3-4 hours, instead of seconds. (in the background a worker
connects/disconnects with 4 second period time)

>From this I suspect the issue appears in a small timeframe which is close
to the monitor event, but otherwise hard to hit.

With GDB is see the following:
- in zmq::socket_base_t::send() the call to xsend() returns EAGAIN. This
should not happen since the ZMQ_DONTWAIT is not specified.
- ZMQ_DONTWAIT is not specified, so the function won't return -1, but block
(see trace in prev mail).

- inside zmq::router_t::xsend() the pipe is found in the outpipes map, but
the check_write() on it returns false
- the if(mandatory) check in this block (router.cpp:218) returns with -1,
EAGAIN
- a similar block 10 lines below returns with -1, EHOSTUNREACH

Should both if(mandatory) checks return EHOSTUNREACH? There's also a
comment in the header for bool mandatory, that it will report EAGAIN, but
this contradicts with the documentation.

Can you help to clarify?


Regards,
  Gyorgy


It

On Thu, Feb 16, 2017 at 12:22 PM, Gyorgy Szekely <hoditohod at gmail.com>
wrote:

> Hi,
> Continuing my journey on detecting dead workers I reduced the design to
> the minimal, and eliminated the messy file descriptors.
> I only have:
> - a router socket, with some number of peers
> - a monitor socket attached to the router socket
>
> When the monitor detects a disconnect on the router socket:
> - do setsockopt(ZMQ_ROUTER_MANDATORY, 1);
> - send heartbeat message to every known peer
> - if EHOSTUNREACH returned: remove the peer
> - do setsockopt(ZMQ_ROUTER_MANDATORY, 0);
>
> What happens: _my app regularly hangs_ in zmq_msg_send(). Roughly 20% of
> the invocations. The call never returns, I have to kill the application.
>
> What am I doing wrong??? According to the RFC's router sockets should
> never block.
> I attached a full stacktrace with info locals and args for each relevant
> frame (sorry for the machine readable format).
>
> Env:
> libzmq 4.2.1 stable, debug build
> Ubuntu 16.04 64bit (the same happens with ubuntu packaged lib)
>
> Regards,
>   Gyorgy
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20170216/53b7d2af/attachment.htm>


More information about the zeromq-dev mailing list