[zeromq-dev] router socket hangs on write (was detecting dead MDP workers)

Gyorgy Szekely hoditohod at gmail.com
Thu Feb 16 22:44:04 CET 2017

I dug a bit deeper, here are my findings:
- removing the on/off switching for the ZMQ_ROUTER_MANDATORY flag, and
enabling it before the router socket bind: makes no difference
- removing the monitor trigger and heartbeating the workers periodically
(2.5 sec) drastically reduces the occurrence rate, the program hangs after
3-4 hours, instead of seconds. (in the background a worker
connects/disconnects with 4 second period time)

>From this I suspect the issue appears in a small timeframe which is close
to the monitor event, but otherwise hard to hit.

With GDB is see the following:
- in zmq::socket_base_t::send() the call to xsend() returns EAGAIN. This
should not happen since the ZMQ_DONTWAIT is not specified.
- ZMQ_DONTWAIT is not specified, so the function won't return -1, but block
(see trace in prev mail).

- inside zmq::router_t::xsend() the pipe is found in the outpipes map, but
the check_write() on it returns false
- the if(mandatory) check in this block (router.cpp:218) returns with -1,
- a similar block 10 lines below returns with -1, EHOSTUNREACH

Should both if(mandatory) checks return EHOSTUNREACH? There's also a
comment in the header for bool mandatory, that it will report EAGAIN, but
this contradicts with the documentation.

Can you help to clarify?



On Thu, Feb 16, 2017 at 12:22 PM, Gyorgy Szekely <hoditohod at gmail.com>

> Hi,
> Continuing my journey on detecting dead workers I reduced the design to
> the minimal, and eliminated the messy file descriptors.
> I only have:
> - a router socket, with some number of peers
> - a monitor socket attached to the router socket
> When the monitor detects a disconnect on the router socket:
> - do setsockopt(ZMQ_ROUTER_MANDATORY, 1);
> - send heartbeat message to every known peer
> - if EHOSTUNREACH returned: remove the peer
> - do setsockopt(ZMQ_ROUTER_MANDATORY, 0);
> What happens: _my app regularly hangs_ in zmq_msg_send(). Roughly 20% of
> the invocations. The call never returns, I have to kill the application.
> What am I doing wrong??? According to the RFC's router sockets should
> never block.
> I attached a full stacktrace with info locals and args for each relevant
> frame (sorry for the machine readable format).
> Env:
> libzmq 4.2.1 stable, debug build
> Ubuntu 16.04 64bit (the same happens with ubuntu packaged lib)
> Regards,
>   Gyorgy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20170216/53b7d2af/attachment.htm>

More information about the zeromq-dev mailing list