[zeromq-dev] Destroying 0MQ context gets indefinitely stuck/hangs despite linger=0

Tomas Krajca tomas at repositpower.com
Mon May 8 03:08:09 CEST 2017


Hi all,

I have come across a weird/bad bug, I believe.

I run libzmq 4.1.6 and pyzmq 16.0.2. This happens on both Centos 6 and 
Centos 7.

The application is a celery worker that runs 16 worker threads. Each 
worker thread instantiates a 0MQ-based client, gets data and then closes 
this client. The 0MQ-based client creates its own 0MQ context and 
terminates it on exit. Nothing is shared between the threads or clients, 
every client processes only one request and then it's fully terminated.

The client itself is a REQ socket which uses CURVE authentication to 
authenticate with a ROUTER socket on the server side. The REQ socket has 
linger=0. Almost always, the REQ socket issues request, gets back 
response, closes the socket, destroys its context, all is good. Once 
every one or two days though, the REQ socket times out when waiting for 
the response from the ROUTER server, it then successfully closes the 
socket but indefinitely hangs when it goes on to destroy the context.

This runs in a data center on 1Gb/s LAN so the responses usually finish 
in under a second, the timeout is 20s. My theory is that the socket gets 
into a weird state and that's why it times out and blocks the context 
termination.

I ran a tcpdump and it turns out that the REQ client successfully 
authenticates with the ROUTER server but then it goes completely silent 
for those 20 odd seconds.

Here is a tcpdump capture of a stuck REQ client - 
https://pastebin.com/HxWAp6SQ. Here is a tcpdump capture of a normal 
communication - https://pastebin.com/qCi1jTp0. This is a full backtrace 
(after SIGABRT signal to the stuck application) - 
https://pastebin.com/jHdZS4VU

Here is ulimit:

[root at auhwbesap001 tomask]# cat /proc/311/limits
Limit                     Soft Limit           Hard Limit 
Units
Max cpu time              unlimited            unlimited 
seconds
Max file size             unlimited            unlimited 
bytes
Max data size             unlimited            unlimited 
bytes
Max stack size            8388608              unlimited 
bytes
Max core file size        0                    unlimited 
bytes
Max resident set          unlimited            unlimited 
bytes
Max processes             31141                31141 
processes
Max open files            8196                 8196 
files
Max locked memory         65536                65536 
bytes
Max address space         unlimited            unlimited 
bytes
Max file locks            unlimited            unlimited 
locks
Max pending signals       31141                31141 
signals
Max msgqueue size         819200               819200 
bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us 


The application doesn't seem to get over any of the limits, it usually 
hovers between 100 and 200 open file handlers.

I tried to swap the REQ socket for a DEALER socket but that didn't help, 
the context eventually hung as well.

I also tried to set ZMQ_BLOCKY to 0 and/or ZMQ_HANDSHAKE_IVL to 100ms 
but the context still eventually hung.

I looked into the C++ code of libzmq but would need some guidance to 
troubleshoot this as I am primarily a python programmer.

I think we had a similar issue back in 2014 - 
https://lists.zeromq.org/pipermail/zeromq-dev/2014-September/026752.html. From 
memory, the tcpdump capture also showed the client/REQ going silent 
after the successful initial CURVE authentication but at that time the 
server/ROUTER application was crashing with an assertion.

I am happy to do any more debugging.

Thanks in advance for any help/pointers.
-- 
<http://www.repositpower.com/>

*Tomas Krajca *
Software architect
m.  02 6162 0277
e.   tomas at repositpower.com
<https://twitter.com/RepositPower>
<https://www.facebook.com/Reposit-Power-1423585874607903/>
<https://www.linkedin.com/company/reposit-power>



More information about the zeromq-dev mailing list