[zeromq-dev] Destroying 0MQ context gets indefinitely, stuck/hangs despite linger=0

Tomas Krajca tomas at repositpower.com
Wed May 10 07:21:44 CEST 2017


Hi Luca and thanks for your reply.

 > Note that these are two well-known anti-patterns. The context is
 > intended to be shared and be unique in an application, and live for as
 > long as the process does, and the sockets are meant to be long lived as
 > well.
 >
 > I would recommend refactoring and, at the very least, use a single
 > context for the duration of your application.
 >

I always thought that having separate context was safer. I will 
refactor the application to use one context for all the clients/sockets 
and see if it makes any difference.

I wonder if that's going eliminate the initial problem though. If the 
sockets really get somehow stuck/into an inconsistent state, then I 
imagine they will just "leak" and stay in that context forever, possibly 
preventing the app from a proper termination.

The client usually is long lived for as long as the app lives but in 
this particular app, it's a bit more special in that the separate tasks 
just use the clients to fetch some data in a standardized way, do their 
computation and exit. These tasks are periodically spawned by celery.

> Message: 1
> Date: Mon, 08 May 2017 11:58:42 +0100
> From: Luca Boccassi <luca.boccassi at gmail.com>
> To: ZeroMQ development list <zeromq-dev at lists.zeromq.org>
> Cc: "developers at repositpower.com" <developers at repositpower.com>
> Subject: Re: [zeromq-dev] Destroying 0MQ context gets indefinitely
> 	stuck/hangs despite linger=0
> Message-ID: <1494241122.11089.5.camel at gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On Mon, 2017-05-08 at 11:08 +1000, Tomas Krajca wrote:
>> Hi all,
>>
>> I have come across a weird/bad bug, I believe.
>>
>> I run libzmq 4.1.6 and pyzmq 16.0.2. This happens on both Centos 6
>> and
>> Centos 7.
>>
>> The application is a celery worker that runs 16 worker threads. Each
>> worker thread instantiates a 0MQ-based client, gets data and then
>> closes
>> this client. The 0MQ-based client creates its own 0MQ context and
>> terminates it on exit. Nothing is shared between the threads or
>> clients,
>> every client processes only one request and then it's fully
>> terminated.
>>
>> The client itself is a REQ socket which uses CURVE authentication to
>> authenticate with a ROUTER socket on the server side. The REQ socket
>> has
>> linger=0. Almost always, the REQ socket issues request, gets back
>> response, closes the socket, destroys its context, all is good. Once
>> every one or two days though, the REQ socket times out when waiting
>> for
>> the response from the ROUTER server, it then successfully closes the
>> socket but indefinitely hangs when it goes on to destroy the context.
>
> Note that these are two well-known anti-patterns. The context is
> intended to be shared and be unique in an application, and live for as
> long as the process does, and the sockets are meant to be long lived as
> well.
>
> I would recommend refactoring and, at the very least, use a single
> context for the duration of your application.
>
>> This runs in a data center on 1Gb/s LAN so the responses usually
>> finish
>> in under a second, the timeout is 20s. My theory is that the socket
>> gets
>> into a weird state and that's why it times out and blocks the
>> context
>> termination.
>>
>> I ran a tcpdump and it turns out that the REQ client successfully
>> authenticates with the ROUTER server but then it goes completely
>> silent
>> for those 20 odd seconds.
>>
>> Here is a tcpdump capture of a stuck REQ client -
>> https://pastebin.com/HxWAp6SQ. Here is a tcpdump capture of a normal
>> communication - https://pastebin.com/qCi1jTp0. This is a full
>> backtrace
>> (after SIGABRT signal to the stuck application) -
>> https://pastebin.com/jHdZS4VU
>>
>> Here is ulimit:
>>
>> [root at auhwbesap001 tomask]# cat /proc/311/limits
>> Limit                     Soft Limit           Hard Limit
>> Units
>> Max cpu time              unlimited            unlimited
>> seconds
>> Max file size             unlimited            unlimited
>> bytes
>> Max data size             unlimited            unlimited
>> bytes
>> Max stack size            8388608              unlimited
>> bytes
>> Max core file size        0                    unlimited
>> bytes
>> Max resident set          unlimited            unlimited
>> bytes
>> Max processes             31141                31141
>> processes
>> Max open files            8196                 8196
>> files
>> Max locked memory         65536                65536
>> bytes
>> Max address space         unlimited            unlimited
>> bytes
>> Max file locks            unlimited            unlimited
>> locks
>> Max pending signals       31141                31141
>> signals
>> Max msgqueue size         819200               819200
>> bytes
>> Max nice priority         0                    0
>> Max realtime priority     0                    0
>> Max realtime
>> timeout      unlimited            unlimited            us
>>
>>
>> The application doesn't seem to get over any of the limits, it
>> usually
>> hovers between 100 and 200 open file handlers.
>>
>> I tried to swap the REQ socket for a DEALER socket but that didn't
>> help,
>> the context eventually hung as well.
>>
>> I also tried to set ZMQ_BLOCKY to 0 and/or ZMQ_HANDSHAKE_IVL to
>> 100ms
>> but the context still eventually hung.
>>
>> I looked into the C++ code of libzmq but would need some guidance to
>> troubleshoot this as I am primarily a python programmer.
>>
>> I think we had a similar issue back in 2014 -
>> https://lists.zeromq.org/pipermail/zeromq-dev/2014-September/026752.h
>> tml. From
>> memory, the tcpdump capture also showed the client/REQ going silent
>> after the successful initial CURVE authentication but at that time
>> the
>> server/ROUTER application was crashing with an assertion.
>>
>> I am happy to do any more debugging.
>>
>> Thanks in advance for any help/pointers.
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 488 bytes
> Desc: This is a digitally signed message part
> URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20170508/fd178ae0/attachment-0001.sig>
>
> ------------------------------

<http://www.repositpower.com/>

*Tomas Krajca *
Software architect
m.  02 6162 0277
e.   tomas at repositpower.com
<https://twitter.com/RepositPower>
<https://www.facebook.com/Reposit-Power-1423585874607903/>
<https://www.linkedin.com/company/reposit-power>



More information about the zeromq-dev mailing list