[zeromq-dev] zeromq-dev Digest, Vol 14, Issue 7

Tomas Krajca tomas at repositpower.com
Mon May 15 03:57:32 CEST 2017


Hi Luca,

Having a single/shared context didn't help. As soon as the REQ client 
timed out, 0MQ seemed to get confused and started leaking file handles. 
It ended up with 100s of those [eventfd] open file descriptors.

I am not sure if it's an issue with the reaper. My feeling is that the 
core issue is the REQ client going silent after successfully 
establishing the CURVE authentication. I have no idea if 0MQ hits some 
system limit or if there is a bug of some sort but that's the odd thing 
for me - successful CURVE handshake/authentication and then silence.

For now, I've got a cron job that restarts stuck workers so it's not 
that urgent/critical. Anyway, I've got some time to do a bit more 
digging or testing but I don't quite know where to start.

Thanks,
Tomas

> Date: Thu, 11 May 2017 11:38:35 +0100
> From: Luca Boccassi <luca.boccassi at gmail.com>
> To: ZeroMQ development list <zeromq-dev at lists.zeromq.org>
> Subject: Re: [zeromq-dev] Destroying 0MQ context gets indefinitely,
> 	stuck/hangs despite linger=0
> Message-ID: <1494499115.4886.3.camel at gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On Wed, 2017-05-10 at 15:21 +1000, Tomas Krajca wrote:
>> Hi Luca and thanks for your reply.
>>
>>   > Note that these are two well-known anti-patterns. The context is
>>   > intended to be shared and be unique in an application, and live
>> for as
>>   > long as the process does, and the sockets are meant to be long
>> lived as
>>   > well.
>>   >
>>   > I would recommend refactoring and, at the very least, use a single
>>   > context for the duration of your application.
>>   >
>>
>> I always thought that having separate context was safer. I will
>> refactor the application to use one context for all the
>> clients/sockets
>> and see if it makes any difference.
>>
>> I wonder if that's going eliminate the initial problem though. If
>> the
>> sockets really get somehow stuck/into an inconsistent state, then I
>> imagine they will just "leak" and stay in that context forever,
>> possibly
>> preventing the app from a proper termination.
>
> There could be an unknown race with the reaper. It should help in that
> case.
>
>> The client usually is long lived for as long as the app lives but in
>> this particular app, it's a bit more special in that the separate
>> tasks
>> just use the clients to fetch some data in a standardized way, do
>> their
>> computation and exit. These tasks are periodically spawned by celery.
>>
>>> Message: 1
>>> Date: Mon, 08 May 2017 11:58:42 +0100
>>> From: Luca Boccassi <luca.boccassi at gmail.com>
>>> To: ZeroMQ development list <zeromq-dev at lists.zeromq.org>
>>> Cc: "developers at repositpower.com" <developers at repositpower.com>
>>> Subject: Re: [zeromq-dev] Destroying 0MQ context gets indefinitely
>>> 	stuck/hangs despite linger=0
>>> Message-ID: <1494241122.11089.5.camel at gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> On Mon, 2017-05-08 at 11:08 +1000, Tomas Krajca wrote:
>>>> Hi all,
>>>>
>>>> I have come across a weird/bad bug, I believe.
>>>>
>>>> I run libzmq 4.1.6 and pyzmq 16.0.2. This happens on both Centos
>>>> 6
>>>> and
>>>> Centos 7.
>>>>
>>>> The application is a celery worker that runs 16 worker threads.
>>>> Each
>>>> worker thread instantiates a 0MQ-based client, gets data and then
>>>> closes
>>>> this client. The 0MQ-based client creates its own 0MQ context and
>>>> terminates it on exit. Nothing is shared between the threads or
>>>> clients,
>>>> every client processes only one request and then it's fully
>>>> terminated.
>>>>
>>>> The client itself is a REQ socket which uses CURVE authentication
>>>> to
>>>> authenticate with a ROUTER socket on the server side. The REQ
>>>> socket
>>>> has
>>>> linger=0. Almost always, the REQ socket issues request, gets back
>>>> response, closes the socket, destroys its context, all is good.
>>>> Once
>>>> every one or two days though, the REQ socket times out when
>>>> waiting
>>>> for
>>>> the response from the ROUTER server, it then successfully closes
>>>> the
>>>> socket but indefinitely hangs when it goes on to destroy the
>>>> context.
>>>
>>> Note that these are two well-known anti-patterns. The context is
>>> intended to be shared and be unique in an application, and live for
>>> as
>>> long as the process does, and the sockets are meant to be long
>>> lived as
>>> well.
>>>
>>> I would recommend refactoring and, at the very least, use a single
>>> context for the duration of your application.
>>>
>>>> This runs in a data center on 1Gb/s LAN so the responses usually
>>>> finish
>>>> in under a second, the timeout is 20s. My theory is that the
>>>> socket
>>>> gets
>>>> into a weird state and that's why it times out and blocks the
>>>> context
>>>> termination.
>>>>
>>>> I ran a tcpdump and it turns out that the REQ client successfully
>>>> authenticates with the ROUTER server but then it goes completely
>>>> silent
>>>> for those 20 odd seconds.
>>>>
>>>> Here is a tcpdump capture of a stuck REQ client -
>>>> https://pastebin.com/HxWAp6SQ. Here is a tcpdump capture of a
>>>> normal
>>>> communication - https://pastebin.com/qCi1jTp0. This is a full
>>>> backtrace
>>>> (after SIGABRT signal to the stuck application) -
>>>> https://pastebin.com/jHdZS4VU
>>>>
>>>> Here is ulimit:
>>>>
>>>> [root at auhwbesap001 tomask]# cat /proc/311/limits
>>>> Limit                     Soft Limit           Hard Limit
>>>> Units
>>>> Max cpu time              unlimited            unlimited
>>>> seconds
>>>> Max file size             unlimited            unlimited
>>>> bytes
>>>> Max data size             unlimited            unlimited
>>>> bytes
>>>> Max stack size            8388608              unlimited
>>>> bytes
>>>> Max core file size        0                    unlimited
>>>> bytes
>>>> Max resident set          unlimited            unlimited
>>>> bytes
>>>> Max processes             31141                31141
>>>> processes
>>>> Max open files            8196                 8196
>>>> files
>>>> Max locked memory         65536                65536
>>>> bytes
>>>> Max address space         unlimited            unlimited
>>>> bytes
>>>> Max file locks            unlimited            unlimited
>>>> locks
>>>> Max pending signals       31141                31141
>>>> signals
>>>> Max msgqueue size         819200               819200
>>>> bytes
>>>> Max nice priority         0                    0
>>>> Max realtime priority     0                    0
>>>> Max realtime
>>>> timeout      unlimited            unlimited            us
>>>>
>>>>
>>>> The application doesn't seem to get over any of the limits, it
>>>> usually
>>>> hovers between 100 and 200 open file handlers.
>>>>
>>>> I tried to swap the REQ socket for a DEALER socket but that
>>>> didn't
>>>> help,
>>>> the context eventually hung as well.
>>>>
>>>> I also tried to set ZMQ_BLOCKY to 0 and/or ZMQ_HANDSHAKE_IVL to
>>>> 100ms
>>>> but the context still eventually hung.
>>>>
>>>> I looked into the C++ code of libzmq but would need some guidance
>>>> to
>>>> troubleshoot this as I am primarily a python programmer.
>>>>
>>>> I think we had a similar issue back in 2014 -
>>>> https://lists.zeromq.org/pipermail/zeromq-dev/2014-September/0267
>>>> 52.h
>>>> tml. From
>>>> memory, the tcpdump capture also showed the client/REQ going
>>>> silent
>>>> after the successful initial CURVE authentication but at that
>>>> time
>>>> the
>>>> server/ROUTER application was crashing with an assertion.
>>>>
>>>> I am happy to do any more debugging.
>>>>
>>>> Thanks in advance for any help/pointers.
>>>
>>> -------------- next part --------------
>>> A non-text attachment was scrubbed...
>>> Name: signature.asc
>>> Type: application/pgp-signature
>>> Size: 488 bytes
>>> Desc: This is a digitally signed message part
>>> URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/201
>>> 70508/fd178ae0/attachment-0001.sig>
>>>
>>> ------------------------------
>>
>> <http://www.repositpower.com/>
>>
>> *Tomas Krajca *
>> Software architect
>> m. 02 6162 0277
>> e.  tomas at repositpower.com
>> <https://twitter.com/RepositPower>
>> <https://www.facebook.com/Reposit-Power-1423585874607903/>
>> <https://www.linkedin.com/company/reposit-power>
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 488 bytes
> Desc: This is a digitally signed message part
> URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20170511/b79e65b7/attachment-0001.sig>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
> ------------------------------
>
> End of zeromq-dev Digest, Vol 14, Issue 7
> *****************************************
>

-- 
<http://www.repositpower.com/>

*Tomas Krajca *
Software architect
m.  02 6162 0277
e.   tomas at repositpower.com
<https://twitter.com/RepositPower>
<https://www.facebook.com/Reposit-Power-1423585874607903/>
<https://www.linkedin.com/company/reposit-power>



More information about the zeromq-dev mailing list