[zeromq-dev] Destroying 0MQ context gets indefinitely, stuck/hangs despite linger=0

Luca Boccassi luca.boccassi at gmail.com
Thu May 11 12:38:35 CEST 2017


On Wed, 2017-05-10 at 15:21 +1000, Tomas Krajca wrote:
> Hi Luca and thanks for your reply.
> 
>  > Note that these are two well-known anti-patterns. The context is
>  > intended to be shared and be unique in an application, and live
> for as
>  > long as the process does, and the sockets are meant to be long
> lived as
>  > well.
>  >
>  > I would recommend refactoring and, at the very least, use a single
>  > context for the duration of your application.
>  >
> 
> I always thought that having separate context was safer. I will 
> refactor the application to use one context for all the
> clients/sockets 
> and see if it makes any difference.
> 
> I wonder if that's going eliminate the initial problem though. If
> the 
> sockets really get somehow stuck/into an inconsistent state, then I 
> imagine they will just "leak" and stay in that context forever,
> possibly 
> preventing the app from a proper termination.

There could be an unknown race with the reaper. It should help in that
case.

> The client usually is long lived for as long as the app lives but in 
> this particular app, it's a bit more special in that the separate
> tasks 
> just use the clients to fetch some data in a standardized way, do
> their 
> computation and exit. These tasks are periodically spawned by celery.
> 
> > Message: 1
> > Date: Mon, 08 May 2017 11:58:42 +0100
> > From: Luca Boccassi <luca.boccassi at gmail.com>
> > To: ZeroMQ development list <zeromq-dev at lists.zeromq.org>
> > Cc: "developers at repositpower.com" <developers at repositpower.com>
> > Subject: Re: [zeromq-dev] Destroying 0MQ context gets indefinitely
> > 	stuck/hangs despite linger=0
> > Message-ID: <1494241122.11089.5.camel at gmail.com>
> > Content-Type: text/plain; charset="utf-8"
> > 
> > On Mon, 2017-05-08 at 11:08 +1000, Tomas Krajca wrote:
> > > Hi all,
> > > 
> > > I have come across a weird/bad bug, I believe.
> > > 
> > > I run libzmq 4.1.6 and pyzmq 16.0.2. This happens on both Centos
> > > 6
> > > and
> > > Centos 7.
> > > 
> > > The application is a celery worker that runs 16 worker threads.
> > > Each
> > > worker thread instantiates a 0MQ-based client, gets data and then
> > > closes
> > > this client. The 0MQ-based client creates its own 0MQ context and
> > > terminates it on exit. Nothing is shared between the threads or
> > > clients,
> > > every client processes only one request and then it's fully
> > > terminated.
> > > 
> > > The client itself is a REQ socket which uses CURVE authentication
> > > to
> > > authenticate with a ROUTER socket on the server side. The REQ
> > > socket
> > > has
> > > linger=0. Almost always, the REQ socket issues request, gets back
> > > response, closes the socket, destroys its context, all is good.
> > > Once
> > > every one or two days though, the REQ socket times out when
> > > waiting
> > > for
> > > the response from the ROUTER server, it then successfully closes
> > > the
> > > socket but indefinitely hangs when it goes on to destroy the
> > > context.
> > 
> > Note that these are two well-known anti-patterns. The context is
> > intended to be shared and be unique in an application, and live for
> > as
> > long as the process does, and the sockets are meant to be long
> > lived as
> > well.
> > 
> > I would recommend refactoring and, at the very least, use a single
> > context for the duration of your application.
> > 
> > > This runs in a data center on 1Gb/s LAN so the responses usually
> > > finish
> > > in under a second, the timeout is 20s. My theory is that the
> > > socket
> > > gets
> > > into a weird state and that's why it times out and blocks the
> > > context
> > > termination.
> > > 
> > > I ran a tcpdump and it turns out that the REQ client successfully
> > > authenticates with the ROUTER server but then it goes completely
> > > silent
> > > for those 20 odd seconds.
> > > 
> > > Here is a tcpdump capture of a stuck REQ client -
> > > https://pastebin.com/HxWAp6SQ. Here is a tcpdump capture of a
> > > normal
> > > communication - https://pastebin.com/qCi1jTp0. This is a full
> > > backtrace
> > > (after SIGABRT signal to the stuck application) -
> > > https://pastebin.com/jHdZS4VU
> > > 
> > > Here is ulimit:
> > > 
> > > [root at auhwbesap001 tomask]# cat /proc/311/limits
> > > Limit                     Soft Limit           Hard Limit
> > > Units
> > > Max cpu time              unlimited            unlimited
> > > seconds
> > > Max file size             unlimited            unlimited
> > > bytes
> > > Max data size             unlimited            unlimited
> > > bytes
> > > Max stack size            8388608              unlimited
> > > bytes
> > > Max core file size        0                    unlimited
> > > bytes
> > > Max resident set          unlimited            unlimited
> > > bytes
> > > Max processes             31141                31141
> > > processes
> > > Max open files            8196                 8196
> > > files
> > > Max locked memory         65536                65536
> > > bytes
> > > Max address space         unlimited            unlimited
> > > bytes
> > > Max file locks            unlimited            unlimited
> > > locks
> > > Max pending signals       31141                31141
> > > signals
> > > Max msgqueue size         819200               819200
> > > bytes
> > > Max nice priority         0                    0
> > > Max realtime priority     0                    0
> > > Max realtime
> > > timeout      unlimited            unlimited            us
> > > 
> > > 
> > > The application doesn't seem to get over any of the limits, it
> > > usually
> > > hovers between 100 and 200 open file handlers.
> > > 
> > > I tried to swap the REQ socket for a DEALER socket but that
> > > didn't
> > > help,
> > > the context eventually hung as well.
> > > 
> > > I also tried to set ZMQ_BLOCKY to 0 and/or ZMQ_HANDSHAKE_IVL to
> > > 100ms
> > > but the context still eventually hung.
> > > 
> > > I looked into the C++ code of libzmq but would need some guidance
> > > to
> > > troubleshoot this as I am primarily a python programmer.
> > > 
> > > I think we had a similar issue back in 2014 -
> > > https://lists.zeromq.org/pipermail/zeromq-dev/2014-September/0267
> > > 52.h
> > > tml. From
> > > memory, the tcpdump capture also showed the client/REQ going
> > > silent
> > > after the successful initial CURVE authentication but at that
> > > time
> > > the
> > > server/ROUTER application was crashing with an assertion.
> > > 
> > > I am happy to do any more debugging.
> > > 
> > > Thanks in advance for any help/pointers.
> > 
> > -------------- next part --------------
> > A non-text attachment was scrubbed...
> > Name: signature.asc
> > Type: application/pgp-signature
> > Size: 488 bytes
> > Desc: This is a digitally signed message part
> > URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/201
> > 70508/fd178ae0/attachment-0001.sig>
> > 
> > ------------------------------
> 
> <http://www.repositpower.com/>
> 
> *Tomas Krajca *
> Software architect
> m. 02 6162 0277
> e.  tomas at repositpower.com
> <https://twitter.com/RepositPower>
> <https://www.facebook.com/Reposit-Power-1423585874607903/>
> <https://www.linkedin.com/company/reposit-power>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20170511/b79e65b7/attachment.sig>


More information about the zeromq-dev mailing list