[zeromq-dev] zeromq-dev Digest, Vol 14, Issue 7

Luca Boccassi luca.boccassi at gmail.com
Mon May 15 14:39:26 CEST 2017


On Mon, 2017-05-15 at 11:57 +1000, Tomas Krajca wrote:
> Hi Luca,
> 
> Having a single/shared context didn't help. As soon as the REQ
> client 
> timed out, 0MQ seemed to get confused and started leaking file
> handles. 
> It ended up with 100s of those [eventfd] open file descriptors.
> 
> I am not sure if it's an issue with the reaper. My feeling is that
> the 
> core issue is the REQ client going silent after successfully 
> establishing the CURVE authentication. I have no idea if 0MQ hits
> some 
> system limit or if there is a bug of some sort but that's the odd
> thing 
> for me - successful CURVE handshake/authentication and then silence.
> 
> For now, I've got a cron job that restarts stuck workers so it's not 
> that urgent/critical. Anyway, I've got some time to do a bit more 
> digging or testing but I don't quite know where to start.
> 
> Thanks,
> Tomas

Ok, thanks for confirming this.

I would recommend 2 following steps:

1) Try with the latest libzmq master and see if the problem still
happens
2) If it does, try to have a minimal test case that reproduces the
issue with just libzmq - removing the layers of bindings helps a lot
when trying to reproduce a problem.

If I understand correctly the pattern is:

1) ROUTER binds over TCP and enables CURVE
2) REQ connects over TCP with CURVE
3) REQ sends a message
4) REQ waits for a response that never arrives

What is the timeout value, and how is it checked (poll, socket option,
etc)?
Can you tell if the ROUTER receives the request and sends a reply?

> > Date: Thu, 11 May 2017 11:38:35 +0100
> > From: Luca Boccassi <luca.boccassi at gmail.com>
> > To: ZeroMQ development list <zeromq-dev at lists.zeromq.org>
> > Subject: Re: [zeromq-dev] Destroying 0MQ context gets indefinitely,
> > 	stuck/hangs despite linger=0
> > Message-ID: <1494499115.4886.3.camel at gmail.com>
> > Content-Type: text/plain; charset="utf-8"
> > 
> > On Wed, 2017-05-10 at 15:21 +1000, Tomas Krajca wrote:
> > > Hi Luca and thanks for your reply.
> > > 
> > >   > Note that these are two well-known anti-patterns. The context
> > > is
> > >   > intended to be shared and be unique in an application, and
> > > live
> > > for as
> > >   > long as the process does, and the sockets are meant to be
> > > long
> > > lived as
> > >   > well.
> > >   >
> > >   > I would recommend refactoring and, at the very least, use a
> > > single
> > >   > context for the duration of your application.
> > >   >
> > > 
> > > I always thought that having separate context was safer. I will
> > > refactor the application to use one context for all the
> > > clients/sockets
> > > and see if it makes any difference.
> > > 
> > > I wonder if that's going eliminate the initial problem though. If
> > > the
> > > sockets really get somehow stuck/into an inconsistent state, then
> > > I
> > > imagine they will just "leak" and stay in that context forever,
> > > possibly
> > > preventing the app from a proper termination.
> > 
> > There could be an unknown race with the reaper. It should help in
> > that
> > case.
> > 
> > > The client usually is long lived for as long as the app lives but
> > > in
> > > this particular app, it's a bit more special in that the separate
> > > tasks
> > > just use the clients to fetch some data in a standardized way, do
> > > their
> > > computation and exit. These tasks are periodically spawned by
> > > celery.
> > > 
> > > > Message: 1
> > > > Date: Mon, 08 May 2017 11:58:42 +0100
> > > > From: Luca Boccassi <luca.boccassi at gmail.com>
> > > > To: ZeroMQ development list <zeromq-dev at lists.zeromq.org>
> > > > Cc: "developers at repositpower.com" <developers at repositpower.com>
> > > > Subject: Re: [zeromq-dev] Destroying 0MQ context gets
> > > > indefinitely
> > > > 	stuck/hangs despite linger=0
> > > > Message-ID: <1494241122.11089.5.camel at gmail.com>
> > > > Content-Type: text/plain; charset="utf-8"
> > > > 
> > > > On Mon, 2017-05-08 at 11:08 +1000, Tomas Krajca wrote:
> > > > > Hi all,
> > > > > 
> > > > > I have come across a weird/bad bug, I believe.
> > > > > 
> > > > > I run libzmq 4.1.6 and pyzmq 16.0.2. This happens on both
> > > > > Centos
> > > > > 6
> > > > > and
> > > > > Centos 7.
> > > > > 
> > > > > The application is a celery worker that runs 16 worker
> > > > > threads.
> > > > > Each
> > > > > worker thread instantiates a 0MQ-based client, gets data and
> > > > > then
> > > > > closes
> > > > > this client. The 0MQ-based client creates its own 0MQ context
> > > > > and
> > > > > terminates it on exit. Nothing is shared between the threads
> > > > > or
> > > > > clients,
> > > > > every client processes only one request and then it's fully
> > > > > terminated.
> > > > > 
> > > > > The client itself is a REQ socket which uses CURVE
> > > > > authentication
> > > > > to
> > > > > authenticate with a ROUTER socket on the server side. The REQ
> > > > > socket
> > > > > has
> > > > > linger=0. Almost always, the REQ socket issues request, gets
> > > > > back
> > > > > response, closes the socket, destroys its context, all is
> > > > > good.
> > > > > Once
> > > > > every one or two days though, the REQ socket times out when
> > > > > waiting
> > > > > for
> > > > > the response from the ROUTER server, it then successfully
> > > > > closes
> > > > > the
> > > > > socket but indefinitely hangs when it goes on to destroy the
> > > > > context.
> > > > 
> > > > Note that these are two well-known anti-patterns. The context
> > > > is
> > > > intended to be shared and be unique in an application, and live
> > > > for
> > > > as
> > > > long as the process does, and the sockets are meant to be long
> > > > lived as
> > > > well.
> > > > 
> > > > I would recommend refactoring and, at the very least, use a
> > > > single
> > > > context for the duration of your application.
> > > > 
> > > > > This runs in a data center on 1Gb/s LAN so the responses
> > > > > usually
> > > > > finish
> > > > > in under a second, the timeout is 20s. My theory is that the
> > > > > socket
> > > > > gets
> > > > > into a weird state and that's why it times out and blocks the
> > > > > context
> > > > > termination.
> > > > > 
> > > > > I ran a tcpdump and it turns out that the REQ client
> > > > > successfully
> > > > > authenticates with the ROUTER server but then it goes
> > > > > completely
> > > > > silent
> > > > > for those 20 odd seconds.
> > > > > 
> > > > > Here is a tcpdump capture of a stuck REQ client -
> > > > > https://pastebin.com/HxWAp6SQ. Here is a tcpdump capture of a
> > > > > normal
> > > > > communication - https://pastebin.com/qCi1jTp0. This is a full
> > > > > backtrace
> > > > > (after SIGABRT signal to the stuck application) -
> > > > > https://pastebin.com/jHdZS4VU
> > > > > 
> > > > > Here is ulimit:
> > > > > 
> > > > > [root at auhwbesap001 tomask]# cat /proc/311/limits
> > > > > Limit                     Soft Limit           Hard Limit
> > > > > Units
> > > > > Max cpu time              unlimited            unlimited
> > > > > seconds
> > > > > Max file size             unlimited            unlimited
> > > > > bytes
> > > > > Max data size             unlimited            unlimited
> > > > > bytes
> > > > > Max stack size            8388608              unlimited
> > > > > bytes
> > > > > Max core file size        0                    unlimited
> > > > > bytes
> > > > > Max resident set          unlimited            unlimited
> > > > > bytes
> > > > > Max processes             31141                31141
> > > > > processes
> > > > > Max open files            8196                 8196
> > > > > files
> > > > > Max locked memory         65536                65536
> > > > > bytes
> > > > > Max address space         unlimited            unlimited
> > > > > bytes
> > > > > Max file locks            unlimited            unlimited
> > > > > locks
> > > > > Max pending signals       31141                31141
> > > > > signals
> > > > > Max msgqueue size         819200               819200
> > > > > bytes
> > > > > Max nice priority         0                    0
> > > > > Max realtime priority     0                    0
> > > > > Max realtime
> > > > > timeout      unlimited            unlimited            us
> > > > > 
> > > > > 
> > > > > The application doesn't seem to get over any of the limits,
> > > > > it
> > > > > usually
> > > > > hovers between 100 and 200 open file handlers.
> > > > > 
> > > > > I tried to swap the REQ socket for a DEALER socket but that
> > > > > didn't
> > > > > help,
> > > > > the context eventually hung as well.
> > > > > 
> > > > > I also tried to set ZMQ_BLOCKY to 0 and/or ZMQ_HANDSHAKE_IVL
> > > > > to
> > > > > 100ms
> > > > > but the context still eventually hung.
> > > > > 
> > > > > I looked into the C++ code of libzmq but would need some
> > > > > guidance
> > > > > to
> > > > > troubleshoot this as I am primarily a python programmer.
> > > > > 
> > > > > I think we had a similar issue back in 2014 -
> > > > > https://lists.zeromq.org/pipermail/zeromq-dev/2014-September/
> > > > > 0267
> > > > > 52.h
> > > > > tml. From
> > > > > memory, the tcpdump capture also showed the client/REQ going
> > > > > silent
> > > > > after the successful initial CURVE authentication but at that
> > > > > time
> > > > > the
> > > > > server/ROUTER application was crashing with an assertion.
> > > > > 
> > > > > I am happy to do any more debugging.
> > > > > 
> > > > > Thanks in advance for any help/pointers.
> > > > 
> > > > -------------- next part --------------
> > > > A non-text attachment was scrubbed...
> > > > Name: signature.asc
> > > > Type: application/pgp-signature
> > > > Size: 488 bytes
> > > > Desc: This is a digitally signed message part
> > > > URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments
> > > > /201
> > > > 70508/fd178ae0/attachment-0001.sig>
> > > > 
> > > > ------------------------------
> > > 
> > > <http://www.repositpower.com/>
> > > 
> > > *Tomas Krajca *
> > > Software architect
> > > m. 02 6162 0277
> > > e.  tomas at repositpower.com
> > > <https://twitter.com/RepositPower>
> > > <https://www.facebook.com/Reposit-Power-1423585874607903/>
> > > <https://www.linkedin.com/company/reposit-power>
> > > _______________________________________________
> > > zeromq-dev mailing list
> > > zeromq-dev at lists.zeromq.org
> > > https://lists.zeromq.org/mailman/listinfo/zeromq-dev
> > 
> > -------------- next part --------------
> > A non-text attachment was scrubbed...
> > Name: signature.asc
> > Type: application/pgp-signature
> > Size: 488 bytes
> > Desc: This is a digitally signed message part
> > URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/201
> > 70511/b79e65b7/attachment-0001.sig>
> > 
> > ------------------------------
> > 
> > Subject: Digest Footer
> > 
> > _______________________________________________
> > zeromq-dev mailing list
> > zeromq-dev at lists.zeromq.org
> > https://lists.zeromq.org/mailman/listinfo/zeromq-dev
> > 
> > ------------------------------
> > 
> > End of zeromq-dev Digest, Vol 14, Issue 7
> > *****************************************
> > 
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20170515/647441f2/attachment.sig>


More information about the zeromq-dev mailing list