[zeromq-dev] C Based ZeroMQ Aggregation Server Problems...

Henry Geddes hgeddes at zynga.com
Mon Oct 24 20:49:21 CEST 2011


So we are still seeing issues with this.  I have spent some time with gdb to try and identify the issue.

gdb debug

thread apply all bt

Thread 3 (Thread 0x42da6940 (LWP 26456)):
#0  0x00000030104d4018 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b6e8cb8a338 in zmq::epoll_t::loop (this=0x12f95f60) at epoll.cpp:142
#2  0x00002b6e8cb9d697 in thread_routine (arg_=0x12f95fd0) at thread.cpp:75
#3  0x0000003010c06617 in start_thread () from /lib64/libpthread.so.0
#4  0x00000030104d3c2d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x423a5940 (LWP 26455)):
#0  0x00000030104d4018 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b6e8cb8a338 in zmq::epoll_t::loop (this=0x12f95980) at epoll.cpp:142
#2  0x00002b6e8cb9d697 in thread_routine (arg_=0x12f959f0) at thread.cpp:75
#3  0x0000003010c06617 in start_thread () from /lib64/libpthread.so.0
#4  0x00000030104d3c2d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2b6e8cdb5560 (LWP 26452)):
#0  0x00000030104cae46 in poll () from /lib64/libc.so.6
#1  0x00002b6e8cb97aa0 in zmq::signaler_t::wait (this=<value optimized out>, timeout_=-1) at signaler.cpp:145
#2  0x00002b6e8cb8d182 in zmq::mailbox_t::recv (this=0x12f968f0, cmd_=0x7fff7ce4b180, timeout_=-1) at mailbox.cpp:69
#3  0x00002b6e8cb980bb in zmq::socket_base_t::process_commands (this=0x12f96810, timeout_=-1, throttle_=false) at socket_base.cpp:713
#4  0x00002b6e8cb982e4 in zmq::socket_base_t::recv (this=0x12f96810, msg_=0x2aab34000910, flags_=0) at socket_base.cpp:618
#5  0x00002b6e8c96d689 in zframe_recv (socket=0x12f96810) at zframe.c:103
#6  0x00002b6e8c971080 in zmsg_recv (socket=0x12f96810) at zmsg.c:98
#7  0x0000000000400b4a in main ()

This is what we see when running a backtrace on a stalled process.  This was running with the new czmq library.

Henry

-----Original Message-----
From: zeromq-dev-bounces at lists.zeromq.org [mailto:zeromq-dev-bounces at lists.zeromq.org] On Behalf Of Henry Geddes
Sent: Tuesday, October 18, 2011 5:11 PM
To: ZeroMQ development list
Subject: Re: [zeromq-dev] C Based ZeroMQ Aggregation Server Problems...

So just to update you on progress with this.  We have upgraded all boxes to now run 2.1.10 and cleaned up some of the code to remove anything that would not compile with the new version.  We are currently running the sink to see if the problem arises over night.  I am also trying to familiarize myself with the inner workings of zmq.  We did try rolling back the version to the previous zmq version for the sink but that did not appear to keep up with the traffic.  We also are trying to identify if it is a client connecting into the sink that may be having an effect.

Hopefully it will remain stable through the night.

-----Original Message-----
From: zeromq-dev-bounces at lists.zeromq.org [mailto:zeromq-dev-bounces at lists.zeromq.org] On Behalf Of Henry Geddes
Sent: Monday, October 17, 2011 5:34 PM
To: ZeroMQ development list
Subject: Re: [zeromq-dev] C Based ZeroMQ Aggregation Server Problems...

We saw the recv in strace.  It is the system call.  

I am adding all info we have into the bug right now.  We will try the upgrading of all programs to 2.1.8 tomorrow and make sure all versions are the same.  Not sure what to add as a test case right now as we cannot force a reproduce on this.

We will loop back around tomorrow.

-----Original Message-----
From: zeromq-dev-bounces at lists.zeromq.org [mailto:zeromq-dev-bounces at lists.zeromq.org] On Behalf Of Pieter Hintjens
Sent: Monday, October 17, 2011 5:09 PM
To: ZeroMQ development list
Subject: Re: [zeromq-dev] C Based ZeroMQ Aggregation Server Problems...

On Mon, Oct 17, 2011 at 6:45 PM, Henry Geddes <hgeddes at zynga.com> wrote:

> Are there any known issues when passing messages between 2.1.4 and 2.1.8?  We are just wondering if it may be an incompatibility between versions?

All 2.x versions are compatible, in theory. But indeed this might be
due to a malformed message landing and confusing things. In that case
it should be easy to reproduce.

Ideally we can get this to a reproducible state with a minimal test case.

Did you attach a debugger to the blocked process to see where it's
waiting? There isn't a recvfrom call anywhere in the codebase afaics.

-Pieter
_______________________________________________
zeromq-dev mailing list
zeromq-dev at lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev
_______________________________________________
zeromq-dev mailing list
zeromq-dev at lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev
_______________________________________________
zeromq-dev mailing list
zeromq-dev at lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev



More information about the zeromq-dev mailing list