[zeromq-dev] Process eating 100 % of one core

Emmanuel TAUREL taurel at esrf.fr
Fri Nov 7 11:20:22 CET 2014


Hello all,

We are using ZMQ (still release 3.2.4) mainly on Linux boxes. We are 
using the PUB/SUB model.
Our system runs 24/7. From time to time, we have some of our PUB 
processes eating 100 % of one core of our CPU's.
We don't know yet what exactly triggers this phenomenon and therefore we 
are not able to repoduce it. It does not happen so often (once every 3/6 
months!!)
Nevertheless, we did some analysis last time it happens.

Here are the result of "strace" on the PUB process

2889  10:53:18.021013 epoll_wait(19, {{EPOLLERR|EPOLLHUP, 
{u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
2889  10:53:18.021041 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0, 
{u32=335547808, u64=140097873776032}}) = 0
2889  10:53:18.021068 epoll_wait(19, {{EPOLLERR|EPOLLHUP, 
{u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
2889  10:53:18.021096 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0, 
{u32=335547808, u64=140097873776032}}) = 0
2889  10:53:18.021123 epoll_wait(19, {{EPOLLERR|EPOLLHUP, 
{u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
2889  10:53:18.021151 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0, 
{u32=335547808, u64=140097873776032}}) = 0
2889  10:53:18.021178 epoll_wait(19, {{EPOLLERR|EPOLLHUP, 
{u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
2889  10:53:18.021206 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0, 
{u32=335547808, u64=140097873776032}}) = 0
2889  10:53:18.021233 epoll_wait(19, {{EPOLLERR|EPOLLHUP, 
{u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
2889  10:53:18.021260 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0, 
{u32=335547808, u64=140097873776032}}) = 0
2889  10:53:18.021288 epoll_wait(19, {{EPOLLERR|EPOLLHUP, 
{u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1

 From the number of couples epoll_wait()/epoll_ctl() and their period (2 
times in in 100 us), it is clear that this is this thread which eats the 
CPU.
Form the flag returned by epoll_wait() (EPOLLERR|EPOLLHUP), it seems 
that something wrong happens on one of the file descriptor (number 49 if 
I look
at epoll_ctl() argument. It is confirmed by the result of "lsof" on the 
same PUB process:

Starter 2863 dserver   49u  sock                0,6      0t0 7902 can't 
identify protocol

If I take control of the PUB process with gdb and if I request for this 
thread stack trace, I have

#0  0x00007fb65d3205ca in epoll_ctl () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fb65e23c298 in zmq::epoll_t::reset_pollin (this=<optimized out>,
     handle_=<optimized out>) at epoll.cpp:101
#2  0x00007fb65e253da1 in zmq::stream_engine_t::in_event 
(this=0x7fb6509d8c10)
     at stream_engine.cpp:216
#3  0x00007fb65e23c46b in zmq::epoll_t::loop (this=0x7fb6611c5b70)
     at epoll.cpp:154
#4  0x00007fb65e257de6 in thread_routine (arg_=0x7fb6611c5be0) at 
thread.cpp:83
#5  0x00007fb65de0d0a4 in start_thread ()
    from /lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007fb65d32004d in clone () from /lib/x86_64-linux-gnu/libc.so.6

Even if something wrong has happened on the socket associated to fd 49, 
I think Zmq should not enter into a "crazy" loop.
Is it a known issue?
Is there something we could do to prevent this to happen anymore?

Thank's in advance for your help

Emmanuel






More information about the zeromq-dev mailing list