[zeromq-dev] Process eating 100 % of one core
Emmanuel TAUREL
taurel at esrf.fr
Fri Nov 7 11:20:22 CET 2014
Hello all,
We are using ZMQ (still release 3.2.4) mainly on Linux boxes. We are
using the PUB/SUB model.
Our system runs 24/7. From time to time, we have some of our PUB
processes eating 100 % of one core of our CPU's.
We don't know yet what exactly triggers this phenomenon and therefore we
are not able to repoduce it. It does not happen so often (once every 3/6
months!!)
Nevertheless, we did some analysis last time it happens.
Here are the result of "strace" on the PUB process
2889 10:53:18.021013 epoll_wait(19, {{EPOLLERR|EPOLLHUP,
{u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
2889 10:53:18.021041 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0,
{u32=335547808, u64=140097873776032}}) = 0
2889 10:53:18.021068 epoll_wait(19, {{EPOLLERR|EPOLLHUP,
{u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
2889 10:53:18.021096 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0,
{u32=335547808, u64=140097873776032}}) = 0
2889 10:53:18.021123 epoll_wait(19, {{EPOLLERR|EPOLLHUP,
{u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
2889 10:53:18.021151 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0,
{u32=335547808, u64=140097873776032}}) = 0
2889 10:53:18.021178 epoll_wait(19, {{EPOLLERR|EPOLLHUP,
{u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
2889 10:53:18.021206 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0,
{u32=335547808, u64=140097873776032}}) = 0
2889 10:53:18.021233 epoll_wait(19, {{EPOLLERR|EPOLLHUP,
{u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
2889 10:53:18.021260 epoll_ctl(19, EPOLL_CTL_MOD, 49, {0,
{u32=335547808, u64=140097873776032}}) = 0
2889 10:53:18.021288 epoll_wait(19, {{EPOLLERR|EPOLLHUP,
{u32=335547808, u64=140097873776032}}}, 256, 4294967295) = 1
From the number of couples epoll_wait()/epoll_ctl() and their period (2
times in in 100 us), it is clear that this is this thread which eats the
CPU.
Form the flag returned by epoll_wait() (EPOLLERR|EPOLLHUP), it seems
that something wrong happens on one of the file descriptor (number 49 if
I look
at epoll_ctl() argument. It is confirmed by the result of "lsof" on the
same PUB process:
Starter 2863 dserver 49u sock 0,6 0t0 7902 can't
identify protocol
If I take control of the PUB process with gdb and if I request for this
thread stack trace, I have
#0 0x00007fb65d3205ca in epoll_ctl () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fb65e23c298 in zmq::epoll_t::reset_pollin (this=<optimized out>,
handle_=<optimized out>) at epoll.cpp:101
#2 0x00007fb65e253da1 in zmq::stream_engine_t::in_event
(this=0x7fb6509d8c10)
at stream_engine.cpp:216
#3 0x00007fb65e23c46b in zmq::epoll_t::loop (this=0x7fb6611c5b70)
at epoll.cpp:154
#4 0x00007fb65e257de6 in thread_routine (arg_=0x7fb6611c5be0) at
thread.cpp:83
#5 0x00007fb65de0d0a4 in start_thread ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#6 0x00007fb65d32004d in clone () from /lib/x86_64-linux-gnu/libc.so.6
Even if something wrong has happened on the socket associated to fd 49,
I think Zmq should not enter into a "crazy" loop.
Is it a known issue?
Is there something we could do to prevent this to happen anymore?
Thank's in advance for your help
Emmanuel
More information about the zeromq-dev
mailing list