[zeromq-dev] zeromq EPGM PUB threads eating 100% of each core

21848863 21848863 at qq.com
Thu Nov 16 04:09:10 CET 2017


Hi,
     We are using zeromq (3.2.0) on linux Redhat 6.4-6.7 and widnows 2008-2012 X64. We are 
using the EPGM PUB/SUB model. We use one master node(linux) and two slave node(linux + windows) and one monitor node(windows).
     On each node,we use three same pub ports,and each sub the same multicast ips and ports.two pub ports for business process and one pub port  for heart beat.   If the monitor find something wrong(heart beat not receive in 10 seconds) in one node,it will issue instruction to  control the destination node to close the heart beat pub port and  reopen it.


     From time to time, we have all of our PUB threads eating 100% of each core.   We don't know yet what exactly triggers this phenomenon and therefore we can't reproduce it.


     Here are some callstacks of the three PUB threads.


Thread 1 (process 42443):
#0  0x00007f70b2986b55 in pgm_rate_remaining2 () from libzmq.so.3
#1  0x00007f70b296d43a in pgm_getsockopt () from libzmq.so.3
#2  0x00007f70b2934b68 in zmq::pgm_socket_t::get_tx_timeout() () from libzmq.so.3
#3  0x00007f70b29333ee in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#4  0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from libzmq.so.3
#5  0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#6  0x00007f70b294747c in thread_routine () from libzmq.so.3
#7  0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#8  0x0000003bbb0e890d in clone () from /lib64/libc.so.6


Thread 1 (process 42443):
#0  0x0000003bbb07a8f3 in malloc () from /lib64/libc.so.6
#1  0x0000003bc94bd09d in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6
#2  0x00007f70b29390a4 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, zmq::poller_base_t::timer_info_t>, std::_Select1st<std::pair<unsigned long const, zmq::poller_base_t::timer_info_t> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, zmq::poller_base_t::timer_info_t> > >::_M_insert(std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::pair<unsigned long const, zmq::poller_base_t::timer_info_t> const&) () from libzmq.so.3
#3  0x00007f70b2939165 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, zmq::poller_base_t::timer_info_t>, std::_Select1st<std::pair<unsigned long const, zmq::poller_base_t::timer_info_t> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, zmq::poller_base_t::timer_info_t> > >::insert_equal(std::pair<unsigned long const, zmq::poller_base_t::timer_info_t> const&) () from libzmq.so.3
#4  0x00007f70b2938baf in zmq::poller_base_t::add_timer(int, zmq::i_poll_events*, int) () from libzmq.so.3
#5  0x00007f70b29333fd in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#6  0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from libzmq.so.3
#7  0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#8  0x00007f70b294747c in thread_routine () from libzmq.so.3
#9  0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003bbb0e890d in clone () from /lib64/libc.so.6


Thread 2 (process 42448):
#0  0x00007f70b2920a82 in zmq::clock_t::rdtsc() () from libzmq.so.3
#1  0x00007f70b2920b69 in zmq::clock_t::now_ms() () from libzmq.so.3
#2  0x00007f70b2938b90 in zmq::poller_base_t::add_timer(int, zmq::i_poll_events*, int) () from libzmq.so.3
#3  0x00007f70b29333fd in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#4  0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from libzmq.so.3
#5  0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#6  0x00007f70b294747c in thread_routine () from libzmq.so.3
#7  0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#8  0x0000003bbb0e890d in clone () from /lib64/libc.so.6


Thread 2 (process 42448):
#0  0x00007f70b297820b in pgm_send () from libzmq.so.3
#1  0x00007f70b293532f in zmq::pgm_socket_t::send(unsigned char*, unsigned long) () from libzmq.so.3
#2  0x00007f70b293334e in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#3  0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from libzmq.so.3
#4  0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#5  0x00007f70b294747c in thread_routine () from libzmq.so.3
#6  0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#7  0x0000003bbb0e890d in clone () from /lib64/libc.so.6


Thread 2 (process 42448):
#0  0x00007f70b296bf87 in pgm_rwlock_reader_unlock () from libzmq.so.3
#1  0x00007f70b296de98 in pgm_getsockopt () from libzmq.so.3
#2  0x00007f70b2934b68 in zmq::pgm_socket_t::get_tx_timeout() () from libzmq.so.3
#3  0x00007f70b29333ee in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#4  0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from libzmq.so.3
#5  0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#6  0x00007f70b294747c in thread_routine () from libzmq.so.3
#7  0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#8  0x0000003bbb0e890d in clone () from /lib64/libc.so.6


Thread 2 (process 42448):
#0  0x0000003bbb40a659 in pthread_mutex_unlock () from /lib64/libpthread.so.0
#1  0x00007f70b297560d in pgm_mutex_unlock () from libzmq.so.3
#2  0x00007f70b29783e6 in pgm_send () from libzmq.so.3
#3  0x00007f70b293532f in zmq::pgm_socket_t::send(unsigned char*, unsigned long) () from libzmq.so.3
#4  0x00007f70b293334e in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#5  0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from libzmq.so.3
#6  0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#7  0x00007f70b294747c in thread_routine () from libzmq.so.3
#8  0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#9  0x0000003bbb0e890d in clone () from /lib64/libc.so.6


Thread 2 (process 42448):
#0  0x00007f70b2972cae in pgm_atomic_inc32 () from libzmq.so.3
#1  0x00007f70b297846e in pgm_rwspinlock_reader_trylock () from libzmq.so.3
#2  0x00007f70b2978450 in pgm_rwlock_reader_trylock () from libzmq.so.3
#3  0x00007f70b29782d1 in pgm_send () from libzmq.so.3
#4  0x00007f70b293532f in zmq::pgm_socket_t::send(unsigned char*, unsigned long) () from libzmq.so.3
#5  0x00007f70b293334e in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#6  0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from libzmq.so.3
#7  0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#8  0x00007f70b294747c in thread_routine () from libzmq.so.3
#9  0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003bbb0e890d in clone () from /lib64/libc.so.6


Thread 3 (process 27985):
#0  0x00007f70b296d4a8 in pgm_getsockopt () from libzmq.so.3
#1  0x00007f70b2934b68 in zmq::pgm_socket_t::get_tx_timeout() () from libzmq.so.3
#2  0x00007f70b29333ee in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#3  0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from libzmq.so.3
#4  0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#5  0x00007f70b294747c in thread_routine () from libzmq.so.3
#6  0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#7  0x0000003bbb0e890d in clone () from /lib64/libc.so.6


Thread 3 (process 27985):
#0  0x0000003bbb07a8ad in malloc () from /lib64/libc.so.6
#1  0x0000003bc94bd09d in operator new(unsigned long) () from /usr/lib64/libstdc++.so.6
#2  0x00007f70b29390a4 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, zmq::poller_base_t::timer_info_t>, std::_Select1st<std::pair<unsigned long const, zmq::poller_base_t::timer_info_t> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, zmq::poller_base_t::timer_info_t> > >::_M_insert(std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::pair<unsigned long const, zmq::poller_base_t::timer_info_t> const&) () from libzmq.so.3
#3  0x00007f70b2939165 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, zmq::poller_base_t::timer_info_t>, std::_Select1st<std::pair<unsigned long const, zmq::poller_base_t::timer_info_t> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, zmq::poller_base_t::timer_info_t> > >::insert_equal(std::pair<unsigned long const, zmq::poller_base_t::timer_info_t> const&) () from libzmq.so.3
#4  0x00007f70b2938baf in zmq::poller_base_t::add_timer(int, zmq::i_poll_events*, int) () from libzmq.so.3
#5  0x00007f70b29333fd in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#6  0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from libzmq.so.3
#7  0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#8  0x00007f70b294747c in thread_routine () from libzmq.so.3
#9  0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003bbb0e890d in clone () from /lib64/libc.so.6


Thread 3 (process 27985):
#0  0x00007f70b297676f in send_odata_copy () from libzmq.so.3
#1  0x00007f70b29783d6 in pgm_send () from libzmq.so.3
#2  0x00007f70b293532f in zmq::pgm_socket_t::send(unsigned char*, unsigned long) () from libzmq.so.3
#3  0x00007f70b293334e in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#4  0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from libzmq.so.3
#5  0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#6  0x00007f70b294747c in thread_routine () from libzmq.so.3
#7  0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#8  0x0000003bbb0e890d in clone () from /lib64/libc.so.6


Thread 3 (process 27985):
#0  0x00007f70b2958313 in pgm_tsc_update () from libzmq.so.3
#1  0x00007f70b2986b04 in pgm_rate_remaining2 () from libzmq.so.3
#2  0x00007f70b296d43a in pgm_getsockopt () from libzmq.so.3
#3  0x00007f70b2934b68 in zmq::pgm_socket_t::get_tx_timeout() () from libzmq.so.3
#4  0x00007f70b29333ee in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#5  0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from libzmq.so.3
#6  0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#7  0x00007f70b294747c in thread_routine () from libzmq.so.3
#8  0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#9  0x0000003bbb0e890d in clone () from /lib64/libc.so.6


we got the recent zeromq source code from github and found some bug fixes might be relate to the phenomenon.


version: 3de116336907323822266aed603a18633e315228
author: Claudio Freire <klaussfreire at gmail.com>
date: 2014/11/11 0:22:54
info:Fix busy-polling bug in pgm_sender causing it to consume CPU cycles
modify: src/pgm_sender.cpp


version: 92bee5a6dbb150115058c494e83702097aa58747
author: Ricardo Catalinas Jiménez <jimenezrick at gmail.com>
date: 2012/7/28 17:04:03
info:Fix LIBZMQ-404: zmq_term not truly re-entrant
zmq_term can not safely be re-entered with pgm transport.
Fix proposed by Steven McCoy.
modify: src/zmq.cpp


     If our monitor node found something wrong., it may issue instruction to  control the destination node to close the heart beat pub port and reopen it,and the destination node use zmq_term to close the heart beat pub port.
 
      But we are not sure if these bug fixes can solve our problem and we don't know how to reproduce the  phenomenon  of these bug fixes.


Thanks.


dongfang zhang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20171116/46c94c53/attachment.htm>


More information about the zeromq-dev mailing list