[zeromq-dev] ZMQ DEALER/ROUTER socket recv and send hang

huang jun hjwsm1989 at gmail.com
Mon Jul 25 15:07:38 CEST 2016


After debug using tcpdump

The stucked machine's log:
1612    2016-07-25 10:09:51.765676    192.168.3.102    192.168.3.101
 TCP    68    59640 → 5556 [PSH, ACK] Seq=11 Ack=13 Win=29312 Len=2
TSval=488153723 TSecr=1036563812 (content is: 0103)
1613    2016-07-25 10:09:51.765683    192.168.3.102    192.168.3.101
 TCP    284    59640 → 5556 [PSH, ACK] Seq=13 Ack=13 Win=29312 Len=218
TSval=488153723 TSecr=1036563812
1614    2016-07-25 10:09:51.765759    192.168.3.101    192.168.3.102
 TCP    66    5556 → 59640 [ACK] Seq=13 Ack=231 Win=30080 Len=0
TSval=1036563812 TSecr=488153723
1615    2016-07-25 10:09:51.765792    192.168.3.101    192.168.3.102
 TCP    68    5556 → 59640 [PSH, ACK] Seq=13 Ack=231 Win=30080 Len=2
TSval=1036563812 TSecr=488153723 (content is: 0000)
1616    2016-07-25 10:09:51.805911    192.168.3.102    192.168.3.101
 TCP    66    59640 → 5556 [ACK] Seq=231 Ack=15 Win=29312 Len=0
TSval=488153764 TSecr=1036563812
it never send the reply message

And the below is the un-stucked machine's log:
24    2016-07-25 12:24:17.696372    192.168.3.102    192.168.3.103
TCP    60    59367 → 5556 [PSH, ACK] Seq=11 Ack=13 Win=29312 Len=2
(content: 0103)
25    2016-07-25 12:24:17.696402    192.168.3.103    192.168.3.102
TCP    56    5556 → 59367 [PSH, ACK] Seq=13 Ack=13 Win=14848 Len=2
(content: 0000)
26    2016-07-25 12:24:17.696614    192.168.3.102    192.168.3.103
TCP    275    59367 → 5556 [PSH, ACK] Seq=13 Ack=13 Win=29312 Len=221
27    2016-07-25 12:24:17.735664    192.168.3.102    192.168.3.103
TCP    60    59367 → 5556 [ACK] Seq=234 Ack=15 Win=29312 Len=0
28    2016-07-25 12:24:17.736489    192.168.3.103    192.168.3.102
TCP    54    5556 → 59367 [ACK] Seq=15 Ack=234 Win=15872 Len=0
29    2016-07-25 12:24:24.487902    192.168.3.103    192.168.3.102
TCP    77    5556 → 59367 [PSH, ACK] Seq=15 Ack=234 Win=15872 Len=23
(***the reply message****)
30    2016-07-25 12:24:24.488116    192.168.3.102    192.168.3.103
TCP    60    59367 → 5556 [ACK] Seq=234 Ack=38 Win=29312 Len=0
31    2016-07-25 12:24:24.488948    192.168.3.102    192.168.3.103
TCP    60    59367 → 5556 [FIN, ACK] Seq=234 Ack=38 Win=29312 Len=0
32    2016-07-25 12:24:24.489029    192.168.3.103    192.168.3.102
TCP    54    5556 → 59367 [FIN, ACK] Seq=38 Ack=235 Win=15872 Len=0
33    2016-07-25 12:24:24.489134    192.168.3.102    192.168.3.103
TCP    60    59367 → 5556 [ACK] Seq=235 Ack=39 Win=29312 Len=0

Compared the two logs, the stucked log seems the zmq send the wrong
reply sequence.
Any ideas?

2016-07-21 21:15 GMT+08:00 huang jun <hjwsm1989 at gmail.com>:
> I use DEALER/ROUTER for message communication,
> we use MsgrTcpServer for message handling
>
> int MsgrTcpServer::start()
> {
>     // Prepare our context and sockets
>     zmq::context_t context (1);
>     zmq::socket_t clients (context, ZMQ_ROUTER);
>
>     dout(10) << " bind_addr " << bind_addr
>        << " bind_port " << bind_port << dendl;
>     string tcp_addr = get_tcp_addr();
>     try {
>          clients.bind (tcp_addr.c_str());
>           dout(10) << "tcp bind to " << tcp_addr << " succeed" << dendl;
>     } catch (zmq::error_t &e) {
>        dout(1) << "tcp bind to " << tcp_addr << " failed, ret=" <<
> e.num() << dendl;
>         return e.num();
>     }
>     zmq::socket_t workers (context, ZMQ_DEALER);
>     workers.bind ("inproc://workers");
>
>     // Launch pool of worker threads
>     for (int thread_nbr = 0; thread_nbr < thread_num; thread_nbr++) {
>        workerlist.push_back(new MsgrTcpServerWorker((void *)&context, msgr));
>     }
>     list<MsgrTcpServerWorker *>::iterator i = workerlist.begin();
>     for ( ;i != workerlist.end(); i++)
>     {
>         (*i)->start();
>     }
>
>     // Connect work threads to client threads via a queue
>     zmq::proxy (clients, workers, NULL);
>     return 0;
> }
>
> and the MsgrTcpServerWorker is the job worker thread
>
> void *MsgrTcpServerWorker::entry ()
> {
>   zmq::context_t *context = (zmq::context_t *) arg;
>   zmq::socket_t socket (*context, ZMQ_REP);
>   socket.connect ("inproc://workers");
>
> while (true) {
>      //  Wait for next request from client
>      zmq::message_t request;
>      dout(0) << "going to recv msg from client" << dendl;
>      socket.recv (&request);
>      dout(0) << "recved msg from client" << dendl;
>      //print_memusage();
>      Handler *handler = Handler::get_handler(&socket,
> string((char*)request.data()));
>      try
>      {
>           handler->handle_message(string((char*)request.data()), msgr);
>      }
>      catch(exception &e)
>      {
>         //dout(10)<<"catch error:"<<e.what()<<dendl;
>         ostringstream oss;
>         BackTrace bt(1);
>         bt.print(oss);
>         dout(0) << oss.str() << dendl;
>      }
>     //dout(10) << "goto release handler" << dendl;
>     if (handler)
>          delete handler;
>      handler = NULL;
>      //dout(10) << "after release handler" << dendl;
>      //print_memusage();
>   }
>   return ((void *)NULL);
> }
>
> The problem is : program can not recv message, but it's alive.
> here is the main process's log
> [2016-07-21 15:53:47.642555] 10 reply message done
> [2016-07-21 15:53:47.642662] 0 going to recv msg from client
> it never print "recved msg from client", so think it's stucked in recv function.
>
> what did i do so far,
> 1. we use a test program to send message to this process, here is the log:
> Connecting to tcp://0.0.0.0:5556
> do send request...
>
> yes, we can connect to the listened socket, but can not send data to it.
>
> 2. the fd used by the process
> cmon    1451579 root   23u  0000     0,9        0     5154 anon_inode
> cmon    1451579 root   24u  IPv4 2687562      0t0      TCP *:freeciv (LISTEN)
> cmon    1451579 root   25u  0000     0,9        0     5154 anon_inode
> cmon    1451579 root   26u  0000     0,9        0     5154 anon_inode
> cmon    1451579 root   27u  0000     0,9        0     5154 anon_inode
> cmon    1451579 root   28u  IPv4 2687563      0t0      TCP
> 192.168.3.103:freeciv->192.168.3.102:35568 (ESTABLISHED)
> cmon    1451579 root   29u  IPv4 2687564      0t0      TCP
> 192.168.3.103:freeciv->192.168.3.102:35575 (ESTABLISHED)
> cmon    1451579 root   30u  IPv4 2688180      0t0      TCP
> 192.168.3.103:freeciv->192.168.3.102:42789 (ESTABLISHED)
>
> 3.we also use strace to trace the process, the log is
> 1451582 write(4, "[2016-07-21 15:53:47.642662] 0 g"..., 61 <unfinished ...>
> 1451579 poll([{fd=25, events=POLLIN}], 1, 0 <unfinished ...>
> 1451582 <... write resumed> )           = 61
> 1451579 <... poll resumed> )            = 1 ([{fd=25, revents=POLLIN}])
> 1451582 poll([{fd=27, events=POLLIN}], 1, -1 <unfinished ...>
> 1451579 read(25, "\1\0\0\0\0\0\0\0", 8) = 8
> 1451579 poll([{fd=25, events=POLLIN}], 1, 0) = 0 (Timeout)
> 1451579 poll([{fd=23, events=POLLIN}], 1, 0) = 0 (Timeout)
> 1451579 write(21, "\1\0\0\0\0\0\0\0", 8) = 8
> 1451579 poll([{fd=23, events=POLLIN}, {fd=25, events=POLLIN}], 2, 0) =
> 0 (Timeout)
> 1451579 poll([{fd=23, events=POLLIN}], 1, 0) = 0 (Timeout)
> 1451579 poll([{fd=25, events=POLLIN}], 1, 0) = 0 (Timeout)
> 1451579 poll([{fd=23, events=POLLIN}, {fd=25, events=POLLIN}], 2, -1
> it's that the poll of zmq_proxy failed?
>
> 4. the socket stat
> [root at lab207 src]# netstat -n|grep 5556
> tcp        0      0 192.168.3.103:5556          192.168.3.102:35568
>      ESTABLISHED
> tcp       10      0 127.0.0.1:5556              127.0.0.1:39115
>      ESTABLISHED
> tcp        0      0 127.0.0.1:39115             127.0.0.1:5556
>      ESTABLISHED
> tcp        0      0 192.168.3.103:5556          192.168.3.102:42789
>      ESTABLISHED
> tcp        0      0 192.168.3.103:5556          192.168.3.102:35575
>      ESTABLISHED
>
> 5. The stack of the process
> [root at lab207 rpms]# pstack 1451579
> Thread 4 (Thread 0x7f91fb597700 (LWP 1451580)):
> #0  0x0000003b33ce8f33 in epoll_wait () from /lib64/libc.so.6
> #1  0x00007f91fba2fdfd in zmq::epoll_t::loop() () from
> /usr/local/lib/libzmq.so.3
> #2  0x00007f91fba4855b in thread_routine () from /usr/local/lib/libzmq.so.3
> #3  0x0000003b34407a51 in start_thread () from /lib64/libpthread.so.0
> #4  0x0000003b33ce893d in clone () from /lib64/libc.so.6
> Thread 3 (Thread 0x7f91fa565700 (LWP 1451582)):
> #0  0x0000003b33cdf113 in poll () from /lib64/libc.so.6
> #1  0x00007f91fba3e1a1 in zmq::signaler_t::wait(int) () from
> /usr/local/lib/libzmq.so.3
> #2  0x00007f91fba3398a in zmq::mailbox_t::recv(zmq::command_t*, int)
> () from /usr/local/lib/libzmq.so.3
> #3  0x00007f91fba3efa4 in zmq::socket_base_t::process_commands(int,
> bool) () from /usr/local/lib/libzmq.so.3
> #4  0x00007f91fba3f1ef in zmq::socket_base_t::recv(zmq::msg_t*, int)
> () from /usr/local/lib/libzmq.so.3
> #5  0x00007f91fba4f029 in s_recvmsg(zmq::socket_base_t*, zmq_msg_t*,
> int) () from /usr/local/lib/libzmq.so.3
> #6  0x00000000005b8507 in zmq::socket_t::recv(zmq::message_t*, int) ()
> #7  0x00000000005b5f9c in MsgrTcpServerWorker::entry() ()
> #8  0x00000000004a85c6 in Thread::entry_wrapper() ()
> #9  0x00000000004a855c in Thread::_entry_func(void*) ()
> #10 0x0000003b34407a51 in start_thread () from /lib64/libpthread.so.0
> #11 0x0000003b33ce893d in clone () from /lib64/libc.so.6
> Thread 2 (Thread 0x7f91f9d64700 (LWP 1451583)):
> #0  0x0000003b33cdf113 in poll () from /lib64/libc.so.6
> #1  0x00007f91fba3e1a1 in zmq::signaler_t::wait(int) () from
> /usr/local/lib/libzmq.so.3
> #2  0x00007f91fba3398a in zmq::mailbox_t::recv(zmq::command_t*, int)
> () from /usr/local/lib/libzmq.so.3
> #3  0x00007f91fba3efa4 in zmq::socket_base_t::process_commands(int,
> bool) () from /usr/local/lib/libzmq.so.3
> #4  0x00007f91fba3f1ef in zmq::socket_base_t::recv(zmq::msg_t*, int)
> () from /usr/local/lib/libzmq.so.3
> #5  0x00007f91fba4f029 in s_recvmsg(zmq::socket_base_t*, zmq_msg_t*,
> int) () from /usr/local/lib/libzmq.so.3
> #6  0x00000000005b8507 in zmq::socket_t::recv(zmq::message_t*, int) ()
> #7  0x00000000005b5f9c in MsgrTcpServerWorker::entry() ()
> #8  0x00000000004a85c6 in Thread::entry_wrapper() ()
> #9  0x00000000004a855c in Thread::_entry_func(void*) ()
> #10 0x0000003b34407a51 in start_thread () from /lib64/libpthread.so.0
> #11 0x0000003b33ce893d in clone () from /lib64/libc.so.6
> Thread 1 (Thread 0x7f91fb5997e0 (LWP 1451579)):
> #0  0x0000003b33cdf113 in poll () from /lib64/libc.so.6
> #1  0x00007f91fba4fba8 in zmq_poll () from /usr/local/lib/libzmq.so.3
> #2  0x00007f91fba3b101 in zmq::proxy(zmq::socket_base_t*,
> zmq::socket_base_t*, zmq::socket_base_t*) () from
> /usr/local/lib/libzmq.so.3
> #3  0x00000000005b8183 in zmq::proxy(void*, void*, void*) ()
> #4  0x00000000005b6a30 in MsgrTcpServer::start() ()
> #5  0x00000000005b726c in Msgr::start() ()
> #6  0x00000000004870ba in main ()
>
> 6. the stack in proc filesystem
> [root at lab207 src]# cat /proc/1451579/stack
> [<ffffffff8118e409>] poll_schedule_timeout+0x49/0x70
> [<ffffffff8118e9a0>] do_poll+0x250/0x2b0
> [<ffffffff8118f3df>] do_sys_poll+0x1bf/0x240
> [<ffffffff8118f59c>] do_restart_poll+0x3c/0x60
> [<ffffffff81066fcf>] sys_restart_syscall+0x1f/0x30
> [<ffffffff8155f1d9>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> I have no idea why this happen, what should i do to dignose this problem.
> Any tips or advices are welcome, as i can not think other way to
> locate the problem.
>
> I also found that, when i killed the process, the 'netstat -n|grep 5556' shows
> the socket still in ESTABLISHED state, is that normal?
>
>
> --
> Thank you!
> HuangJun



-- 
Thank you!
HuangJun



More information about the zeromq-dev mailing list