[zeromq-dev] ZMQ sporadic problem with REQ/REP sockets
Francesco
francesco.montorsi at gmail.com
Wed Feb 21 13:29:07 CET 2018
Hi all,
I'm using ZMQ 4.2.3 in a sort of large project. This project has
automatic integration tests running every night that start/stop
processes, send debug commands to a daemon processing data.
I have built a debug interface to some of the C++ processes I have
using ZMQ REQ/REP pattern.
Through this interface I just send text commands like "show
statistics" to retrieve some debugging counters from the live, running
daemon I want to debug.
To send the commands I built a small utility (named "debug shell")
that opens a REQ socket, sends the text message, waits for the reply
and prints the received output on stdout.
Note that both REQ/REP sockets use 127.0.0.1 and TCP transport.
Sporadically the integration tests fail because the daemon stops
replying to the debugging text commands on its REP socket.
When this happens that daemon enters into some weird state where until
I restart it, it will NEVER reply anymore on its REP socket.
So the bug happens sporadically but when it manifests, it remains till
the daemon is restarted.
So far I could do some basic debugging (against a daemon built in
release mode unfortunately!!) and I found that:
1) adding a ZMQ monitor socket on the REQ socket inside the "debug
shell" I discovered that I get ZMQ_EVENT_CONNECTED notification on the
REQ socket: so IMHO the ZMQ background threads have completed their
TCP handshake.
2) I run a "strace -p" on the thread of the daemon that does
zmq_poll() (to wait for ZMQ_POLLIN events) on the REP socket.
When the debug shell is down I see a regular pattern of:
poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=35, events=POLLIN}], 1, 250) = 0 (Timeout)
...
indeed that thread does the zmq_poll() with a timeout of 250ms, finds
no message and retunr.s
I found that when I start the debug shell the following system calls happen:
...
poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=35, events=POLLIN}], 1, 250) = 1 ([{fd=35, revents=POLLIN}])
poll([{fd=35, events=POLLIN}], 1, 0) = 1 ([{fd=35, revents=POLLIN}])
read(35, "\1\0\0\0\0\0\0\0", 8) = 8
poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=35, events=POLLIN}], 1, 162) = 1 ([{fd=35, revents=POLLIN}])
poll([{fd=35, events=POLLIN}], 1, 0) = 1 ([{fd=35, revents=POLLIN}])
read(35, "\1\0\0\0\0\0\0\0", 8) = 8
write(11, "\1\0\0\0\0\0\0\0", 8) = 8
poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=35, events=POLLIN}], 1, 161) = 0 (Timeout)
poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout)
...
I believe this is the ZMQ background thread handshake going on.
The problem is that then I send a text command from the debug shell
and I see nothing happens, just the endless flow of:
poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=35, events=POLLIN}], 1, 250) = 0 (Timeout)
...
However note that the zmq_send() on the REQ socket on the debug shell
returns success.
If I restart the daemon process and then repeat this test and I send a
message from the debug shell I see this pattern:
poll([{fd=73, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=73, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=73, events=POLLIN}], 1, 250) = 1 ([{fd=73, revents=POLLIN}])
poll([{fd=73, events=POLLIN}], 1, 0) = 1 ([{fd=73, revents=POLLIN}])
read(73, "\1\0\0\0\0\0\0\0", 8) = 8
poll([{fd=73, events=POLLIN}], 1, 0) = 0 (Timeout)
open("/proc/2933/status", O_RDONLY) = 70
fstat(70, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7fcd87392000
read(70, "Name:\thmstatemachine\nUmask:\t0000"..., 1024) = 1024
close(70) = 0
munmap(0x7fcd87392000, 4096) = 0
poll([{fd=17, events=POLLIN}], 1, 0) = 0 (Timeout)
write(18, "\1\0\0\0\0\0\0\0", 8) = 8
poll([{fd=17, events=POLLIN}], 1, -1) = 1 ([{fd=17, revents=POLLIN}])
read(17, "\1\0\0\0\0\0\0\0", 8) = 8
poll([{fd=17, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=73, events=POLLIN}], 1, 0) = 0 (Timeout)
write(13, "\1\0\0\0\0\0\0\0", 8) = 8
poll([{fd=73, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=73, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=73, events=POLLIN}], 1, 250) = 0 (Timeout)
poll([{fd=73, events=POLLIN}], 1, 0) = 0 (Timeout)
and the daemon process is able to reply to the text command successfully.
Can you suggest me some way to debug this problem?
How can I know if the ZMQ background threads of the daemon and the
debug shell have successfully connected? (I'm using a monitor socket
to understand that but maybe there-s a better way)
How can I know if the problem lies on the debug shell (REQ socket) or
on the daemon (REP socket)?
(everything seems to indicate the problem is in the daemon: only
restarting it I get back to a working condition!!)
Thanks,
Francesco
More information about the zeromq-dev
mailing list