[zeromq-dev] ZMQ (3.1.0) loosing first part message in multi-part message
Emmanuel TAUREL
taurel at esrf.fr
Mon Mar 25 09:32:20 CET 2013
Hello all,
Due to the wire incompatibility between ZMQ 3.1 and 3.2, we are still
using 3.1 in our production environment.
We are using the PUB/SUB pattern. The subscriber is a GUI given to our
users. There are several instances of this GUI running. There are around
150 publishers running on different hosts. Every publisher process
publishes a heartbeat message every 9 seconds. ZMQ propagates (PUB/SUB
pattern) these heartbeat messages to
every running subscribers (GUI). These heartbeat messages are multi-part
messages with 3 parts. Our problem is that "from time to time", the first
part of the multi-part message is not sent by ZMQ to some registered
subscriber!
I have recorded all the network packets sent by some of our publishers
using the strace command (- e trace=network -o <file>).
Here are some lines extracted from this strace generated file where the
problem is clear.
All heartbeat messages sent by the publisher to its socket 34 around
07:41. I remind you that they are sent every 9 seconds.
4848 07:41:00.873685 send(34,
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
89, 0) = 89
4848 07:41:09.875210 send(34,
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
89, 0) = 89
4848 07:41:18.873082 send(34,
"\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 26, 0) = 26
4848 07:41:27.875234 send(34,
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
89, 0) = 89
4848 07:41:36.874466 send(34,
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
89, 0) = 89
4848 07:41:45.873737 send(34,
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
89, 0) = 89
4848 07:41:54.873934 send(34,
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
89, 0) = 89
All heartbeat messages sent by the same publisher to its socket 60
around 07:41
4848 07:41:00.872882 send(60,
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
89, 0) = 89
4848 07:41:09.873237 send(60,
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
89, 0) = 89
4848 07:41:18.873612 send(60,
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
89, 0) = 89
4848 07:41:27.874226 send(60,
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
89, 0 <unfinished ...>
4848 07:41:36.873452 send(60,
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
89, 0 <unfinished ...>
4848 07:41:45.872701 send(60,
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
89, 0 <unfinished ...>
4848 07:41:54.872940 send(60,
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
89, 0 <unfinished ...>
As you can see, for socket 34, the packet sent at 07:41:18 is not
correct. The first message part (the string finishing with .heartbeat)
is not transmitted.
But for socket 60, at the same date (07:41:18), it is correctly sent. To
me, it seems that when ZMQ sends its messages to all the connected
subscribers, it
from time to time forget to send the first part of the message to one
subscriber.
This happens rarely. On this specific publisher, it happens only once
during a 64 hours recording session!
Are you aware of this kind of problem?
Is there a chance that this problem will disappear when we will upgrade
our production system to 3.2?
Due to the complexity of the system set-up (150 publishers running on
150 different hosts) and to the rare ocuurence of the problem, we are
not able (yet?) to provide a simple test case which reproduces the problem.
Thank's for your answers and help
Emmanuel
More information about the zeromq-dev
mailing list