[zeromq-dev] ZMQ (3.1.0) loosing first part message in multi-part message

Emmanuel TAUREL taurel at esrf.fr
Mon Mar 25 09:32:20 CET 2013


Hello all,

Due to the wire incompatibility between ZMQ 3.1 and 3.2, we are still 
using 3.1 in our production environment.
We are using the PUB/SUB pattern. The subscriber is a GUI given to our 
users. There are several instances of this GUI running. There are around 
150 publishers running on different hosts. Every publisher process 
publishes a heartbeat message every 9 seconds. ZMQ propagates (PUB/SUB 
pattern) these heartbeat messages to
every running subscribers (GUI). These heartbeat messages are multi-part 
messages with 3 parts. Our problem is that "from time to time", the first
part of the multi-part message is not sent by ZMQ to some registered 
subscriber!

I have recorded all the network packets sent by some of our publishers 
using the strace command (- e trace=network -o <file>).
Here are some lines extracted from this strace generated file where the 
problem is clear.

All heartbeat messages sent by the publisher to its socket 34 around 
07:41. I remind you that they are sent every 9 seconds.

4848  07:41:00.873685 send(34, 
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 
89, 0) = 89
4848  07:41:09.875210 send(34, 
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 
89, 0) = 89
4848  07:41:18.873082 send(34, 
"\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 26, 0) = 26
4848  07:41:27.875234 send(34, 
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 
89, 0) = 89
4848  07:41:36.874466 send(34, 
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 
89, 0) = 89
4848  07:41:45.873737 send(34, 
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 
89, 0) = 89
4848  07:41:54.873934 send(34, 
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 
89, 0) = 89

All heartbeat messages sent by the same publisher to its socket 60 
around 07:41
4848  07:41:00.872882 send(60, 
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 
89, 0) = 89
4848  07:41:09.873237 send(60, 
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 
89, 0) = 89
4848  07:41:18.873612 send(60, 
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 
89, 0) = 89
4848  07:41:27.874226 send(60, 
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 
89, 0 <unfinished ...>
4848  07:41:36.873452 send(60, 
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 
89, 0 <unfinished ...>
4848  07:41:45.872701 send(60, 
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 
89, 0 <unfinished ...>
4848  07:41:54.872940 send(60, 
">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 
89, 0 <unfinished ...>

As you can see, for socket 34, the packet sent at 07:41:18 is not 
correct. The first message part (the string finishing with .heartbeat) 
is not transmitted.
But for socket 60, at the same date (07:41:18), it is correctly sent. To 
me, it seems that when ZMQ sends its messages to all the connected 
subscribers, it
from time to time forget to send the first part of the message to one 
subscriber.
This happens rarely. On this specific publisher, it happens only once 
during a 64 hours recording session!

Are you aware of this kind of problem?
Is there a chance that this problem will disappear when we will upgrade 
our production system to 3.2?

Due to the complexity of the system set-up (150 publishers running on 
150 different hosts) and to the rare ocuurence of the problem, we are 
not able (yet?) to provide a simple test case which reproduces the problem.

Thank's for your answers and help

Emmanuel




More information about the zeromq-dev mailing list