[zeromq-dev] ZMQ (3.1.0) loosing first part message in multi-part message

Pieter Hintjens ph at imatix.com
Mon Mar 25 10:30:19 CET 2013


Hi Emmanuel,

There were a number of issues with PUB sockets and multipart, e.g
commit eb9bc1 from two years ago. But these changes are all in 3.1.

So most likely this issue is still present in the current master. It
may be related to the HWM, so you could try raising this on the
production PUB side, and using a lower HWM when you try to reproduce
the problem.

-Pieter


On Mon, Mar 25, 2013 at 9:32 AM, Emmanuel TAUREL <taurel at esrf.fr> wrote:
> Hello all,
>
> Due to the wire incompatibility between ZMQ 3.1 and 3.2, we are still
> using 3.1 in our production environment.
> We are using the PUB/SUB pattern. The subscriber is a GUI given to our
> users. There are several instances of this GUI running. There are around
> 150 publishers running on different hosts. Every publisher process
> publishes a heartbeat message every 9 seconds. ZMQ propagates (PUB/SUB
> pattern) these heartbeat messages to
> every running subscribers (GUI). These heartbeat messages are multi-part
> messages with 3 parts. Our problem is that "from time to time", the first
> part of the multi-part message is not sent by ZMQ to some registered
> subscriber!
>
> I have recorded all the network packets sent by some of our publishers
> using the strace command (- e trace=network -o <file>).
> Here are some lines extracted from this strace generated file where the
> problem is clear.
>
> All heartbeat messages sent by the publisher to its socket 34 around
> 07:41. I remind you that they are sent every 9 seconds.
>
> 4848  07:41:00.873685 send(34,
> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
> 89, 0) = 89
> 4848  07:41:09.875210 send(34,
> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
> 89, 0) = 89
> 4848  07:41:18.873082 send(34,
> "\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 26, 0) = 26
> 4848  07:41:27.875234 send(34,
> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
> 89, 0) = 89
> 4848  07:41:36.874466 send(34,
> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
> 89, 0) = 89
> 4848  07:41:45.873737 send(34,
> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
> 89, 0) = 89
> 4848  07:41:54.873934 send(34,
> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
> 89, 0) = 89
>
> All heartbeat messages sent by the same publisher to its socket 60
> around 07:41
> 4848  07:41:00.872882 send(60,
> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
> 89, 0) = 89
> 4848  07:41:09.873237 send(60,
> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
> 89, 0) = 89
> 4848  07:41:18.873612 send(60,
> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
> 89, 0) = 89
> 4848  07:41:27.874226 send(60,
> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
> 89, 0 <unfinished ...>
> 4848  07:41:36.873452 send(60,
> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
> 89, 0 <unfinished ...>
> 4848  07:41:45.872701 send(60,
> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
> 89, 0 <unfinished ...>
> 4848  07:41:54.872940 send(60,
> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
> 89, 0 <unfinished ...>
>
> As you can see, for socket 34, the packet sent at 07:41:18 is not
> correct. The first message part (the string finishing with .heartbeat)
> is not transmitted.
> But for socket 60, at the same date (07:41:18), it is correctly sent. To
> me, it seems that when ZMQ sends its messages to all the connected
> subscribers, it
> from time to time forget to send the first part of the message to one
> subscriber.
> This happens rarely. On this specific publisher, it happens only once
> during a 64 hours recording session!
>
> Are you aware of this kind of problem?
> Is there a chance that this problem will disappear when we will upgrade
> our production system to 3.2?
>
> Due to the complexity of the system set-up (150 publishers running on
> 150 different hosts) and to the rare ocuurence of the problem, we are
> not able (yet?) to provide a simple test case which reproduces the problem.
>
> Thank's for your answers and help
>
> Emmanuel
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev



More information about the zeromq-dev mailing list