[zeromq-dev] ZMQ (3.1.0) loosing first part message in multi-part message

Emmanuel TAUREL taurel at esrf.fr
Tue Mar 26 14:02:49 CET 2013


Hi Pieter,

I have done more investigation on this problem.
It seems that the problem always happens after a subscriber shutdown. On 
the publisher, I see packet containing "" arriving from the killed 
subscriber
and the next multi-part message sent by ZMQ is wrong. Not always but 
very often.
 From this remark, I was able to re-produce the problem in a small 
development environment.
Upgrading the development environment to ZMQ 3.2.2 solves the problem 
which is a good news.

Another reason to move our production system to ZMQ 3.2.2.

Thank's for your answer.

Emmanuel

On 25/03/2013 10:30, Pieter Hintjens wrote:
> Hi Emmanuel,
>
> There were a number of issues with PUB sockets and multipart, e.g
> commit eb9bc1 from two years ago. But these changes are all in 3.1.
>
> So most likely this issue is still present in the current master. It
> may be related to the HWM, so you could try raising this on the
> production PUB side, and using a lower HWM when you try to reproduce
> the problem.
>
> -Pieter
>
>
> On Mon, Mar 25, 2013 at 9:32 AM, Emmanuel TAUREL <taurel at esrf.fr> wrote:
>> Hello all,
>>
>> Due to the wire incompatibility between ZMQ 3.1 and 3.2, we are still
>> using 3.1 in our production environment.
>> We are using the PUB/SUB pattern. The subscriber is a GUI given to our
>> users. There are several instances of this GUI running. There are around
>> 150 publishers running on different hosts. Every publisher process
>> publishes a heartbeat message every 9 seconds. ZMQ propagates (PUB/SUB
>> pattern) these heartbeat messages to
>> every running subscribers (GUI). These heartbeat messages are multi-part
>> messages with 3 parts. Our problem is that "from time to time", the first
>> part of the multi-part message is not sent by ZMQ to some registered
>> subscriber!
>>
>> I have recorded all the network packets sent by some of our publishers
>> using the strace command (- e trace=network -o <file>).
>> Here are some lines extracted from this strace generated file where the
>> problem is clear.
>>
>> All heartbeat messages sent by the publisher to its socket 34 around
>> 07:41. I remind you that they are sent every 9 seconds.
>>
>> 4848  07:41:00.873685 send(34,
>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>> 89, 0) = 89
>> 4848  07:41:09.875210 send(34,
>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>> 89, 0) = 89
>> 4848  07:41:18.873082 send(34,
>> "\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 26, 0) = 26
>> 4848  07:41:27.875234 send(34,
>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>> 89, 0) = 89
>> 4848  07:41:36.874466 send(34,
>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>> 89, 0) = 89
>> 4848  07:41:45.873737 send(34,
>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>> 89, 0) = 89
>> 4848  07:41:54.873934 send(34,
>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>> 89, 0) = 89
>>
>> All heartbeat messages sent by the same publisher to its socket 60
>> around 07:41
>> 4848  07:41:00.872882 send(60,
>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>> 89, 0) = 89
>> 4848  07:41:09.873237 send(60,
>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>> 89, 0) = 89
>> 4848  07:41:18.873612 send(60,
>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>> 89, 0) = 89
>> 4848  07:41:27.874226 send(60,
>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>> 89, 0 <unfinished ...>
>> 4848  07:41:36.873452 send(60,
>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>> 89, 0 <unfinished ...>
>> 4848  07:41:45.872701 send(60,
>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>> 89, 0 <unfinished ...>
>> 4848  07:41:54.872940 send(60,
>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>> 89, 0 <unfinished ...>
>>
>> As you can see, for socket 34, the packet sent at 07:41:18 is not
>> correct. The first message part (the string finishing with .heartbeat)
>> is not transmitted.
>> But for socket 60, at the same date (07:41:18), it is correctly sent. To
>> me, it seems that when ZMQ sends its messages to all the connected
>> subscribers, it
>> from time to time forget to send the first part of the message to one
>> subscriber.
>> This happens rarely. On this specific publisher, it happens only once
>> during a 64 hours recording session!
>>
>> Are you aware of this kind of problem?
>> Is there a chance that this problem will disappear when we will upgrade
>> our production system to 3.2?
>>
>> Due to the complexity of the system set-up (150 publishers running on
>> 150 different hosts) and to the rare ocuurence of the problem, we are
>> not able (yet?) to provide a simple test case which reproduces the problem.
>>
>> Thank's for your answers and help
>>
>> Emmanuel
>>
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev





More information about the zeromq-dev mailing list