[zeromq-dev] ZMQ (3.1.0) loosing first part message in multi-part message
Pieter Hintjens
ph at imatix.com
Tue Mar 26 14:29:19 CET 2013
Hi Emmanuel,
This is great news. Version 3.3 will be able to talk to 3.1 peers.
-Pieter
On Tue, Mar 26, 2013 at 2:02 PM, Emmanuel TAUREL <taurel at esrf.fr> wrote:
> Hi Pieter,
>
> I have done more investigation on this problem.
> It seems that the problem always happens after a subscriber shutdown. On the
> publisher, I see packet containing "" arriving from the killed subscriber
> and the next multi-part message sent by ZMQ is wrong. Not always but very
> often.
> From this remark, I was able to re-produce the problem in a small
> development environment.
> Upgrading the development environment to ZMQ 3.2.2 solves the problem which
> is a good news.
>
> Another reason to move our production system to ZMQ 3.2.2.
>
> Thank's for your answer.
>
> Emmanuel
>
>
> On 25/03/2013 10:30, Pieter Hintjens wrote:
>>
>> Hi Emmanuel,
>>
>> There were a number of issues with PUB sockets and multipart, e.g
>> commit eb9bc1 from two years ago. But these changes are all in 3.1.
>>
>> So most likely this issue is still present in the current master. It
>> may be related to the HWM, so you could try raising this on the
>> production PUB side, and using a lower HWM when you try to reproduce
>> the problem.
>>
>> -Pieter
>>
>>
>> On Mon, Mar 25, 2013 at 9:32 AM, Emmanuel TAUREL <taurel at esrf.fr> wrote:
>>>
>>> Hello all,
>>>
>>> Due to the wire incompatibility between ZMQ 3.1 and 3.2, we are still
>>> using 3.1 in our production environment.
>>> We are using the PUB/SUB pattern. The subscriber is a GUI given to our
>>> users. There are several instances of this GUI running. There are around
>>> 150 publishers running on different hosts. Every publisher process
>>> publishes a heartbeat message every 9 seconds. ZMQ propagates (PUB/SUB
>>> pattern) these heartbeat messages to
>>> every running subscribers (GUI). These heartbeat messages are multi-part
>>> messages with 3 parts. Our problem is that "from time to time", the first
>>> part of the multi-part message is not sent by ZMQ to some registered
>>> subscriber!
>>>
>>> I have recorded all the network packets sent by some of our publishers
>>> using the strace command (- e trace=network -o <file>).
>>> Here are some lines extracted from this strace generated file where the
>>> problem is clear.
>>>
>>> All heartbeat messages sent by the publisher to its socket 34 around
>>> 07:41. I remind you that they are sent every 9 seconds.
>>>
>>> 4848 07:41:00.873685 send(34,
>>>
>>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>>> 89, 0) = 89
>>> 4848 07:41:09.875210 send(34,
>>>
>>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>>> 89, 0) = 89
>>> 4848 07:41:18.873082 send(34,
>>> "\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 26, 0) = 26
>>> 4848 07:41:27.875234 send(34,
>>>
>>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>>> 89, 0) = 89
>>> 4848 07:41:36.874466 send(34,
>>>
>>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>>> 89, 0) = 89
>>> 4848 07:41:45.873737 send(34,
>>>
>>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>>> 89, 0) = 89
>>> 4848 07:41:54.873934 send(34,
>>>
>>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>>> 89, 0) = 89
>>>
>>> All heartbeat messages sent by the same publisher to its socket 60
>>> around 07:41
>>> 4848 07:41:00.872882 send(60,
>>>
>>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>>> 89, 0) = 89
>>> 4848 07:41:09.873237 send(60,
>>>
>>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>>> 89, 0) = 89
>>> 4848 07:41:18.873612 send(60,
>>>
>>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>>> 89, 0) = 89
>>> 4848 07:41:27.874226 send(60,
>>>
>>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>>> 89, 0 <unfinished ...>
>>> 4848 07:41:36.873452 send(60,
>>>
>>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>>> 89, 0 <unfinished ...>
>>> 4848 07:41:45.872701 send(60,
>>>
>>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>>> 89, 0 <unfinished ...>
>>> 4848 07:41:54.872940 send(60,
>>>
>>> ">\1tango://orion.esrf.fr:10000/dserver/starter/l-c32-1.heartbeat\2\1\1\26\0\1\0\0\0y\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0",
>>> 89, 0 <unfinished ...>
>>>
>>> As you can see, for socket 34, the packet sent at 07:41:18 is not
>>> correct. The first message part (the string finishing with .heartbeat)
>>> is not transmitted.
>>> But for socket 60, at the same date (07:41:18), it is correctly sent. To
>>> me, it seems that when ZMQ sends its messages to all the connected
>>> subscribers, it
>>> from time to time forget to send the first part of the message to one
>>> subscriber.
>>> This happens rarely. On this specific publisher, it happens only once
>>> during a 64 hours recording session!
>>>
>>> Are you aware of this kind of problem?
>>> Is there a chance that this problem will disappear when we will upgrade
>>> our production system to 3.2?
>>>
>>> Due to the complexity of the system set-up (150 publishers running on
>>> 150 different hosts) and to the rare ocuurence of the problem, we are
>>> not able (yet?) to provide a simple test case which reproduces the
>>> problem.
>>>
>>> Thank's for your answers and help
>>>
>>> Emmanuel
>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev at lists.zeromq.org
>>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>
More information about the zeromq-dev
mailing list