[zeromq-dev] PUB/SUB unreliabiliity

Gerry Steele gerry.steele at gmail.com
Sun Jun 15 13:43:40 CEST 2014


Thanks Charles, that's pretty much my understanding too. Meaning this is a
bug in my implementation or in zeromq.

I understand the implications of the slow consumer problem but the
fundamental issue here is to establish trust in PUB/SUB.


On 14 June 2014 21:09, Charles Remes <lists at chuckremes.com> wrote:

> Let’s back up for a second.
>
> Take a look at the man page for zmq_setsockopt and read the section on
> ZMQ_SNDHWM. It clearly states that zero means “no limit.” Second, it also
> states that when the socket reaches its exceptional state then it will
> either block or drop messages depending on socket type.
>
> Next, look at the man page for zmq_socket and check the ZMQ_PUB section.
> The socket will reach its mute state (its exceptional state) when it
> reaches it high water mark. When it’s mute, it will drop messages.
>
> So, taking the two together then a socket with a ZMQ_SNDHWM of 0 should
> never drop messages because it will never reach its mute state.
>
> The one exception to this is when there are no SUB sockets connected to
> the PUB socket. When there are no connections, all messages are dropped
> (because no one is listening and there are no queues created).
>
> However, I highly recommend *against* setting HWM to 0 for a PUB socket.
> Here’s why:
>
> 1. It gives you a false sense of security that all messages will be
> delivered.
> If the publishing process dies, any messages in queue go with it so
> they’ll never get delivered.
>
> 2. Your subscribers might be too slow.
> If your subscribers can’t keep up with the message flow and the publisher
> starts queueing, it *will* run out of memory. You’ll either exhaust the
> amount of memory allowed by your process, or your OS will start paging &
> swapping and you’ll wish the process had just died.
>
> cr
>
>
> On Jun 13, 2014, at 5:34 PM, Gerry Steele <gerry.steele at gmail.com> wrote:
>
> Hi Brian
>
> I noticed your comment on another thread about this and I think you got it
> a bit wrong:
>
> > The high water mark is a hard limit on the maximum number of
> outstanding messages ØMQ shall queue in memory for any single peer that the
> specified*socket* is communicating with.* A value of zero means no limit.*
>
> and from your link:
>
> > Since v3.x, ØMQ forces default limits on its internal buffers (the
> so-called high-water mark or HWM), so publisher crashes are rarer *unless
> you deliberately set the HWM to infinite.*
>
> Nothing I read indicates anything other than the fact that no messages
> post connections being made should be dropped.
>
> Thanks
> G
>
>
>
> On 13 June 2014 23:17, Brian Knox <bknox at digitalocean.com> wrote:
>
>> "From what i've read, PUB SUB should be reliable when the _HWM are set to
>> zero (don't drop). By reliable I mean no messages should fail to be
>> delivered to an already connected consumer."
>>
>>
>> Your understanding of pub-sub behavior and how  it interacts with the HWM
>> is incorrect.  Please see: http://zguide.zeromq.org/php:chapter5
>>
>> Brian
>>
>>
>>
>>
>> On Fri, Jun 13, 2014 at 2:33 PM, Gerry Steele <gerry.steele at gmail.com>
>> wrote:
>>
>>> I've read everything I can find including the Printed book, but I am at
>>> a loss as to the definitive definition as to how PUB/SUB should behave in
>>> zmq.
>>>
>>> A production system I'm using is experiencing message loss between
>>> several nodes using PUB/SUB.
>>>
>>> From what i've read, PUB SUB should be reliable when the _HWM are set to
>>> zero (don't drop). By reliable I mean no messages should fail to be
>>> delivered to an already connected consumer.
>>>
>>> I implemented some utilities to reproduce the message loss in my system :
>>>
>>> zmq_sub: https://gist.github.com/easytiger/992b3a29eb5c8545d289
>>> zmq_pub: https://gist.github.com/easytiger/e382502badab49856357
>>>
>>>
>>> zmq_pub takes a number of events to send and the logging frequency and
>>> zmq_sub only takes the logging frequency. zmq prints out the number of msgs
>>> received vs the packet contents containing the integer packet count from
>>> the publisher.
>>>
>>> It can be seen when sending events in a tight loop that messages simply
>>> go missing mid way through (loss is not at beginning or end ruling out slow
>>> connectors etc)
>>>
>>> In a small loop it usually works ok:
>>>
>>> $ ./zmq_pub 2000 1000
>>> sent MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #1000 with
>>> rc=58
>>> sent MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #2000 with
>>> rc=58
>>>
>>> $ ./zmq_sub 1
>>>
>>> RECV:1|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #1
>>> RECV:2|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #2
>>> RECV:3|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #3
>>> RECV:4|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #4
>>> [...]
>>> RECV:2000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #2000
>>>
>>> You can see every message was sent as the counts align.
>>>
>>> However increase the message counts and messages start going missing
>>>
>>> $ ./zmq_pub 200000 100000
>>>
>>> sent MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #100000 with
>>> rc=60
>>> sent MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #200000 with
>>> rc=60
>>>
>>> ./zmq_sub 10000
>>> RECV:10000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #11000
>>>  RECV:20000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #21000
>>> RECV:30000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #31610
>>> RECV:40000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #42000
>>> RECV:50000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #52524
>>> RECV:60000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #64654
>>> RECV:70000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #77298
>>> RECV:80000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #90117
>>> RECV:90000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #102864
>>> RECV:100000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #115846
>>> RECV:110000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #129135
>>> RECV:120000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #141606
>>> RECV:130000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #154179
>>> RECV:140000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #166627
>>> RECV:150000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #179166
>>> RECV:160000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #192247
>>>
>>>
>>> Is this expected behaviour? With PUSH/PULL I get no loss at all with
>>> similar utilities.
>>>
>>> If I put more work between sends (e.g. cout  each time) and the full
>>> message the results are better.
>>>
>>> zmq_push: https://gist.github.com/easytiger/2c4f806594ccfbc74f54
>>> zmq_pull:   https://gist.github.com/easytiger/268a630fd22f959fde93
>>>
>>> Is there an issue/bug in my implementation that would cause this?
>>>
>>> Using zeromq 4.0.3
>>>
>>> Many Thanks
>>> Gerry
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Gerry Steele
>>>
>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev at lists.zeromq.org
>>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>>
>>
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
>>
>
>
> --
> Gerry Steele
>
>  _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>


-- 
Gerry Steele
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20140615/35e839d3/attachment.htm>


More information about the zeromq-dev mailing list