[zeromq-dev] PUB/SUB unreliabiliity
Charles Remes
lists at chuckremes.com
Sat Jun 14 22:09:18 CEST 2014
Let’s back up for a second.
Take a look at the man page for zmq_setsockopt and read the section on ZMQ_SNDHWM. It clearly states that zero means “no limit.” Second, it also states that when the socket reaches its exceptional state then it will either block or drop messages depending on socket type.
Next, look at the man page for zmq_socket and check the ZMQ_PUB section. The socket will reach its mute state (its exceptional state) when it reaches it high water mark. When it’s mute, it will drop messages.
So, taking the two together then a socket with a ZMQ_SNDHWM of 0 should never drop messages because it will never reach its mute state.
The one exception to this is when there are no SUB sockets connected to the PUB socket. When there are no connections, all messages are dropped (because no one is listening and there are no queues created).
However, I highly recommend *against* setting HWM to 0 for a PUB socket. Here’s why:
1. It gives you a false sense of security that all messages will be delivered.
If the publishing process dies, any messages in queue go with it so they’ll never get delivered.
2. Your subscribers might be too slow.
If your subscribers can’t keep up with the message flow and the publisher starts queueing, it *will* run out of memory. You’ll either exhaust the amount of memory allowed by your process, or your OS will start paging & swapping and you’ll wish the process had just died.
cr
On Jun 13, 2014, at 5:34 PM, Gerry Steele <gerry.steele at gmail.com> wrote:
> Hi Brian
>
> I noticed your comment on another thread about this and I think you got it a bit wrong:
>
> > The high water mark is a hard limit on the maximum number of outstanding messages ØMQ shall queue in memory for any single peer that the specifiedsocket is communicating with. A value of zero means no limit.
>
> and from your link:
>
> > Since v3.x, ØMQ forces default limits on its internal buffers (the so-called high-water mark or HWM), so publisher crashes are rarer unless you deliberately set the HWM to infinite.
>
> Nothing I read indicates anything other than the fact that no messages post connections being made should be dropped.
>
> Thanks
> G
>
>
>
> On 13 June 2014 23:17, Brian Knox <bknox at digitalocean.com> wrote:
> "From what i've read, PUB SUB should be reliable when the _HWM are set to zero (don't drop). By reliable I mean no messages should fail to be delivered to an already connected consumer."
>
>
> Your understanding of pub-sub behavior and how it interacts with the HWM is incorrect. Please see: http://zguide.zeromq.org/php:chapter5
>
> Brian
>
>
>
>
> On Fri, Jun 13, 2014 at 2:33 PM, Gerry Steele <gerry.steele at gmail.com> wrote:
> I've read everything I can find including the Printed book, but I am at a loss as to the definitive definition as to how PUB/SUB should behave in zmq.
>
> A production system I'm using is experiencing message loss between several nodes using PUB/SUB.
>
> From what i've read, PUB SUB should be reliable when the _HWM are set to zero (don't drop). By reliable I mean no messages should fail to be delivered to an already connected consumer.
>
> I implemented some utilities to reproduce the message loss in my system :
>
> zmq_sub: https://gist.github.com/easytiger/992b3a29eb5c8545d289
> zmq_pub: https://gist.github.com/easytiger/e382502badab49856357
>
>
> zmq_pub takes a number of events to send and the logging frequency and zmq_sub only takes the logging frequency. zmq prints out the number of msgs received vs the packet contents containing the integer packet count from the publisher.
>
> It can be seen when sending events in a tight loop that messages simply go missing mid way through (loss is not at beginning or end ruling out slow connectors etc)
>
> In a small loop it usually works ok:
>
> $ ./zmq_pub 2000 1000
> sent MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #1000 with rc=58
> sent MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #2000 with rc=58
>
> $ ./zmq_sub 1
>
> RECV:1|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #1
> RECV:2|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #2
> RECV:3|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #3
> RECV:4|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #4
> [...]
> RECV:2000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #2000
>
> You can see every message was sent as the counts align.
>
> However increase the message counts and messages start going missing
>
> $ ./zmq_pub 200000 100000
> sent MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #100000 with rc=60
> sent MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #200000 with rc=60
>
> ./zmq_sub 10000
> RECV:10000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #11000
> RECV:20000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #21000
> RECV:30000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #31610
> RECV:40000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #42000
> RECV:50000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #52524
> RECV:60000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #64654
> RECV:70000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #77298
> RECV:80000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #90117
> RECV:90000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #102864
> RECV:100000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #115846
> RECV:110000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #129135
> RECV:120000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #141606
> RECV:130000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #154179
> RECV:140000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #166627
> RECV:150000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #179166
> RECV:160000|MESSAGE PAYLOAD OF A NONTRIVAL SIZE KIND OF AND SUCH #192247
>
>
> Is this expected behaviour? With PUSH/PULL I get no loss at all with similar utilities.
>
> If I put more work between sends (e.g. cout each time) and the full message the results are better.
>
> zmq_push: https://gist.github.com/easytiger/2c4f806594ccfbc74f54
> zmq_pull: https://gist.github.com/easytiger/268a630fd22f959fde93
>
> Is there an issue/bug in my implementation that would cause this?
>
> Using zeromq 4.0.3
>
> Many Thanks
> Gerry
>
>
>
>
>
> --
> Gerry Steele
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>
>
> --
> Gerry Steele
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20140614/e715aa63/attachment.htm>
More information about the zeromq-dev
mailing list