[zeromq-dev] PUB/SUB missed initial message even using monitor events (possibly just #2267, but not sure)

Bill Torpey wallstprog at gmail.com
Mon Feb 6 16:22:43 CET 2023

Hi Jason:

> Is my example code simply demonstrating this known issue? 

If you have pub sockets connecting to subs, then probably yes.  The issue is that the HANDSHAKE_SUCCEEDED only tells you the sockets have connected, but not that subscription information has been exchanged.

When a sub connects to a pub, the sub’s subscriptions are sent along with the connection request.  So when the connection is made, the pub already has the sub’s subscriptions.

When a pub connects to a sub, there is an additional exchange following the connect for the sub to send its subscription information to the pub, which takes additional time and in any event is asynchronous to the publishing of messages.

This is a long-standing issue with ZeroMQ and there’s no way around it.  If you can arrange to have subs connect to pubs, life will be much simpler.

Also, it’s not a great idea to use the socket monitor for “command and control” — for one thing, the notifications can lag the actual events, and that can be tricky to program for.  If you need that kind of visibility and control, you’re probably better off implementing your own protocol using ZeroMQ as the transport.  Lots of projects have done that, two examples being Zyre and OZ (https://github.com/nyfix/OZ <https://github.com/nyfix/OZ>), which I created.  You may be able to use one of these "out-of-the-box”, or get some ideas on problems they’ve had to solve.

Hope this helps.

Bill Torpey

> On Feb 4, 2023, at 9:14 PM, Jason Heeris <jason.heeris at gmail.com> wrote:
> Some time ago I encountered an issue using PUB/SUB sockets for some
> integration tests I was writing. For context let me say that yes, I
> know that PUB/SUB semantics are explicitly not (by themselves) about
> reliable message delivery. Indeed in my real services (what I'm
> testing), it's not something I rely on. In my tests, I wanted to kind
> of "imitate" having long-running processes already established, but
> also did not want to actually have tests that took minutes or hours to
> run. What motivates this question is simply that I want to know what's
> going on under the hood.
> So certain tests took the form of: start up two processes, one pub,
> one sub. Use socket monitor events or a "side-channel" (like a pipe)
> to synchronise them on when the sockets are ready (eg. when the
> subscriber is connected and subscribed, when the publisher has bound).
> Now we can pretend we're testing PUB/SUB between established
> processes. Except that sometimes, these particular tests would hang in
> CI. What was really happening under the hood was that some tests would
> fail to receive the first message in the test. After a lot of digging,
> I found I could reproduce the issue only by running under docker with
> settings that force it to run (and be pinned to) a single CPU. I
> actually went so far as to intercept and synchronise on monitor
> events, specifically "HANDSHAKE_SUCCEEDED". Still missed the first
> message.
> I eventually pared my code down to a smallish (~300 line) example. It
> uses Rust's ZMQ bindings (0.10), you need Docker to reliably reproduce
> it, it's up on Gitlab with instructions:
> https://gitlab.com/detly/zeromq-mre
> Specifically, I can run and reproduce this on my system with Rust
> stable 1.66.1, zmq crate 0.10 which uses libzmq 4.3.4, Ubuntu 22.10,
> Docker 20.10.23 (but only with the CPU pinning mentioned). I can also
> trigger it on an MT7628 (single core, single thread, embedded/router
> CPU using mipsel_24kc arch), but not every time.
> Of course, AFTER writing this all and posting it, I found a couple of
> other interesting discussions. (I absolutely swear I searched for this
> and didn't see these until last week.) First is this mailing list post
> from December 2022:
> https://lists.zeromq.org/pipermail/zeromq-dev/2022-December/033802.html
> Second is the issue it links to:
> https://github.com/zeromq/libzmq/issues/2267
> Is my example code simply demonstrating this known issue? On the
> surface it certainly looks like it, the only thing that makes me
> sceptical is that I do wait for the handshake exchange to complete
> before proceeding, and doesn't that imply that the necessary "one
> extra poll/socket action/whatever" is being performed, which should be
> enough to exchange subscription information? Or is that an
> oversimplified understanding of what's needed?
> I'd appreciate any insight into this. As I said, in my real code, it
> doesn't matter. I just want to satisfy my curiosity now.
> Cheers,
> Jason
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20230206/1db182e7/attachment.htm>

More information about the zeromq-dev mailing list