[zeromq-dev] PUB/SUB missed initial message even using monitor events (possibly just #2267, but not sure)
Jason Heeris
jason.heeris at gmail.com
Sun Feb 5 03:14:54 CET 2023
Some time ago I encountered an issue using PUB/SUB sockets for some
integration tests I was writing. For context let me say that yes, I
know that PUB/SUB semantics are explicitly not (by themselves) about
reliable message delivery. Indeed in my real services (what I'm
testing), it's not something I rely on. In my tests, I wanted to kind
of "imitate" having long-running processes already established, but
also did not want to actually have tests that took minutes or hours to
run. What motivates this question is simply that I want to know what's
going on under the hood.
So certain tests took the form of: start up two processes, one pub,
one sub. Use socket monitor events or a "side-channel" (like a pipe)
to synchronise them on when the sockets are ready (eg. when the
subscriber is connected and subscribed, when the publisher has bound).
Now we can pretend we're testing PUB/SUB between established
processes. Except that sometimes, these particular tests would hang in
CI. What was really happening under the hood was that some tests would
fail to receive the first message in the test. After a lot of digging,
I found I could reproduce the issue only by running under docker with
settings that force it to run (and be pinned to) a single CPU. I
actually went so far as to intercept and synchronise on monitor
events, specifically "HANDSHAKE_SUCCEEDED". Still missed the first
message.
I eventually pared my code down to a smallish (~300 line) example. It
uses Rust's ZMQ bindings (0.10), you need Docker to reliably reproduce
it, it's up on Gitlab with instructions:
https://gitlab.com/detly/zeromq-mre
Specifically, I can run and reproduce this on my system with Rust
stable 1.66.1, zmq crate 0.10 which uses libzmq 4.3.4, Ubuntu 22.10,
Docker 20.10.23 (but only with the CPU pinning mentioned). I can also
trigger it on an MT7628 (single core, single thread, embedded/router
CPU using mipsel_24kc arch), but not every time.
Of course, AFTER writing this all and posting it, I found a couple of
other interesting discussions. (I absolutely swear I searched for this
and didn't see these until last week.) First is this mailing list post
from December 2022:
https://lists.zeromq.org/pipermail/zeromq-dev/2022-December/033802.html
Second is the issue it links to:
https://github.com/zeromq/libzmq/issues/2267
Is my example code simply demonstrating this known issue? On the
surface it certainly looks like it, the only thing that makes me
sceptical is that I do wait for the handshake exchange to complete
before proceeding, and doesn't that imply that the necessary "one
extra poll/socket action/whatever" is being performed, which should be
enough to exchange subscription information? Or is that an
oversimplified understanding of what's needed?
I'd appreciate any insight into this. As I said, in my real code, it
doesn't matter. I just want to satisfy my curiosity now.
Cheers,
Jason
More information about the zeromq-dev
mailing list