[zeromq-dev] PUB/SUB missed initial message even using monitor events (possibly just #2267, but not sure)

Bill Torpey wallstprog at gmail.com
Wed Feb 8 15:09:41 CET 2023



> On Feb 7, 2023, at 8:54 PM, Jason Heeris <jason.heeris at gmail.com> wrote:
> 
> On Wed, 8 Feb 2023 at 05:41, Bill Torpey <wallstprog at gmail.com> wrote:
>> Btw, I’m assuming connection-oriented (e.g., TCP) transport here.  Semantics
>> could be very different w/other mechanisms.
> 
> No, IPC (ie. Unix sockets in the abstract namespace eg. "ipc://@zmq-test", see
> eg. subscribe.rs:6). I actually didn't realise the semantics were so different
> between IPC and TCP, so apologies for not mentioning it in the first place.

Haven’t used ipc much at all, so can’t really help there.  It should be pretty easy to replace ipc endpoints with tcp and see if behavior changes.

> 
>> On Feb 6, 2023, at 8:10 PM, Jason Heeris <jason.heeris at gmail.com> wrote:
>> So, the initial connect should timeout — correct?
> 
> At the level my code is using the API, the connect call returns immediately and
> the connection is eventually done asynchronously. I assume this is because I set
> the connect timeout to 0 and then as you say, zmq handlies it asynchronously.
> 
> Based on the details you explained, it does actually sound like this is simply
> #2267, partly obscured by leaning heavily on the asynchronous behaviour around
> connect/bind.

Yes, connect will return immediately, but if you have socket monitor going you would see a bunch of messages like this:

13:22:23.486050	socket:0x7f2f7c0008c0 name:dataPub value:12 event:8 desc:LISTENING endpoint:tcp://127.0.0.1:22690
13:22:23.486446	socket:0x7f2f7c01bac0 name:proxySub value:115 event:2 desc:CONNECT_DELAYED endpoint:tcp://127.0.0.1:5555
13:22:23.486481	socket:0x7f2f7c01bac0 name:proxySub value:23 event:1 desc:CONNECTED endpoint:tcp://127.0.0.1:5555
13:22:23.486594	socket:0x7f2f7c01bac0 name:proxySub value:0 event:4096 desc:HANDSHAKE_SUCCEEDED endpoint:tcp://127.0.0.1:5555
…
13:23:24.555382	socket:0x7f2f7c009780 name:dataSub value:25 event:512 desc:DISCONNECTED endpoint:tcp://127.0.0.1:22690
13:23:24.555504	socket:0x7f2f7c009780 name:dataSub value:119 event:4 desc:CONNECT_RETRIED endpoint:tcp://127.0.0.1:22690


With TCP, the first connect call will fail with ECONNREFUSED, and then zmq will start retrying.  No idea what happens with ipc. 

> 
>> It looks like your code calls socket_monitor *after* the bind/connect calls —
>> it’s better to start monitoring immediately after the create in order to see
>> what is going on with the connect/bind calls.  I’m not a “rustacean” myself,
>> but it looks like you’re missing some events given the way you sequence the
>> calls to monitor.
> 
> Ah yes, you are right. I had doubled-down on looking at the handshake event and
> anything after, nothing earlier.
> 
>> BTW, I know this doesn’t answer your question as to why this is happening, but
>> a very helpful feature in zmq is the “welcome” msg — see here
>> (https://web.archive.org/web/20160208000728/http://somdoron.com/2015/09/reliable-pubsub/)
>> and here (https://github.com/somdoron/ReliablePubSub).   OZ uses this to know
>> for sure when a sub is connected to a pub.  You might also find some of this
>> info helpful:
>> https://github.com/nyfix/OZ/blob/master/doc/Reconnects-Heartbeats.md.
> 
> I have been looking at OZ with great interest actually, it's a good protocol! In
> my application I lean more towards the out-of-band snapshot ie. events have a
> sequence number, and there's another channel for getting the full initial state,
> ensuring there's no gap, etc. This works where the model is "have big app
> config, publish tiny updates after validation".

Yup, our application(s) do that as well, but that’s handled at a layer above OZ, using pub/sub.

> 
> But note that this question is specifically in the context of tests. Although it
> seems against the grain, for *some* integration tests I try to use non-ZMQ
> out-of-band signalling, because I want to test eg. what happens a long time
> after that initial snapshot. So sort-of-reimplementing-but-not-quite a test
> version of that would defeat the point. For these specific tests, I've
> refactored them so that wherever pub/sub exchanges are involved, the pub is
> started and bound first. (Again, this is just for a subset of tests. For others,
> yes, a "real" protocol is good and useful.)
> 
> Thanks so much for the detail, there are some hard-won insights here I can carry
> over to my applications to improve them!
> 
> Cheers,
> Jason
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev



More information about the zeromq-dev mailing list