[zeromq-dev] zeromq protocol_error handling

James Harvey jamesdillonharvey at gmail.com
Thu May 13 17:48:51 CEST 2021


Hi,

I have a rare/random bug that causes my ZMQ_SUB socket to fail for a
certain endpoint with no way to track/notify.  Yes it's because a SUB
connects to a REQ socket but once you start to use zeromq for lots of
transient systems in a large company this kind of thing will happen
occasionally.

The process happens like this:

  - ZMQ_PUB binds on 1.2.3.4:44444 (ephemeral)
  - ZMQ_SUB connects to 1.2.3.4:44444 (data flows)
  - ZMQ_PUB goes down
  - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:44444
as its ephemeral
  - ZMQ_SUB has not yet been told to disconnect so it reconnects to the
ZMQ_REQ
  - protocol error happens and the connection is terminated in the
session/engine
  - Now a good ZMQ_PUB comes up and binds on 1.2.3.4:44444
  - ZMQ_SUB gets new instruction to connect()
  - connect() just returns noop.
    - The socket_base thinks it still has a valid endpoint and SUB only
connects once to each endpoint.
  - At this point there are no errors and no data flowing.

My question is, should the protocol_error in the session propagate up to
remove the endpoint from the socket?

If yes I can look at adding that, if no do you have any suggestions?

Thanks for your time

James

Some links to the code:

If socket is SUB and the endpoint is present dont connect.
https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901

terminate with no reconnect on protocol_error
https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20210513/0764c47c/attachment.htm>


More information about the zeromq-dev mailing list