[zeromq-dev] zeromq protocol_error handling

Bill Torpey wallstprog at gmail.com
Fri May 21 22:12:31 CEST 2021


Hey James:

Going back over your original scenario:

>  - ZMQ_PUB binds on 1.2.3.4:44444 <http://1.2.3.4:44444/> (ephemeral)

>  - ZMQ_SUB connects to 1.2.3.4:44444 <http://1.2.3.4:44444/> (data flows)

>  - ZMQ_PUB goes down

At this point the SUB should get a disconnect.  It will then start trying to reconnect, which it will do “forever” without any other  action.  (The default for ZMQ_RECONNECT_IVL is 100 millis).

This PR (https://github.com/zeromq/libzmq/pull/3831) explicitly checks for the scenario where a previously-connected socket gets ECONNREFUSED when attempting to reconnect.  If that condition is detected, the reconnect is aborted AND the endpoint address is “forgotten” so subsequent attempts to connect (not re-connect) to that endpoint are not silently ignored. 

Note that you have to ask for this behavior, as it’s not the default, by calling something like "zmq_setsockopt(socket, ZMQ_RECONNECT_STOP, ZMQ_RECONNECT_STOP_CONN_REFUSED ..”.

(FWIW, I initially suggested that silently ignoring duplicate connection attempts is a bad idea, and would prefer that the connect return an error (like EAGAIN), but there was push-back on that as it’s a change in behavior.  I still think that’s a better approach).


>  - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:44444 <http://1.2.3.4:44444/> as its ephemeral 


It seems unlikely that another process could grab the same ephemeral port without an intervening ECONNREFUSED (no code listening at port). 

You really need to implement the socket monitoring code (as I’ve already suggested).  Make sure to use zmqBridgeMamaTransportImpl_monitorEvent_v2 as that will give you both endpoint addresses.

If that’s too much trouble, you may be able to use zmtpdump(https://github.com/zeromq/zmtpdump) or wireshark to see what is really going on.

Last but not least, you are likely better off creating an issue on GitHub for this.

Regards,

Bill


> On May 21, 2021, at 2:38 PM, James Harvey <jamesdillonharvey at gmail.com> wrote:
> 
> Hi Bill,
> 
> I will check/reply to rest of points later ( im in the pub ) but that is the point. The protocol_error stops everything so no more reconnect from the pub socket. Its effectively a zombie as it's terminated but still the endpoint is registered on the socket.
> 
> Cheers
> 
> James
> 
> 
> On Fri, 21 May 2021, 18:43 Bill Torpey, <wallstprog at gmail.com <mailto:wallstprog at gmail.com>> wrote:
> Hi James:
> 
> A couple of questions:
> 
> - Is the SUB socket attempting to reconnect?  (Default is yes).
> 
> - Are you activating any of the socket options added by recent changes?  IIRC none of the new options (e.g., ZMQ_RECONNECT_STOP_CONN_REFUSED)  have any effect by default — they need to be activated explicitly.
> 
> - Are you tracing socket events?  If not, you should give that a try — it will tell you what is going on “under the covers”. You can find an example at https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549 <https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549>
> 
> I’ll try to take a look when I have some time, but not sure when that will be …
> 
> Regards,
> 
> Bill
> 
>> On May 21, 2021, at 10:04 AM, James Harvey <jamesdillonharvey at gmail.com <mailto:jamesdillonharvey at gmail.com>> wrote:
>> 
>> Thanks Bill 
>> 
>> I pulled the latest libzmq and the issue still occurs.
>> 
>> I have tracked it down to the protocol_error handling.  In the case of a ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the session is terminated.
>> 
>> The termination does not remove that connection endpoint from the socket. This means subsequent calls to socket->connect on the same endpoint (after the correct service has resumed) are no ops because SUB can only have one connection to a single endpoint.
>> 
>> 
>> The change below fixes my issue but I'm not sure if it's correct for other protocol errors.  I haven't worked on the sessions/pipes before.    I noticed in gdb the second session has a _pipe but is not fully created.
>> 
>> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487 <https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487>  
>> 
>>         case i_engine::protocol_error:
>> //            if (_pending) {
>>             if (_pending || handshaked_) {  // <<<  if handshaked we should also terminate pipes.
>>                 if (_pipe)
>>                     _pipe->terminate (false);
>>                 if (_zap_pipe)
>>                     _zap_pipe->terminate (false);
>>             } else {
>>                 terminate ();
>>             }
>> 
>> I am happy to create a pull request to discuss if I am on the right track?
>> 
>> I have test code to recreate.
>> 
>> #include "testutil.hpp"
>> #include "testutil_unity.hpp"
>> #include <iostream>
>> #include <stdlib.h>
>> SETUP_TEARDOWN_TESTCONTEXT
>> char end[] = "tcp://127.0.0.1:55667 <http://127.0.0.1:55667/>";
>> 
>> void test_pubreq ()
>> {
>>    
>> // SUB up and connect to 55557
>>     void *sub = test_context_socket (ZMQ_SUB);
>>     TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt (sub, ZMQ_SUBSCRIBE, "", 0));
>>     TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end));
>> 
>> // REQ is up incorrectly on 55667 
>>     void *req = test_context_socket (ZMQ_REQ);
>>     TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (req, end));
>>     msleep(1000);
>>     TEST_ASSERT_SUCCESS_ERRNO (zmq_unbind (req, end));
>> // REQ is down
>> // At this point the SUB socket has a protocol_error on 55667 (so no reconnect) but the socket thinks it still connected to 55667
>> 
>>     msleep(1000);
>> 
>> // PUB correctly comes up on 55667
>>     void *pub = test_context_socket (ZMQ_PUB);
>>     TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (pub, end));
>> 
>> // NOTE: If we force a disconnect here it works.
>> //    TEST_ASSERT_SUCCESS_ERRNO (zmq_disconnect (sub, end));
>> 
>> // Connect again fails
>>     TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end));
>> 
>>     msleep(100);
>> 
>>     send_string_expect_success (pub, "Hello", 0);
>>     
>>     msleep(100);
>> 
>>     recv_string_expect_success (sub, "Hello", 0);
>> 
>>     msleep(100);
>> 
>>     test_context_socket_close (pub);
>>     test_context_socket_close (req);
>>     test_context_socket_close (sub);
>> 
>> }
>> 
>> int main (void)
>> {
>>     setup_test_environment ();
>> 
>>     UNITY_BEGIN ();
>>     RUN_TEST (test_pubreq);
>>     return UNITY_END (); 
>> }
>> 
>> On Thu, May 20, 2021 at 4:56 PM Bill Torpey <wallstprog at gmail.com <mailto:wallstprog at gmail.com>> wrote:
>> Sorry — meant to get back to you sooner, but it’s been a crazy week.
>> 
>> You don’t say what version you’re running, but there have been some changes in that area not that long ago — check these out and see if they help:
>> 
>> https://github.com/zeromq/libzmq/pull/3831 <https://github.com/zeromq/libzmq/pull/3831>
>> 
>> https://github.com/zeromq/libzmq/pull/3960 <https://github.com/zeromq/libzmq/pull/3960>
>> 
>> https://github.com/zeromq/libzmq/pull/4053 <https://github.com/zeromq/libzmq/pull/4053>
>> 
>> Good luck.
>> 
>> Bill
>> 
>> 
>>> On May 20, 2021, at 10:26 AM, James Harvey <jamesdillonharvey at gmail.com <mailto:jamesdillonharvey at gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> I will try and simplify my previous long email.
>>> 
>>> If a stream gets into a protocol error state  (e.g tcp SUB connect to REQ) 
>>> 
>>> Should the information (connection is terminated) be passed somehow back to the parent socket so if connect() is called again it attempts to connect rather than a no-op.
>>> 
>>> OR
>>> 
>>> Should we add a protocol error event to socket monitor so the calling process can handle it  by calling disconnect/connect
>>> 
>>> Just want some clarification so I work on the correct code.
>>> 
>>> Thanks
>>> 
>>> James
>>> 
>>> On Thu, May 13, 2021 at 4:48 PM James Harvey <jamesdillonharvey at gmail.com <mailto:jamesdillonharvey at gmail.com>> wrote:
>>> Hi,
>>> 
>>> I have a rare/random bug that causes my ZMQ_SUB socket to fail for a certain endpoint with no way to track/notify.  Yes it's because a SUB connects to a REQ socket but once you start to use zeromq for lots of transient systems in a large company this kind of thing will happen occasionally.
>>> 
>>> The process happens like this:
>>> 
>>>   - ZMQ_PUB binds on 1.2.3.4:44444 <http://1.2.3.4:44444/> (ephemeral)
>>>   - ZMQ_SUB connects to 1.2.3.4:44444 <http://1.2.3.4:44444/> (data flows)
>>>   - ZMQ_PUB goes down
>>>   - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:44444 <http://1.2.3.4:44444/> as its ephemeral
>>>   - ZMQ_SUB has not yet been told to disconnect so it reconnects to the ZMQ_REQ
>>>   - protocol error happens and the connection is terminated in the session/engine
>>>   - Now a good ZMQ_PUB comes up and binds on 1.2.3.4:44444 <http://1.2.3.4:44444/>
>>>   - ZMQ_SUB gets new instruction to connect()
>>>   - connect() just returns noop.
>>>     - The socket_base thinks it still has a valid endpoint and SUB only connects once to each endpoint.
>>>   - At this point there are no errors and no data flowing.
>>> 
>>> My question is, should the protocol_error in the session propagate up to remove the endpoint from the socket?
>>> 
>>> If yes I can look at adding that, if no do you have any suggestions?
>>> 
>>> Thanks for your time
>>> 
>>> James
>>> 
>>> Some links to the code:
>>> 
>>> If socket is SUB and the endpoint is present dont connect.
>>> https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901 <https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901>
>>> 
>>> terminate with no reconnect on protocol_error 
>>> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486 <https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev at lists.zeromq.org <mailto:zeromq-dev at lists.zeromq.org>
>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev <https://lists.zeromq.org/mailman/listinfo/zeromq-dev>
>> 
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org <mailto:zeromq-dev at lists.zeromq.org>
>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev <https://lists.zeromq.org/mailman/listinfo/zeromq-dev>
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org <mailto:zeromq-dev at lists.zeromq.org>
>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev <https://lists.zeromq.org/mailman/listinfo/zeromq-dev>
> 
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org <mailto:zeromq-dev at lists.zeromq.org>
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev <https://lists.zeromq.org/mailman/listinfo/zeromq-dev>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20210521/9760d35e/attachment.htm>


More information about the zeromq-dev mailing list