[zeromq-dev] zeromq protocol_error handling

James Harvey jamesdillonharvey at gmail.com
Fri May 21 23:11:28 CEST 2021


Thanks Bill for the advice, I will implement the monitoring to gather more
data. I think I have sufficient information to create an issue now.

In general zeromq  is a steep learning curve and trying to work out if the
behaviour you think is bad is really an issue or expected is hard.

 The maintainers of zmq clearly have a far superior knowledge so it's easy
to just let them do all the work. This feels wrong so I want to help.




On Fri, 21 May 2021, 21:16 Bill Torpey, <wallstprog at gmail.com> wrote:

> Hey James:
>
> Going back over your original scenario:
>
>  - ZMQ_PUB binds on 1.2.3.4:44444 (ephemeral)
>
>  - ZMQ_SUB connects to 1.2.3.4:44444 (data flows)
>
>  - ZMQ_PUB goes down
>
>
> At this point the SUB should get a disconnect.  It will then start trying
> to reconnect, which it will do “forever” without any other  action.  (The
> default for ZMQ_RECONNECT_IVL is 100 millis).
>
> This PR (https://github.com/zeromq/libzmq/pull/3831) explicitly checks
> for the scenario where a previously-connected socket gets ECONNREFUSED when
> attempting to reconnect.  If that condition is detected, the reconnect is
> aborted AND the endpoint address is “forgotten” so subsequent attempts to
> connect (not re-connect) to that endpoint are not silently ignored.
>
> Note that you have to ask for this behavior, as it’s not the default, by
> calling something like "zmq_setsockopt(socket, ZMQ_RECONNECT_STOP,
> ZMQ_RECONNECT_STOP_CONN_REFUSED ..”.
>
> (FWIW, I initially suggested that silently ignoring duplicate connection
> attempts is a bad idea, and would prefer that the connect return an error
> (like EAGAIN), but there was push-back on that as it’s a change in
> behavior.  I still think that’s a better approach).
>
>
>  - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:44444 as
> its ephemeral
>
>
> It seems unlikely that another process could grab the same ephemeral port
> without an intervening ECONNREFUSED (no code listening at port).
>
> You really need to implement the socket monitoring code (as I’ve already
> suggested).  Make sure to use zmqBridgeMamaTransportImpl_monitorEvent_v2 as
> that will give you both endpoint addresses.
>
> If that’s too much trouble, you may be able to use zmtpdump(
> https://github.com/zeromq/zmtpdump) or wireshark to see what is really
> going on.
>
> Last but not least, you are likely better off creating an issue on GitHub
> for this.
>
> Regards,
>
> Bill
>
>
> On May 21, 2021, at 2:38 PM, James Harvey <jamesdillonharvey at gmail.com>
> wrote:
>
> Hi Bill,
>
> I will check/reply to rest of points later ( im in the pub ) but that is
> the point. The protocol_error stops everything so no more reconnect from
> the pub socket. Its effectively a zombie as it's terminated but still the
> endpoint is registered on the socket.
>
> Cheers
>
> James
>
>
> On Fri, 21 May 2021, 18:43 Bill Torpey, <wallstprog at gmail.com> wrote:
>
>> Hi James:
>>
>> A couple of questions:
>>
>> - Is the SUB socket attempting to reconnect?  (Default is yes).
>>
>> - Are you activating any of the socket options added by recent changes?
>> IIRC none of the new options (e.g., ZMQ_RECONNECT_STOP_CONN_REFUSED)  have
>> any effect by default — they need to be activated explicitly.
>>
>> - Are you tracing socket events?  If not, you should give that a try — it
>> will tell you what is going on “under the covers”. You can find an example
>> at
>> https://github.com/nyfix/OZ/blob/4627b0364be80de4451bf1a80a26c00d0ba9310f/src/transport.c#L1549
>>
>> I’ll try to take a look when I have some time, but not sure when that
>> will be …
>>
>> Regards,
>>
>> Bill
>>
>> On May 21, 2021, at 10:04 AM, James Harvey <jamesdillonharvey at gmail.com>
>> wrote:
>>
>> Thanks Bill
>>
>> I pulled the latest libzmq and the issue still occurs.
>>
>> I have tracked it down to the protocol_error handling.  In the case of a
>> ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the
>> session is terminated.
>>
>> The termination does not remove that connection endpoint from the socket.
>> This means subsequent calls to socket->connect on the same endpoint (after
>> the correct service has resumed) are no ops because SUB can only have one
>> connection to a single endpoint.
>>
>>
>> The change below fixes my issue but I'm not sure if it's correct for
>> other protocol errors.  I haven't worked on the sessions/pipes before.    I
>> noticed in gdb the second session has a _pipe but is not fully created.
>>
>> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487
>>
>>         case i_engine::protocol_error:
>> //            if (_pending) {
>>             if (_pending || handshaked_) {  // <<<  if handshaked we
>> should also terminate pipes.
>>                 if (_pipe)
>>                     _pipe->terminate (false);
>>                 if (_zap_pipe)
>>                     _zap_pipe->terminate (false);
>>             } else {
>>                 terminate ();
>>             }
>>
>> I am happy to create a pull request to discuss if I am on the right track?
>>
>> I have test code to recreate.
>>
>> #include "testutil.hpp"
>> #include "testutil_unity.hpp"
>> #include <iostream>
>> #include <stdlib.h>
>> SETUP_TEARDOWN_TESTCONTEXT
>> char end[] = "tcp://127.0.0.1:55667";
>>
>> void test_pubreq ()
>> {
>>
>> // SUB up and connect to 55557
>>     void *sub = test_context_socket (ZMQ_SUB);
>>     TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt (sub, ZMQ_SUBSCRIBE, "",
>> 0));
>>     TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end));
>>
>> // REQ is up incorrectly on 55667
>>     void *req = test_context_socket (ZMQ_REQ);
>>     TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (req, end));
>>     msleep(1000);
>>     TEST_ASSERT_SUCCESS_ERRNO (zmq_unbind (req, end));
>> // REQ is down
>> // At this point the SUB socket has a protocol_error on 55667 (so no
>> reconnect) but the socket thinks it still connected to 55667
>>
>>     msleep(1000);
>>
>> // PUB correctly comes up on 55667
>>     void *pub = test_context_socket (ZMQ_PUB);
>>     TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (pub, end));
>>
>> // NOTE: If we force a disconnect here it works.
>> //    TEST_ASSERT_SUCCESS_ERRNO (zmq_disconnect (sub, end));
>>
>> // Connect again fails
>>     TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end));
>>
>>     msleep(100);
>>
>>     send_string_expect_success (pub, "Hello", 0);
>>
>>     msleep(100);
>>
>>     recv_string_expect_success (sub, "Hello", 0);
>>
>>     msleep(100);
>>
>>     test_context_socket_close (pub);
>>     test_context_socket_close (req);
>>     test_context_socket_close (sub);
>>
>> }
>>
>> int main (void)
>> {
>>     setup_test_environment ();
>>
>>     UNITY_BEGIN ();
>>     RUN_TEST (test_pubreq);
>>     return UNITY_END ();
>> }
>>
>> On Thu, May 20, 2021 at 4:56 PM Bill Torpey <wallstprog at gmail.com> wrote:
>>
>>> Sorry — meant to get back to you sooner, but it’s been a crazy week.
>>>
>>> You don’t say what version you’re running, but there have been some
>>> changes in that area not that long ago — check these out and see if they
>>> help:
>>>
>>> https://github.com/zeromq/libzmq/pull/3831
>>>
>>> https://github.com/zeromq/libzmq/pull/3960
>>>
>>> https://github.com/zeromq/libzmq/pull/4053
>>>
>>> Good luck.
>>>
>>> Bill
>>>
>>>
>>> On May 20, 2021, at 10:26 AM, James Harvey <jamesdillonharvey at gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> I will try and simplify my previous long email.
>>>
>>> If a stream gets into a protocol error state  (e.g tcp SUB connect to
>>> REQ)
>>>
>>> Should the information (connection is terminated) be passed somehow back
>>> to the parent socket so if connect() is called again it attempts to connect
>>> rather than a no-op.
>>>
>>> OR
>>>
>>> Should we add a protocol error event to socket monitor so the calling
>>> process can handle it  by calling disconnect/connect
>>>
>>> Just want some clarification so I work on the correct code.
>>>
>>> Thanks
>>>
>>> James
>>>
>>> On Thu, May 13, 2021 at 4:48 PM James Harvey <
>>> jamesdillonharvey at gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a rare/random bug that causes my ZMQ_SUB socket to fail for a
>>>> certain endpoint with no way to track/notify.  Yes it's because a SUB
>>>> connects to a REQ socket but once you start to use zeromq for lots of
>>>> transient systems in a large company this kind of thing will happen
>>>> occasionally.
>>>>
>>>> The process happens like this:
>>>>
>>>>   - ZMQ_PUB binds on 1.2.3.4:44444 (ephemeral)
>>>>   - ZMQ_SUB connects to 1.2.3.4:44444 (data flows)
>>>>   - ZMQ_PUB goes down
>>>>   - Unrelated process (ZMQ_REQ) comes up and grabs the same
>>>> 1.2.3.4:44444 as its ephemeral
>>>>   - ZMQ_SUB has not yet been told to disconnect so it reconnects to the
>>>> ZMQ_REQ
>>>>   - protocol error happens and the connection is terminated in the
>>>> session/engine
>>>>   - Now a good ZMQ_PUB comes up and binds on 1.2.3.4:44444
>>>>   - ZMQ_SUB gets new instruction to connect()
>>>>   - connect() just returns noop.
>>>>     - The socket_base thinks it still has a valid endpoint and SUB only
>>>> connects once to each endpoint.
>>>>   - At this point there are no errors and no data flowing.
>>>>
>>>> My question is, should the protocol_error in the session propagate up
>>>> to remove the endpoint from the socket?
>>>>
>>>> If yes I can look at adding that, if no do you have any suggestions?
>>>>
>>>> Thanks for your time
>>>>
>>>> James
>>>>
>>>> Some links to the code:
>>>>
>>>> If socket is SUB and the endpoint is present dont connect.
>>>> https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901
>>>>
>>>> terminate with no reconnect on protocol_error
>>>> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486
>>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev at lists.zeromq.org
>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev at lists.zeromq.org
>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
>>
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20210521/7282289d/attachment.htm>


More information about the zeromq-dev mailing list