[zeromq-dev] zeromq protocol_error handling
James Harvey
jamesdillonharvey at gmail.com
Fri May 21 16:04:16 CEST 2021
Thanks Bill
I pulled the latest libzmq and the issue still occurs.
I have tracked it down to the protocol_error handling. In the case of a
ZMQ_SUB connecting to a ZMQ_REQ a protocol_error happens (expected) and the
session is terminated.
The termination does not remove that connection endpoint from the socket.
This means subsequent calls to socket->connect on the same endpoint (after
the correct service has resumed) are no ops because SUB can only have one
connection to a single endpoint.
The change below fixes my issue but I'm not sure if it's correct for other
protocol errors. I haven't worked on the sessions/pipes before. I
noticed in gdb the second session has a _pipe but is not fully created.
https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L487
case i_engine::protocol_error:
// if (_pending) {
if (_pending || handshaked_) { // <<< if handshaked we should
also terminate pipes.
if (_pipe)
_pipe->terminate (false);
if (_zap_pipe)
_zap_pipe->terminate (false);
} else {
terminate ();
}
I am happy to create a pull request to discuss if I am on the right track?
I have test code to recreate.
#include "testutil.hpp"
#include "testutil_unity.hpp"
#include <iostream>
#include <stdlib.h>
SETUP_TEARDOWN_TESTCONTEXT
char end[] = "tcp://127.0.0.1:55667";
void test_pubreq ()
{
// SUB up and connect to 55557
void *sub = test_context_socket (ZMQ_SUB);
TEST_ASSERT_SUCCESS_ERRNO (zmq_setsockopt (sub, ZMQ_SUBSCRIBE, "", 0));
TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end));
// REQ is up incorrectly on 55667
void *req = test_context_socket (ZMQ_REQ);
TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (req, end));
msleep(1000);
TEST_ASSERT_SUCCESS_ERRNO (zmq_unbind (req, end));
// REQ is down
// At this point the SUB socket has a protocol_error on 55667 (so no
reconnect) but the socket thinks it still connected to 55667
msleep(1000);
// PUB correctly comes up on 55667
void *pub = test_context_socket (ZMQ_PUB);
TEST_ASSERT_SUCCESS_ERRNO (zmq_bind (pub, end));
// NOTE: If we force a disconnect here it works.
// TEST_ASSERT_SUCCESS_ERRNO (zmq_disconnect (sub, end));
// Connect again fails
TEST_ASSERT_SUCCESS_ERRNO (zmq_connect (sub, end));
msleep(100);
send_string_expect_success (pub, "Hello", 0);
msleep(100);
recv_string_expect_success (sub, "Hello", 0);
msleep(100);
test_context_socket_close (pub);
test_context_socket_close (req);
test_context_socket_close (sub);
}
int main (void)
{
setup_test_environment ();
UNITY_BEGIN ();
RUN_TEST (test_pubreq);
return UNITY_END ();
}
On Thu, May 20, 2021 at 4:56 PM Bill Torpey <wallstprog at gmail.com> wrote:
> Sorry — meant to get back to you sooner, but it’s been a crazy week.
>
> You don’t say what version you’re running, but there have been some
> changes in that area not that long ago — check these out and see if they
> help:
>
> https://github.com/zeromq/libzmq/pull/3831
>
> https://github.com/zeromq/libzmq/pull/3960
>
> https://github.com/zeromq/libzmq/pull/4053
>
> Good luck.
>
> Bill
>
>
> On May 20, 2021, at 10:26 AM, James Harvey <jamesdillonharvey at gmail.com>
> wrote:
>
> Hi,
>
> I will try and simplify my previous long email.
>
> If a stream gets into a protocol error state (e.g tcp SUB connect to REQ)
>
> Should the information (connection is terminated) be passed somehow back
> to the parent socket so if connect() is called again it attempts to connect
> rather than a no-op.
>
> OR
>
> Should we add a protocol error event to socket monitor so the calling
> process can handle it by calling disconnect/connect
>
> Just want some clarification so I work on the correct code.
>
> Thanks
>
> James
>
> On Thu, May 13, 2021 at 4:48 PM James Harvey <jamesdillonharvey at gmail.com>
> wrote:
>
>> Hi,
>>
>> I have a rare/random bug that causes my ZMQ_SUB socket to fail for a
>> certain endpoint with no way to track/notify. Yes it's because a SUB
>> connects to a REQ socket but once you start to use zeromq for lots of
>> transient systems in a large company this kind of thing will happen
>> occasionally.
>>
>> The process happens like this:
>>
>> - ZMQ_PUB binds on 1.2.3.4:44444 (ephemeral)
>> - ZMQ_SUB connects to 1.2.3.4:44444 (data flows)
>> - ZMQ_PUB goes down
>> - Unrelated process (ZMQ_REQ) comes up and grabs the same 1.2.3.4:44444
>> as its ephemeral
>> - ZMQ_SUB has not yet been told to disconnect so it reconnects to the
>> ZMQ_REQ
>> - protocol error happens and the connection is terminated in the
>> session/engine
>> - Now a good ZMQ_PUB comes up and binds on 1.2.3.4:44444
>> - ZMQ_SUB gets new instruction to connect()
>> - connect() just returns noop.
>> - The socket_base thinks it still has a valid endpoint and SUB only
>> connects once to each endpoint.
>> - At this point there are no errors and no data flowing.
>>
>> My question is, should the protocol_error in the session propagate up to
>> remove the endpoint from the socket?
>>
>> If yes I can look at adding that, if no do you have any suggestions?
>>
>> Thanks for your time
>>
>> James
>>
>> Some links to the code:
>>
>> If socket is SUB and the endpoint is present dont connect.
>> https://github.com/zeromq/libzmq/blob/master/src/socket_base.cpp#L901
>>
>> terminate with no reconnect on protocol_error
>> https://github.com/zeromq/libzmq/blob/master/src/session_base.cpp#L486
>>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20210521/62f4d5ed/attachment.htm>
More information about the zeromq-dev
mailing list