[zeromq-dev] Malamute (reconnection and some more questions)
Kevin Sapper
kevinsapper88 at gmail.com
Mon Apr 18 13:29:34 CEST 2016
Okay, seems I was a little bit to quick :(. Great analysis btw :)
Your correct the client cannot recover from disconnected state. The
heartbeat event has been overridden so the client itself will stop sending
heartbeat to the server. But this results in the client ignoring any
heartbeats from the revived server. This is definitely a bug! Instead of
ignoring the heartbeat events we need to stop the client heartbeat timer
and restart it upon reconnect.
@hintjens please correct me if I'm wrong.
2016-04-18 13:13 GMT+02:00 Kevin Sapper <kevinsapper88 at gmail.com>:
> Hi Alena,
>
> in the mlm_client.xml there is a state named "defaults" which is inherited
> by many others including "disconnecting". When the client is in
> "disconnecting" state and the server reconnects it will send a heartbeat
> which the client will answer with a connection ping and upon connection
> pong from the server the client will move from "disconnecting" state into
> "connected" state.
>
> //Kevin
>
> 2016-04-18 8:47 GMT+02:00 Alena Chernikava <e.c.6078570 at gmail.com>:
>
>> Hi,
>>
>> I would like to ask some questions and point out some problems in
>> Malamute broker.
>>
>> I am facing a problem with client reconnect procedure in malamute.
>> Usually a formal description allows me to better understand the problem,
>> that is why I started an investigation with creating a visualization of a
>> state machine for malamute client. I would say it helped me a lot :) Right
>> away I found some "strange behavior"s. I would like to ask some questions
>> to make it more clear for me (may be it was done intentionally) before I
>> will try to "experiment" with fixes.
>>
>> In the attachment you can find my hand-made visualization of the state
>> machine (I was doing it for myself, so it has my thoughts written down).
>> (GREEN - states, RED - events, BLUE - actions). It is not complete, but
>> already helped me to spot some potential and real problems. Here I would
>> describe some issues I found (numbering is the same as on the picture).
>>
>> 1. Re-connection problem. It is actually the main problem I want to
>> discuss.
>>
>> Situation:
>> client sends 3 PINGS and do not receive any PONGS back. After this
>> client will end up in the "disconnected" state. I would say that it is a
>> black hole state, as client cannot normally recover from it (to the
>> "connected" state) or at least move somewhere.
>>
>> Analysis:
>> * We can destroy the client. We will move out of "disconnected" state,
>> but we destroyed the client. :) End of work, nothing to do. Everything is
>> fine
>> * We can move to the "connected" state, if client will receive "PONG"
>> from server or we can move to the "HAVE ERROR" state if client will receive
>> "ERROR" from server. In order to receive from server some response, we need
>> to send something to the server. And here we are: the client do not send
>> anything to the server :( PINGs are disabled in the "mlm_client.xml" from
>> the very beginning.
>>
>> Questions:
>> * Why PING was disabled in "disconnected" state?
>> * What was the basic idea for the "re connect" implementation?
>>
>> Proposal:
>> Enable PINGs. When server receive a PING from "unknown client" it will
>> send "ERROR" back that will trigger "re connection" procedure. But still, I
>> am not sure if client would reconnect correctly, but at least we can give
>> him a chance to do so, because now the client have no chance to reconnect
>> (if server is off for longer period)
>>
>> 2. Take a look on the picture on the right corner.
>>
>> in the mlm_client.xml:
>>
>> <state name = "connecting" inherit = "defaults">
>> <event name = "OK" next = "connected">
>> <action name = "signal success" />
>> <action name = "client is connected" />
>> </event>
>> This can cause that the following code will be ok (and actually I saw
>> such behavior couple times):
>> int rv = mlm_client_connect();
>> assert (rv == 0)
>> assert (mlm_client_connected () == false)
>>
>> Proposal: do "signal success" after "client is connected"
>> Question: is there any reason to left the order as it is?
>>
>> 3+4. I didn't understand from the code one point. When client is supposed
>> to start heart beating?
>> I thought, that it should happen after client got "OK" response from the
>> server, but from the state machine I see that in the state "connecting"
>> (while waiting for the response from the server) heart beating starts. Is
>> this a bug or it was done intentionally?
>>
>> 5. It is just a bug, I will fix it later. If mlm_client_connect didn’t
>> work for the first time, the client should remain in «start" state.
>>
>> 6. It is a potential problem. If "PONG" will come before "OK" message
>> from server, the mlm_client_set_producer/consumer/worker will not end
>> correctly and potentially will never do a "return". I propose: return to
>> "confirming" state and wait for "OK" response from server. Do you think it
>> will not break anything?
>>
>>
>>
>>
>>
>> Thank you for reading this, waiting forward for your reply.
>> Alena Chernikava
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20160418/d7abe804/attachment.htm>
More information about the zeromq-dev
mailing list