[zeromq-dev] Malamute (reconnection and some more questions)

Pieter Hintjens ph at imatix.com
Wed Apr 20 18:57:37 CEST 2016

Sounds right to me.

On Mon, Apr 18, 2016 at 1:29 PM, Kevin Sapper <kevinsapper88 at gmail.com> wrote:
> Okay, seems I was a little bit to quick :(. Great analysis btw :)
> Your correct the client cannot recover from disconnected state. The
> heartbeat event has been overridden so the client itself will stop sending
> heartbeat to the server. But this results in the client ignoring any
> heartbeats from the revived server. This is definitely a bug! Instead of
> ignoring the heartbeat events we need to stop the client heartbeat timer and
> restart it upon reconnect.
> @hintjens please correct me if I'm wrong.
> 2016-04-18 13:13 GMT+02:00 Kevin Sapper <kevinsapper88 at gmail.com>:
>> Hi Alena,
>> in the mlm_client.xml there is a state named "defaults" which is inherited
>> by many others including "disconnecting". When the client is in
>> "disconnecting" state and the server reconnects it will send a heartbeat
>> which the client will answer with a connection ping and upon connection pong
>> from the server the client will move from "disconnecting" state into
>> "connected" state.
>> //Kevin
>> 2016-04-18 8:47 GMT+02:00 Alena Chernikava <e.c.6078570 at gmail.com>:
>>> Hi,
>>> I would like to ask some questions and point out some problems in
>>> Malamute broker.
>>> I am facing a problem with client reconnect procedure in malamute.
>>> Usually a formal description allows me to better understand the problem,
>>> that is why I started an investigation with creating a visualization of a
>>> state machine for malamute client. I would say it helped me a lot :) Right
>>> away I found some "strange behavior"s. I would like to ask some questions to
>>> make it more clear for me (may be it was done intentionally) before I will
>>> try to "experiment" with fixes.
>>> In the attachment you can find my hand-made visualization of the state
>>> machine (I was doing it for myself, so it has my thoughts written down).
>>> (GREEN - states, RED - events, BLUE - actions). It is not complete, but
>>> already helped me to spot some potential and real problems. Here I would
>>> describe some issues I found (numbering is the same as on the picture).
>>> 1. Re-connection problem. It is actually the main problem I want to
>>> discuss.
>>> Situation:
>>> client sends 3  PINGS and do not receive any PONGS back. After this
>>> client will end up in the "disconnected" state. I would say that it is a
>>> black hole state, as client cannot normally recover from it (to the
>>> "connected" state) or at least move somewhere.
>>> Analysis:
>>> * We can destroy the client. We will move out of "disconnected" state,
>>> but we destroyed the client. :) End of work, nothing to do. Everything is
>>> fine
>>> * We can move to the "connected" state, if client will receive "PONG"
>>> from server or we can move to the "HAVE ERROR" state if client will receive
>>> "ERROR" from server. In order to receive from server some response, we need
>>> to send something to the server. And here we are: the client do not send
>>> anything to the server :( PINGs are disabled in the "mlm_client.xml" from
>>> the very beginning.
>>> Questions:
>>> * Why PING was disabled in "disconnected" state?
>>> * What was the basic idea for the "re connect" implementation?
>>> Proposal:
>>> Enable PINGs. When server receive a PING from "unknown client" it will
>>> send "ERROR" back that will trigger "re connection" procedure. But still, I
>>> am not sure if client would reconnect correctly, but at least we can give
>>> him a chance to do so, because now the client have no chance to reconnect
>>> (if server is off for longer period)
>>> 2. Take a look on the picture on the right corner.
>>> in the mlm_client.xml:
>>>     <state name = "connecting" inherit = "defaults">
>>>         <event name = "OK" next = "connected">
>>>             <action name = "signal success" />
>>>             <action name = "client is connected" />
>>>         </event>
>>> This can cause that the following code will be ok (and actually I saw
>>> such behavior couple times):
>>>       int rv  = mlm_client_connect();
>>>       assert (rv == 0)
>>>       assert (mlm_client_connected () == false)
>>> Proposal: do "signal success" after "client is connected"
>>> Question: is there any reason to left the order as it is?
>>> 3+4. I didn't understand from the code one point. When client is supposed
>>> to start heart beating?
>>> I thought, that it should happen after client got "OK" response from the
>>> server, but from the state machine I see that in the state "connecting"
>>> (while waiting for the response from the server) heart beating starts. Is
>>> this a bug or it was done intentionally?
>>> 5. It is just a bug, I will fix it later. If mlm_client_connect didn’t
>>> work for the first time, the client should remain in «start" state.
>>> 6. It is a potential problem. If "PONG" will come before "OK" message
>>> from server, the mlm_client_set_producer/consumer/worker will not end
>>> correctly and potentially will never do a "return". I propose: return to
>>> "confirming" state and wait for "OK" response from server. Do you think it
>>> will not break anything?
>>> Thank you for reading this, waiting forward for your reply.
>>> Alena Chernikava
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev at lists.zeromq.org
>>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev

More information about the zeromq-dev mailing list