[zeromq-dev] Malamute (reconnection and some more questions)
e.c.6078570 at gmail.com
Mon Apr 18 08:47:12 CEST 2016
I would like to ask some questions and point out some problems in Malamute broker.
I am facing a problem with client reconnect procedure in malamute. Usually a formal description allows me to better understand the problem, that is why I started an investigation with creating a visualization of a state machine for malamute client. I would say it helped me a lot :) Right away I found some "strange behavior"s. I would like to ask some questions to make it more clear for me (may be it was done intentionally) before I will try to "experiment" with fixes.
In the attachment you can find my hand-made visualization of the state machine (I was doing it for myself, so it has my thoughts written down). (GREEN - states, RED - events, BLUE - actions). It is not complete, but already helped me to spot some potential and real problems. Here I would describe some issues I found (numbering is the same as on the picture).
1. Re-connection problem. It is actually the main problem I want to discuss.
client sends 3 PINGS and do not receive any PONGS back. After this client will end up in the "disconnected" state. I would say that it is a black hole state, as client cannot normally recover from it (to the "connected" state) or at least move somewhere.
* We can destroy the client. We will move out of "disconnected" state, but we destroyed the client. :) End of work, nothing to do. Everything is fine
* We can move to the "connected" state, if client will receive "PONG" from server or we can move to the "HAVE ERROR" state if client will receive "ERROR" from server. In order to receive from server some response, we need to send something to the server. And here we are: the client do not send anything to the server :( PINGs are disabled in the "mlm_client.xml" from the very beginning.
* Why PING was disabled in "disconnected" state?
* What was the basic idea for the "re connect" implementation?
Enable PINGs. When server receive a PING from "unknown client" it will send "ERROR" back that will trigger "re connection" procedure. But still, I am not sure if client would reconnect correctly, but at least we can give him a chance to do so, because now the client have no chance to reconnect (if server is off for longer period)
2. Take a look on the picture on the right corner.
in the mlm_client.xml:
<state name = "connecting" inherit = "defaults">
<event name = "OK" next = "connected">
<action name = "signal success" />
<action name = "client is connected" />
This can cause that the following code will be ok (and actually I saw such behavior couple times):
int rv = mlm_client_connect();
assert (rv == 0)
assert (mlm_client_connected () == false)
Proposal: do "signal success" after "client is connected"
Question: is there any reason to left the order as it is?
3+4. I didn't understand from the code one point. When client is supposed to start heart beating?
I thought, that it should happen after client got "OK" response from the server, but from the state machine I see that in the state "connecting" (while waiting for the response from the server) heart beating starts. Is this a bug or it was done intentionally?
5. It is just a bug, I will fix it later. If mlm_client_connect didn’t work for the first time, the client should remain in «start" state.
6. It is a potential problem. If "PONG" will come before "OK" message from server, the mlm_client_set_producer/consumer/worker will not end correctly and potentially will never do a "return". I propose: return to "confirming" state and wait for "OK" response from server. Do you think it will not break anything?
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 1946345 bytes
Desc: not available
-------------- next part --------------
Thank you for reading this, waiting forward for your reply.
More information about the zeromq-dev