[zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue
Pieter Hintjens
ph at imatix.com
Wed Mar 25 18:45:12 CET 2015
It's similar to the model we use in projects like Malamute. You don't
need a UUID however. The client sends PING, and the server replies
PING-OK if it recognizes the client, else it replies with an
"unexpected command" error. The client handles that by restarting its
protocol handshake.
Server can time-out idle clients, and clients can detect dead servers.
There are a few corner cases, e.g. don't send more than 3-4 PINGs
before getting a response.
-Pieter
On Wed, Mar 25, 2015 at 6:38 PM, Russell Della Rosa
<rdellar2000-zeromq at yahoo.com> wrote:
> I implemented ping pong heartbeats with the UUID idea and it works great. Thanks!
>
> It keeps the server stateless (just a UUID on construction was the primary
> change) and very simple.
>
> And the client controls the entire ping / pong lifecycle which is what I really
> like. (The client is complex, but I'd rather have the controlling logic all in
> the client.)
>
>
> Here's a high level summary of how I did the ping pong heartbeats... I listed
> the pseudocode as accurate I could recall, but no warranty implied or otherwise. :)
>
> Hopefully this pattern is useful to others...
>
>
> Setup:
> ------
> client[DEALER] --tcp--> primary server[ROUTER/proxy/DEALER] --inproc--> workers[DEALER]
>
> The ping request contains:
> - A message header that denotes it as a ping (No payload)
>
> The pong response contains:
> - Server UUID
> - Health
>
> The client side keeps the following state information on each server:
> - Dead/Alive
> - missedPongs
> - successfulPongs
> - Socket / Url / UUID
>
> Client Ping:
> ------------
> - A ping (async & dontwait) is sent to all servers
>
> - Will poll up to HEARTBEAT_TIMEOUT seconds to receive a pong.
> -- Upon pong receipt
> --- If the UUID is unknown, set the UUID = received UUID
> --- Records the pong as valid IF:
> ---- The UUID didn't change
> ---- The pong listed the server as healthy
> ---- The pong was received before the timeout (implied, see notes)
> --- For each valid pong:
> ---- If current status is Alive, reset the missedPongs count to 0
> ---- If current status Dead, increment the successfulPongs by 1
>
> - After each HEARTBEAT_TIMEOUT poll completes, update state for all servers
> -- If no valid pong was received
> --- If current status is Alive, increment the missedPongs count by 1
> --- If current status Dead, reset the successfulPongs count to 0
> -- If (Alive && missedPongs == HEARTBEAT_UNHEALTHY_THRESHOLD) or UUID changed
> --- Mark the server as Dead
> --- Failover to the next best server by rebuilding the socket and resending
> --- Reset server state variables properly (missed/successful=0, etc)
> -- If Dead && successfulPongs == HEARTBEAT_HEALTHY_THRESHOLD
> --- Mark the server as Alive
> --- Failover to the next best server by rebuilding the socket and resending
> --- Reset server state variables properly (missed/successful=0, etc)
> -- If there was a UUID conflict, reset the stored UUID to unknown
>
> - Delay for what is remaining of HEARTBEAT_INTERVAL and then repeat...
>
>
> Server Pong:
> ------------
> - Generates a UUID on startup. (Also uses this as the Router socket identity)
> - Replies to a ping with a pong that includes the UUID & health
> -- I decided to include the health in case the server was starting / shutting down
>
>
> Notes:
>
> The above includes failback also... The pattern is just like missedPongs
> except you track successfulPongs if the server is dead. And when
> successfulPongs == HEART_BEAT_HEALTHY_THRESHOLD you bring a server back to
> life. (You will also failback to the primary if it comes alive. Note that
> HEART_BEAT_HEALTHY_THRESHOLD should be quite a bit bigger than
> HEART_BEAT_UNHEALTHY_THRESHOLD.)
>
> Make sure to rebuild the poll list each time also, since a reconnect will foul
> the old socket.
>
> Also if both servers are dead I decided to keep trying to send requests to the
> last know live server. (If both servers die at the same time I will reconnect
> / resend the first time to handle the quick server death on a single server
> issue.)
>
> I set the heartbeat high water mark to something low so after a few outstanding
> pings it wouldn't queue any more. Depending on what you set the HWM to you
> will need to properly handle receiving multiple pongs when a server comes to
> life. (I treated multiple pongs within the same HEARTBEAT_TIMEOUT period as a
> single pong. This simplified the logic so I didn't have to track if a pong
> REALLY came back during the window. It can come back in the next window and
> not be double counted.)
>
> In the actual implementation the client pinger and server ponger are running in
> separate threads. I did this so the pings wouldn't backup behind real work
> and real work wouldn't wait on pings. This means a valid response from the
> server isn't counting as a pong. I debated this and decided it was ok since
> the ping/pongs will flow even if the server is working. Normally a valid
> server reply would count as a pong though.
>
> Hope this helps someone else,
>
> -- Russell
>
> ________________________________
> From: Stephen Lord <Steve.Lord at quantum.com>
> To: ZeroMQ development list <zeromq-dev at lists.zeromq.org>
> Sent: Monday, March 23, 2015 10:50 AM
> Subject: Re: [zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue
>
>
>
> Have the heartbeat reply include a guid which represents the instance of the
> server, the server picks a guid at startup and always uses it. If the client
> sees two different guids then it knows the server restarted and can take
> action. The server side state is minimal, the client needs to track the guids
> it gets back on a per server basis.
>
>
>
>>
>>
>>On Mar 23, 2015, at 9:39 AM, Russell Della Rosa <rdellar2000-zeromq at yahoo.com>
> wrote:
>>
>>I'm doing this using JeroMq (may use jzmq at some point) so I'm at the mercy
> of the JVM.
>>
>>
>>I have a wrapper around the JVM that heartbeats also, it and will kill the JVM
> if it doesn't reply with a pong. After the wrapper kills the JVM, it will
> quickly restart the JVM so I'm not sure there is a good point to send this
> shutdown message. (The wrapper might be able to but I think that might get
> complex.)
>>
>>
>>I like this idea though since it keeps the server stateless.
>>
>>
>>________________________________
>> From: Justin Karneges <justin at affinix.com>
>>To: zeromq-dev at lists.zeromq.org
>>Sent: Friday, March 20, 2015 2:41 PM
>>Subject: Re: [zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue
>>
>>
>>> I'm curious if anyone has solved this quick server restart problem in a
>>> clean way with socket patterns? Or if you have other suggestions? Or if
>>> you have example code of ping/pong handling this case I'd love to see it.
>>
>>I suggest having the server send some kind of shutdown message. This is
>>basically the same as how regular TCP connection loss is indicated,
>>except that you have to do it yourself rather than the OS doing it for
>>
>>
>>
>>
>>you.
>>
>>Of course, the advantage of the OS doing it for you is that you can
>>ensure a close packet is sent even if your process crashes. This may bit
>>a bit harder to do with ZeroMQ, depending on the language.
>>_______________________________________________
>>zeromq-dev mailing list
>>zeromq-dev at lists.zeromq.org
>>http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
>>
>>
>>
> _______________________________________________
>>zeromq-dev mailing list
>>zeromq-dev at lists.zeromq.org
>>https://urldefense.proofpoint.com/v1/url?u=http://lists.zeromq.org/mailman/lis
> tinfo/zeromq-
> dev&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=3Cz4BWxkuioYQ%2BdxY62EqptPwDeTj3M%2B5v0
> 6yEnFWTY%3D%0A&m=7MAl5btFQ60vHCXT9uKH65obPguN3ihVUEVTNwTJvzY%3D%0A&s=caf963151a
> 45847acc3cf01940a6b42b97590128726ebbf7e0ef3af7f0b78330
>>
>
> ________________________________
> The information contained in this transmission may be confidential. Any
> disclosure, copying, or further distribution of confidential information is not
> permitted unless such privilege is explicitly granted in writing by Quantum.
> Quantum reserves the right to have electronic communications, including email
> and attachments, sent across its networks filtered through anti virus and spam
> software programs and retain such messages in order to comply with applicable
> data security and retention requirements. Quantum is not responsible for the
> proper and complete transmission of the substance of this communication or for
> any delay in its receipt.
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
More information about the zeromq-dev
mailing list