[zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue

Russell Della Rosa rdellar2000-zeromq at yahoo.com
Wed Mar 25 18:38:37 CET 2015


I implemented ping pong heartbeats with the UUID idea and it works great.  Thanks!

It keeps the server stateless (just a UUID on construction was the primary
change) and very simple. 

And the client controls the entire ping / pong lifecycle which is what I really
like.  (The client is complex, but I'd rather have the controlling logic all in
the client.)


Here's a high level summary of how I did the ping pong heartbeats...  I listed
the pseudocode as accurate I could recall, but no warranty implied or otherwise.  :)

Hopefully this pattern is useful to others...


Setup:
------
client[DEALER] --tcp--> primary server[ROUTER/proxy/DEALER] --inproc--> workers[DEALER]

The ping request contains:
- A message header that denotes it as a ping  (No payload)

The pong response contains:
- Server UUID
- Health

The client side keeps the following state information on each server:
- Dead/Alive
- missedPongs
- successfulPongs
- Socket / Url / UUID

Client Ping:
------------
- A ping (async & dontwait) is sent to all servers

- Will poll up to HEARTBEAT_TIMEOUT seconds to receive a pong.
-- Upon pong receipt
--- If the UUID is unknown, set the UUID = received UUID
--- Records the pong as valid IF:
---- The UUID didn't change
---- The pong listed the server as healthy
---- The pong was received before the timeout (implied, see notes)
--- For each valid pong:
---- If current status is Alive, reset the missedPongs count to 0
---- If current status Dead, increment the successfulPongs by 1

- After each HEARTBEAT_TIMEOUT poll completes, update state for all servers
-- If no valid pong was received
--- If current status is Alive, increment the missedPongs count by 1
--- If current status Dead, reset the successfulPongs count to 0
-- If (Alive && missedPongs == HEARTBEAT_UNHEALTHY_THRESHOLD) or UUID changed
--- Mark the server as Dead
--- Failover to the next best server by rebuilding the socket and resending
--- Reset server state variables properly (missed/successful=0, etc)
-- If Dead && successfulPongs == HEARTBEAT_HEALTHY_THRESHOLD
--- Mark the server as Alive
--- Failover to the next best server by rebuilding the socket and resending
--- Reset server state variables properly (missed/successful=0, etc)
-- If there was a UUID conflict, reset the stored UUID to unknown

- Delay for what is remaining of HEARTBEAT_INTERVAL and then repeat...


Server Pong:
------------
- Generates a UUID on startup.  (Also uses this as the Router socket identity)
- Replies to a ping with a pong that includes the UUID & health
-- I decided to include the health in case the server was starting / shutting down


Notes:

The above includes failback also...  The pattern is just like missedPongs
except you track successfulPongs if the server is dead.  And when
successfulPongs == HEART_BEAT_HEALTHY_THRESHOLD you bring a server back to
life.  (You will also failback to the primary if it comes alive.  Note that
HEART_BEAT_HEALTHY_THRESHOLD should be quite a bit bigger than
HEART_BEAT_UNHEALTHY_THRESHOLD.)

Make sure to rebuild the poll list each time also, since a reconnect will foul
the old socket.

Also if both servers are dead I decided to keep trying to send requests to the
last know live server.  (If both servers die at the same time I will reconnect
/ resend the first time to handle the quick server death on a single server 
issue.)

I set the heartbeat high water mark to something low so after a few outstanding
pings it wouldn't queue any more.  Depending on what you set the HWM to you
will need to properly handle receiving multiple pongs when a server comes to
life.  (I treated multiple pongs within the same HEARTBEAT_TIMEOUT period as a
single pong.  This simplified the logic so I didn't have to track if a pong
REALLY came back during the window.  It can come back in the next window and
not be double counted.)

In the actual implementation the client pinger and server ponger are running in
separate threads.  I did this so the pings wouldn't backup behind real work
and real work wouldn't wait on pings.  This means a valid response from the
server isn't counting as a pong.  I debated this and decided it was ok since
the ping/pongs will flow even if the server is working.  Normally a valid
server reply would count as a pong though.

Hope this helps someone else,

-- Russell

________________________________
From: Stephen Lord <Steve.Lord at quantum.com>
To: ZeroMQ development list <zeromq-dev at lists.zeromq.org> 
Sent: Monday, March 23, 2015 10:50 AM
Subject: Re: [zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue



Have the heartbeat reply include a guid which represents the instance of the
server, the server picks a guid at startup and always uses it. If the client
sees two different guids then it knows the server restarted and can take
action. The server side state is minimal, the client needs to track the guids
it gets back on a per server basis.



>
>
>On Mar 23, 2015, at 9:39 AM, Russell Della Rosa <rdellar2000-zeromq at yahoo.com>
wrote:
>
>I'm doing this using JeroMq (may use jzmq at some point) so I'm at the mercy
of the JVM.
>
>
>I have a wrapper around the JVM that heartbeats also, it and will kill the JVM
if it doesn't reply with a pong.  After the wrapper kills the JVM, it will
quickly restart the JVM so I'm not sure there is a good point to send this
shutdown message.  (The wrapper might be able to but I think that might get
complex.)
>
>
>I like this idea though since it keeps the server stateless.
>
>
>________________________________
> From: Justin Karneges <justin at affinix.com>
>To: zeromq-dev at lists.zeromq.org 
>Sent: Friday, March 20, 2015 2:41 PM
>Subject: Re: [zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue
>
>
>> I'm curious if anyone has solved this quick server restart problem in a
>> clean way with socket patterns?  Or if you have other suggestions?  Or if
>> you have example code of ping/pong handling this case I'd love to see it.
>
>I suggest having the server send some kind of shutdown message. This is
>basically the same as how regular TCP connection loss is indicated,
>except that you have to do it yourself rather than the OS doing it for 
>
>
>
>
>you.
>
>Of course, the advantage of the OS doing it for you is that you can
>ensure a close packet is sent even if your process crashes. This may bit
>a bit harder to do with ZeroMQ, depending on the language.
>_______________________________________________
>zeromq-dev mailing list
>zeromq-dev at lists.zeromq.org
>http://lists.zeromq.org/mailman/listinfo/zeromq-dev 
>
>
>
>
_______________________________________________
>zeromq-dev mailing list
>zeromq-dev at lists.zeromq.org
>https://urldefense.proofpoint.com/v1/url?u=http://lists.zeromq.org/mailman/lis
tinfo/zeromq-
dev&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=3Cz4BWxkuioYQ%2BdxY62EqptPwDeTj3M%2B5v0
6yEnFWTY%3D%0A&m=7MAl5btFQ60vHCXT9uKH65obPguN3ihVUEVTNwTJvzY%3D%0A&s=caf963151a
45847acc3cf01940a6b42b97590128726ebbf7e0ef3af7f0b78330
>

________________________________
The information contained in this transmission may be confidential. Any
disclosure, copying, or further distribution of confidential information is not
permitted unless such privilege is explicitly granted in writing by Quantum.
Quantum reserves the right to have electronic communications, including email
and attachments, sent across its networks filtered through anti virus and spam
software programs and retain such messages in order to comply with applicable
data security and retention requirements. Quantum is not responsible for the
proper and complete transmission of the substance of this communication or for
any delay in its receipt.


_______________________________________________
zeromq-dev mailing list
zeromq-dev at lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev



More information about the zeromq-dev mailing list