[zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue

Russell Della Rosa rdellar2000-zeromq at yahoo.com
Wed Mar 25 20:33:37 CET 2015


> It's similar to the model we use in projects like Malamute. You don't
> need a UUID however. The client sends PING, and the server replies
> PING-OK if it recognizes the client, else it replies with an
> "unexpected command" error. The client handles that by restarting its
> protocol handshake.

I considered that but I didn't go that route for a few reasons:
- I think that setup requires some type of initial handshake / login
- It requires the server to keep some basic state / handle some of the complexity
- In my use case, the client drives all the traffic, so the server doesn't care 
  if the client dies.  (The service will eventually finish/reply back or fail.
  I could probably fail in the server faster if I knew the client died though...)

However the over the wire ping messages would be MUCH smaller without the UUID.
(Long term I'll probably cut the UUID down to something much smaller that has a
near 0 chance of generating the same Id.  But I figured I would start with a UUID.)




...On a related note, I haven't completely thought thru this, but...

What would be interesting (don't think it is in the spec) would be if the Identity
of the router was exchanged with the client dealer during the initial zmq handshake.
(I think the way it is now, the Identity is ignored unless the socket is connecting
TO a router.  It would be interesting to always save off the peers identity regardless 
of socket type.)  Then, if the servers identity was accessible, the client could just 
use the servers Identity in the ping/pong model in place of the UUID I'm generating.






Either way how to handle heartbeating is an interesting problem...

Thanks!
-- Russell



> ----- Original Message -----
> From: Pieter Hintjens <ph at imatix.com>
> To: Russell Della Rosa <rdellar2000-zeromq at yahoo.com>; ZeroMQ development list <zeromq-dev at lists.zeromq.org>
> Cc: 
> Sent: Wednesday, March 25, 2015 12:45 PM
> Subject: Re: [zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue
>
> It's similar to the model we use in projects like Malamute. You don't
> need a UUID however. The client sends PING, and the server replies
> PING-OK if it recognizes the client, else it replies with an
> "unexpected command" error. The client handles that by restarting its
> protocol handshake.
> 
> Server can time-out idle clients, and clients can detect dead servers.
> There are a few corner cases, e.g. don't send more than 3-4 PINGs
> before getting a response.
>
> -Pieter
>
> On Wed, Mar 25, 2015 at 6:38 PM, Russell Della Rosa
> <rdellar2000-zeromq at yahoo.com> wrote:
>> I implemented ping pong heartbeats with the UUID idea and it works great.  Thanks!
>>
>> It keeps the server stateless (just a UUID on construction was the primary
>> change) and very simple.
>>
>> And the client controls the entire ping / pong lifecycle which is what I really
>> like.  (The client is complex, but I'd rather have the controlling logic all in
>> the client.)
>>
>>
>> Here's a high level summary of how I did the ping pong heartbeats...  I listed
>> the pseudocode as accurate I could recall, but no warranty implied or otherwise.  :)
>>
>> Hopefully this pattern is useful to others...
>>
>>
>> Setup:
>> ------
>> client[DEALER] --tcp--> primary server[ROUTER/proxy/DEALER] --inproc--> workers[DEALER]
>>
>> The ping request contains:
>> - A message header that denotes it as a ping  (No payload)
>>
>> The pong response contains:
>> - Server UUID
>> - Health
>>
>> The client side keeps the following state information on each server:
>> - Dead/Alive
>> - missedPongs
>> - successfulPongs
>> - Socket / Url / UUID
>>
>> Client Ping:
>> ------------
>> - A ping (async & dontwait) is sent to all servers
>>
>> - Will poll up to HEARTBEAT_TIMEOUT seconds to receive a pong.
>> -- Upon pong receipt
>> --- If the UUID is unknown, set the UUID = received UUID
>> --- Records the pong as valid IF:
>> ---- The UUID didn't change
>> ---- The pong listed the server as healthy
>> ---- The pong was received before the timeout (implied, see notes)
>> --- For each valid pong:
>> ---- If current status is Alive, reset the missedPongs count to 0
>> ---- If current status Dead, increment the successfulPongs by 1
>>
>> - After each HEARTBEAT_TIMEOUT poll completes, update state for all servers
>> -- If no valid pong was received
>> --- If current status is Alive, increment the missedPongs count by 1
>> --- If current status Dead, reset the successfulPongs count to 0
>> -- If (Alive && missedPongs == HEARTBEAT_UNHEALTHY_THRESHOLD) or UUID changed
>> --- Mark the server as Dead
>> --- Failover to the next best server by rebuilding the socket and resending
>> --- Reset server state variables properly (missed/successful=0, etc)
>> -- If Dead && successfulPongs == HEARTBEAT_HEALTHY_THRESHOLD
>> --- Mark the server as Alive
>> --- Failover to the next best server by rebuilding the socket and resending
>> --- Reset server state variables properly (missed/successful=0, etc)
>> -- If there was a UUID conflict, reset the stored UUID to unknown
>>
>> - Delay for what is remaining of HEARTBEAT_INTERVAL and then repeat...
>>
>>
>> Server Pong:
>> ------------
>> - Generates a UUID on startup.  (Also uses this as the Router socket identity)
>> - Replies to a ping with a pong that includes the UUID & health
>> -- I decided to include the health in case the server was starting / shutting down
>>
>>
>> Notes:
>>
>> The above includes failback also...  The pattern is just like missedPongs
>> except you track successfulPongs if the server is dead.  And when
>> successfulPongs == HEART_BEAT_HEALTHY_THRESHOLD you bring a server back to
>> life.  (You will also failback to the primary if it comes alive.  Note that
>> HEART_BEAT_HEALTHY_THRESHOLD should be quite a bit bigger than
>> HEART_BEAT_UNHEALTHY_THRESHOLD.)
>>
>> Make sure to rebuild the poll list each time also, since a reconnect will foul
>> the old socket.
>>
>> Also if both servers are dead I decided to keep trying to send requests to the
>> last know live server.  (If both servers die at the same time I will reconnect
>> / resend the first time to handle the quick server death on a single server
>> issue.)
>>
>> I set the heartbeat high water mark to something low so after a few outstanding
>> pings it wouldn't queue any more.  Depending on what you set the HWM to you
>> will need to properly handle receiving multiple pongs when a server comes to
>> life.  (I treated multiple pongs within the same HEARTBEAT_TIMEOUT period as a
>> single pong.  This simplified the logic so I didn't have to track if a pong
>> REALLY came back during the window.  It can come back in the next window and
>> not be double counted.)
>>
>> In the actual implementation the client pinger and server ponger are running in
>> separate threads.  I did this so the pings wouldn't backup behind real work
>> and real work wouldn't wait on pings.  This means a valid response from the
>> server isn't counting as a pong.  I debated this and decided it was ok since
>> the ping/pongs will flow even if the server is working.  Normally a valid
>> server reply would count as a pong though.
>>
>> Hope this helps someone else,
>>
>> -- Russell
>>
>> ________________________________
>> From: Stephen Lord <Steve.Lord at quantum.com>
>> To: ZeroMQ development list <zeromq-dev at lists.zeromq.org>
>> Sent: Monday, March 23, 2015 10:50 AM
>> Subject: Re: [zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue
>>
>>
>>
>> Have the heartbeat reply include a guid which represents the instance of the
>> server, the server picks a guid at startup and always uses it. If the client
>> sees two different guids then it knows the server restarted and can take
>> action. The server side state is minimal, the client needs to track the guids
>> it gets back on a per server basis.
>>
>>
>>
>>>
>>>
>>>On Mar 23, 2015, at 9:39 AM, Russell Della Rosa <rdellar2000-zeromq at yahoo.com>
>> wrote:
>>>
>>>I'm doing this using JeroMq (may use jzmq at some point) so I'm at the mercy
>> of the JVM.
>>>
>>>
>>>I have a wrapper around the JVM that heartbeats also, it and will kill the JVM
>> if it doesn't reply with a pong.  After the wrapper kills the JVM, it will
>> quickly restart the JVM so I'm not sure there is a good point to send this
>> shutdown message.  (The wrapper might be able to but I think that might get
>> complex.)
>>>
>>>
>>>I like this idea though since it keeps the server stateless.
>>>
>>>
>>>________________________________
>>> From: Justin Karneges <justin at affinix.com>
>>>To: zeromq-dev at lists.zeromq.org
>>>Sent: Friday, March 20, 2015 2:41 PM
>>>Subject: Re: [zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue
>>>
>>>
>>>> I'm curious if anyone has solved this quick server restart problem in a
>>>> clean way with socket patterns?  Or if you have other suggestions?  Or if
>>>> you have example code of ping/pong handling this case I'd love to see it.
>>>
>>>I suggest having the server send some kind of shutdown message. This is
>>>basically the same as how regular TCP connection loss is indicated,
>>>except that you have to do it yourself rather than the OS doing it for
>>>
>>>
>>>
>>>
>>>you.
>>>
>>>Of course, the advantage of the OS doing it for you is that you can
>>>ensure a close packet is sent even if your process crashes. This may bit
>>>a bit harder to do with ZeroMQ, depending on the language.
>>>_______________________________________________
>>>zeromq-dev mailing list
>>>zeromq-dev at lists.zeromq.org
>>>http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>>
>>>
>>>
>> _______________________________________________
>>>zeromq-dev mailing list
>>>zeromq-dev at lists.zeromq.org
>>>https://urldefense.proofpoint.com/v1/url?u=http://lists.zeromq.org/mailman/lis
>> tinfo/zeromq-
>> dev&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=3Cz4BWxkuioYQ%2BdxY62EqptPwDeTj3M%2B5v0
>> 6yEnFWTY%3D%0A&m=7MAl5btFQ60vHCXT9uKH65obPguN3ihVUEVTNwTJvzY%3D%0A&s=caf963151a
>> 45847acc3cf01940a6b42b97590128726ebbf7e0ef3af7f0b78330
>>>
>>
>> ________________________________
>> The information contained in this transmission may be confidential. Any
>> disclosure, copying, or further distribution of confidential information is not
>> permitted unless such privilege is explicitly granted in writing by Quantum.
>> Quantum reserves the right to have electronic communications, including email
>> and attachments, sent across its networks filtered through anti virus and spam
>> software programs and retain such messages in order to comply with applicable
>> data security and retention requirements. Quantum is not responsible for the
>> proper and complete transmission of the substance of this communication or for
>> any delay in its receipt.
>
>>
>>
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>_______________________________________________
>zeromq-dev mailing list
>zeromq-dev at lists.zeromq.org
>http://lists.zeromq.org/mailman/listinfo/zeromq-dev



More information about the zeromq-dev mailing list