[zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue

Doron Somech somdoron at gmail.com
Sat Mar 21 20:44:33 CET 2015


Make the protocol between the client and server stateful and have some kind
of handshake between client and server.

So when the server dies and then restart quickly, client sending a ping
message will be replied with an error as the server doesn't know the
client. The client then will re-initiate the handshake and will expire
immediately any pending requests which you can know resend. To summarize:

1. Have handshake / login process between client and server
2. Server has a hash table of all clients (routing id to client) - which it
added client to it as part of the handshake process
3. Client sends a ping every X seconds
4. Server receives a ping and reply with pong if the client is known or
error if client is unknown
5. Client that receives an error from a ping re-initiate the
handshake/login process
6. Client immediately expire pending requests when error is received / or
resending all pending requests after handshake is completed




On Fri, Mar 20, 2015 at 8:53 PM, Russell Della Rosa <
rdellar2000-zeromq at yahoo.com> wrote:

> Hi all,
>
> Quite a bit of setup to ask a question...
>
> Setup:
> ------
> I setup a simple zmq aync request/reply architecture that looks like this:
> client[DEALER] --tcp--> primary server[ROUTER/proxy/DEALER] --inproc-->
> workers[DEALER]
>
> An identical secondary server exists for failover.
>
> To make this more robust against server death / network issues I setup a
> ping pong heartbeat system as described in the guide.  (I liked the
> elegance of having the client control the timeouts, ping times, etc in
> ping/pong.  This makes the server simple since all timings are controlled
> by the client and all the server has to do is reply with a pong.  I
> basically modeled it after the Amazon EC2 ELB heartbeating, but for brevity
> won't go into that.)
>
> The client sends requests & pings in an async manner, but it will poll
> waiting for a reply before sending the next request.  The client will fail
> a request if the REQUEST_TIMEOUT[300s] is exceeded.  While waiting for a
> reply the client will ping the server every HB_INTERVAL[10s], wait
> HB_TIMEOUT[2s] in the poll loop to get a pong, and after some threshold of
> missed pings, HB_UNHEALTHY_THRESHOLD[2], the client will failover to a
> secondary server.  (Note that the REQUEST_TIMEOUT is quite long at 300s
> since some requests can take quite a while to complete.)
>
> Using the settings above, in brackets, all this works very well and only
> causes a worst case delay of around 20s on an outstanding request before it
> will failover to the secondary.
>
> Problem:
> --------
> This works well, except in this one case:
> - Client sends a request, server receives the request, sever dies, server
> is restarted very quickly (fast enough to miss no more than one ping)
>
> In this case the client will wait the entire REQUEST_TIMEOUT and then
> fail the request.  (The client assumes the server was working so it waits.
> The pings kept flowing to the server, save maybe 1, so it treated as alive.)
>
> I have various ideas on how to fix this issue and resend faster, but none
> are that elegant.
> 1) Could add a retry after REQUEST_TIMEOUT, but that is a long time [300s]
> to wait before retrying.  Easiest...
> 2) Could add the server zmq identity to the pong message and force a
> reconnect when the pong identity changes, but that can get complex with
> multiple servers.
> 3) I considered using a ROUTER as the client so the pings would be dropped
> when a server dies, but that is difficult to setup the first time and
> various posts on this forum (see below) mention client routers coming and
> going as being troublesome.  (And ROUTER to ROUTER looks tricky to get
> correct.)
>
> I considered Pub/Sub and one way heartbeats but neither would change this
> behavior, the pong messages would still flow.
>
> I have the service setup to auto-recover on a crash so it's more than just
> an edge case.
>
> Question:
> ---------
> I'm curious if anyone has solved this quick server restart problem in a
> clean way with socket patterns?  Or if you have other suggestions?
>
> Or if you have example code of ping/pong handling this case I'd love to
> see it.
>
> Thanks!
>  -- Russell
>
> Related threads:
> ----------------
> Disconnects / Retry Logic -
> http://lists.zeromq.org/pipermail/zeromq-dev/2012-January/015024.html
> Using a router with an identity issue -
> http://lists.zeromq.org/pipermail/zeromq-dev/2014-February/025206.html
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20150321/feb8fd75/attachment.htm>


More information about the zeromq-dev mailing list