[zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue

Russell Della Rosa rdellar2000-zeromq at yahoo.com
Fri Mar 20 20:22:28 CET 2015

(Resending plaintext.)

Hi all,

Quite a bit of setup to ask a question...

I setup a simple zmq aync request/reply architecture that looks like this:
client[DEALER] --tcp--> primary server[ROUTER/proxy/DEALER] --inproc--> workers[DEALER]

An identical secondary server exists for failover.

To make this more robust against server death / network issues I setup a ping pong heartbeat system as described in the guide.  (I liked the elegance of having the client control the timeouts, ping times, etc in ping/pong.  This makes the server simple since all timings are controlled by the client and all the server has to do is reply with a pong.  I basically modeled it after the Amazon EC2 ELB heartbeating, but for brevity won't go into that.)

The client sends requests & pings in an async manner, but it will poll waiting for a reply before sending the next request.  The client will fail a request if the REQUEST_TIMEOUT[300s] is exceeded.  While waiting for a reply the client will ping the server every HB_INTERVAL[10s], wait HB_TIMEOUT[2s] in the poll loop to get a pong, and after some threshold of missed pings, HB_UNHEALTHY_THRESHOLD[2], the client will failover to a secondary server.  (Note that the REQUEST_TIMEOUT is quite long at 300s since some requests can take quite a while to complete.)

Using the settings above, in brackets, all this works very well and only causes a worst case delay of around 20s on an outstanding request before it will failover to the secondary.

This works well, except in this one case:
- Client sends a request, server receives the request, sever dies, server is restarted very quickly (fast enough to miss no more than one ping)

In this case the client will wait the entire REQUEST_TIMEOUT and then fail the request.  (The client assumes the server was working so it waits. The pings kept flowing to the server, save maybe 1, so it treated as alive.)

I have various ideas on how to fix this issue and resend faster, but none are that elegant. My goal is to resend the message as fast as possible to the secondary.
1) Could add a retry after REQUEST_TIMEOUT, but that is a long time [300s] to wait before retrying.  Easiest...
2) Could add the server identity to the pong message and force a reconnect when the pong identity changes, but that can get complex with multiple servers.
3) I considered using a ROUTER as the client so the pings would be dropped when a server dies, but that is difficult to setup the first time and various posts on this forum (see below) mention client routers coming and going as being troublesome.  (And ROUTER to ROUTER looks tricky to get correct.)

I considered Pub/Sub and one way heartbeats but neither would change this behavior, the pong messages would still flow.

I have the service setup to auto-recover on a crash so it's more than just an edge case.

I'm curious if anyone has solved this quick server restart problem in a clean way with socket patterns?  Or if you have other suggestions?  Or if you have example code of ping/pong handling this case I'd love to see it.

-- Russell

Related threads:
Disconnects / Retry Logic - http://lists.zeromq.org/pipermail/zeromq-dev/2012-January/015024.html
Using a router with an identity issue - http://lists.zeromq.org/pipermail/zeromq-dev/2014-February/025206.html

More information about the zeromq-dev mailing list