[zeromq-dev] Proposal for a revised zlist API

Goswin von Brederlow goswin-v-b at web.de
Fri Sep 12 16:58:49 CEST 2014

On Wed, Sep 10, 2014 at 06:42:18PM +0200, Pieter Hintjens wrote:
> Sorry, I've lost track of some of the details here. Could you
> re-describe your timeout queue starting from the problem so I can
> think about how I'd implement it?
> Thanks
> Pieter

The problem is that every connected host has a timeout for the
heartbeat and every message send and received a timeout for ACK. Most
of the time the timeout is canceled because there is activity or a
message gets ACKed in time. And this has to scale well, being able to
cope with 1000+ hosts and 1000000+ messages on the fly. So there can
be a lot of timers and a lot of acticity. It is important that
creating, canceling and restarting a timeout timer is fast.

The normal approach to timeouts is to have a priority queue based on a
heap or balanced tree. This makes creating, canceling and restarting a
timeout an O(log n) operation while finding the next timeout is O(1).

My timeout queue uses a different approach taken from the Linux
kernels network code. Most timeouts get canceled early. So the idea is
to delay work for each timeout as long as possible in the hope that it
gets canceled and you don't have to spend time on it at all. For this
I have an array of buckets (zring). Bucket N holds timeouts that
expire between 2^N and 2^(N+1) ticks from now. Every 2^N ticks all
timeouts in bucket N are inspected and resorted into new buckets.

Over time timeouts migrate from the highest bucket to the lowest and
then expire. That is unless they get canceled. A timeout also only
gets inspected a maximum number of times [iirc number of buckets + 1],
less if canceled. The overall work is therefore comparable to a heap
or sorted tree if all timeouts expire. But much less if they get

Creating a new timeout is trivial. I just compute which bucket it
belongs to and insert it into the zring. Canceling is harder since
searching the zring for the right timeout would take too long. That's
where the node_t pointer comes in. It allows the timeout to remove
itself without searching.


More information about the zeromq-dev mailing list