[zeromq-dev] Exact matching on subscription topics

Staffan Gimåker staffan at spotify.com
Mon Jan 9 17:56:20 CET 2012


Hey guys,

Having exact topic matching built into zmq would be very useful for a
couple of reasons:

 * Hash maps are less gluttonous for memory than tries, and should be as
fast or faster.
 * Exact matching semantics provides a nice a way to scale, more on that
below.

I did a quick and dirty throw-away prototype that supports both prefix
and exact matching:
https://github.com/gimaker/libzmq/tree/exact-matching-prototype (given
the lack of a hash map in C++03 and my laziness I used a trie for exact
subscriptions for now, quick and dirty!)

To make exact subscriptions/unsubscriptions you do:
  zmq_setsockopt(sock, ZMQ_SUBSCRIBE_EXACT, topic, topic_len); and
  zmq_setsockopt(sock, ZMQ_UNSUBSCRIBE_EXACT, topic, topic_len);

Prefix matched topics are added and removed as normal with ZMQ_SUBSCRIBE
and ZMQ_UNSUBSCRIBE respectively.

The cost of mixed prefix/exact matching is two lookups instead of one
(one for exact matching, one for prefix matching) but you can have fast
paths for when only one kind of matching is used, making the added cost
negligible unless you use a mix of prefix and exact matching.
x
TL;DR: Our situation is this: we have a large amount of active
subscriptions (in the range of tens of millions, and counting) that are
propagated to all publisher instances (of the same type), which results
in memory usage in the 10-20 GiB range per instance, just for storing
subscriptions, independent on the how many machines we throw at the
problem. Thus, we need some way to shard subscription information.

Sharding subscription information is easy enough -- we're building a
small pubsub service that sits in between publishers and subscribers
with the sole purpose of sharding the subscription information. So, with
N machines each machine handles a fraction 1/N of all subscriptions.
Subscribers subscribe only to the token responsible for the topic and
publishers publish to all intermediary machines.

This should scale well with regards to memory, but less so with regards
to throughput and bandwidth as each intermediary machine still has to
process all published messages. Exact matching would allow us to publish
to a single token rather than all of them. There are also some potential
headaches with having pubsub across multiple data centers that I can
elaborate on if anyone is interested.

Of course, the path of least resistance is to just disregard that zmq
uses prefix matching and treat it as it was exact, but then we'd lose
out on the memory savings (and likely better performance) of using hash
maps. And there are probably cases where prefix matching would be useful
for us as well, although the bulk of our traffic is better suited for
exact matching.

So, is this something someone else would be interested in having in zmq?
And if we do it, what are the chances of it getting merged?

/S




More information about the zeromq-dev mailing list