[zeromq-dev] Exact matching on subscription topics

Martin Sustrik sustrik at 250bpm.com
Mon Jan 9 22:30:18 CET 2012


Hi Staffan,

I think this kind of thing should be definitely built into 0MQ, however, 
we should take a look at big picture first.

The big picture being forwarding of subscriptions in large distribution 
trees.

What we need to do is to add the capability to add new matching 
algorithms without disrupting the existing infrastructure (change in 
wire formats etc.)

My idea was that subscription matching can be thought of as end-to-end 
feature (topic string is provided on topmost publisher, filtering is 
going in leaf subscribers) with subscription forwarding being an 
optional optimisation.

That way, new matching algorithms can be added and the only nodes that 
would have to be aware of them are terminal publishers and subscribers. 
Intermediate devices could handle unknown matching algorithms by simply 
passing all the messages on.

The actual wire format I've proposed for this is described here:

http://groups.google.com/group/sp-discuss-group/msg/22ee4d6e9f82857a

Maybe it would make sense to add this to 0MQ now, while 3-1 is still in 
beta.

Having this extra field in the wire format then makes it easy to add new 
matching algos like the one you've implemented. For now we could start 
with following list of algorithms:

0 - all (any message matches)
1 - exact
2 - prefix

What do you think?

See some more comments inlined:

> Having exact topic matching built into zmq would be very useful for a
> couple of reasons:
>
>   * Hash maps are less gluttonous for memory than tries, and should be as
> fast or faster.
>   * Exact matching semantics provides a nice a way to scale, more on that
> below.

Ack.

> I did a quick and dirty throw-away prototype that supports both prefix
> and exact matching:
> https://github.com/gimaker/libzmq/tree/exact-matching-prototype (given
> the lack of a hash map in C++03 and my laziness I used a trie for exact
> subscriptions for now, quick and dirty!)
>
> To make exact subscriptions/unsubscriptions you do:
>    zmq_setsockopt(sock, ZMQ_SUBSCRIBE_EXACT, topic, topic_len); and
>    zmq_setsockopt(sock, ZMQ_UNSUBSCRIBE_EXACT, topic, topic_len);
>
> Prefix matched topics are added and removed as normal with ZMQ_SUBSCRIBE
> and ZMQ_UNSUBSCRIBE respectively.
>
> The cost of mixed prefix/exact matching is two lookups instead of one
> (one for exact matching, one for prefix matching) but you can have fast
> paths for when only one kind of matching is used, making the added cost
> negligible unless you use a mix of prefix and exact matching.

Yes. The overhead is checking whether the number of prefix/exact 
subscriptions is zero, ie. couple of nanoseconds.

> TL;DR: Our situation is this: we have a large amount of active
> subscriptions (in the range of tens of millions, and counting) that are
> propagated to all publisher instances (of the same type), which results
> in memory usage in the 10-20 GiB range per instance, just for storing
> subscriptions, independent on the how many machines we throw at the
> problem. Thus, we need some way to shard subscription information.
>
> Sharding subscription information is easy enough -- we're building a
> small pubsub service that sits in between publishers and subscribers
> with the sole purpose of sharding the subscription information. So, with
> N machines each machine handles a fraction 1/N of all subscriptions.
> Subscribers subscribe only to the token responsible for the topic and
> publishers publish to all intermediary machines.

What's "token"? I've got a bit lost...

> This should scale well with regards to memory, but less so with regards
> to throughput and bandwidth as each intermediary machine still has to
> process all published messages. Exact matching would allow us to publish
> to a single token rather than all of them. There are also some potential
> headaches with having pubsub across multiple data centers that I can
> elaborate on if anyone is interested.

I would be extremely interested in that. Sounds like it's relevant to 
designing efficient large-scale pub-sub topologies, what's what I am 
focusing on myself.

> Of course, the path of least resistance is to just disregard that zmq
> uses prefix matching and treat it as it was exact, but then we'd lose
> out on the memory savings (and likely better performance) of using hash
> maps. And there are probably cases where prefix matching would be useful
> for us as well, although the bulk of our traffic is better suited for
> exact matching.

I think the discussion above about extensible set of matching algos 
applies here.

> So, is this something someone else would be interested in having in zmq?
> And if we do it, what are the chances of it getting merged?

Close to 100% I guess :)

Martin



More information about the zeromq-dev mailing list