[zeromq-dev] "Threadless" version

Martin Sustrik sustrik at 250bpm.com
Wed Jan 13 09:42:26 CET 2010


Hi Erik,

> I have some more comments regarding zeromq2. The code seems to be
> highly optimized for message throughput. As i understand basically the
> application thread puts stuff in the io threads queue and vice versa.
> This is excellent for throughput as the io thread can keep on
> preparing messages for delivery to the application thread while the
> application thread is working and the io thread can wait for slow tcp
> receivers without blocking.

Right.

> The problem is that thread context switching is expensive. I did some
> testing on Solaris running the latency benchmark from zeromq2 git repo
> and got latencies around 35-40µs between two processes on the same
> machine. The loopback tcp latency on this machine is 10µs. So zeromq
> adds a significant overhead here. It's not hard see why: the
> application to io thread signalling will add 5-10µs in latency.

Ack. On Linux the signaling latency is somewhat lower, but still 
significant.

> Maybe
> it's possible to add support for having the application thread itself
> write to the sockets and read from them. This would reduce latencies
> at the cost of throughput put. For some applications this could be
> important.

Doing this in generic way is quite complex. There was such functionality 
when we've started with 0MQ three years ago, but later on as other 
aspects of 0MQ got more complex, the feature was dropped in faviour of 
simplicity.

In short it requires application thread when blocking recv() is called 
and there are no messages available to ask I/O thread to pass it its 
sockets. Once the sockets are returned to the application thread it can 
poll on the sockets, get the message and send the sockets back to the 
I/O thread before returning the message.

Obviously, "initialising" recv() function can take significant amount of 
time (asking I/O thread to hand its sockets, passing the sockets to app 
thead etc.). Still, once the message arrives it can be processed 
immediately, avoiding the latency impact of passing the message among 
threads. Thus, such an optimisation would help if the load of messages 
is low, but the latency is priority.

> It could even be fine to disconnect any sockets who returns
> -EAGAIN, ie slow receivers should pull their orders from the market
> anyway since they have stale data.

Yes. That's viable. It would make 0MQ just a thin synchronous wrapper on 
top of the underlying transport. The drawback would be decreased 
throughput as no message batching can be done in synchronous mode and 
each message would have be written to the underlying socket separately.

> 29wests solution is similar to zeromq, they seem to run a configurable
> amount of io threads in the background. But judging from some of the
> latency data they present (20µs end to end using openonload) it's
> seems like they also have a possibility for the app thread to write to
> the sockets directly. They also have a drop slow TCP receivers mode
> which seems to corroborate this.

No idea what's under covers of LBM, however, 20us is achievable even 
with threaded design. Some time ago few tests were done with 0MQ on top 
of 10GbE/Open-MX/Linux stack with latencies ~15us.

> One reason I'm bringing this up is also because of our discussion on
> shared memory. Shared memory won't give much gains except for very
> large messages unless it's possible to bypass the io thread.

Right. With shmem avoiding I/O thread makes perfect sense as the only 
thing sent through the sockets are notifications about data available in 
the shared memory, i.e. 1 bit in the most optimised case.

It's a complex topic and it requires more design work IMO, so keep 
thinking about it share your ideas. Btw, Jon is still working on UNIX 
domain socket transport so that would make an ideal starting point for 
experimenting with all kinds of IPC optimisation.

One more idea: Have you through of moving you applications into 
different threads within a single process? In such case you can use 
inproc transport that avoids I/O threads altogether. Latency should drop 
to the level of latency of sending a byte over socketpair.

Martin



More information about the zeromq-dev mailing list