[zeromq-dev] "Threadless" version
Martin Sustrik
sustrik at 250bpm.com
Wed Jan 13 09:42:26 CET 2010
Hi Erik,
> I have some more comments regarding zeromq2. The code seems to be
> highly optimized for message throughput. As i understand basically the
> application thread puts stuff in the io threads queue and vice versa.
> This is excellent for throughput as the io thread can keep on
> preparing messages for delivery to the application thread while the
> application thread is working and the io thread can wait for slow tcp
> receivers without blocking.
Right.
> The problem is that thread context switching is expensive. I did some
> testing on Solaris running the latency benchmark from zeromq2 git repo
> and got latencies around 35-40µs between two processes on the same
> machine. The loopback tcp latency on this machine is 10µs. So zeromq
> adds a significant overhead here. It's not hard see why: the
> application to io thread signalling will add 5-10µs in latency.
Ack. On Linux the signaling latency is somewhat lower, but still
significant.
> Maybe
> it's possible to add support for having the application thread itself
> write to the sockets and read from them. This would reduce latencies
> at the cost of throughput put. For some applications this could be
> important.
Doing this in generic way is quite complex. There was such functionality
when we've started with 0MQ three years ago, but later on as other
aspects of 0MQ got more complex, the feature was dropped in faviour of
simplicity.
In short it requires application thread when blocking recv() is called
and there are no messages available to ask I/O thread to pass it its
sockets. Once the sockets are returned to the application thread it can
poll on the sockets, get the message and send the sockets back to the
I/O thread before returning the message.
Obviously, "initialising" recv() function can take significant amount of
time (asking I/O thread to hand its sockets, passing the sockets to app
thead etc.). Still, once the message arrives it can be processed
immediately, avoiding the latency impact of passing the message among
threads. Thus, such an optimisation would help if the load of messages
is low, but the latency is priority.
> It could even be fine to disconnect any sockets who returns
> -EAGAIN, ie slow receivers should pull their orders from the market
> anyway since they have stale data.
Yes. That's viable. It would make 0MQ just a thin synchronous wrapper on
top of the underlying transport. The drawback would be decreased
throughput as no message batching can be done in synchronous mode and
each message would have be written to the underlying socket separately.
> 29wests solution is similar to zeromq, they seem to run a configurable
> amount of io threads in the background. But judging from some of the
> latency data they present (20µs end to end using openonload) it's
> seems like they also have a possibility for the app thread to write to
> the sockets directly. They also have a drop slow TCP receivers mode
> which seems to corroborate this.
No idea what's under covers of LBM, however, 20us is achievable even
with threaded design. Some time ago few tests were done with 0MQ on top
of 10GbE/Open-MX/Linux stack with latencies ~15us.
> One reason I'm bringing this up is also because of our discussion on
> shared memory. Shared memory won't give much gains except for very
> large messages unless it's possible to bypass the io thread.
Right. With shmem avoiding I/O thread makes perfect sense as the only
thing sent through the sockets are notifications about data available in
the shared memory, i.e. 1 bit in the most optimised case.
It's a complex topic and it requires more design work IMO, so keep
thinking about it share your ideas. Btw, Jon is still working on UNIX
domain socket transport so that would make an ideal starting point for
experimenting with all kinds of IPC optimisation.
One more idea: Have you through of moving you applications into
different threads within a single process? In such case you can use
inproc transport that avoids I/O threads altogether. Latency should drop
to the level of latency of sending a byte over socketpair.
Martin
More information about the zeromq-dev
mailing list