[zeromq-dev] "Threadless" version
Dhammika Pathirana
dhammika at gmail.com
Wed Jan 13 11:06:55 CET 2010
Hi Martin,
Have we run zmq on Xeon Nehalem CPUs?
Following Intel benchmarks are impressive,
http://www.vyatta.com/downloads/whitepapers/Intel_Router_solBrief_r04.pdf
http://www.nyse.com/pdfs/Data-Fabric-Intel-Product-Sheet.pdf
On 1/13/10, Martin Sustrik <sustrik at 250bpm.com> wrote:
> Hi Erik,
>
>
> > I have some more comments regarding zeromq2. The code seems to be
> > highly optimized for message throughput. As i understand basically the
> > application thread puts stuff in the io threads queue and vice versa.
> > This is excellent for throughput as the io thread can keep on
> > preparing messages for delivery to the application thread while the
> > application thread is working and the io thread can wait for slow tcp
> > receivers without blocking.
>
>
> Right.
>
>
> > The problem is that thread context switching is expensive. I did some
> > testing on Solaris running the latency benchmark from zeromq2 git repo
> > and got latencies around 35-40µs between two processes on the same
> > machine. The loopback tcp latency on this machine is 10µs. So zeromq
> > adds a significant overhead here. It's not hard see why: the
> > application to io thread signalling will add 5-10µs in latency.
>
>
> Ack. On Linux the signaling latency is somewhat lower, but still
> significant.
>
>
> > Maybe
> > it's possible to add support for having the application thread itself
> > write to the sockets and read from them. This would reduce latencies
> > at the cost of throughput put. For some applications this could be
> > important.
>
>
> Doing this in generic way is quite complex. There was such functionality
> when we've started with 0MQ three years ago, but later on as other
> aspects of 0MQ got more complex, the feature was dropped in faviour of
> simplicity.
>
> In short it requires application thread when blocking recv() is called
> and there are no messages available to ask I/O thread to pass it its
> sockets. Once the sockets are returned to the application thread it can
> poll on the sockets, get the message and send the sockets back to the
> I/O thread before returning the message.
>
> Obviously, "initialising" recv() function can take significant amount of
> time (asking I/O thread to hand its sockets, passing the sockets to app
> thead etc.). Still, once the message arrives it can be processed
> immediately, avoiding the latency impact of passing the message among
> threads. Thus, such an optimisation would help if the load of messages
> is low, but the latency is priority.
>
>
> > It could even be fine to disconnect any sockets who returns
> > -EAGAIN, ie slow receivers should pull their orders from the market
> > anyway since they have stale data.
>
>
> Yes. That's viable. It would make 0MQ just a thin synchronous wrapper on
> top of the underlying transport. The drawback would be decreased
> throughput as no message batching can be done in synchronous mode and
> each message would have be written to the underlying socket separately.
>
>
> > 29wests solution is similar to zeromq, they seem to run a configurable
> > amount of io threads in the background. But judging from some of the
> > latency data they present (20µs end to end using openonload) it's
> > seems like they also have a possibility for the app thread to write to
> > the sockets directly. They also have a drop slow TCP receivers mode
> > which seems to corroborate this.
>
>
> No idea what's under covers of LBM, however, 20us is achievable even
> with threaded design. Some time ago few tests were done with 0MQ on top
> of 10GbE/Open-MX/Linux stack with latencies ~15us.
>
>
> > One reason I'm bringing this up is also because of our discussion on
> > shared memory. Shared memory won't give much gains except for very
> > large messages unless it's possible to bypass the io thread.
>
>
> Right. With shmem avoiding I/O thread makes perfect sense as the only
> thing sent through the sockets are notifications about data available in
> the shared memory, i.e. 1 bit in the most optimised case.
>
> It's a complex topic and it requires more design work IMO, so keep
> thinking about it share your ideas. Btw, Jon is still working on UNIX
> domain socket transport so that would make an ideal starting point for
> experimenting with all kinds of IPC optimisation.
>
> One more idea: Have you through of moving you applications into
> different threads within a single process? In such case you can use
> inproc transport that avoids I/O threads altogether. Latency should drop
> to the level of latency of sending a byte over socketpair.
>
>
> Martin
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
More information about the zeromq-dev
mailing list