[zeromq-dev] "Threadless" version

Dhammika Pathirana dhammika at gmail.com
Wed Jan 13 11:06:55 CET 2010


Hi Martin,

Have we run zmq on Xeon Nehalem CPUs?

Following Intel benchmarks are impressive,
http://www.vyatta.com/downloads/whitepapers/Intel_Router_solBrief_r04.pdf
http://www.nyse.com/pdfs/Data-Fabric-Intel-Product-Sheet.pdf



On 1/13/10, Martin Sustrik <sustrik at 250bpm.com> wrote:
> Hi Erik,
>
>
>  > I have some more comments regarding zeromq2. The code seems to be
>  > highly optimized for message throughput. As i understand basically the
>  > application thread puts stuff in the io threads queue and vice versa.
>  > This is excellent for throughput as the io thread can keep on
>  > preparing messages for delivery to the application thread while the
>  > application thread is working and the io thread can wait for slow tcp
>  > receivers without blocking.
>
>
> Right.
>
>
>  > The problem is that thread context switching is expensive. I did some
>  > testing on Solaris running the latency benchmark from zeromq2 git repo
>  > and got latencies around 35-40µs between two processes on the same
>  > machine. The loopback tcp latency on this machine is 10µs. So zeromq
>  > adds a significant overhead here. It's not hard see why: the
>  > application to io thread signalling will add 5-10µs in latency.
>
>
> Ack. On Linux the signaling latency is somewhat lower, but still
>  significant.
>
>
>  > Maybe
>  > it's possible to add support for having the application thread itself
>  > write to the sockets and read from them. This would reduce latencies
>  > at the cost of throughput put. For some applications this could be
>  > important.
>
>
> Doing this in generic way is quite complex. There was such functionality
>  when we've started with 0MQ three years ago, but later on as other
>  aspects of 0MQ got more complex, the feature was dropped in faviour of
>  simplicity.
>
>  In short it requires application thread when blocking recv() is called
>  and there are no messages available to ask I/O thread to pass it its
>  sockets. Once the sockets are returned to the application thread it can
>  poll on the sockets, get the message and send the sockets back to the
>  I/O thread before returning the message.
>
>  Obviously, "initialising" recv() function can take significant amount of
>  time (asking I/O thread to hand its sockets, passing the sockets to app
>  thead etc.). Still, once the message arrives it can be processed
>  immediately, avoiding the latency impact of passing the message among
>  threads. Thus, such an optimisation would help if the load of messages
>  is low, but the latency is priority.
>
>
>  > It could even be fine to disconnect any sockets who returns
>  > -EAGAIN, ie slow receivers should pull their orders from the market
>  > anyway since they have stale data.
>
>
> Yes. That's viable. It would make 0MQ just a thin synchronous wrapper on
>  top of the underlying transport. The drawback would be decreased
>  throughput as no message batching can be done in synchronous mode and
>  each message would have be written to the underlying socket separately.
>
>
>  > 29wests solution is similar to zeromq, they seem to run a configurable
>  > amount of io threads in the background. But judging from some of the
>  > latency data they present (20µs end to end using openonload) it's
>  > seems like they also have a possibility for the app thread to write to
>  > the sockets directly. They also have a drop slow TCP receivers mode
>  > which seems to corroborate this.
>
>
> No idea what's under covers of LBM, however, 20us is achievable even
>  with threaded design. Some time ago few tests were done with 0MQ on top
>  of 10GbE/Open-MX/Linux stack with latencies ~15us.
>
>
>  > One reason I'm bringing this up is also because of our discussion on
>  > shared memory. Shared memory won't give much gains except for very
>  > large messages unless it's possible to bypass the io thread.
>
>
> Right. With shmem avoiding I/O thread makes perfect sense as the only
>  thing sent through the sockets are notifications about data available in
>  the shared memory, i.e. 1 bit in the most optimised case.
>
>  It's a complex topic and it requires more design work IMO, so keep
>  thinking about it share your ideas. Btw, Jon is still working on UNIX
>  domain socket transport so that would make an ideal starting point for
>  experimenting with all kinds of IPC optimisation.
>
>  One more idea: Have you through of moving you applications into
>  different threads within a single process? In such case you can use
>  inproc transport that avoids I/O threads altogether. Latency should drop
>  to the level of latency of sending a byte over socketpair.
>
>
>  Martin
>
> _______________________________________________
>  zeromq-dev mailing list
>  zeromq-dev at lists.zeromq.org
>  http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>



More information about the zeromq-dev mailing list