[zeromq-dev] How to write high-perf messaging

Martin Sustrik sustrik at fastmq.com
Sun Jan 11 00:03:51 CET 2009


Hi all,

Thinking about discussion with Jannes w.r.t. splice batching in Linux 
kernel, I've realised that some may appreciate brief explanation of how 
high messaging performance is achieved so that they can experiment with 
the code and settings and try to squeeze as much performance from 0MQ as 
possible.

There are two diagrams attached. First one shows how sending messages is 
done, second once illustrates message receiving. All diagrams show 
messaging stack composed of network (NW), network interface card (NIC), 
operating system (OS), messaging system (0MQ) and application (App). One 
can devise more fine-grained diagrams, however, for now we'll do with 
these 5 layers.

The 'sending' slide shows 3 different sending strategies. The leftmost 
strategy ('no batching') is pretty straightforward. Application sends a 
message, messaging layer forwards it to the operating system, which in 
its turn passes it to the networking hardware. The obvious problem is 
that too much overhead is involved in moving repeatedly up & down the 
stack. Thus we shouldn't expect this kind of solution to be very efficient.

Second strategy is well-known Nagle's algorithm. In this case instead of 
sending each message as a separate network packet OS waits for a while, 
aggregates all the messages that were sent by the app in the meantime 
and flushes them to the network as a single packet. The problem with 
this approach is that even if you send a single message, OS will wait 
till timeout expires before physically sending the data which obviously 
hurts latency.

Third strategy is the one currently used in 0MQ. Messages are batched in 
0MQ and the whole batch is passed to the OS using a single system call. 
Moreover, there is no timeout. The principle is: Send all the messages 
available to the OS - even if its only a single message - and never wait 
for more messages to arrive. Thus, if OS & network are able to keep with 
the message publishing rate, messages are passed as fast as possible (no 
batching, lowest possible latency). However, if OS & network aren't able 
to keep with message publishing rate, messages are batched (high 
throughput, somewhat higher latency).

Now let's have a look at the receiving side. Leftmost strategy is the 
simplest possible one. As packets arrive they are passed to the OS. When 
application asks for a new message, messaging system retrieves adequate 
amount of data from the OS and passes the message to the application. 
Simple, but not very efficient.

Second strategy is one used in 0MQ. Packets arrive from the network and 
are passed to the OS. When application asks for new message, messaging 
system gets as much data as possible from the OS and returns a message 
to the application. When application asks for a new message, chances are 
good that the data were already retrieved from the OS and thus the 
operation is extremely cheap.

Last strategy shows what happens when you have interrupt coalescing 
turned on in your network interface card. The card waits for a certain 
amount of time and reports several arrived packets using a single 
interrupt. With small packets this eliminates a lot of stack traversal, 
however, the tradeoff is much worse latency bacause of the timeout 
involved. Thus we recommend to switch interrupt coalescing off when 
using 0MQ.

In summary, to get best throughput & latency follow these three basic rules:

1. Avoid timeouts on all layers of the stack.
2. Batch data as soon as possible.
3. Unbatch the data as late as possible.
4. Don't re-batch the data that were already batched.

Hope this gives an idea of how to tune individual layers of the stack to 
get the best performance.

Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: receiving.png
Type: image/png
Size: 20173 bytes
Desc: not available
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20090111/64ca29fd/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sending.png
Type: image/png
Size: 14703 bytes
Desc: not available
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20090111/64ca29fd/attachment-0001.png>


More information about the zeromq-dev mailing list