[zeromq-dev] How to write high-perf messaging
Martin Sustrik
sustrik at fastmq.com
Sun Jan 11 00:03:51 CET 2009
Hi all,
Thinking about discussion with Jannes w.r.t. splice batching in Linux
kernel, I've realised that some may appreciate brief explanation of how
high messaging performance is achieved so that they can experiment with
the code and settings and try to squeeze as much performance from 0MQ as
possible.
There are two diagrams attached. First one shows how sending messages is
done, second once illustrates message receiving. All diagrams show
messaging stack composed of network (NW), network interface card (NIC),
operating system (OS), messaging system (0MQ) and application (App). One
can devise more fine-grained diagrams, however, for now we'll do with
these 5 layers.
The 'sending' slide shows 3 different sending strategies. The leftmost
strategy ('no batching') is pretty straightforward. Application sends a
message, messaging layer forwards it to the operating system, which in
its turn passes it to the networking hardware. The obvious problem is
that too much overhead is involved in moving repeatedly up & down the
stack. Thus we shouldn't expect this kind of solution to be very efficient.
Second strategy is well-known Nagle's algorithm. In this case instead of
sending each message as a separate network packet OS waits for a while,
aggregates all the messages that were sent by the app in the meantime
and flushes them to the network as a single packet. The problem with
this approach is that even if you send a single message, OS will wait
till timeout expires before physically sending the data which obviously
hurts latency.
Third strategy is the one currently used in 0MQ. Messages are batched in
0MQ and the whole batch is passed to the OS using a single system call.
Moreover, there is no timeout. The principle is: Send all the messages
available to the OS - even if its only a single message - and never wait
for more messages to arrive. Thus, if OS & network are able to keep with
the message publishing rate, messages are passed as fast as possible (no
batching, lowest possible latency). However, if OS & network aren't able
to keep with message publishing rate, messages are batched (high
throughput, somewhat higher latency).
Now let's have a look at the receiving side. Leftmost strategy is the
simplest possible one. As packets arrive they are passed to the OS. When
application asks for a new message, messaging system retrieves adequate
amount of data from the OS and passes the message to the application.
Simple, but not very efficient.
Second strategy is one used in 0MQ. Packets arrive from the network and
are passed to the OS. When application asks for new message, messaging
system gets as much data as possible from the OS and returns a message
to the application. When application asks for a new message, chances are
good that the data were already retrieved from the OS and thus the
operation is extremely cheap.
Last strategy shows what happens when you have interrupt coalescing
turned on in your network interface card. The card waits for a certain
amount of time and reports several arrived packets using a single
interrupt. With small packets this eliminates a lot of stack traversal,
however, the tradeoff is much worse latency bacause of the timeout
involved. Thus we recommend to switch interrupt coalescing off when
using 0MQ.
In summary, to get best throughput & latency follow these three basic rules:
1. Avoid timeouts on all layers of the stack.
2. Batch data as soon as possible.
3. Unbatch the data as late as possible.
4. Don't re-batch the data that were already batched.
Hope this gives an idea of how to tune individual layers of the stack to
get the best performance.
Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: receiving.png
Type: image/png
Size: 20173 bytes
Desc: not available
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20090111/64ca29fd/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sending.png
Type: image/png
Size: 14703 bytes
Desc: not available
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20090111/64ca29fd/attachment-0001.png>
More information about the zeromq-dev
mailing list