[zeromq-dev] (almost) zero-copy message receive
Jens Auer
jens.auer at betaversion.net
Sun May 31 19:29:59 CEST 2015
Hi,
I did some performance analysis of a program which receives data on a (SUB or
PULL) socket, filters it for some criteria, extracts a value from the message
and uses this as a subscription to forward the datato a PUB socket. As
expected, most time is spent in memory allocations and memcpy operations, so I
decided to check if there is an opportunity to minimize these operations in
the critical path. From my analysis, the path is as follows:
1. stream_engine receives data from a socket into a static buffer of 8192
bytes
2. decoder/v2_decoder implement a state machine which reads the flag and
message size, create a new message and copy the data into the message data
field
3. When sending, stream_engine copies the flags field, message and message
data into a static buffer and sends this buffer completely to the socket
Memory allocations are done in v2_decoder when a new message is created, and
deallocations are done when sending the message. Memcpy operations are done in
decoder to copy
- the flags byte into a temporary buffer
- the message size into a temporary buffer
- the message data into the dynamically allocated storage
Since the allocations and memcpy are the dominating operations, I implemented
a scheme where these operations are minimized. The main idea is to allocate
the receive buffer of 8192 byte dynamically and use this as the data storage
for zero-copy messages created with msg_t::init_data. This replaces n = 8192 /
(m_size + 10) memory allocations with one allocation, and it gets rid of the
same number of memcpy operations for the message data. I implemented this in a
fork (https://github.com/jens-auer/libzmq/tree/zero_copy_receive). For
testing, I ran the throughput test (message size 100, 100000 messages) locally
and profiled for memory allocations and memcpy. The results are promising:
- memory allocations reduced from 100,260 to 2,573
- memcpy operations reduced from 301,227 to 202,449. This is expected because
for every message, three memcpys are done, and the patch removes the data
memcpy only.
- throughput increased significantly by about 30-40% ( I only did a couple of
runs to test it, no thorough benchmarking)
For the implementation, I had to change two other things. After my first
implementation, I realized that msg_t::init_data does a malloc to create the
content_t member. Given that msg_t's size is now 64 bytes, I removed content_t
completely by adding the members of content_t to the lmsg_t union. However,
this is problem with the current code because one of the members is a
atomic_counter_t which is a non-POD type and cannot be a union member. For my
proof-of-concept implementation, I switched on C++11 mode because this relaxes
the requirements for PODs.
I hope this could be useful and maybe included in the main branch. My next
step is to change the encoder/stream engine to use writev to skip the memcpy
operations when sending messages.
Best wishes,
Jens Auer
More information about the zeromq-dev
mailing list