[zeromq-dev] (almost) zero-copy message receive

Jens Auer jens.auer at betaversion.net
Sun May 31 19:29:59 CEST 2015


I did some performance analysis of  a program which receives data on a (SUB or 
PULL) socket, filters it for some criteria, extracts a value from the message 
and uses this as a subscription to forward the datato a PUB socket. As 
expected, most time is spent in memory allocations and memcpy operations, so I 
decided to check if there is an opportunity to  minimize these operations in 
the critical path. From my analysis, the path is as follows:
1. stream_engine receives data from a socket into a static buffer of 8192 
2. decoder/v2_decoder implement a state machine which reads the flag and 
message size, create a new message and copy the data into the message data 
3. When sending, stream_engine copies the flags field, message and message 
data into a static buffer and sends this buffer completely to the socket

Memory allocations are done in v2_decoder when a new message is created, and 
deallocations are done when sending the message. Memcpy operations are done in 
decoder to copy
- the flags byte into a temporary buffer
- the message size into a temporary buffer
- the message data into the dynamically allocated storage

Since the allocations and memcpy are the dominating operations, I implemented 
a scheme where these operations are minimized. The main idea is to allocate 
the receive buffer of 8192 byte dynamically and use this as the data storage 
for zero-copy messages created with msg_t::init_data. This replaces n = 8192 / 
(m_size + 10) memory allocations with one allocation, and it gets rid of the 
same number of memcpy operations for the message data. I implemented this in a 
fork (https://github.com/jens-auer/libzmq/tree/zero_copy_receive). For 
testing, I ran the throughput test (message size 100, 100000 messages) locally 
and profiled for memory allocations and memcpy. The results are promising:
- memory allocations reduced from 100,260 to 2,573
- memcpy operations reduced from 301,227 to 202,449. This is expected because 
for every message, three memcpys are done, and the patch removes the data 
memcpy only.
- throughput increased significantly by about 30-40% ( I only did a couple of 
runs to test it, no thorough benchmarking)

For the implementation, I had to change two other things. After my first 
implementation, I realized that msg_t::init_data does a malloc to create the 
content_t member. Given that msg_t's size is now 64 bytes, I removed content_t 
completely by adding the members of content_t to the lmsg_t union. However, 
this is problem with the current code because one of the members is a 
atomic_counter_t which is a non-POD type and cannot be a union member. For my 
proof-of-concept implementation, I switched on C++11 mode because this relaxes 
the requirements for PODs. 

I hope this could be useful and maybe included in the main branch. My next 
step is to change the encoder/stream engine to use writev to skip the memcpy 
operations when sending messages.

Best wishes,
  Jens Auer

More information about the zeromq-dev mailing list