[zeromq-dev] (almost) zero-copy message receive

Arnaud Loonstra arnaud at sphaero.org
Tue Jun 2 10:13:43 CEST 2015

Although I'm not very familiar with zmq's internals this looks 
Did you test if your implementation remains correct? ie. it doesn't 
introduce deadlocks or other race conditions?



On 2015-05-31 19:29, Jens Auer wrote:
> Hi,
> I did some performance analysis of  a program which receives data on
> a (SUB or
> PULL) socket, filters it for some criteria, extracts a value from the
> message
> and uses this as a subscription to forward the datato a PUB socket. 
> As
> expected, most time is spent in memory allocations and memcpy
> operations, so I
> decided to check if there is an opportunity to  minimize these 
> operations in
> the critical path. From my analysis, the path is as follows:
> 1. stream_engine receives data from a socket into a static buffer of 
> 8192
> bytes
> 2. decoder/v2_decoder implement a state machine which reads the flag 
> and
> message size, create a new message and copy the data into the message 
> data
> field
> 3. When sending, stream_engine copies the flags field, message and 
> message
> data into a static buffer and sends this buffer completely to the 
> socket
> Memory allocations are done in v2_decoder when a new message is 
> created, and
> deallocations are done when sending the message. Memcpy operations
> are done in
> decoder to copy
> - the flags byte into a temporary buffer
> - the message size into a temporary buffer
> - the message data into the dynamically allocated storage
> Since the allocations and memcpy are the dominating operations, I
> implemented
> a scheme where these operations are minimized. The main idea is to 
> allocate
> the receive buffer of 8192 byte dynamically and use this as the data 
> storage
> for zero-copy messages created with msg_t::init_data. This replaces n
> = 8192 /
> (m_size + 10) memory allocations with one allocation, and it gets rid 
> of the
> same number of memcpy operations for the message data. I implemented
> this in a
> fork (https://github.com/jens-auer/libzmq/tree/zero_copy_receive). 
> For
> testing, I ran the throughput test (message size 100, 100000
> messages) locally
> and profiled for memory allocations and memcpy. The results are 
> promising:
> - memory allocations reduced from 100,260 to 2,573
> - memcpy operations reduced from 301,227 to 202,449. This is expected
> because
> for every message, three memcpys are done, and the patch removes the 
> data
> memcpy only.
> - throughput increased significantly by about 30-40% ( I only did a
> couple of
> runs to test it, no thorough benchmarking)
> For the implementation, I had to change two other things. After my 
> first
> implementation, I realized that msg_t::init_data does a malloc to 
> create the
> content_t member. Given that msg_t's size is now 64 bytes, I removed
> content_t
> completely by adding the members of content_t to the lmsg_t union. 
> However,
> this is problem with the current code because one of the members is a
> atomic_counter_t which is a non-POD type and cannot be a union
> member. For my
> proof-of-concept implementation, I switched on C++11 mode because
> this relaxes
> the requirements for PODs.
> I hope this could be useful and maybe included in the main branch. My 
> next
> step is to change the encoder/stream engine to use writev to skip the 
> memcpy
> operations when sending messages.
> Best wishes,
>   Jens Auer
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev

More information about the zeromq-dev mailing list