[zeromq-dev] (almost) zero-copy message receive

Auer, Jens jens.auer at cgi.com
Tue Jun 2 15:00:56 CEST 2015

Hi Amaud,

I am quite sure that there no deadlocks because no locking is used in the changes :-). The main idea is to allocate a new buffer of 8k + sizeof(int) for receiving data, and put a atomic_counter_t in the first bytes. This counter is incremented every time a new message using the buffer as the storage is created, and the free function passed to msg_t::init_data decrements the counter and frees it when it reaches zero. Each decoder has its own buffer, and I think that decoders are used by a single thread only.

I ran the tests included in zeroMQ, and they all passed (except for test_system which had an issue in the cleanup code). I did not do much functional testing because I don't have a huge test suite available. I just ran the throughput tests and a simple multiplexing/demultiplexing example I am evaluating. It would be great if somebody else could run some tests on the branch.

Best wishes,
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.auer at cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI Group Inc. and its affiliates may be contained in this message. If you are not a recipient indicated or intended in this message (or responsible for delivery of this message to such person), or you think for any reason that this message may have been addressed to you in error, you may not use or copy or deliver this message to anyone else. In such case, you should destroy this message and are asked to notify the sender by reply e-mail.

Von: zeromq-dev-bounces at lists.zeromq.org [zeromq-dev-bounces at lists.zeromq.org]" im Auftrag von "Arnaud Loonstra [arnaud at sphaero.org]
Gesendet: Dienstag, 2. Juni 2015 10:13
An: ZeroMQ development list
Betreff: Re: [zeromq-dev] (almost) zero-copy message receive

Although I'm not very familiar with zmq's internals this looks
Did you test if your implementation remains correct? ie. it doesn't
introduce deadlocks or other race conditions?



On 2015-05-31 19:29, Jens Auer wrote:
> Hi,
> I did some performance analysis of  a program which receives data on
> a (SUB or
> PULL) socket, filters it for some criteria, extracts a value from the
> message
> and uses this as a subscription to forward the datato a PUB socket.
> As
> expected, most time is spent in memory allocations and memcpy
> operations, so I
> decided to check if there is an opportunity to  minimize these
> operations in
> the critical path. From my analysis, the path is as follows:
> 1. stream_engine receives data from a socket into a static buffer of
> 8192
> bytes
> 2. decoder/v2_decoder implement a state machine which reads the flag
> and
> message size, create a new message and copy the data into the message
> data
> field
> 3. When sending, stream_engine copies the flags field, message and
> message
> data into a static buffer and sends this buffer completely to the
> socket
> Memory allocations are done in v2_decoder when a new message is
> created, and
> deallocations are done when sending the message. Memcpy operations
> are done in
> decoder to copy
> - the flags byte into a temporary buffer
> - the message size into a temporary buffer
> - the message data into the dynamically allocated storage
> Since the allocations and memcpy are the dominating operations, I
> implemented
> a scheme where these operations are minimized. The main idea is to
> allocate
> the receive buffer of 8192 byte dynamically and use this as the data
> storage
> for zero-copy messages created with msg_t::init_data. This replaces n
> = 8192 /
> (m_size + 10) memory allocations with one allocation, and it gets rid
> of the
> same number of memcpy operations for the message data. I implemented
> this in a
> fork (https://github.com/jens-auer/libzmq/tree/zero_copy_receive).
> For
> testing, I ran the throughput test (message size 100, 100000
> messages) locally
> and profiled for memory allocations and memcpy. The results are
> promising:
> - memory allocations reduced from 100,260 to 2,573
> - memcpy operations reduced from 301,227 to 202,449. This is expected
> because
> for every message, three memcpys are done, and the patch removes the
> data
> memcpy only.
> - throughput increased significantly by about 30-40% ( I only did a
> couple of
> runs to test it, no thorough benchmarking)
> For the implementation, I had to change two other things. After my
> first
> implementation, I realized that msg_t::init_data does a malloc to
> create the
> content_t member. Given that msg_t's size is now 64 bytes, I removed
> content_t
> completely by adding the members of content_t to the lmsg_t union.
> However,
> this is problem with the current code because one of the members is a
> atomic_counter_t which is a non-POD type and cannot be a union
> member. For my
> proof-of-concept implementation, I switched on C++11 mode because
> this relaxes
> the requirements for PODs.
> I hope this could be useful and maybe included in the main branch. My
> next
> step is to change the encoder/stream engine to use writev to skip the
> memcpy
> operations when sending messages.
> Best wishes,
>   Jens Auer
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev

zeromq-dev mailing list
zeromq-dev at lists.zeromq.org

More information about the zeromq-dev mailing list