[zeromq-dev] Memory pool for zmq_msg_t

Francesco francesco.montorsi at gmail.com
Sat Aug 17 15:57:05 CEST 2019


Hi Luca,
THanks for the explanation. It seems like there is no need to do memory
pooling for packet RX right?
One allocation every 19kB seems pretty efficient already (nice work! :))

Still I wonder if we can improve somehow the performance of
zmq::v2_decoder_t::size_ready
since that function appears to be the bottleneck of my latest performance
benchmarks. (See my previous email).
My feeling is that if memory management is not a problem along the RX path
then a single zmq background IO thread/core (on a fast CPU) should be able
to do more than the approx 2 Mpps limit that I found...
My concern is that it's a fundamental limit in zmq scalability: since a
single zmq socket is always handled by a single zmq background thread that
means that , even if I buy 100gbps of bandwidth, I will not be able to use
more than 2/3gbps sending messages 64B long on that socket.

Thanks for any hint or comment,
Francesco



Il ven 16 ago 2019, 17:20 Luca Boccassi <luca.boccassi at gmail.com> ha
scritto:

> The messages structures themselves are always on the stack. The TCP
> receive is batched, and if there are multiple messages in an 8KB kernel
> buffer, each message's content_t simply points to the right place for the
> data in that shared buffer, which is refcounted. The content_t structure is
> also in the same memory zone, which is split to allow enough content_t for
> 8KB/minimum_size_msg+1 messages - so in practice there is one allocation on
> ~19KB which is shared with as many messages as their data can fit in 8KB
> that are received in one TCP read.
>
> On Fri, 2019-08-16 at 16:46 +0200, Francesco wrote:
>
> Hi Doron,
> Ok the zmq_msg_init_allocator approach looks fine to me. I hope I have
> time to work on that in the next couple of weeks (unless someone else wants
> to step in of course :-) ).
>
> Anyway the current approach works for sending messages...I wonder how the
> Rx side works and if we could exploit memory pooling also for that... Is
> there any kind of documentation on how the engine works for Rx (or some
> email thread) perhaps?
>
> I know there is some zero copy mechanism in place but it's not totally
> clear to me: is the zmq_msg_t coming out of zmq API pointing directly to
> the kernel buffers?
>
> Thanks
> Francesco
>
>
> Il gio 15 ago 2019, 11:39 Doron Somech <somdoron at gmail.com> ha scritto:
>
> maybe zmq_msg_init_allocator which accept the allocator.
>
> With that pattern we do need the release method, the zmq_msg will handle
> it internally and register the release method as the free method of the
> zmq_msg. They do need to have the same signature.
>
> On Thu, Aug 15, 2019 at 12:35 PM Francesco <francesco.montorsi at gmail.com>
> wrote:
>
> Hi Doron, hi Jens,
> Yes the allocator method is a nice solution.
> I think it would be nice to have libzmq provide also a memory pool
> implementation but use as default the malloc/free implementation for
> backward compatibility.
>
> It's also important to have a smart allocator that internally contains not
> just  one but several pools for different packet size classes,to avoid
> memory waste. But I think this can fit easily in the allocator pattern
> sketched out by Jens.
>
> Btw another issue unrelated to the allocator API but regarding performance
> aspects: I think it's important to avoid not only the msg buffer but also
> the allocation of the content_t structure and indeed in my preliminary
> merge request I did modify zmq_msg_t of type_lmsg to use the first 40b
> inside the pooled buffer.
> Of course this approach is not backward compatible with the _init_data()
> semantics.
> How do you think this would best be approached?
> I guess we may have a new _init_data_and_controlblock() helper that does
> the trick of taking the first 40bytes of the provided buffer?
>
> Thanks
> Francesco
>
>
> Il mer 14 ago 2019, 22:23 Doron Somech <somdoron at gmail.com> ha scritto:
>
> Jens I like the idea.
>
> We actually don't need the release method.
> The signature of the allocate should receive zmq_msg and allocate it.
>
> int (&allocate)(zmq_msg *msg, size_t size, void *obj);
>
> When the allocator will create the zmq_msg it will provide the release
> method to the zmq_msg in the constructor.
>
> This is important in order to forward messages between sockets, so the
> release method is part of the msg. This is already supported by zmq_msg
> which accept free method with a hint (obj in your example).
>
> The return value of allocate will be success indication, like the rest of
> zeromq methods.
>
> zeromq actually already support pool mechanism when sending, using zmq_msg
> api. Receiving is the problem, your suggestion solve it nicely.
>
> By the way, memory pool already supported in NetMQ in a very similar
> solution as you suggested. (It is global for all sockets without override)
>
>
>
> On Wed, Aug 14, 2019, 22:41 Jens Auer <jens.auer at betaversion.net> wrote:
>
> Hi,
>
> Maybe this can be combined with a request that I have seen a couple of
> times to be able to configure the allocator used in libzmq? I am thinking
> of something like
>
> struct zmq_allocator {
>     void* obj;
>     void* (&allocate)(size_t n, void* obj);
>     void (&release)(void* ptr, void* obj);
> };
>
> void* useMalloc(size_t n, void*) {return malloc(n);}
> void freeMalloc(void* ptr) {free(ptr);}
>
> zmq_allocator& zmg_default_allocator() {
>     static zmg_allocator defaultAllocator = {nullptr, useMalloc,
> freeMalloc};
>     Return defaultAllocator;
> }
>
> The context could then store the allocator for libzmq, and users could set
> a specific allocator as a context option, e.g. with a zmq_ctx_set. A socket
> created for a context can then inherit the default allocator or set a
> special allocator as a socket option.
>
> class MemoryPool {…}; // hopefully thread-safe
> void* poolAllocate(size_t n) {return
>
> MemoryPool pool;
>
> void* allocatePool(size_t n, void* pool) {return
> static_cast<MemoryPool*>(pool)->allocate(n);}
> void releasePool(void* ptr, void* pool)
> {static_cast<MemoryPool*>(pool)->release(ptr);}
>
> zmq_allocator pooledAllocator {
>     &pool, allocatePool, releasePool
> }
>
> void* cdx = zmq_ctx_new();
> zmq_ctx_set(ZMQ_ALLOCATOR, &pooledAllocator);
>
> Cheers,
> Jens
>
> Am 13.08.2019 um 13:24 schrieb Francesco <francesco.montorsi at gmail.com>:
>
> Hi all,
>
> today I've taken some time to attempt building a memory-pooling
> mechanism in ZMQ local_thr/remote_thr benchmarking utilities.
> Here's the result:
> https://github.com/zeromq/libzmq/pull/3631
> This PR is a work in progress and is a simple modification to show the
> effects of avoiding malloc/free when creating zmq_msg_t with the
> standard benchmark utils of ZMQ.
>
> In particular the very fast, zero-lock,
> single-producer/single-consumer queue from:
> https://github.com/cameron314/readerwriterqueue
> is used to maintain between the "remote_thr" main thread and its ZMQ
> background IO thread a list of free buffers that can be used.
>
> Here are the graphical results:
> with mallocs / no memory pool:
>
> https://cdn1.imggmi.com/uploads/2019/8/13/9f009b91df394fa945cd2519fd993f50-full.png
> with memory pool:
>
> https://cdn1.imggmi.com/uploads/2019/8/13/f3ae0d6d58e9721b63129c23fe7347a6-full.png
>
> Doing the math the memory pooled approach shows:
>
> mostly the same performances for messages <= 32B
> +15% pps/throughput increase @ 64B,
> +60% pps/throughput increase @ 128B,
> +70% pps/throughput increase @ 210B
>
> [the tests were stopped at 210B because my current quick-dirty memory
> pool approach has fixed max msg size of about 210B].
>
> Honestly this is not a huge speedup, even if still interesting.
> Indeed with these changes the performances now seem to be bounded by
> the "local_thr" side and not by the "remote_thr" anymore. Indeed the
> zmq background IO thread for local_thr is the only thread at 100% in
> the 2 systems and its "perf top" now shows:
>
>  15,02%  libzmq.so.5.2.3     [.] zmq::metadata_t::add_ref
>  14,91%  libzmq.so.5.2.3     [.] zmq::v2_decoder_t::size_ready
>   8,94%  libzmq.so.5.2.3     [.] zmq::ypipe_t<zmq::msg_t, 256>::write
>   6,97%  libzmq.so.5.2.3     [.] zmq::msg_t::close
>   5,48%  libzmq.so.5.2.3     [.]
> zmq::decoder_base_t<zmq::v2_decoder_t, zmq::shared_message_memory_allo
>   5,40%  libzmq.so.5.2.3     [.] zmq::pipe_t::write
>   4,94%  libzmq.so.5.2.3     [.]
> zmq::shared_message_memory_allocator::inc_ref
>   2,59%  libzmq.so.5.2.3     [.] zmq::msg_t::init_external_storage
>   1,63%  [kernel]            [k] copy_user_enhanced_fast_string
>   1,56%  libzmq.so.5.2.3     [.] zmq::msg_t::data
>   1,43%  libzmq.so.5.2.3     [.] zmq::msg_t::init
>   1,34%  libzmq.so.5.2.3     [.] zmq::pipe_t::check_write
>   1,24%  libzmq.so.5.2.3     [.]
> zmq::stream_engine_base_t::in_event_internal
>   1,24%  libzmq.so.5.2.3     [.] zmq::msg_t::size
>
> Do you know what this stacktrace might mean?
> I would expect to have that ZMQ background thread topping in its
> read() system call (from TCP socket)...
>
> Thanks,
> Francesco
>
>
> Il giorno ven 19 lug 2019 alle ore 18:15 Francesco
> <francesco.montorsi at gmail.com> ha scritto:
>
>
> Hi Yan,
> Unfortunately I have interrupted my attempts in this area after getting
> some strange results (possibly due to the fact that I tried in a complex
> application context... I should probably try hacking a simple zeromq
> example instead!).
>
> I'm also a bit surprised that nobody has tried and posted online a way to
> achieve something similar (Memory pool zmq send) ... But anyway It remains
> in my plans to try that out when I have a bit more spare time...
> If you manage to have some results earlier, I would be eager to know :-)
>
> Francesco
>
>
> Il ven 19 lug 2019, 04:02 Yan, Liming (NSB - CN/Hangzhou) <
> liming.yan at nokia-sbell.com> ha scritto:
>
>
> Hi,  Francesco
>   Could you please share the final solution and benchmark result for plan
> 2?  Big Thanks.
>   I'm concerning this because I had tried the similar before with
> zmq_msg_init_data() and zmq_msg_send() but failed because of two issues.
>  1)  My process is running in background for long time and finally I found
> it occupies more and more memory, until it exhausted the system memory. It
> seems there's memory leak with this way.   2) I provided *ffn for
> deallocation but the memory freed back is much slower than consumer. So
> finally my own customized pool could also be exhausted. How do you solve
> this?
>   I had to turn back to use zmq_send(). I know it has memory copy penalty
> but it's the easiest and most stable way to send message. I'm still using
> 0MQ 4.1.x.
>   Thanks.
>
> BR
> Yan Limin
>
> -----Original Message-----
> From: zeromq-dev [mailto:zeromq-dev-bounces at lists.zeromq.org
> <zeromq-dev-bounces at lists.zeromq.org>] On Behalf Of Luca Boccassi
> Sent: Friday, July 05, 2019 4:58 PM
> To: ZeroMQ development list <zeromq-dev at lists.zeromq.org>
> Subject: Re: [zeromq-dev] Memory pool for zmq_msg_t
>
> There's no need to change the source for experimenting, you can just use
> _init_data without a callback and with a callback (yes the first case will
> leak memory but it's just a test), and measure the difference between the
> two cases. You can then immediately see if it's worth pursuing further
> optimisations or not.
>
> _external_storage is an implementation detail, and it's non-shared because
> it's used in the receive case only, as it's used with a reference to the
> TCP buffer used in the system call for zero-copy receives. Exposing that
> means that those kind of messages could not be used with pub-sub or
> radio-dish, as they can't have multiple references without copying them,
> which means there would be a semantic difference between the different
> message initialisation APIs, unlike now when the difference is only in who
> owns the buffer. It would make the API quite messy in my opinion, and be
> quite confusing as pub/sub is probably the most well known pattern.
>
> On Thu, 2019-07-04 at 23:20 +0200, Francesco wrote:
>
> Hi Luca,
> thanks for the details. Indeed I understand why the "content_t" needs
> to be allocated dynamically: it's just like the control block used by
> STL's std::shared_ptr<>.
>
> And you're right: I'm not sure how much gain there is in removing 100%
> of malloc operations from my TX path... still I would be curious to
> find it out but right now it seems I need to patch ZMQ source code to
> achieve that.
>
> Anyway I wonder if it could be possible to expose in the public API a
> method like "zmq::msg_t::init_external_storage()" that, AFAICS, allows
> to create a non-shared zero-copy long message... it appears to be used
> only by v2 decoder internally right now...
> Is there a specific reason why that's not accessible from the public
> API?
>
> Thanks,
> Francesco
>
>
>
>
>
> Il giorno gio 4 lug 2019 alle ore 20:25 Luca Boccassi <
> luca.boccassi at gmail.com> ha scritto:
>
> Another reason for that small struct to be on the heap is so that it
> can be shared among all the copies of the message (eg: a pub socket
> has N copies of the message on the stack, one for each subscriber).
> The struct has an atomic counter in it, so that when all the copies
> of the message on the stack have been closed, the userspace buffer
> deallocation callback can be invoked. If the atomic counter were on
> the stack inlined in the message, this wouldn't work.
> So even if room were to be found, a malloc would still be needed.
>
> If you _really_ are worried about it, and testing shows it makes a
> difference, then one option could be to pre-allocate a set of these
> metadata structures at startup, and just assign them when the
> message is created. It's possible, but increases complexity quite a
> bit, so it needs to be worth it.
>
> On Thu, 2019-07-04 at 17:42 +0100, Luca Boccassi wrote:
>
> The second malloc cannot be avoided, but it's tiny and fixed in
>
> size
>
> at
> compile time, so the compiler and glibc will be able to optimize
>
> it
>
> to
> death.
>
> The reason for that is that there's not enough room in the 64
>
> bytes
>
> to
> store that structure, and increasing the message allocation on
>
> the
>
> stack past 64 bytes means it will no longer fit in a single cache
> line, which will incur in a performance penalty far worse than the
>
> small
>
> malloc (I tested this some time ago). That is of course unless
>
> you
>
> are
> running on s390 or a POWER with 256 bytes cacheline, but given
>
> it's
>
> part of the ABI it would be a bit of a mess for the benefit of
>
> very
>
> few
> users if any.
>
> So I'd recommend to just go with the second plan, and compare
>
> what
>
> the
> result is when passing a deallocation function vs not passing it
>
> (yes
>
> it will leak the memory but it's just for the test). My bet is
>
> that
>
> the
> difference will not be that large.
>
> On Thu, 2019-07-04 at 16:30 +0200, Francesco wrote:
>
> Hi Stephan, Hi Luca,
>
> thanks for your hints. However I inspected
>
> https://github.com/dasys-lab/capnzero/blob/master/capnzero/src/Publi
> sher.cpp
>
>
> and I don't think it's saving from malloc()...  see my point
>
> 2)
>
> below:
>
> Indeed I realized that probably current ZMQ API does not allow
>
> me
>
> to
> achieve the 100% of what I intended to do.
> Let me rephrase my target: my target is to be able to
> - memory pool creation: do a large memory allocation of, say,
>
> 1M
>
> zmq_msg_t only at the start of my program; let's say I create
>
> all
>
> these zmq_msg_t of a size of 2k bytes each (let's assume this
>
> is
>
> the
> max size of message possible in my app)
> - during application lifetime: call zmq_msg_send() at anytime
> always avoiding malloc() operations (just picking the first
> available unused entry of zmq_msg_t from the memory pool).
>
> Initially I thought that was possible but I think I have
>
> identified
>
> 2
> blocking issues:
> 1) If I try to recycle zmq_msg_t directly: in this case I will
>
> fail
>
> because I cannot really change only the "size" member of a
> zmq_msg_t without reallocating it... so that I'm forced (in my
> example)
>
> to
>
> always send 2k bytes out (!!)
> 2) if I do create only a memory pool of buffers of 2k bytes and
> then wrap the first available buffer inside a zmq_msg_t
> (allocated
>
> on
>
> the
> stack, not in the heap): in this case I need to know when the
> internals of ZMQ have completed using the zmq_msg_t and thus
>
> when I
>
> can mark that buffer as available again in my memory pool.
>
> However
>
> I
> see that zmq_msg_init_data() ZMQ code contains:
>
>    //  Initialize constant message if there's no need to
> deallocate
>    if (ffn_ == NULL) {
> ...
>        _u.cmsg.data = data_;
>        _u.cmsg.size = size_;
> ...
>    } else {
> ...
>        _u.lmsg.content =
>          static_cast<content_t *> (malloc (sizeof
>
> (content_t)));
>
> ...
>        _u.lmsg.content->data = data_;
>        _u.lmsg.content->size = size_;
>        _u.lmsg.content->ffn = ffn_;
>        _u.lmsg.content->hint = hint_;
>        new (&_u.lmsg.content->refcnt) zmq::atomic_counter_t
>
> ();
>
>    }
>
> So that I skip malloc() operation only if I pass ffn_ == NULL.
>
> The
>
> problem is that if I pass ffn_ == NULL, then I have no way to
>
> know
>
> when the internals of ZMQ have completed using the zmq_msg_t...
>
> Any way to workaround either issue 1) or issue 2) ?
>
> I understand that the malloc is just of size(content_t)~=
>
> 40B...
>
> but
> still I'd like to avoid it...
>
> Thanks!
> Francesco
>
>
>
>
>
> Il giorno gio 4 lug 2019 alle ore 14:58 Stephan Opfer <
> opfer at vs.uni-kassel.de
>
> ha scritto:
> On 04.07.19 14:29, Luca Boccassi wrote:
>
> How users make use of these primitives is up to them
>
> though, I
>
>
> don't
>
> think anything special was shared before, as far as I
>
> remember.
>
>
> Some example can be found here:
>
> https://github.com/dasys-lab/capnzero/tree/master/capnzero/src
>
>
>
> The classes Publisher and Subscriber should replace the
>
> publisher
>
> and
> subscriber in a former Robot-Operating-System-based System. I
> hope that the subscriber is actually using the method Luca is
> talking
>
> about
>
> on the
> receiving side.
>
> The message data here is a Cap'n Proto container that we
> "simply"
> serialize and send via ZeroMQ -> therefore the name Cap'nZero
>
> ;-)
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
>
>
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
>
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>
> --
> Kind regards,
> Luca Boccassi
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
> _______________________________________________
>
> zeromq-dev mailing list
>
> zeromq-dev at lists.zeromq.org
>
>
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
> --
>
> Kind regards,
> Luca Boccassi
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20190817/20e99e3a/attachment.htm>


More information about the zeromq-dev mailing list