[zeromq-dev] ZMQ I/O threads CPU usage

Brett Viren brett.viren at gmail.com
Fri Apr 2 16:06:04 CEST 2021


Francesco <francesco.montorsi at gmail.com> writes:

> Here's what I get varying the spin_loop duration:

Thanks for sharing it.  It's cool to see the effect in action!

> Do you think the cpu load of zmq background thread would be caused by
> the much more frequent TCP ACKs coming from the SUB when the
> "batching" suddenly stops happening ?

Well, I don't know enough about all the mechanisms to personally say it
is the TCP ACKs which are the driver of the effect.  Though, that
certainly sounds reasonable to me.

> Your suggestion is that if the application thread is fast enough (spin loop is "short enough")
> then the while() loop body is actually executed 2-3-4 times and we send() a large TCP packet,
> thereby reducing both syscall overhead and number of TCP acks from the SUB (and thus kernel
> overhead). 
> If instead the  application thread is not fast enough (spin loop is "too long") then the while
> () loop body executes only once and we send my 300B frames one by one to the zmq::tcp_write()
> and send() syscall. That would kill performances of zmq background thread.
> Is that correct?

Yep, that's the basic premise I had.

Though, I don't know the exact mechanisms beyond "more stuff happens
when many, tiny packets are sent". :)

> Now the other 1M$ question: if that's the case, is there any tuning I
> can do to force the zmq background thread to wait for some time before
> invoking send() ?

> I'm thinking that I could try to change the option TCP_NODELAY that is set on the tcp socket
> with the option TCP_CORK instead and see what happens. In this way I basically go to the
> opposite direction in the throughput-vs-latency tradeoff ...
> Or maybe I could change libzmq source code to invoke tcp_write() only e.g. every N times
> out_event() is invoked? I think I risk getting some bytes stuck into the stream engine if at
> some point I stop sending out messages though....
>
> Any other suggestion?

Nothing specific.

As you say, it's a throughput-vs-latency problem.  And in this case it
is a bit more complicated because the particular size/time parameters
bring the problem to a place where the Nagle "step function" matters.

Two approaches to try, with maybe not much hope of huge improvements, is
to push Nagle's algorithm out of libzmq and either back down to the TCP
stack or up into the application layer.

I don't know how to tell libzmq to give this optimization back to the
TCP stack.  I recall reading (maybe on this list) about someone doing
work in this direction.  I also don't remember the outcome of that work
but I'd guess there was not much benefit.  The libzmq developers took
the effort to bring Nagle up into libzmq (presumably) because libzmq has
more knowledge than exists down in the TCP stack and so can perform the
optimization more... er, optimally.

Likewise, doing message batching in the application may or may not help.
But, in this case it would be rather easy to try.  And there are two
approaches to try.  Either send N 300B parts in an N-part multipart
message or enact join/split operations in the application layer.

In particular, if the application can directly deal with concatenated
parts so no explicit join/split is required, then you may solve this
problem.  At least, reading N 300B blocks "in place" on the recv() side
should be easy enough.  As an example, zproto-generated code uses this
trick to "unpack-in-place" highly structured data.


My other general suggestion is to step back and see what the application
actually requires w.r.t. throughput-vs-latency.  Testing the limits is
one (interesting) thing but in practical use, will the app actually come
close to the limit?  If it really must push 300B messages at low latency
and at 1Gbps then using faster links may be appropriate.  Eg, 10 GbE can
give better than 2 Gbps throughput for 300B messages[1] while keeping
latency low.


-Brett.


[1]
http://wiki.zeromq.org/results:10gbe-tests-v432
http://wiki.zeromq.org/results:100gbe-tests-v432
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 865 bytes
Desc: not available
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20210402/435bdd49/attachment.sig>


More information about the zeromq-dev mailing list