[zeromq-dev] ZMQ I/O threads CPU usage
Francesco
francesco.montorsi at gmail.com
Mon Apr 12 01:50:01 CEST 2021
Hi all,
An update on this topic: I didn't give up yet :)
I'm trying to rewrite the ZMQ proxy code in a way that includes a "batching
queue" between the frontend-> backend direction (that's the direction where
in my use case most of the data is moved).
The intent is to have such "batching queue" allow me to better tune the
"throughput vs latency" tradeoff. I hope I can share positive results soon.
I have a question on the current code in proxy.cpp (the version using ZMQ
poller mechanism). I see there's a lot of logic that will substitute the
in-use poller depending on whether the frontend socket or the backend
socket gets blocked or not.
My question is: is this logic really necessary?* Is there some side effect
in using zmq_poller_wait_all() on a socket in mute state having reached its
HWM ? *
Thanks,
Francesco
Il giorno ven 2 apr 2021 alle ore 16:06 Brett Viren <brett.viren at gmail.com>
ha scritto:
> Francesco <francesco.montorsi at gmail.com> writes:
>
> > Here's what I get varying the spin_loop duration:
>
> Thanks for sharing it. It's cool to see the effect in action!
>
> > Do you think the cpu load of zmq background thread would be caused by
> > the much more frequent TCP ACKs coming from the SUB when the
> > "batching" suddenly stops happening ?
>
> Well, I don't know enough about all the mechanisms to personally say it
> is the TCP ACKs which are the driver of the effect. Though, that
> certainly sounds reasonable to me.
>
> > Your suggestion is that if the application thread is fast enough (spin
> loop is "short enough")
> > then the while() loop body is actually executed 2-3-4 times and we
> send() a large TCP packet,
> > thereby reducing both syscall overhead and number of TCP acks from the
> SUB (and thus kernel
> > overhead).
> > If instead the application thread is not fast enough (spin loop is "too
> long") then the while
> > () loop body executes only once and we send my 300B frames one by one to
> the zmq::tcp_write()
> > and send() syscall. That would kill performances of zmq background
> thread.
> > Is that correct?
>
> Yep, that's the basic premise I had.
>
> Though, I don't know the exact mechanisms beyond "more stuff happens
> when many, tiny packets are sent". :)
>
> > Now the other 1M$ question: if that's the case, is there any tuning I
> > can do to force the zmq background thread to wait for some time before
> > invoking send() ?
>
> > I'm thinking that I could try to change the option TCP_NODELAY that is
> set on the tcp socket
> > with the option TCP_CORK instead and see what happens. In this way I
> basically go to the
> > opposite direction in the throughput-vs-latency tradeoff ...
> > Or maybe I could change libzmq source code to invoke tcp_write() only
> e.g. every N times
> > out_event() is invoked? I think I risk getting some bytes stuck into the
> stream engine if at
> > some point I stop sending out messages though....
> >
> > Any other suggestion?
>
> Nothing specific.
>
> As you say, it's a throughput-vs-latency problem. And in this case it
> is a bit more complicated because the particular size/time parameters
> bring the problem to a place where the Nagle "step function" matters.
>
> Two approaches to try, with maybe not much hope of huge improvements, is
> to push Nagle's algorithm out of libzmq and either back down to the TCP
> stack or up into the application layer.
>
> I don't know how to tell libzmq to give this optimization back to the
> TCP stack. I recall reading (maybe on this list) about someone doing
> work in this direction. I also don't remember the outcome of that work
> but I'd guess there was not much benefit. The libzmq developers took
> the effort to bring Nagle up into libzmq (presumably) because libzmq has
> more knowledge than exists down in the TCP stack and so can perform the
> optimization more... er, optimally.
>
> Likewise, doing message batching in the application may or may not help.
> But, in this case it would be rather easy to try. And there are two
> approaches to try. Either send N 300B parts in an N-part multipart
> message or enact join/split operations in the application layer.
>
> In particular, if the application can directly deal with concatenated
> parts so no explicit join/split is required, then you may solve this
> problem. At least, reading N 300B blocks "in place" on the recv() side
> should be easy enough. As an example, zproto-generated code uses this
> trick to "unpack-in-place" highly structured data.
>
>
> My other general suggestion is to step back and see what the application
> actually requires w.r.t. throughput-vs-latency. Testing the limits is
> one (interesting) thing but in practical use, will the app actually come
> close to the limit? If it really must push 300B messages at low latency
> and at 1Gbps then using faster links may be appropriate. Eg, 10 GbE can
> give better than 2 Gbps throughput for 300B messages[1] while keeping
> latency low.
>
>
> -Brett.
>
>
> [1]
> http://wiki.zeromq.org/results:10gbe-tests-v432
> http://wiki.zeromq.org/results:100gbe-tests-v432
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20210412/1c22300a/attachment.htm>
More information about the zeromq-dev
mailing list