[zeromq-dev] Inefficient TCP connection for my PUB-SUB zmq communication

Francesco francesco.montorsi at gmail.com
Sun Mar 28 17:43:59 CEST 2021

Hi all,

A few more questions after inspecting ZMQ source code:
- I see that in June 2019 the following PR was merged:
https://github.com/zeromq/libzmq/pull/3555   This one exposes
At first look it may seem exactly what I was looking for, but the thing is
that the default value is already quite high (8192)... in my use case
probably it would be enough to coalesce together a max of 5 or 6 messages
to reach the MTU size.
- The thread that is publishing on my PUB zmq socket probably takes between
100-500usec to generate a new message. That means that to generate 5
messages in worst case it might take 2.5msec. I would be OK to pay this
latency in order to improve throughput... .is there any way to achieve
that? What happens if I disable the code in ZMQ that sets TCP_NODELAY and
replace it with TCP_CORK ? Do you think I could get some kind of breakage
of my PUB/SUB connections?

and one consideration:
 - I discovered why my tcpdump capture contains larger-than-MTU packets
(even though they are <1%): the reason is that capturing traffic on the
same server sending/receiving the traffic is not a  good idea:
I will try to acquire tcpdumps from the SPAN port of a managed switch. I
don't think the results will change much though

Thanks for any hint,

Il giorno sab 27 mar 2021 alle ore 10:22 Francesco <
francesco.montorsi at gmail.com> ha scritto:

> Hi Jim,
> You're right and I have in plan to change the MTU to 9000 for sure.
> However even now, with the MTU being 1500, I see most packets are very far
> from the limit.
> Attached is a screenshot of the capture:
> [image: tcp_capture.png]
> By looking at the timestamps I see that the packets of size 583B and 376B
> are spaced just 100us roughly and between the packet of 376B and 366B are
> spaced 400us.
> In this case I'd be more than welcome to pay some extra latency and merge
> all these 3 packets together.
> After some more digging I found this code in ZMQ:
>     //  Disable Nagle's algorithm. We are doing data batching on 0MQ level,
>     //  so using Nagle wouldn't improve throughput in anyway, but it would
>     //  hurt latency.
>     int nodelay = 1;
>     const int rc =
>       setsockopt (s_, IPPROTO_TCP, TCP_NODELAY,
>                   reinterpret_cast<char *> (&nodelay), sizeof (int));
>     assert_success_or_recoverable (s_, rc);
>     if (rc != 0)
>         return rc;
> Now my next question is: where is this " data batching on 0MQ level"
> happening? Can I tune it somehow? Can I restore Nagle algorithm ?
> I saw also from here
>   https://man7.org/linux/man-pages/man7/tcp.7.html
> that there's the possibility to set TCP_CORK as option on the socket to
> try to optimize throughput ... any way to do that through ZMQ?
> Thanks!!
> Francesco
> Il giorno sab 27 mar 2021 alle ore 05:01 Jim Melton <jim at melton.space> ha
> scritto:
>> Small TCP packets will never achieve maximum throughput. This is
>> independent of ZMQ. Each TCP packet requires a synchronous round-trip.
>> For a 20 Gbps network, you need a larger MTU to achieve close to
>> theoretical bandwidth, and each packet needs to be close to MTU. Jumbo MTU
>> is typically 9000 bytes. The TCP ACK packets will kill your throughput,
>> though.
>> --
>> Jim Melton
>> (303) 829-0447
>> http://blogs.melton.space/pharisee/
>> jim at melton.space
>> On Mar 26, 2021, at 4:17 PM, Francesco <francesco.montorsi at gmail.com>
>> wrote:
>> Hi all,
>> I'm using ZMQ in a product that moves a lot of data using TCP as
>> transport and PUB-SUB as communication pattern. "A lot" here means around
>> 1Gbps. The software is actually a mono-directional chain of small
>> components each linked to the previous with a SUB socket (to receive data)
>> and a PUB socket (to send data to next stage).
>> I'm debugging an issue with one of these components receiving 1.1Gbps
>> from its SUB socket and sending out 1.1Gbps on its PUB socket (no wonder
>> the two numbers match since the component does not aggregation whatsoever).
>> The "problem" is that we are currently using 16 ZMQ background threads to
>> move a total of 2.2Gbps for that software component (note the physical
>> links can carry up to 20Gbps so we're far from saturation of the link).
>> IIRC the "golden rule" for sizing number of ZMQ background threads is 1Gbps
>> = 1 thread.
>> As you can see we're very far from this golden rule, and that's what I'm
>> trying to debug.
>> The ZMQ background threads have a CPU usage ranging from 98% to 80%.
>> Using "strace" I see that most of the time for these threads is spent in
>> the "sendto" syscall.
>> So I started digging on the quality of the TX side of the TCP connection,
>> recording a short trace of the traffic outgoing from the software component.
>> Analyzing the traffic with wireshark it turns out that the TCP packets
>> for the PUB connection are pretty small:
>> * 50% of them are 66B long; these are the TCP ACK packets (incoming)
>> * 21% of them are in the range 160B-320B
>> * 18% in the range 320B-640B
>> * 5% in range 640B-1280B
>> * just 3% reach the MTU equal to 1500B
>> * [there are a <1% fraction that also exceed the MTU=1500B of the link,
>> which I'm not sure how is possible]
>> My belief is that having a fewer number of packets, all close to the MTU
>> of the link should greatly improve the performances. Would you agree with
>> that?
>> Is there any configuration I can apply on the PUB socket to force the
>> Linux TCP stack to generate fewer but larger TCP segments on the wire?
>> Thanks for any hint,
>> Francesco
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20210328/e393db31/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tcp_capture.png
Type: image/png
Size: 56670 bytes
Desc: not available
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20210328/e393db31/attachment.png>

More information about the zeromq-dev mailing list