[zeromq-dev] Pub socket slows down for one client

Auer, Jens jens.auer at cgi.com
Mon Jul 11 11:47:20 CEST 2016


Hi,

so I added some more logging to check if the consumer is slowed down. In stream_engine::out_event, I check if the call to tcp_write would block and count blocking calls. To make sure that I don't miss blocked send calls, I also added output to tcp.cpp. I also increased the in_batch_size and out_batch_size to reduce the number of syscalls. With that, I ran my test setup and observed the problem once more:

The good connection consistently sends ~100Mbps
Mon Jul 11 09:02:28 2016
1468227748668193        147     Sent 12769753 of 12769753 bytes in 1.00001s (1000010us) with 303 calls 0 blocked
Mon Jul 11 09:02:29 2016
1468227749668202        147     Sent 12700046 of 12700046 bytes in 1.00001s (1000009us) with 295 calls 0 blocked
Mon Jul 11 09:02:30 2016
1468227750668249        147     Sent 12792417 of 12792417 bytes in 1.00005s (1000047us) with 295 calls 0 blocked
Mon Jul 11 09:02:31 2016
1468227751668461        147     Sent 12787390 of 12787390 bytes in 1.00021s (1000212us) with 345 calls 0 blocked
Mon Jul 11 09:02:32 2016
1468227752668635        147     Sent 12808874 of 12808874 bytes in 1.00017s (1000174us) with 308 calls 0 blocked
Mon Jul 11 09:02:33 2016
1468227753668824        147     Sent 12832523 of 12832523 bytes in 1.00019s (1000189us) with 274 calls 0 blocked
Mon Jul 11 09:02:34 2016
1468227754668855        147     Sent 12861635 of 12861635 bytes in 1.00003s (1000031us) with 234 calls 0 blocked

The second connection on the same PUB socket slows down after a couple of minutes:
Mon Jul 11 09:02:45 2016
1468227765157113        159     Sent 9568256 of 9568256 bytes in 1.01988s (1019875us) with 146 calls 0 blocked
Mon Jul 11 09:02:46 2016
1468227766180625        159     Sent 9175040 of 9175040 bytes in 1.02351s (1023512us) with 140 calls 0 blocked
Mon Jul 11 09:02:47 2016
1468227767181102        159     Sent 8060928 of 8060928 bytes in 1.00048s (1000477us) with 123 calls 0 blocked
Mon Jul 11 09:02:48 2016
1468227768200968        159     Sent 7471104 of 7471104 bytes in 1.01987s (1019866us) with 114 calls 0 blocked
Mon Jul 11 09:02:49 2016
1468227769226228        159     Sent 7864320 of 7864320 bytes in 1.02526s (1025260us) with 120 calls 0 blocked
Mon Jul 11 09:02:50 2016
1468227770226817        159     Sent 8323072 of 8323072 bytes in 1.00059s (1000589us) with 127 calls 0 blocked
Mon Jul 11 09:02:51 2016
1468227771246216        159     Sent 7143424 of 7143424 bytes in 1.0194s (1019399us) with 109 calls 0 blocked
Mon Jul 11 09:02:52 2016
1468227772270349        159     Sent 6553600 of 6553600 bytes in 1.02413s (1024133us) with 100 calls 0 blocked

and then catches up after some time. Since I have no blocked send calls I am very sure that it is not a slow consumer problem. It looks like a scheduling problem in ZeroMQ (or maybe the OS scheduler?). 

Best wishes,
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.auer at cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI Group Inc. and its affiliates may be contained in this message. If you are not a recipient indicated or intended in this message (or responsible for delivery of this message to such person), or you think for any reason that this message may have been addressed to you in error, you may not use or copy or deliver this message to anyone else. In such case, you should destroy this message and are asked to notify the sender by reply e-mail.

________________________________________
Von: zeromq-dev [zeromq-dev-bounces at lists.zeromq.org]" im Auftrag von "Jens Auer [jens.auer at betaversion.net]
Gesendet: Sonntag, 10. Juli 2016 10:59
An: 'ZeroMQ development list'
Betreff: [zeromq-dev] Pub socket slows down for one client

Hi,

I am having a problem in my system that looks like a slow consumer, but I
don't think it is. All my processes are running at ~20% maximum CPU, and
there is plenty of margin on the system and on the network. However, I see
that sometimes one of the PUB connections slows down for a rather long time
(in the order of 5-40s) and then catches up for hours without any further
problems.

I added some logging the stream_engine::in_event and out_event, but I am not
sure how to interpret the observations. I analyzed one case of slow down.
Here, I have one processes receiving a constant stream of ~32000000 bytes on
a ZMQ_STREAM raw tcp socket. The process basically splits this into
individual streams using an id in the data and then publishes this with the
id as subscription. Two subscribers are connected to this PUB socket.
Sometimes, one of these SUBs gets slowed down. I have logging at the
application<->ZeroMQ boundaries to count the number of messages received and
sent plus logging the ZeroMQ<->system boundary to count the number of bytes
received/sent and the number of syscalls. I see the following pattern
appearing in the logs:
- No data is lost. Everything that is sent is received.
- The number of messages received and sent in the application is as
expected.
- One of the connections in the PUB socket does not send the same data as
the other one. This looks like a slow receiver, but the receiver is running
at 20% CPU, the system is far from being loaded and the network is a 10GB
network with all hosts connected to a single switch. TCP windows are sized
to 6MB default value for a latency expected to be 160ms
- Everything runs fine until I start to simulate latency with netem.

Looking at the log files, I see the expected output for the fast output
stream. The log prints a time stamp, the number of bytes sent since the last
log output, the time between the outputs, the number of frames (each 1142
bytes) and the number of calls to out_event resp. send.
Time                    Bytes           delta time      Frames  Syscalls to
send
1467902214174220        12775617        1.00005s        1645    11187
1467902215174480        12724076        1.00026s        1615    11142
1467902216174610        12913520        1.00012s        1657    11308
1467902217174610        12913979        1.00001s        1661    11308
1467902218174670        12619185        1.00005s        1636    11050
1467902219174780        12699323        1.00011s        1608    11120
1467902220174780        12941494        1s      1666    11332

The problematic stream slows down, and then catches up and after sending
buffered frames, everything is fine:
1467902253096660        7487488 1.00045s        914     6556
1467902254113250        7929856 1.01659s        968     6944
1467902255137710        7675904 1.02446s        937     6721
1467902256137830        7979008 1.00013s        974     6987
1467902257154080        8888320 1.01624s        1085    7783
1467902258180360        8863744 1.02628s        1082    7762
1467902259181080        9691136 1.00072s        1183    8486
1467902260197550        11091968        1.01648s        1354    9713
1467902261222090        11558912        1.02454s        1411    10122
1467902262222300        12984320        1.00021s        1585    11370
1467902263240030        15409152        1.01773s        1881    13493
1467902264263870        16228352        1.02384s        1981    14210
1467902265264570        18800640        1.0007s         2295    16463
1467902266281750        22364160        1.01719s        2730    19583
1467902267305500        24100864        1.02375s        2942    21104
1467902268306090        29376512        1.00059s        3586    25724
1467902269324240        35127296        1.01815s        4288    30759
1467902270347700        35618816        1.02347s        4348    31190
1467902271347770        39297024        1.00006s        4797    34411
1467902272482910        44589056        1.13515s        5443    39045
1467902273487890        42328064        1.00498s        5167    37065
1467902274510060        35741696        1.02217s        4363    31297
1467902275510560        38371328        1.0005s         4684    33600
1467902276530410        43376640        1.01985s        5295    37983
1467902277594050        35787401        1.06365s        4382    31337
1467902278594850        12757268        1.00079s        1582    11171
1467902279595210        12757337        1.00037s        1577    11171

What puzzles me is that the number of calls to out_event resp. send goes
down for a considerable time, and then catches up. Does ZeroMQ dynamically
schedule attempts to send data? If so, what is algorithm here? Can it be
influenced?

One thing I forgot to mention is that this only happens when I use netem to
simulate the expected latency of 80ms (RTT=160ms). TCP settings are tuned
and I consistently achieve the expected performance with iperf etc. I have
the feeling that there is some interaction between the TCP layer and ZeroMQ
that I am missing.

Best wishes,
  Jens


---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus

_______________________________________________
zeromq-dev mailing list
zeromq-dev at lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev



More information about the zeromq-dev mailing list