[zeromq-dev] Pub socket slows down for one client

Jens Auer jens.auer at betaversion.net
Sun Jul 10 10:59:54 CEST 2016


Hi,

I am having a problem in my system that looks like a slow consumer, but I
don't think it is. All my processes are running at ~20% maximum CPU, and
there is plenty of margin on the system and on the network. However, I see
that sometimes one of the PUB connections slows down for a rather long time
(in the order of 5-40s) and then catches up for hours without any further
problems.

I added some logging the stream_engine::in_event and out_event, but I am not
sure how to interpret the observations. I analyzed one case of slow down.
Here, I have one processes receiving a constant stream of ~32000000 bytes on
a ZMQ_STREAM raw tcp socket. The process basically splits this into
individual streams using an id in the data and then publishes this with the
id as subscription. Two subscribers are connected to this PUB socket.
Sometimes, one of these SUBs gets slowed down. I have logging at the
application<->ZeroMQ boundaries to count the number of messages received and
sent plus logging the ZeroMQ<->system boundary to count the number of bytes
received/sent and the number of syscalls. I see the following pattern
appearing in the logs:
- No data is lost. Everything that is sent is received.
- The number of messages received and sent in the application is as
expected.
- One of the connections in the PUB socket does not send the same data as
the other one. This looks like a slow receiver, but the receiver is running
at 20% CPU, the system is far from being loaded and the network is a 10GB
network with all hosts connected to a single switch. TCP windows are sized
to 6MB default value for a latency expected to be 160ms
- Everything runs fine until I start to simulate latency with netem. 

Looking at the log files, I see the expected output for the fast output
stream. The log prints a time stamp, the number of bytes sent since the last
log output, the time between the outputs, the number of frames (each 1142
bytes) and the number of calls to out_event resp. send.
Time			Bytes		delta time	Frames	Syscalls to
send
1467902214174220	12775617	1.00005s	1645	11187
1467902215174480	12724076	1.00026s	1615	11142
1467902216174610	12913520	1.00012s	1657	11308
1467902217174610	12913979	1.00001s	1661	11308
1467902218174670	12619185	1.00005s	1636	11050
1467902219174780	12699323	1.00011s	1608	11120
1467902220174780	12941494	1s	1666	11332

The problematic stream slows down, and then catches up and after sending
buffered frames, everything is fine:
1467902253096660	7487488	1.00045s	914	6556
1467902254113250	7929856	1.01659s	968	6944
1467902255137710	7675904	1.02446s	937	6721
1467902256137830	7979008	1.00013s	974	6987
1467902257154080	8888320	1.01624s	1085	7783
1467902258180360	8863744	1.02628s	1082	7762
1467902259181080	9691136	1.00072s	1183	8486
1467902260197550	11091968	1.01648s	1354	9713
1467902261222090	11558912	1.02454s	1411	10122
1467902262222300	12984320	1.00021s	1585	11370
1467902263240030	15409152	1.01773s	1881	13493
1467902264263870	16228352	1.02384s	1981	14210
1467902265264570	18800640	1.0007s		2295	16463
1467902266281750	22364160	1.01719s	2730	19583
1467902267305500	24100864	1.02375s	2942	21104
1467902268306090	29376512	1.00059s	3586	25724
1467902269324240	35127296	1.01815s	4288	30759
1467902270347700	35618816	1.02347s	4348	31190
1467902271347770	39297024	1.00006s	4797	34411
1467902272482910	44589056	1.13515s	5443	39045
1467902273487890	42328064	1.00498s	5167	37065
1467902274510060	35741696	1.02217s	4363	31297
1467902275510560	38371328	1.0005s		4684	33600
1467902276530410	43376640	1.01985s	5295	37983
1467902277594050	35787401	1.06365s	4382	31337
1467902278594850	12757268	1.00079s	1582	11171
1467902279595210	12757337	1.00037s	1577	11171

What puzzles me is that the number of calls to out_event resp. send goes
down for a considerable time, and then catches up. Does ZeroMQ dynamically
schedule attempts to send data? If so, what is algorithm here? Can it be
influenced?

One thing I forgot to mention is that this only happens when I use netem to
simulate the expected latency of 80ms (RTT=160ms). TCP settings are tuned
and I consistently achieve the expected performance with iperf etc. I have
the feeling that there is some interaction between the TCP layer and ZeroMQ
that I am missing.

Best wishes, 
  Jens


---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus




More information about the zeromq-dev mailing list