[zeromq-dev] QoS in PUB/SUB networks: transmission latency grows as communication becomes more asynchronous.

Brian Rossa rossabri at hotmail.com
Sun Aug 7 22:41:49 CEST 2011


Hi all,
I'm continuing to see some less-than-desirable latency behavior in PUB/SUB networks despite HWM=1. I first posted about this back in May. See http://www.mail-archive.com/zeromq-dev@lists.zeromq.org/msg08943.html. I've picked this issue up again and have done quite a bit more testing, now with libzmq-3.0. The behavior is still present, and I have some slightly stronger results.
= Methods =
I construct a PUB/SUB network with 1 message source, N processing nodes, and 1 sink. The source is multi-casting to the N processing nodes, and they are running in parallel. The nodes are "dummies;" they just sleep() for some amount of time and then send the message on to the sink. The sink prints out some message metrics: arrival time, transmission latency -- total time in ZMQ; does not include the sleep() time at the intermediate node -- message id, and route.
I try to simulate the video processing system that I'm working on throughout. The source sends at 30 msgs/sec, the nodes have variable processing time, and the sink works as fast as it can. My "variable processing time" is achieved by sleep()ing for some time X where X is normally distributed around mean u and variance q. Each processing node gets its own u and q. (This will be relevant later.)
You can find my simulation code here: http://pastebin.com/dSBtD1u7. It's about 100 lines and I would really appreciate it if somebody would sanity-check it for me. There is also a utility for graphing the data: http://pastebin.com/EBwdxu6M. A simple 30-second run looks something like "python lat_test.py 30 > foo && python plot_lat.py foo"
= Results =
The following hold both for TCP and IPC protocols:1) In my tests, ZMQ never dropped any messages despite HWM=1. This was the case even when the processing nodes were made to run much slower than 30 msg/sec.2) Total transmission latency (time in ZMQ) is much greater with asynchronous networks than with synchronous ones.
Both of these results are startling, but the second deserves some explanation.  Remember I said that I was simulating processing time by sampling random sleep times from a normal distribution? Well, nodes with greater *variance* ("q" parameter) in their sleep times were much more likely to suffer from high transmission latency. (Again, the "transmission latency" does not include the sleep time itself; just the time on the wire.) This is a crazy result! One expects transmission latency to vary with things like message size and number of connections, but not the processing jitter.
In the case of my real-time video processing system, this relationship between processing jitter and transmission latency wouldn't be a deal-breaker if it weren't for the *magnitude* of the resulting latency. We're consistently seeing upwards of 5 seconds of transmission latency on network paths that go via certain "jittery" system components, and this is easy to reproduce in my simulation.
= Thanks! =
I hope this little testing effort that I've undertaken proves useful to the project. I encourage everybody who is interested in QoS under ZMQ to run my code and check the results for themselves. I am also eager to hear from people that are using ZMQ under hard real-time constraints. Feedback and technical direction are appreciated. If this is a bug, I'm keen to squash it! And I am certainly hoping to continue with ZMQ on my real-time video project.
Cheers!~Brian 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20110807/58cc5c0b/attachment.htm>


More information about the zeromq-dev mailing list