[zeromq-dev] Abnormal Latency Issue
Aja at ciscor.com
Thu Feb 2 20:26:46 CET 2012
I have been observing some abnormally large latencies in an application currently under development that uses the 0MQ library for messaging between components distributed within and across multiple processes. I am using 0MQ version 2.1.11 (though I also observed the issue with 2.1.10) and the Java bindings. I see the abnormal latencies on my Windows 7 machine, but not my Linux machine.
What I have been seeing is that occasionally a message sent from one component to another will take a relatively large amount of time to arrive at the destination component. Most messages have a latency of well under a millisecond, but occasionally a message will take almost exactly 10 milliseconds (within a couple hundred microseconds or so). Here is a link to a minimal-ish test program that demonstrates these abnormal latencies on my Windows 7 machine:
The test program currently starts one thread to send ping messages and one thread to echo them back. The pinging thread records the time at which the ping messages are sent and the time at which the echoes are received. If the elapsed time is greater than 5 milliseconds it is considered to be an abnormal latency. For each echo received, the latency in milliseconds is printed out, along with an overall percentage of all echoes that exceeded the 5 millisecond threshold.
The test program as-is will demonstrate the large latencies roughly 2% of the time. If more pingers are added, either in the same process or in a separate process, this percentage will increase rather quickly. For example, with 5 pinger threads, the large latencies will be observed for about half of all echoed messages. Using an inproc:// transport decreases the rate of abnormal latencies, but does not eliminate it. When running the echoer on Linux and the pinger on Windows 7, or vice versa, the issue is still observed, which seems to indicate that the latencies can occur in either direction (from pinger to echoer or from echoer to pinger) since the issue doesn't occur with both the echoer and the pinger on Linux.
In the application currently under development (not the supplied test program), there are many dozens of components communicating with one another, though they do not send messages as rapidly as the pingers in the test program. Still, in that application I see about 80% of all messages sent from one component to another suffer from these large latencies. As a message will often have to make several hops to get to its final destination, these 10-millisecond latencies add up quickly.
The thing I find most interesting is that a message will either have a sub-millisecond latency or it will have a 10-millisecond latency, but nothing in between. This looks to me like an artifact of something internal, rather than the degradation in performance of an overloaded thread.
So, does anyone have any thoughts, suggestions, or requests for clarification?
More information about the zeromq-dev