[zeromq-dev] Very Small Messages/Manual

Ben Kloosterman bklooste at gmail.com
Mon Jul 26 15:02:27 CEST 2010

>>> Yes, there should be a guide to compile-time performance tuning of
 >>  >The question is what percentage of users has enough theoretical
 >>  >background, experience, available HW resources and funding to
 >>  >relevant benchmarking and tuning.
 >> I dislike any form of "compile time" tuning ... it is one of those
 >> that has turned building Unix apps into a dogs breakfast with
 >> painful make / configuration systemss,  it used to be so simple.
 >Agreed, but there's no way to set the MAX_VSM_SIZE at runtime. Compiler
 >has to be aware of it.

True , but I bet changing it is not tested either  . 

This is more of an architecture decision  , instead of fixed size we could
just assume the message is flexible and get the pipe to allocate the memory
and just place the struct at the header. Instead of incrementing it by a
constant message size you bump up the counter by the size in the message.
My focus is really super high in proc speed and I realize 0mq focus is more
mixed ( which gives my system more options and is why im trying to bang 0mq
into my requirements) .

 >>> Existing compile time constants are carefully chosen to perform best
 >>  >common modern hardware. Playing with them is likely to cause more
 >>  >than good.
 >> Agree , considering how fast it is in most cases it would be premature
 >> optimization..Once you have your app finished tune by all means.
 >Doing some
 >> testing last few days and quite a few surprises.
 >My point was that doing perf tests and optimisation without
 >understanding how it should be done can be pretty much misleading.
 >For example, when measuring throughput, using [msgs/sec] unit is
 >extremely misleading. It blows small fluctuations or chip/memory
 >implementation details completely out of proportion.
 >10M msgs/sec is twice as good as 5M msgs/sec, right?
 >When you think of it, 10M msgs/sec means each message is processed 100
 >nanoseconds. 5M msgs/sec means each message is processed 200
 >Real improvement is 100 nanoseconds per message. Which can result from
 >couple of additional CPU instructions, a small measurement imprecision
 >or maybe even from minor manufacturing flaws of your CPU / memory.
 >In short, when doing throughput tests, use "time to process one message"
 >metric rather than "messages per second". It'll give you more sane
 >picture of what the actual performance impact is.

Noted it does give a better figure across architectures.  That said I'm not
a big fan of messages per second  or time to process 1 message and prefer
throughput over a time sufficient to saturize the cache. Also 100
nanoseconds per message is a LOT of cycles eg a 3GHz CPU will have 3 * 10^9
cycles ,  100  ns = 300 cycles which at 4 super scalar instructions (Core 2)
would mean  a potential 1200 instructions and even 10ns  potentially 120
instructions. That's not to say this will be achieved  but it shows stalls
esp memory have the biggest impact ( as can clearly be seen in larger
messages) and hence optimizing for mem prefech & maximum cache  ( and more
importantly no dirtying for temporal data)  is preferred and will give the
best results regardless of memory speed.  None of which is easy ....



More information about the zeromq-dev mailing list