I’ve been testing a lot of combinations of ZeroMQ over Java, between the pure jeromq base and the jzmq JNI libzmq C code. Albeit my impression so far is that jeromq is way faster than the binding - not that the code isn’t great, but my feeling so far is that the JNI jump slows everything down - at a certain point I felt the need for a simple zmq_proxy network node and I was pretty sure that the C code must be faster than the jeromq. I have some ideas that can improve the jeromq proxy code, but it felt easier to just compile the zmq_proxy code from the book.

Unfortunately something went completely wrong on my side so I need your help to understand what is happening here.

MacOSX Mavericks fully updated, MBPro i7 4x2 CPU 2.2Ghz 16GB
libzmq from git head
(same for jeromq and libzmq, albeit I’m using my own fork so I can send pulls back)
my data are json lines that goes from about 100 bytes to some multi MB exceptions, but the average of those million messages is about 500bytes.

Test 1: pure local_thr and remote_thr:

iDavi:perf bruno$ ./local_thr tcp:// 500 1000000 &
iDavi:perf bruno$ time ./remote_thr tcp:// 500 1000000 &
real	0m0.732s
user	0m0.516s
sys	0m0.394s
message size: 500 [B]
message count: 1000000
mean throughput: 1418029 [msg/s]
mean throughput: 5672.116 [Mb/s]

Test 2: change local_thr to perform connect instead of bind, and put a proxy in the middle.
The proxy is the first C code example from the book, available here https://gist.github.com/davipt/7361477
iDavi:c bruno$ gcc -o proxy proxy.c -I /usr/local/include/ -L /usr/local/lib/ -lzmq
iDavi:c bruno$ ./proxy tcp://*:8881 tcp://*:8882 1
Proxy type=PULL/PUSH in=tcp://*:8881 out=tcp://*:8882

iDavi:perf bruno$ ./local_thr tcp:// 500 1000000 &
iDavi:perf bruno$ time ./remote_thr tcp:// 500 1000000 &
iDavi:perf bruno$ message size: 500 [B]
message count: 1000000
mean throughput: 74764 [msg/s]
mean throughput: 299.056 [Mb/s]

real	0m10.358s
user	0m0.668s
sys	0m0.508s

Test3: use the jeromq equivalent of the proxy: https://gist.github.com/davipt/7361623

iDavi:perf bruno$ ./local_thr tcp:// 500 1000000 &
[1] 15816
iDavi:perf bruno$ time ./remote_thr tcp:// 500 1000000 &
[2] 15830
iDavi:perf bruno$ 
real	0m3.429s
user	0m0.654s
sys	0m0.509s
message size: 500 [B]
message count: 1000000
mean throughput: 293532 [msg/s]
mean throughput: 1174.128 [Mb/s]

This performance coming out of Java is okish, it’s here just for comparison, and I’ll spend some time looking at it.

The core question is the C proxy - why 10 times slower than the no-proxy version?

One thing I noticed, by coincidence, is that on the upper side of the proxy, both with the C “producer” as well as the java one, tcpdump shows me consistently packets of 16332 (or the MTU size if using ethernet, 1438 I think). This value is consistent for the 4 combinations of producers and proxies (jeromq vs c).

But on the other side of the proxy, the result is completely different. With the jeromq proxy, I see packets of 8192 bytes, but with the C code I see packets of either 509 or 1010. It feels like the proxy is sending the messages one by one. Again, this value is consistent with the PULL consumer after the proxy, being it C or java.

So this is something on the proxy “backend” socket side of the zmq_proxy.

Also, I see quite similar behavior with a PUB - [XSUB+Proxy+XPUB] - SUB version.

What do I need to tweak on the proxy.c ?

Thanks in advance

