[zeromq-dev] missing messages on 40GbE network

Marko Vendelin markov at sysbio.ioc.ee
Fri Jul 3 13:19:38 CEST 2015


Thank you, I have made some testing and below are the results. Before
results, few words on the configuration: we have two 40GbE cards
linked directly, without any switch. When I am NOT writing to files, I
can get sustained 36Gb/s transfers with ZeroMQ for as long as I tried.
Few dropped frames probably occurred during a boot. I have rebooted
the both machines and now, after all tests, we have no-errors ifconfig
output:

<receiver> ens1f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.38.1  netmask 255.255.255.252  broadcast 192.168.38.3
        inet6 fe80::225:90ff:fe9c:62c3  prefixlen 64  scopeid 0x20<link>
        ether 00:25:90:9c:62:c3  txqueuelen 1000  (Ethernet)
        RX packets 1263870484  bytes 11244160379023 (10.2 TiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 220745627  bytes 14803910877 (13.7 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

<sender> eth3      Link encap:Ethernet  HWaddr 00:25:90:9c:63:1a
          inet addr:192.168.38.2  Bcast:192.168.38.3  Mask:255.255.255.252
          inet6 addr: fe80::225:90ff:fe9c:631a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:258755797 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1403606650 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:17238102838 (17.2 GB)  TX bytes:12498689995721 (12.4 TB)

Tests have been performed after setting limits as follows (reboot
after limits were set):

<receiver> ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         unlimited
-m: resident set size (kbytes)      unlimited
-u: processes                       257167
-n: file descriptors                1024000
-l: locked-in-memory size (kbytes)  64
-v: address space (kbytes)          unlimited
-x: file locks                      unlimited
-i: pending signals                 257167
-q: bytes in POSIX msg queues       819200
-e: max nice                        0
-r: max rt priority                 0
-N 15:                              unlimited

<sender>
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         unlimited
-m: resident set size (kbytes)      unlimited
-u: processes                       257240
-n: file descriptors                1024000
-l: locked-in-memory size (kbytes)  62914560
-v: address space (kbytes)          unlimited
-x: file locks                      unlimited
-i: pending signals                 257240
-q: bytes in POSIX msg queues       819200
-e: max nice                        0
-r: max rt priority                 99
-N 15:                              unlimited

As you could see, I set nofiles to very large value. However, it seems
that the same results were obtained using 10240 as a limit.

Tests:

* As mentioned above, without storing datasets to files, transfer of
36-37Gb/s is sustained. No messages are lost and all arrive in
specified 10s timeout (using polling). It seems to me that this rules
out network card problems, as a first guess.

* When writing to datasets to file, on receiver, at some point, socket
does not receive new messages. I can close the socket, make a new one,
and get new messages after asking for them from the sender (REQ-REP
pattern). While it helps to keep the rate about 29-32 Gb/s in the
beginning, eventually, after 5-15 minutes the transfer rate slowly
starts to reduce and reaches sub-1Gb/s rates in 20-30 minutes. The
same occurs whether I use zero-copy or not.

* I have rewrote the simple programs to use nanomsg. Using nanomsg, I
can obtain the sustained rates of 33.9 Gb/s while writing to files and
using their zero-copy mechanism. No missing frames have been
identified and the load is distributed among the slaves rather equally
(would be disturbed if the missing messages would occur).

On the basis of these tests, it seems to me that either there is a
hardware bug that gets triggered by ZeroMQ or there is some
restriction in ZeroMQ that my use pattern hits. If it is a hardware
bug, nanomsg manages to avoid it somehow. If it is a ZeroMQ
restriction, what could it be?

Best wishes,

Marko

On Fri, Jul 3, 2015 at 7:15 AM, Ben Kloosterman <bklooste at gmail.com> wrote:
> Try change /proc/sys/net/core/rmem_default and /proc/sys/net/core/rmem_max
>
> also try test with tcpdump and check for drops.
>
> Frame errors are CRC no many but i bet they are the big packets you lost.
> This could be cable , switch etc.
>
>
> Here is some older stuff http://datatag.web.cern.ch/datatag/howto/tcp.html
> for 10Gig.
>
> Regards,
>
> Ben
>
> On Wed, Jul 1, 2015 at 9:45 PM, Marko Vendelin <markov at sysbio.ioc.ee> wrote:
>>
>> Dear ØMQ developers:
>>
>> Synopsis: I am observing a strange interaction between storing
>> datastream on harddisks and a loss of ZeroMQ messages. It seems that
>> in my use case, when messages are larger than 2MB, some of them are
>> randomly dropped.
>>
>> Full story:
>>
>> I need to pump images acquired by fast scientific cameras into the
>> files with the rates approaching 25Gb/s. For that, images are acquired
>> in one server and transferred into the harddisk array using 40Gb/s
>> network. Since Linux-based solutions using iSCSI were not working very
>> well (maybe need to optimize more) and plain network applications
>> could use the full bandwidth, I decided to use RAID-0 inspired
>> approach: make filesystem on each of 32 harddisks separately, run
>> small slave programs one per filesystem and let the slaves ask the
>> dataset server for a dataset in a loop. As a messaging system, I use
>> ZeroMQ and REQ/REP connection. In general, all seem to work perfectly:
>> I am able to stream and record data at about 36Gb/s rates. However, at
>> some point (within 5-10 min), sometimes messages get lost.
>> Intriguingly, this occurs only if I write files and messages are 2MB
>> or larger. Much smaller messages do not seem to trigger this effect.
>> If I just stream data and either dump it or just calculate on the
>> basis of it, all messages go through. All messages go through if I use
>> 1Gb network.
>>
>> While in production code I stream data into HDF5, use zmqpp and
>> pooling to receive messages, I have reduced the problematic code into
>> the simplest case using zmq.hpp, regular files, and plain send/recv
>> calls. Code is available at
>>
>> http://www.ioc.ee/~markov/zmq/problem-missing-messages/
>>
>> At the same time, there don't seem to be any excessive drops in
>> ethernet cards, as reported by ifconfig in Linux (slaves run on
>> Gentoo, server on Ubuntu):
>>
>>
>> ens1f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
>>         inet 192.168.38.1  netmask 255.255.255.252  broadcast 192.168.38.3
>>         inet6 fe80::225:90ff:fe9c:62c3  prefixlen 64  scopeid 0x20<link>
>>         ether 00:25:90:9c:62:c3  txqueuelen 1000  (Ethernet)
>>         RX packets 8568340799  bytes 76612663159251 (69.6 TiB)
>>         RX errors 7  dropped 0  overruns 0  frame 7
>>         TX packets 1558294820  bytes 93932603947 (87.4 GiB)
>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> eth3      Link encap:Ethernet  HWaddr 00:25:90:9c:63:1a
>>           inet addr:192.168.38.2  Bcast:192.168.38.3  Mask:255.255.255.252
>>           inet6 addr: fe80::225:90ff:fe9c:631a/64 Scope:Link
>>           UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
>>           RX packets:1558294810 errors:0 dropped:0 overruns:0 frame:0
>>           TX packets:8570261350 errors:0 dropped:0 overruns:0 carrier:0
>>           collisions:0 txqueuelen:1000
>>           RX bytes:102083292705 (102.0 GB)  TX bytes:76629844394725 (76.6
>> TB)
>>
>>
>> So, it should not be a simple dropped frames problem.
>>
>> Since the problem occurs only with larger messages, is there any
>> size-limited buffer in ZeroMQ that may cause dropping of the messages?
>> Or any other possible solution?
>>
>> Thank you for your help,
>>
>> Marko
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>



More information about the zeromq-dev mailing list