[zeromq-dev] missing messages on 40GbE network

Marko Vendelin markov at sysbio.ioc.ee
Wed Jul 8 12:24:59 CEST 2015


I have used the default ones (1000) and increased them 10x at least.
No difference, as far as I remember from the top of my head.

Note that while I use PAIR sockets to communicate between server and
clients (one PAIR per client). The communication pattern is still
similar to REP/REQ: client asks for new dataset, gets it from server.
If within specified timeout no reply has been received, client asks
again. Protocol runs for a while (10 minutes) and when the timeouts
occur, the things are gradually going down. However, since I have 31
clients only, HWM settings should be more than sufficient. As far as I
understand, I should not have to have more than 31 messages in the
send and receive queues (in each queue).

Marko

On Tue, Jul 7, 2015 at 7:51 PM, Peter Krey <peterjkrey at gmail.com> wrote:
> What are your High Water Mark settings (HWM) ?
>
> On Tue, Jul 7, 2015 at 9:35 AM, A. Mark <gougolith at gmail.com> wrote:
>>
>> Hello,
>>
>> Are you doing extensive error checking with ZMQ? If you are flooding the
>> network, some of your ZMQ clients may be timing out on either end and the
>> sockets maybe simply closed before they have a chance to send/recv anything?
>>
>> Mark
>>
>> On Tue, Jul 7, 2015 at 8:36 AM, Thomas Rodgers <rodgert at twrodgers.com>
>> wrote:
>>>
>>> Is the filesystem ext4? We have seen issues with high rates of smallish
>>> writes to ext4 (it seems related to failing to acquire a lock in
>>> http://lxr.free-electrons.com/source/fs/ext4/extents.c?v=2.6.32#L3228).
>>>
>>> Using XFS seems to improve the situation for us.
>>>
>>> On Tue, Jul 7, 2015 at 2:16 AM, Marko Vendelin <markov at sysbio.ioc.ee>
>>> wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> thank you for the pointers. Its seems now that there is some problem
>>>> with the disk I/O, as suspected first. Namely, when system starts to
>>>> 'crawl', I can fire up new clients that don't write anything and these
>>>> clients are doing absolutely fine (recording high rates). New clients
>>>> with disk i/o crawl immediately.
>>>>
>>>> I'll look into it and would try to isolate the issue further.
>>>>
>>>> REP-REQ: No, I was using multiple requests in PAIR sockets, as you
>>>> advised earlier.
>>>>
>>>> NORM: When things work, TCP is fine. As far as I know a lot is
>>>> processed on the cards internally and I can get to the rates that are
>>>> as large as needed.
>>>>
>>>> I'll let the list know if the problem is in disk I/O and what was the
>>>> cause of it.
>>>>
>>>> Best wishes,
>>>>
>>>> Marko
>>>>
>>>>
>>>> On Mon, Jul 6, 2015 at 11:30 PM, Peter Krey <krey at ripple.com> wrote:
>>>> > You may want to try switching to a UDP based protocol like NORM on
>>>> > zmq. This
>>>> > will let you achieve higher throughput as there will be no TCP packet
>>>> > handshakes.
>>>> >
>>>> > You can also try installing multiple NIC cards on your computer and
>>>> > bind
>>>> > them together into one device for higher throughput if you think the
>>>> > cards
>>>> > devices buffers are being overrun.
>>>> >
>>>> > On Mon, Jul 6, 2015 at 1:25 PM, Peter Krey <krey at ripple.com> wrote:
>>>> >>
>>>> >> You are not using REQ-REP properly; a REQ-REP socket will not accept
>>>> >> two
>>>> >> REQ messages in a row; it needs a REP before it will proceed
>>>> >> otherwise it
>>>> >> will block.
>>>> >>
>>>> >> I highly advise you using PAIR type for all sockets in your
>>>> >> application
>>>> >> and no REQ-REP sockets at all, especially given the throughput
>>>> >> required in
>>>> >> your application.
>>>> >>
>>>> >> On Sun, Jul 5, 2015 at 9:58 AM, Marko Vendelin <markov at sysbio.ioc.ee>
>>>> >> wrote:
>>>> >>>
>>>> >>> I did reprogram using pair sockets, one per each client. They were
>>>> >>> still
>>>> >>> using request reply pattern and when request was not replied to, the
>>>> >>> client
>>>> >>> repeated the request. Unfortunately, the similar behaviour was
>>>> >>> observed:
>>>> >>> initial fast rate reduced and never recovered.
>>>> >>>
>>>> >>> I'm wondering is it possible to get error codes out of zeromq to see
>>>> >>> where the problem is?
>>>> >>>
>>>> >>> Best wishes
>>>> >>>
>>>> >>> Marko
>>>> >>>
>>>> >>> On Jul 4, 2015 12:04 AM, "Marko Vendelin" <marko.vendelin at gmail.com>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> Is there any way I could check for automatic drop of messages by
>>>> >>>> zeromq? I could recompile the library with some debug settings if
>>>> >>>> needed, but this information would be very valuable.
>>>> >>>>
>>>> >>>> In this case I would expect to have the same in nanomsg as well and
>>>> >>>> in
>>>> >>>> the beginning of the test with ZeroMQ. We should have disk i/o
>>>> >>>> faster
>>>> >>>> than the network. Since the dropoff happens at ~10minutes when
>>>> >>>> using
>>>> >>>> zeromq, RAM would not be able to cache the data either (at that
>>>> >>>> time I
>>>> >>>> have transferred already ~2TB in 64GB RAM machines).
>>>> >>>>
>>>> >>>> Use of REQ/REP allows me to spread the load among all disks
>>>> >>>> automatically. Since disk writers are one per HDD and after
>>>> >>>> receiving
>>>> >>>> each dataset write it on disk, the load per disk is proportional to
>>>> >>>> its speed. The rates I am getting in the beginning with ZMQ (first
>>>> >>>> ~10
>>>> >>>> min, ~30-36Gb/s) are above our requirements and would fit the
>>>> >>>> application perfectly. If I could only sustain it as long as the
>>>> >>>> disk
>>>> >>>> space allows.
>>>> >>>>
>>>> >>>> Re PAIR: I was thinking about giving PAIR a try. Would need to
>>>> >>>> reprogram a bit, but its possible.
>>>> >>>>
>>>> >>>> Best wishes,
>>>> >>>>
>>>> >>>> Marko
>>>> >>>>
>>>> >>>>
>>>> >>>> On Fri, Jul 3, 2015 at 10:52 PM, Peter Krey <peterjkrey at gmail.com>
>>>> >>>> wrote:
>>>> >>>> > You may be sending messages faster than you can receive them and
>>>> >>>> > write
>>>> >>>> > them
>>>> >>>> > to disk, overflowing zeromq message send buffer causing zeromq to
>>>> >>>> > automatically discard some messages. This is expected behavior.
>>>> >>>> >
>>>> >>>> > Also do not use socket type request reply, use pair. This will
>>>> >>>> > not
>>>> >>>> > require
>>>> >>>> > your app to recv and reply before sending the next image; your
>>>> >>>> > app can
>>>> >>>> > send
>>>> >>>> > async.
>>>> >>>> >
>>>> >>>> > On Wednesday, July 1, 2015, Marko Vendelin <markov at sysbio.ioc.ee>
>>>> >>>> > wrote:
>>>> >>>> >>
>>>> >>>> >> Dear ØMQ developers:
>>>> >>>> >>
>>>> >>>> >> Synopsis: I am observing a strange interaction between storing
>>>> >>>> >> datastream on harddisks and a loss of ZeroMQ messages. It seems
>>>> >>>> >> that
>>>> >>>> >> in my use case, when messages are larger than 2MB, some of them
>>>> >>>> >> are
>>>> >>>> >> randomly dropped.
>>>> >>>> >>
>>>> >>>> >> Full story:
>>>> >>>> >>
>>>> >>>> >> I need to pump images acquired by fast scientific cameras into
>>>> >>>> >> the
>>>> >>>> >> files with the rates approaching 25Gb/s. For that, images are
>>>> >>>> >> acquired
>>>> >>>> >> in one server and transferred into the harddisk array using
>>>> >>>> >> 40Gb/s
>>>> >>>> >> network. Since Linux-based solutions using iSCSI were not
>>>> >>>> >> working
>>>> >>>> >> very
>>>> >>>> >> well (maybe need to optimize more) and plain network
>>>> >>>> >> applications
>>>> >>>> >> could use the full bandwidth, I decided to use RAID-0 inspired
>>>> >>>> >> approach: make filesystem on each of 32 harddisks separately,
>>>> >>>> >> run
>>>> >>>> >> small slave programs one per filesystem and let the slaves ask
>>>> >>>> >> the
>>>> >>>> >> dataset server for a dataset in a loop. As a messaging system, I
>>>> >>>> >> use
>>>> >>>> >> ZeroMQ and REQ/REP connection. In general, all seem to work
>>>> >>>> >> perfectly:
>>>> >>>> >> I am able to stream and record data at about 36Gb/s rates.
>>>> >>>> >> However,
>>>> >>>> >> at
>>>> >>>> >> some point (within 5-10 min), sometimes messages get lost.
>>>> >>>> >> Intriguingly, this occurs only if I write files and messages are
>>>> >>>> >> 2MB
>>>> >>>> >> or larger. Much smaller messages do not seem to trigger this
>>>> >>>> >> effect.
>>>> >>>> >> If I just stream data and either dump it or just calculate on
>>>> >>>> >> the
>>>> >>>> >> basis of it, all messages go through. All messages go through if
>>>> >>>> >> I
>>>> >>>> >> use
>>>> >>>> >> 1Gb network.
>>>> >>>> >>
>>>> >>>> >> While in production code I stream data into HDF5, use zmqpp and
>>>> >>>> >> pooling to receive messages, I have reduced the problematic code
>>>> >>>> >> into
>>>> >>>> >> the simplest case using zmq.hpp, regular files, and plain
>>>> >>>> >> send/recv
>>>> >>>> >> calls. Code is available at
>>>> >>>> >>
>>>> >>>> >> http://www.ioc.ee/~markov/zmq/problem-missing-messages/
>>>> >>>> >>
>>>> >>>> >> At the same time, there don't seem to be any excessive drops in
>>>> >>>> >> ethernet cards, as reported by ifconfig in Linux (slaves run on
>>>> >>>> >> Gentoo, server on Ubuntu):
>>>> >>>> >>
>>>> >>>> >>
>>>> >>>> >> ens1f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
>>>> >>>> >>         inet 192.168.38.1  netmask 255.255.255.252  broadcast
>>>> >>>> >> 192.168.38.3
>>>> >>>> >>         inet6 fe80::225:90ff:fe9c:62c3  prefixlen 64  scopeid
>>>> >>>> >> 0x20<link>
>>>> >>>> >>         ether 00:25:90:9c:62:c3  txqueuelen 1000  (Ethernet)
>>>> >>>> >>         RX packets 8568340799  bytes 76612663159251 (69.6 TiB)
>>>> >>>> >>         RX errors 7  dropped 0  overruns 0  frame 7
>>>> >>>> >>         TX packets 1558294820  bytes 93932603947 (87.4 GiB)
>>>> >>>> >>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions
>>>> >>>> >> 0
>>>> >>>> >>
>>>> >>>> >> eth3      Link encap:Ethernet  HWaddr 00:25:90:9c:63:1a
>>>> >>>> >>           inet addr:192.168.38.2  Bcast:192.168.38.3
>>>> >>>> >> Mask:255.255.255.252
>>>> >>>> >>           inet6 addr: fe80::225:90ff:fe9c:631a/64 Scope:Link
>>>> >>>> >>           UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
>>>> >>>> >>           RX packets:1558294810 errors:0 dropped:0 overruns:0
>>>> >>>> >> frame:0
>>>> >>>> >>           TX packets:8570261350 errors:0 dropped:0 overruns:0
>>>> >>>> >> carrier:0
>>>> >>>> >>           collisions:0 txqueuelen:1000
>>>> >>>> >>           RX bytes:102083292705 (102.0 GB)  TX
>>>> >>>> >> bytes:76629844394725
>>>> >>>> >> (76.6
>>>> >>>> >> TB)
>>>> >>>> >>
>>>> >>>> >>
>>>> >>>> >> So, it should not be a simple dropped frames problem.
>>>> >>>> >>
>>>> >>>> >> Since the problem occurs only with larger messages, is there any
>>>> >>>> >> size-limited buffer in ZeroMQ that may cause dropping of the
>>>> >>>> >> messages?
>>>> >>>> >> Or any other possible solution?
>>>> >>>> >>
>>>> >>>> >> Thank you for your help,
>>>> >>>> >>
>>>> >>>> >> Marko
>>>> >>>> >> _______________________________________________
>>>> >>>> >> zeromq-dev mailing list
>>>> >>>> >> zeromq-dev at lists.zeromq.org
>>>> >>>> >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>> >>>> >
>>>> >>>> >
>>>> >>>> > _______________________________________________
>>>> >>>> > zeromq-dev mailing list
>>>> >>>> > zeromq-dev at lists.zeromq.org
>>>> >>>> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>> >>>> >
>>>> >>>
>>>> >>>
>>>> >>> _______________________________________________
>>>> >>> zeromq-dev mailing list
>>>> >>> zeromq-dev at lists.zeromq.org
>>>> >>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>> >>>
>>>> >>
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > zeromq-dev mailing list
>>>> > zeromq-dev at lists.zeromq.org
>>>> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>> >
>>>> _______________________________________________
>>>> zeromq-dev mailing list
>>>> zeromq-dev at lists.zeromq.org
>>>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>>
>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev at lists.zeromq.org
>>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>
>>
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>



More information about the zeromq-dev mailing list