[zeromq-dev] missing messages on 40GbE network
Stephen Lord
Steve.Lord at quantum.com
Wed Jul 8 15:51:27 CEST 2015
If you are just writing to files with regular file I/O then you will be pushing to memory and flushing will be in the background. You may be stalling when the background flush happens. You may want to use O_DIRECT on open, but that will turn each write into a disk operation, so you need to make them large. If you know file sizes in advance then you could use a preallocation call to reserve space in advance, this would avoid checker boarding between files, but it all depends on how you want to read it later.
Try using iostat -x to monitor the I/O and see if you are saturating the storage.
You might want to simulate this part of your system without getting data over the network and make sure your I/O path can sustain constant throughput at the rate you need. Once you can sustain the write load hook it back up with the network code again.
Steve
Sent from my iPhone
> On Jul 8, 2015, at 4:25 AM, Marko Vendelin <markov at sysbio.ioc.ee> wrote:
>
> I have used the default ones (1000) and increased them 10x at least.
> No difference, as far as I remember from the top of my head.
>
> Note that while I use PAIR sockets to communicate between server and
> clients (one PAIR per client). The communication pattern is still
> similar to REP/REQ: client asks for new dataset, gets it from server.
> If within specified timeout no reply has been received, client asks
> again. Protocol runs for a while (10 minutes) and when the timeouts
> occur, the things are gradually going down. However, since I have 31
> clients only, HWM settings should be more than sufficient. As far as I
> understand, I should not have to have more than 31 messages in the
> send and receive queues (in each queue).
>
> Marko
>
>> On Tue, Jul 7, 2015 at 7:51 PM, Peter Krey <peterjkrey at gmail.com> wrote:
>> What are your High Water Mark settings (HWM) ?
>>
>>> On Tue, Jul 7, 2015 at 9:35 AM, A. Mark <gougolith at gmail.com> wrote:
>>>
>>> Hello,
>>>
>>> Are you doing extensive error checking with ZMQ? If you are flooding the
>>> network, some of your ZMQ clients may be timing out on either end and the
>>> sockets maybe simply closed before they have a chance to send/recv anything?
>>>
>>> Mark
>>>
>>> On Tue, Jul 7, 2015 at 8:36 AM, Thomas Rodgers <rodgert at twrodgers.com>
>>> wrote:
>>>>
>>>> Is the filesystem ext4? We have seen issues with high rates of smallish
>>>> writes to ext4 (it seems related to failing to acquire a lock in
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lxr.free-2Delectrons.com_source_fs_ext4_extents.c-3Fv-3D2.6.32-23L3228&d=BQIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=tZ1K7viiBHmygwDm10unISVGpJSFV-7Ua8Hi_8Q0QiM&s=aPEi6SMXAWD2Zz26Krqw1mT3XFfcfAX-9Lm_0d739io&e= ).
>>>>
>>>> Using XFS seems to improve the situation for us.
>>>>
>>>> On Tue, Jul 7, 2015 at 2:16 AM, Marko Vendelin <markov at sysbio.ioc.ee>
>>>> wrote:
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> thank you for the pointers. Its seems now that there is some problem
>>>>> with the disk I/O, as suspected first. Namely, when system starts to
>>>>> 'crawl', I can fire up new clients that don't write anything and these
>>>>> clients are doing absolutely fine (recording high rates). New clients
>>>>> with disk i/o crawl immediately.
>>>>>
>>>>> I'll look into it and would try to isolate the issue further.
>>>>>
>>>>> REP-REQ: No, I was using multiple requests in PAIR sockets, as you
>>>>> advised earlier.
>>>>>
>>>>> NORM: When things work, TCP is fine. As far as I know a lot is
>>>>> processed on the cards internally and I can get to the rates that are
>>>>> as large as needed.
>>>>>
>>>>> I'll let the list know if the problem is in disk I/O and what was the
>>>>> cause of it.
>>>>>
>>>>> Best wishes,
>>>>>
>>>>> Marko
>>>>>
>>>>>
>>>>>> On Mon, Jul 6, 2015 at 11:30 PM, Peter Krey <krey at ripple.com> wrote:
>>>>>> You may want to try switching to a UDP based protocol like NORM on
>>>>>> zmq. This
>>>>>> will let you achieve higher throughput as there will be no TCP packet
>>>>>> handshakes.
>>>>>>
>>>>>> You can also try installing multiple NIC cards on your computer and
>>>>>> bind
>>>>>> them together into one device for higher throughput if you think the
>>>>>> cards
>>>>>> devices buffers are being overrun.
>>>>>>
>>>>>>> On Mon, Jul 6, 2015 at 1:25 PM, Peter Krey <krey at ripple.com> wrote:
>>>>>>>
>>>>>>> You are not using REQ-REP properly; a REQ-REP socket will not accept
>>>>>>> two
>>>>>>> REQ messages in a row; it needs a REP before it will proceed
>>>>>>> otherwise it
>>>>>>> will block.
>>>>>>>
>>>>>>> I highly advise you using PAIR type for all sockets in your
>>>>>>> application
>>>>>>> and no REQ-REP sockets at all, especially given the throughput
>>>>>>> required in
>>>>>>> your application.
>>>>>>>
>>>>>>> On Sun, Jul 5, 2015 at 9:58 AM, Marko Vendelin <markov at sysbio.ioc.ee>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I did reprogram using pair sockets, one per each client. They were
>>>>>>>> still
>>>>>>>> using request reply pattern and when request was not replied to, the
>>>>>>>> client
>>>>>>>> repeated the request. Unfortunately, the similar behaviour was
>>>>>>>> observed:
>>>>>>>> initial fast rate reduced and never recovered.
>>>>>>>>
>>>>>>>> I'm wondering is it possible to get error codes out of zeromq to see
>>>>>>>> where the problem is?
>>>>>>>>
>>>>>>>> Best wishes
>>>>>>>>
>>>>>>>> Marko
>>>>>>>>
>>>>>>>> On Jul 4, 2015 12:04 AM, "Marko Vendelin" <marko.vendelin at gmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Is there any way I could check for automatic drop of messages by
>>>>>>>>> zeromq? I could recompile the library with some debug settings if
>>>>>>>>> needed, but this information would be very valuable.
>>>>>>>>>
>>>>>>>>> In this case I would expect to have the same in nanomsg as well and
>>>>>>>>> in
>>>>>>>>> the beginning of the test with ZeroMQ. We should have disk i/o
>>>>>>>>> faster
>>>>>>>>> than the network. Since the dropoff happens at ~10minutes when
>>>>>>>>> using
>>>>>>>>> zeromq, RAM would not be able to cache the data either (at that
>>>>>>>>> time I
>>>>>>>>> have transferred already ~2TB in 64GB RAM machines).
>>>>>>>>>
>>>>>>>>> Use of REQ/REP allows me to spread the load among all disks
>>>>>>>>> automatically. Since disk writers are one per HDD and after
>>>>>>>>> receiving
>>>>>>>>> each dataset write it on disk, the load per disk is proportional to
>>>>>>>>> its speed. The rates I am getting in the beginning with ZMQ (first
>>>>>>>>> ~10
>>>>>>>>> min, ~30-36Gb/s) are above our requirements and would fit the
>>>>>>>>> application perfectly. If I could only sustain it as long as the
>>>>>>>>> disk
>>>>>>>>> space allows.
>>>>>>>>>
>>>>>>>>> Re PAIR: I was thinking about giving PAIR a try. Would need to
>>>>>>>>> reprogram a bit, but its possible.
>>>>>>>>>
>>>>>>>>> Best wishes,
>>>>>>>>>
>>>>>>>>> Marko
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jul 3, 2015 at 10:52 PM, Peter Krey <peterjkrey at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>> You may be sending messages faster than you can receive them and
>>>>>>>>>> write
>>>>>>>>>> them
>>>>>>>>>> to disk, overflowing zeromq message send buffer causing zeromq to
>>>>>>>>>> automatically discard some messages. This is expected behavior.
>>>>>>>>>>
>>>>>>>>>> Also do not use socket type request reply, use pair. This will
>>>>>>>>>> not
>>>>>>>>>> require
>>>>>>>>>> your app to recv and reply before sending the next image; your
>>>>>>>>>> app can
>>>>>>>>>> send
>>>>>>>>>> async.
>>>>>>>>>>
>>>>>>>>>> On Wednesday, July 1, 2015, Marko Vendelin <markov at sysbio.ioc.ee>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Dear ØMQ developers:
>>>>>>>>>>>
>>>>>>>>>>> Synopsis: I am observing a strange interaction between storing
>>>>>>>>>>> datastream on harddisks and a loss of ZeroMQ messages. It seems
>>>>>>>>>>> that
>>>>>>>>>>> in my use case, when messages are larger than 2MB, some of them
>>>>>>>>>>> are
>>>>>>>>>>> randomly dropped.
>>>>>>>>>>>
>>>>>>>>>>> Full story:
>>>>>>>>>>>
>>>>>>>>>>> I need to pump images acquired by fast scientific cameras into
>>>>>>>>>>> the
>>>>>>>>>>> files with the rates approaching 25Gb/s. For that, images are
>>>>>>>>>>> acquired
>>>>>>>>>>> in one server and transferred into the harddisk array using
>>>>>>>>>>> 40Gb/s
>>>>>>>>>>> network. Since Linux-based solutions using iSCSI were not
>>>>>>>>>>> working
>>>>>>>>>>> very
>>>>>>>>>>> well (maybe need to optimize more) and plain network
>>>>>>>>>>> applications
>>>>>>>>>>> could use the full bandwidth, I decided to use RAID-0 inspired
>>>>>>>>>>> approach: make filesystem on each of 32 harddisks separately,
>>>>>>>>>>> run
>>>>>>>>>>> small slave programs one per filesystem and let the slaves ask
>>>>>>>>>>> the
>>>>>>>>>>> dataset server for a dataset in a loop. As a messaging system, I
>>>>>>>>>>> use
>>>>>>>>>>> ZeroMQ and REQ/REP connection. In general, all seem to work
>>>>>>>>>>> perfectly:
>>>>>>>>>>> I am able to stream and record data at about 36Gb/s rates.
>>>>>>>>>>> However,
>>>>>>>>>>> at
>>>>>>>>>>> some point (within 5-10 min), sometimes messages get lost.
>>>>>>>>>>> Intriguingly, this occurs only if I write files and messages are
>>>>>>>>>>> 2MB
>>>>>>>>>>> or larger. Much smaller messages do not seem to trigger this
>>>>>>>>>>> effect.
>>>>>>>>>>> If I just stream data and either dump it or just calculate on
>>>>>>>>>>> the
>>>>>>>>>>> basis of it, all messages go through. All messages go through if
>>>>>>>>>>> I
>>>>>>>>>>> use
>>>>>>>>>>> 1Gb network.
>>>>>>>>>>>
>>>>>>>>>>> While in production code I stream data into HDF5, use zmqpp and
>>>>>>>>>>> pooling to receive messages, I have reduced the problematic code
>>>>>>>>>>> into
>>>>>>>>>>> the simplest case using zmq.hpp, regular files, and plain
>>>>>>>>>>> send/recv
>>>>>>>>>>> calls. Code is available at
>>>>>>>>>>>
>>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ioc.ee_-7Emarkov_zmq_problem-2Dmissing-2Dmessages_&d=BQIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=tZ1K7viiBHmygwDm10unISVGpJSFV-7Ua8Hi_8Q0QiM&s=SnfZh76QQlMxA0A8yk0wqDzTW71p0vCnZ-1iyXi-PvQ&e=
>>>>>>>>>>>
>>>>>>>>>>> At the same time, there don't seem to be any excessive drops in
>>>>>>>>>>> ethernet cards, as reported by ifconfig in Linux (slaves run on
>>>>>>>>>>> Gentoo, server on Ubuntu):
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ens1f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
>>>>>>>>>>> inet 192.168.38.1 netmask 255.255.255.252 broadcast
>>>>>>>>>>> 192.168.38.3
>>>>>>>>>>> inet6 fe80::225:90ff:fe9c:62c3 prefixlen 64 scopeid
>>>>>>>>>>> 0x20<link>
>>>>>>>>>>> ether 00:25:90:9c:62:c3 txqueuelen 1000 (Ethernet)
>>>>>>>>>>> RX packets 8568340799 bytes 76612663159251 (69.6 TiB)
>>>>>>>>>>> RX errors 7 dropped 0 overruns 0 frame 7
>>>>>>>>>>> TX packets 1558294820 bytes 93932603947 (87.4 GiB)
>>>>>>>>>>> TX errors 0 dropped 0 overruns 0 carrier 0 collisions
>>>>>>>>>>> 0
>>>>>>>>>>>
>>>>>>>>>>> eth3 Link encap:Ethernet HWaddr 00:25:90:9c:63:1a
>>>>>>>>>>> inet addr:192.168.38.2 Bcast:192.168.38.3
>>>>>>>>>>> Mask:255.255.255.252
>>>>>>>>>>> inet6 addr: fe80::225:90ff:fe9c:631a/64 Scope:Link
>>>>>>>>>>> UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
>>>>>>>>>>> RX packets:1558294810 errors:0 dropped:0 overruns:0
>>>>>>>>>>> frame:0
>>>>>>>>>>> TX packets:8570261350 errors:0 dropped:0 overruns:0
>>>>>>>>>>> carrier:0
>>>>>>>>>>> collisions:0 txqueuelen:1000
>>>>>>>>>>> RX bytes:102083292705 (102.0 GB) TX
>>>>>>>>>>> bytes:76629844394725
>>>>>>>>>>> (76.6
>>>>>>>>>>> TB)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> So, it should not be a simple dropped frames problem.
>>>>>>>>>>>
>>>>>>>>>>> Since the problem occurs only with larger messages, is there any
>>>>>>>>>>> size-limited buffer in ZeroMQ that may cause dropping of the
>>>>>>>>>>> messages?
>>>>>>>>>>> Or any other possible solution?
>>>>>>>>>>>
>>>>>>>>>>> Thank you for your help,
>>>>>>>>>>>
>>>>>>>>>>> Marko
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> zeromq-dev mailing list
>>>>>>>>>>> zeromq-dev at lists.zeromq.org
>>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.zeromq.org_mailman_listinfo_zeromq-2Ddev&d=BQIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=tZ1K7viiBHmygwDm10unISVGpJSFV-7Ua8Hi_8Q0QiM&s=3Q-nVSfTzx9PRTN9b-co8SaoGAi7qmGZ0aVL4LWu3Iw&e=
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> zeromq-dev mailing list
>>>>>>>>>> zeromq-dev at lists.zeromq.org
>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.zeromq.org_mailman_listinfo_zeromq-2Ddev&d=BQIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=tZ1K7viiBHmygwDm10unISVGpJSFV-7Ua8Hi_8Q0QiM&s=3Q-nVSfTzx9PRTN9b-co8SaoGAi7qmGZ0aVL4LWu3Iw&e=
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> zeromq-dev mailing list
>>>>>>>> zeromq-dev at lists.zeromq.org
>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.zeromq.org_mailman_listinfo_zeromq-2Ddev&d=BQIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=tZ1K7viiBHmygwDm10unISVGpJSFV-7Ua8Hi_8Q0QiM&s=3Q-nVSfTzx9PRTN9b-co8SaoGAi7qmGZ0aVL4LWu3Iw&e=
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> zeromq-dev mailing list
>>>>>> zeromq-dev at lists.zeromq.org
>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.zeromq.org_mailman_listinfo_zeromq-2Ddev&d=BQIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=tZ1K7viiBHmygwDm10unISVGpJSFV-7Ua8Hi_8Q0QiM&s=3Q-nVSfTzx9PRTN9b-co8SaoGAi7qmGZ0aVL4LWu3Iw&e=
>>>>> _______________________________________________
>>>>> zeromq-dev mailing list
>>>>> zeromq-dev at lists.zeromq.org
>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.zeromq.org_mailman_listinfo_zeromq-2Ddev&d=BQIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=tZ1K7viiBHmygwDm10unISVGpJSFV-7Ua8Hi_8Q0QiM&s=3Q-nVSfTzx9PRTN9b-co8SaoGAi7qmGZ0aVL4LWu3Iw&e=
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> zeromq-dev mailing list
>>>> zeromq-dev at lists.zeromq.org
>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.zeromq.org_mailman_listinfo_zeromq-2Ddev&d=BQIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=tZ1K7viiBHmygwDm10unISVGpJSFV-7Ua8Hi_8Q0QiM&s=3Q-nVSfTzx9PRTN9b-co8SaoGAi7qmGZ0aVL4LWu3Iw&e=
>>>
>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev at lists.zeromq.org
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.zeromq.org_mailman_listinfo_zeromq-2Ddev&d=BQIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=tZ1K7viiBHmygwDm10unISVGpJSFV-7Ua8Hi_8Q0QiM&s=3Q-nVSfTzx9PRTN9b-co8SaoGAi7qmGZ0aVL4LWu3Iw&e=
>>
>>
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.zeromq.org_mailman_listinfo_zeromq-2Ddev&d=BQIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=tZ1K7viiBHmygwDm10unISVGpJSFV-7Ua8Hi_8Q0QiM&s=3Q-nVSfTzx9PRTN9b-co8SaoGAi7qmGZ0aVL4LWu3Iw&e=
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.zeromq.org_mailman_listinfo_zeromq-2Ddev&d=BQIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=tZ1K7viiBHmygwDm10unISVGpJSFV-7Ua8Hi_8Q0QiM&s=3Q-nVSfTzx9PRTN9b-co8SaoGAi7qmGZ0aVL4LWu3Iw&e=
----------------------------------------------------------------------
The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt.
More information about the zeromq-dev
mailing list