[zeromq-dev] Blocking issues with signaler_t::make_fdpair

Felipe Farinon felipe.farinon at powersyslab.com
Tue Dec 10 20:18:25 CET 2013


Maybe it's time to switch to ephemeral ports again.

Em 10/12/2013 14:42, Koby Boyango escreveu:
> Sorry for my late reply, been sick for a few days. I've done some 
> tests using the make_fdpair from the master, and it seems like using 
> the ephemeral port support and avoiding the locking solved it. Thanks!
> But I do believe that if supporting a fixed signaler port is still 
> desired, we should better protect against the scenarios I've described 
> in my first mail. What do you think?
>
> Koby
>
>
> On Tue, Dec 10, 2013 at 12:37 AM, KIU Shueng Chuan <nixchuan at gmail.com 
> <mailto:nixchuan at gmail.com>> wrote:
>
>     I believe no permission is needed to do a pull request. :)
>
>     Upon rereading Koby's mail more closely, his problem can be
>     reproduced by having one background program use version 3.2.2. The
>     leaked event handle ensures that the global event stays alive and
>     doesn't get recreated each time by Windows.
>
>     On Dec 10, 2013 2:44 AM, "Felipe Farinon"
>     <felipe.farinon at powersyslab.com
>     <mailto:felipe.farinon at powersyslab.com>> wrote:
>
>         As Koby didn't answered, and I am not able to reproduce the
>         problem anymore, could I make the modification even being
>         unable to reproduce the problem (indirectly it will be tested,
>         since I am going to run the modification in the same
>         environment where the problem was happening)?
>
>         Em 01/12/2013 21:27, KIU Shueng Chuan escreveu:
>>
>>         In master, you can switch to using ephemeral ports by
>>         modifying signaler_port to 0 in config.hpp. A new ephemeral
>>         port is used per make_fdpair call and no critical section is
>>         used.
>>
>>         Could you try that and see if it solves your problems?
>>
>>         On Dec 1, 2013 9:39 PM, "Koby Boyango" <koby.b at mce-sys.com
>>         <mailto:koby.b at mce-sys.com>> wrote:
>>
>>             Hi
>>             I'm fairly new to ZeroMQ, and have been working on
>>             integrating it using czmq in several projects, Windows only.
>>             I've opened an issue on GitHub*, *#767**, and to Pieter's
>>             request I'm moving the discussion here. So here is what
>>             I've written there:
>>             While trying to integrate ZeroMQ in different
>>             modules\processes (Windows only), I've encountered a
>>             problem where in some situations a ZeroMQ call blocks -
>>             forever. After debugging the issue, I've found out that
>>             zmq_init wasn't returning, and after further debugging
>>             and digging through the code I've found out that the
>>             problem was in signaler_t::make_fdpair, where the
>>             WaitForSingleObject on the "zmq-signaler-port-sync"
>>             didn't return.
>>             Initially i wasn't sure in which situations it occurs. So
>>             I did some further investigation and found out that in my
>>             case:
>>
>>               * For some reason, when I close a test program with
>>                 Ctrl+C, the event stays un-signaled. Not sure why
>>                 yet, will need further debugging.
>>               * I had a node.js script, which uses ZeroMQ, running in
>>                 the background. Because it uses version 3.2.2 of
>>                 libzmq, which leaks the event handle, the existing
>>                 event wasn't deleted, and stayed in an un-signaled state.
>>               * Basically, from that point no one on the system can
>>                 use ZeroMQ.
>>
>>             I find make_fdpair to be very problematic on Windows:
>>
>>               * If one call exits without signaling the event, while
>>                 someone else is holding a handle to the event - All
>>                 further calls on the system will block. It can
>>                 happen, for example, if an assertion fails, and the
>>                 process crashes because of the exception raised.
>>               * It can also happen if an assertion has failed, an
>>                 exception was raised, but caught by the caller using
>>                 a __try & __except block (SEH). We can't simply rely
>>                 on the exception to crash the process (for example, a
>>                 program might wrap calls to its plugins with __try &
>>                 __except, so a faulty plugin won't crash the while
>>                 program).
>>               * So it basically means that one faulty program can
>>                 cause other, unrelated programs, to block.
>>
>>             I suggest:
>>
>>               * No matter which synchronization mechanism is used,
>>                 wrap the code with __try & __finally, and release the
>>                 lock in the finally block. This will make sure that
>>                 we'll release in case of an exception (In my case,
>>                 though, I tried it and it didn't help. the thread
>>                 might be terminated during the call).
>>               * If possible, don't use a global, system wide, lock.
>>                 From my understanding, it is used in order to reuse
>>                 the signaler port. So either use a random, available,
>>                 port, or make the port "libzmq instance" specific
>>                 (the first calls binds on a random port, further
>>                 calls will reuse the port) and protect it with
>>                 critical section. This will at least limit the
>>                 problems to the same process.
>>               * If the system wide lock is really needed, I suggest
>>                 using a mutex instead of the event. When using a
>>                 mutex, if the owning thread dies without releasing
>>                 it, Windows automatically releases it and the next
>>                 call to WaitForSingleObject will return
>>                 WAIT_ABANDONED, and do not block. We can than check
>>                 if the port was left in a "listening" state, close it
>>                 if necessary, and "re-listen" with a new socket.
>>
>>             I'm using libzmq 4.0.1 with czmq 2.0.2. I saw that the
>>             make_fdpair was improved in the master, but I believe it
>>             still doesn't entirely solve it.
>>             What do you say?
>>
>>             Koby
>>
>>             _______________________________________________
>>             zeromq-dev mailing list
>>             zeromq-dev at lists.zeromq.org
>>             <mailto:zeromq-dev at lists.zeromq.org>
>>             http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
>>
>>
>>         _______________________________________________
>>         zeromq-dev mailing list
>>         zeromq-dev at lists.zeromq.org  <mailto:zeromq-dev at lists.zeromq.org>
>>         http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>         _______________________________________________
>         zeromq-dev mailing list
>         zeromq-dev at lists.zeromq.org <mailto:zeromq-dev at lists.zeromq.org>
>         http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>     _______________________________________________
>     zeromq-dev mailing list
>     zeromq-dev at lists.zeromq.org <mailto:zeromq-dev at lists.zeromq.org>
>     http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20131210/e58daedc/attachment.htm>


More information about the zeromq-dev mailing list