[zeromq-dev] Blocking issues with signaler_t::make_fdpair

Koby Boyango koby.b at mce-sys.com
Sun Dec 1 14:39:45 CET 2013


Hi
I'm fairly new to ZeroMQ, and have been working on integrating it using
czmq in several projects, Windows only.
I've opened an issue on GitHub*, *#767, and to Pieter's request I'm moving
the discussion here. So here is what I've written there:
While trying to integrate ZeroMQ in different modules\processes (Windows
only), I've encountered a problem where in some situations a ZeroMQ call
blocks - forever. After debugging the issue, I've found out that zmq_init
wasn't returning, and after further debugging and digging through the code
I've found out that the problem was in signaler_t::make_fdpair, where the
WaitForSingleObject on the "zmq-signaler-port-sync" didn't return.
Initially i wasn't sure in which situations it occurs. So I did some
further investigation and found out that in my case:

   - For some reason, when I close a test program with Ctrl+C, the event
   stays un-signaled. Not sure why yet, will need further debugging.
   - I had a node.js script, which uses ZeroMQ, running in the background.
   Because it uses version 3.2.2 of libzmq, which leaks the event handle, the
   existing event wasn't deleted, and stayed in an un-signaled state.
   - Basically, from that point no one on the system can use ZeroMQ.

I find make_fdpair to be very problematic on Windows:

   - If one call exits without signaling the event, while someone else is
   holding a handle to the event - All further calls on the system will block.
   It can happen, for example, if an assertion fails, and the process crashes
   because of the exception raised.
   - It can also happen if an assertion has failed, an exception was
   raised, but caught by the caller using a __try & __except block (SEH). We
   can't simply rely on the exception to crash the process (for example, a
   program might wrap calls to its plugins with __try & __except, so a faulty
   plugin won't crash the while program).
   - So it basically means that one faulty program can cause other,
   unrelated programs, to block.

I suggest:

   - No matter which synchronization mechanism is used, wrap the code with
   __try & __finally, and release the lock in the finally block. This will
   make sure that we'll release in case of an exception (In my case, though, I
   tried it and it didn't help. the thread might be terminated during the
   call).
   - If possible, don't use a global, system wide, lock. From my
   understanding, it is used in order to reuse the signaler port. So either
   use a random, available, port, or make the port "libzmq instance" specific
   (the first calls binds on a random port, further calls will reuse the port)
   and protect it with critical section. This will at least limit the problems
   to the same process.
   - If the system wide lock is really needed, I suggest using a mutex
   instead of the event. When using a mutex, if the owning thread dies without
   releasing it, Windows automatically releases it and the next call to
   WaitForSingleObject will return WAIT_ABANDONED, and do not block. We can
   than check if the port was left in a "listening" state, close it if
   necessary, and "re-listen" with a new socket.

I'm using libzmq 4.0.1 with czmq 2.0.2. I saw that the make_fdpair was
improved in the master, but I believe it still doesn't entirely solve it.
What do you say?

Koby
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20131201/4708ed95/attachment.htm>


More information about the zeromq-dev mailing list