[zeromq-dev] zmq-signaler-port-sync

KIU Shueng Chuan nixchuan at gmail.com
Fri Feb 15 10:13:49 CET 2013


Hi Pau,

I submitted a patch earlier this morning against the development branch.
https://github.com/zeromq/libzmq/pull/514
It only checks for error returns from connect() and accept(). The other
assertions that occur would be due to programming errors.


Regarding your patch, I see that yours doesn't have the CloseHandle(sync)
as fixed in the trunk branch. Although it was to fix a handle leak, not
having this fix also has the side effect of making the problem described in
this thread much more likely to surface.
- Process A and Process B both use zeromq and are running
- Due to the handle leak, the global Event is kept alive even when we are
not inside the critical section.
- Process A subsequently enters the critical section and asserts, leaving
it locked
- Sometime later when Process B (or another newly started zeromq
application) tries to enter the critical section, it will block forever
(With the fix, the problem surfaces only if another process was blocking to
enter the critical section at the point of assertion)


On Fri, Feb 15, 2013 at 4:21 PM, Pau <pau at teleopsia.com> wrote:

>  Kiu,
>
> I have rethought about the problem and it is not very clear to me how to
> run all make_fdpair() with no asserts. The problem is that I do not know
> waht may happen if we call a wsa function after a failure of a previous
> one. I guess that we could check in a per call way and see if it is ok. Or
> define an internal boolean that only calls the next function if the
> previous has suceeded.
>
> I am not familiar with the coding style in ZMQ (never made a patch) but I
> would say that the specific assert can look a bit better. I am testing this
> code.
> Anyway, if you think removing the asserts is ok, I am ok with it, as said
> I am new in this yard.
>
> Please check code:
>
> It is not clear to me what to do with the win_assert(..) in case SetEvent
> (sync) fails inside another assert. As it is now it is clear that it will
> never be called but I guess that in case of error you prefer to see the
> fisrt error. Not sure.
>
> #if defined ZMQ_HAVE_WINDOWS
> #define wsa_assert_fdpair(sync, no) \
>     {    \
>      bool brc = SetEvent (sync); \
>      wsa_assert (no); \
>      win_assert (brc != 0);    \
>     }
>
> #define zmq_assert_fdpair(sync, x) \
>     {    \
>      bool brc = SetEvent (sync); \
>      zmq_assert (x); \
>      win_assert (brc != 0);    \
>     }
>
> #define win_assert_fdpair(sync, x) \
>     {    \
>      bool brc = SetEvent (sync); \
>      win_assert ( (x) && (brc != 0) );    \
>     }
>
> #endif
>
> int zmq::signaler_t::make_fdpair (fd_t *r_, fd_t *w_)
>
> ...
>
>     HANDLE sync = CreateEvent (&sa, FALSE, TRUE, TEXT
> ("Global\\zmq-signaler-port-sync"));
>     if (sync == NULL && GetLastError () == ERROR_ACCESS_DENIED)
>       sync = OpenEvent (SYNCHRONIZE | EVENT_MODIFY_STATE, FALSE, TEXT
> ("Global\\zmq-signaler-port-sync"));
>
>     win_assert (sync != NULL);
>
>     //  Enter the critical section.
>     DWORD dwrc = WaitForSingleObject (sync, INFINITE);
>     zmq_assert_fdpair (sync, dwrc == WAIT_OBJECT_0);
>
>     //  Windows has no 'socketpair' function. CreatePipe is no good as pipe
>     //  handles cannot be polled on. Here we create the socketpair by hand.
>     *w_ = INVALID_SOCKET;
>     *r_ = INVALID_SOCKET;
>
>     //  Create listening socket.
>     SOCKET listener;
>     listener = open_socket (AF_INET, SOCK_STREAM, 0);
>     wsa_assert_fdpair(sync, listener != INVALID_SOCKET);
>
>     //  Set SO_REUSEADDR and TCP_NODELAY on listening socket.
>     BOOL so_reuseaddr = 1;
>     int rc = setsockopt (listener, SOL_SOCKET, SO_REUSEADDR,
>         (char *)&so_reuseaddr, sizeof (so_reuseaddr));
>     wsa_assert (rc != SOCKET_ERROR);
>     BOOL tcp_nodelay = 1;
>     rc = setsockopt (listener, IPPROTO_TCP, TCP_NODELAY,
>         (char *)&tcp_nodelay, sizeof (tcp_nodelay));
>     wsa_assert_fdpair (sync, rc != SOCKET_ERROR);
>
>     //  Bind listening socket to any free local port.
>     struct sockaddr_in addr;
>     memset (&addr, 0, sizeof (addr));
>     addr.sin_family = AF_INET;
>     addr.sin_addr.s_addr = htonl (INADDR_LOOPBACK);
>     addr.sin_port = htons (signaler_port);
>     rc = bind (listener, (const struct sockaddr*) &addr, sizeof (addr));
>     wsa_assert_fdpair (sync, rc != SOCKET_ERROR);
>
>     //  Listen for incomming connections.
>     rc = listen (listener, 1);
>     wsa_assert_fdpair (sync, rc != SOCKET_ERROR);
>
>     //  Create the writer socket.
>     *w_ = WSASocket (AF_INET, SOCK_STREAM, 0, NULL, 0,  0);
>     wsa_assert_fdpair (sync, *w_ != INVALID_SOCKET);
>
>     //  On Windows, preventing sockets to be inherited by child processes.
>     BOOL brc = SetHandleInformation ((HANDLE) *w_, HANDLE_FLAG_INHERIT, 0);
>     win_assert_fdpair (sync, brc);
>
>     //  Set TCP_NODELAY on writer socket.
>     rc = setsockopt (*w_, IPPROTO_TCP, TCP_NODELAY,
>         (char *)&tcp_nodelay, sizeof (tcp_nodelay));
>     wsa_assert_fdpair (sync, rc != SOCKET_ERROR);
>
>     //  Connect writer to the listener.
>     rc = connect (*w_, (struct sockaddr*) &addr, sizeof (addr));
>     wsa_assert_fdpair (sync, rc != SOCKET_ERROR);
>
>     //  Accept connection from writer.
>     *r_ = accept (listener, NULL, NULL);
>     wsa_assert_fdpair (sync, *r_ != INVALID_SOCKET);
>
>     //  On Windows, preventing sockets to be inherited by child processes.
>     brc = SetHandleInformation ((HANDLE) *r_, HANDLE_FLAG_INHERIT, 0);
>     win_assert_fdpair (sync, brc);
>
>     //  We don't need the listening socket anymore. Close it.
>     rc = closesocket (listener);
>     wsa_assert_fdpair (sync, rc != SOCKET_ERROR);
>
>     //  Exit the critical section.
>     brc = SetEvent (sync);
>     win_assert (brc != 0);
>
>     return 0;
>
>
>
> El 14/02/2013 1:46, KIU Shueng Chuan escribió:
>
> Are we okay with using assertions to catch both
> - programming errors
> - rare but known situations not handled in the code? (in this case
> resource exhaustion)
>
>  How about this: Have make_fdpair() return -1 (and release the critical
> section) on error returns from the calls to connect() and accept() only.
> The failure will be caught by signaler_t() which calls make_fdpair()
>
>
>
> On Wed, Feb 13, 2013 at 7:22 PM, Pau <pau at teleopsia.com> wrote:
>
>>  Hi, thanks
>>
>> I do not want to look bungler, but wouldn't be a shortcut to implement
>> asserts that clean the event before aborting?
>>
>> El 13/02/2013 9:54, KIU Shueng Chuan escribió:
>>
>> Hi Pau,
>>
>>  The system-wide critical section is currently implemented using a win32
>> Event which, as you observed, has the possibility of resulting in a
>> deadlock in the following situation:
>> 1) Process A takes the Event
>> 2) Process B tries to take the Event and blocks
>> 3) Process A aborts within the critical section (due to an assertion
>> being raised)
>> 4) Since Process B has opened the Event, the OS will not clean up the
>> Event.
>> 5) Process B and any subsequent process will now block forever for the
>> Event.
>>
>>  As I mentioned in the previous mail, if the critical section were to be
>> implemented using a Mutex instead, then in step 5, Process B would be able
>> to enter the critical section with a return code of WAIT_ABANDONED from
>> WaitForSingleObject. (Or at least that's what I read from MSDN)
>>
>>  Note: If Process A aborted due to some exhaustion of resources, then
>> Process B would likely hit the same assertion too.
>>
>>  The question is how to convert the Event to a Mutex and yet not break
>> compatibility with existing applications using older versions of the
>> library.
>>
>>
>>
>> On Wed, Feb 13, 2013 at 3:28 PM, Pau <pau at teleopsia.com> wrote:
>>
>>>  Hi,
>>>
>>> I am back with the asserts happening inside a critical section in
>>> signaler.cpp.
>>> The problem still is that in signale.cpp make_fdpair(..) creates
>>> system-wide critical section and does a number of things that can generate
>>> a wsa_assert() or win_assert() before releasing the session.
>>>
>>> I have seen that in the trunk someone has added a CloseHandle(sync) at
>>> the end of the function, I do not know if it had something related with
>>> this but I understand that the problem is still there. wsa_assert() and
>>> wsa_windows() end up in  RaiseException (0x40000015,
>>> EXCEPTION_NONCONTINUABLE, 1, extra_info) which I understand is a cul de sac
>>> that has no way out to clean up before leaving.
>>>
>>> I guess we need a special assert function to use inside this critical
>>> but I'd like a more documented opinion (Kiu?).
>>>
>>> thanks,
>>>
>>> Pau Ceano
>>>
>>> El 21/01/2013 23:37, KIU Shueng Chuan escribió:
>>>
>>> Hi Pau, a fix for the assertion on connection to port 5905 is in trunk
>>> branch.
>>>
>>> I think the dangling critical section possibility could be fixed by
>>> changing the Event to a Mutex. When an assertion occurs the mutex would
>>> just be abandoned. However this change will cause backward compatibility
>>> issues with older versions.
>>>  On Jan 22, 2013 2:04 AM, "Pieter Hintjens" <ph at imatix.com> wrote:
>>>
>>>> Hi Pau,
>>>>
>>>> So there are two different problems here, one is that we're hitting a
>>>> socket limit on WXP, and the other is that the asserts are happening
>>>> inside a critical section.
>>>>
>>>> I don't think we can fix the first one easily but we can presumably
>>>> assert in a smarter way. Do you want to try making a patch for this?
>>>>
>>>> -Pieter
>>>>
>>>> On Mon, Jan 21, 2013 at 6:23 PM, Pau <pau at teleopsia.com> wrote:
>>>> >
>>>> > Hi,
>>>> >
>>>> >
>>>> > I am using (not yet in production) ZMQ on Windows and I have found
>>>> what
>>>> > I think is a big problem for Windows users.
>>>> > We use WXP and W7 and Visual C++ different versions. ZMQ version 3.2.0
>>>> > (as far as I see the same problem happens in 3.2.2)
>>>> >
>>>> > I do not fully understand ZMQ internals but I've seen that every time
>>>> a
>>>> > socket is created the function make_fdpair(..) is called and in
>>>> > signaler.cpp(line244) a system event "zmq-signaler-port-sync" is
>>>> created.
>>>> > This event is used as a system-wide critical section and, so all
>>>> > applications that try to create an event will WaitForSingleObject
>>>> (sync,
>>>> > INFINITE) until  SetEvent (...) is called.
>>>> > The problem is that the code between:
>>>> >   HANDLE sync = CreateEvent (NULL, FALSE, TRUE, TEXT
>>>> > ("zmq-signaler-port-sync"));
>>>> > and
>>>> > SetEvent (sync);
>>>> > is full of wsa_asserts(..) that will terminate the application if
>>>> > something goes wrong.
>>>> >
>>>> > It is clear that terminating the application not leaving the
>>>> system-wide
>>>> > critical section is a bad idea because all applications in the system
>>>> > will hang and you will have to stop all them to start again.
>>>> > I understand that no errors should happen but anyway to escape from
>>>> the
>>>> > error is not a good idea in this case.
>>>> >
>>>> > I do not know all possible reasons to generate a fatal wsa_assert(..)
>>>> > but there is at least one:
>>>> >
>>>> > I have seen that in XP it is possible that line 301  rc = connect
>>>> (*w_,
>>>> > (sockaddr *) &addr, sizeof (addr)); returns an error when a socket
>>>> tries
>>>> > to connect to 5905 and this has happened many times.
>>>> > Windows uses port numbers starting near 1400 and XP has a limit at
>>>> 5000.
>>>> > In W7 this does not look as a problem because maximum is 65000
>>>> > It sounds as if the number was big enough but apart from the fact that
>>>> > ZMQ uses a big number of connections (at least in my tests) I have
>>>> > experienced that Windows jumps port numbers by 7, so 5000 happens
>>>> > sometimes with catastrophic consequences.
>>>> >
>>>> > best,
>>>> >
>>>> > Pau Ceano
>>>> > _______________________________________________
>>>> > zeromq-dev mailing list
>>>> > zeromq-dev at lists.zeromq.org
>>>> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>> _______________________________________________
>>>> zeromq-dev mailing list
>>>> zeromq-dev at lists.zeromq.org
>>>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>>
>>>
>>>
>>> _______________________________________________
>>> zeromq-dev mailing listzeromq-dev at lists.zeromq.orghttp://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>>
>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev at lists.zeromq.org
>>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>>
>>
>>
>> _______________________________________________
>> zeromq-dev mailing listzeromq-dev at lists.zeromq.orghttp://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
>>
>>
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
>>
>
>
> _______________________________________________
> zeromq-dev mailing listzeromq-dev at lists.zeromq.orghttp://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20130215/b09b0148/attachment.htm>


More information about the zeromq-dev mailing list