[zeromq-dev] zeromq, abort(), and high reliability environments
Goswin von Brederlow
goswin-v-b at web.de
Thu Aug 14 12:33:54 CEST 2014
On Wed, Aug 13, 2014 at 09:39:24AM -0500, Thomas Rodgers wrote:
> For the other cases where the assert happens in a background thread, I
> could see retrying before giving up in the event of transient errors, but
> there's still the fundamental complication of how you communicate the now
> asynchronous, hard failure back to the caller in some reliable/sane way (as
> was noted before, the choice the CUDA SDK made here is great example of how
> not do it).
One way may be to have an abort callback that language bindings (or
applications) can set. Instead of killing the program outright the
abort callback would be invoked and the bindings / application can
take the proper actions. If unset or if the callback returns the real
abort() can be called.
But this would be for unrecoverable errors. I don't think there should
be many of those in zmq.
There is still a class of errors left though. A background thread can
have a persistent error that isn't unrecoverable. For the simplest
case when a connection dies and the client doesn't reconnect then
there is an error. It won't go away. It won't fix itself. It doesn't
impact any other socket (or even other connections of the same
socket). So abort/assert is realy the wrong thing. But how to tell the
application? Currently these kind of errors get silently ignored in
zmq. Messages get dropped to the floor in most cases.
Note: I don't know of a better solution so this isn't critizism. Just
an example of another class of errors.
More information about the zeromq-dev