[zeromq-dev] zeromq, abort(), and high reliability environments

Thomas Rodgers rodgert at twrodgers.com
Wed Aug 13 16:39:24 CEST 2014


For the class of errors where zmq may assert() on failure to meet
pre-conditions (e.g. the cases where it would likely return EINVAL), how
does retrying help?

For the other cases where the assert happens in a background thread, I
could see retrying before giving up in the event of transient errors, but
there's still the fundamental complication of how you communicate the now
asynchronous, hard failure back to the caller in some reliable/sane way (as
was noted before, the choice the CUDA SDK made here is great example of how
not do it).


On Wed, Aug 13, 2014 at 9:29 AM, Prabhakara Yellai (pyellai) <
pyellai at cisco.com> wrote:

> Coming from the carrier class systems, and working in High Availability
> for more than a decade, I have seen and experienced that the error handling
> decision is better left to the application than library code making
> decisions. The same error/failure may be fatal to some apps but not fatal
> for some other apps. Library code should try to recover by retrying few
> times to deal with transient errors/failures. If it couldn't recover, it
> can return the error to the App. Let the App determine whether (1) it wants
> to restart to recover (2) ignore as the operation it was performing is not
> critical to the system (3) let user/operator determine when and how to
> recover ('when' is very important). We actually changed several libraries
> from asserting inside the libs to returning the error to apps to reduce
> process crashes and to increase the system stability, resiliency and fault
> containment.
>
> Prabhakara
>
> -----Original Message-----
> From: zeromq-dev-bounces at lists.zeromq.org [mailto:
> zeromq-dev-bounces at lists.zeromq.org] On Behalf Of Pieter Hintjens
> Sent: Wednesday, August 13, 2014 12:08 PM
> To: ZeroMQ development list
> Subject: Re: [zeromq-dev] zeromq, abort(), and high reliability
> environments
>
> What we know, anyone who's written production systems knows, is that as
> Thomas says, when things go badly wrong, error handling simply will not
> work. There is a whole science to software reliability, kind of as there is
> for messaging reliability. Rule #1 is, also, simpler = more reliable. Error
> handling creates complexity. Especially error handling by an application on
> its own internal consistency... that's IMO a recipe for fragile software.
>
> In our APIs we've stripped down error reporting to a minimum. libzmq with
> its POSIX tendencies still relies IMO far too heavily on subtle error
> returns (errno == EAGAIN vs. errno == EINVAL?). CZMQ is much
> cleaner: a method works, or fails if there's a recoverable error, or
> asserts if there's an unrecoverable error.
>
> Note also that "assert" is just a name. Like "malloc", this can be
> replaced with more sophisticated diagnostics if there's profit in doing
> that. The adding in of extra diagnostics is fine. I'm usually more than
> happy with a stack backtrace. An assert invariably leads me to a bug in my
> application code and a fast, accurate fix.
>
> People who won't use a library because it contains asserts are
> misinformed, and remind me of people who thing adding extra brokers to a
> network will magically add message "reliability".
>
> We get these threads every six months or so, and over time my attitude to
> asserts gets more militant, as I watch the quality of the code that's built
> on them, and the vast lack of counter-data grow and grow.
> You get more reliable software by eliminating those strange unexpected
> code paths, not by adding error handling into the mix.
>
> -Pieter
>
> On Wed, Aug 13, 2014 at 2:08 AM, Thomas Rodgers <rodgert at twrodgers.com>
> wrote:
> >> At least for languages that support exceptions, I believe throwing an
> >> exception for invalid arguments is far preferable to just killing the
> >> process.
> >
> >
> > I write a lot of C++ code for automated trading systems, so I come at
> > this from the view that there is no way in this world to light
> > yourself on fire faster than making the same stupid trade over and
> > over in a tight loop.  My experience has been that error recovery
> > logic is almost always poorly exercised and never works entirely as
> intended when Bad Things happen.
> >
> > Spending time writing Erlang based systems has also changed my view to
> > favor the "just let it crash" approach (note, Erlang also has
> > exceptions, but they are not the most important feature of it's error
> handling/recovery model).
> > These days, I do not generally expect exceptions to be recoverable.
> > They are used a mechanism where I can hang additional reporting on
> > what the failure was, and the context within which it happened, on the
> > way to a top level handler that does nothing but log the failure and
> > terminate the process.  It is then the responsibility of an external
> > process to put the system back into a known good state and restart the
> failed process.
> >
> > To some extent, a library that aborts the process out from underneath
> > me denies me the opportunity to gather more context into my logs
> > before terminating, but core files are useful things for post mortem
> debugging.
> >
> >
> > On Tue, Aug 12, 2014 at 6:35 PM, Michi Henning <michi at triodia.com>
> wrote:
> >>
> >> > My current view on what constitutes a sane API and behavior from
> >> > the library is heavily driven by what I want, as a user. That is,
> >> > my C libraries are things I primarily make to use, not to sell. I
> >> > think it's been about 30 years that I wrote my first C libraries,
> >> > and my style and view has shifted massively since then, to what we
> >> > have in cases like CZMQ today.
> >>
> >> I can attest to having undergone a similar change of view over the
> >> past 30 years :-)
> >>
> >> > Mainly, the API enforces its style upwards, so that you simply
> >> > *cannot* get strange code paths and bizarre arguments. If you do,
> >> > your application is corrupt, or incompetent, and the library has a
> >> > responsibility to stop things immediately, not allow them to continue.
> >> >
> >> > It is a safety cord that has proven its usefulness many times.
> >> > Indeed, some of the hardest bugs to catch in recent months were
> >> > from older APIs that precisely returned EINVAL on bad arguments,
> >> > and where the calling code forgot to check the return code. Stuff is
> then...
> >> > bizarrely broken and tracking that down can be insanely hard.
> >>
> >> I hear you, and there is probably not a single one true answer here.
> >>
> >> Part of the problem is C, which makes it possible to ignore error
> >> codes and just blithely stumble on regardless.
> >>
> >> In languages with exception handling, it's a different matter though,
> >> because I can force the caller to pay attention to invalid arguments.
> >>
> >> My main concern is that, by aborting in the library, it becomes very
> >> difficult to write something that needs to have high reliability.
> >> Basically, I can be sure that my program won't dump core only if I
> >> have exercised it to the extent that all possible code paths with all
> >> possible argument values are tested under all possible combinations.
> >> For any sizeable program (especially with lots of threads and
> >> asynchronous things going on), that can be damn near impossible.
> >>
> >> In turn, if I still want to persist, I now have to wrap the
> >> underlying C API and check all the preconditions for every API call
> >> myself, just so I can throw an exception when a pre-condition is
> >> violated instead of having the program aborted by the library. But
> >> validating the pre-conditions myself may well be very difficult or
> >> very expensive. For example, the cost of verifying that a valid socket
> pointer is passed to every API call is quite high.
> >>
> >> If I'm given the option of catching an exception, I may be able to
> >> recover from my own programming error, for example, by terminating
> >> only the current operation. At least, the program keeps running,
> >> instead of dumping core, and I can splatter my log with error messages
> or whatever I deem appropriate.
> >> The point here is that general-purpose libraries should avoid setting
> >> policy, because what should happen under certain error conditions is
> >> something that needs to be under control of the caller.
> >>
> >> I hear you about the difficulty of debugging code that ignores EINVAL
> >> from API calls. But that is the price of programming in C. It's no
> >> different from making system calls and ignoring the return value; do
> >> so at your peril. But a system call is policy-free: it allows me to
> >> decide what should happen when I have passed bad arguments, instead of
> taking that decision away from me.
> >>
> >> At least for languages that support exceptions, I believe throwing an
> >> exception for invalid arguments is far preferable to just killing the
> >> process.
> >>
> >> Cheers,
> >>
> >> Michi.
> >> _______________________________________________
> >> zeromq-dev mailing list
> >> zeromq-dev at lists.zeromq.org
> >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> >
> >
> >
> > _______________________________________________
> > zeromq-dev mailing list
> > zeromq-dev at lists.zeromq.org
> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> >
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20140813/aa8fb4fe/attachment.html>


More information about the zeromq-dev mailing list