[zeromq-dev] zeromq, abort(), and high reliability environments

Pieter Hintjens ph at imatix.com
Wed Aug 13 08:37:36 CEST 2014


What we know, anyone who's written production systems knows, is that
as Thomas says, when things go badly wrong, error handling simply will
not work. There is a whole science to software reliability, kind of as
there is for messaging reliability. Rule #1 is, also, simpler = more
reliable. Error handling creates complexity. Especially error handling
by an application on its own internal consistency... that's IMO a
recipe for fragile software.

In our APIs we've stripped down error reporting to a minimum. libzmq
with its POSIX tendencies still relies IMO far too heavily on subtle
error returns (errno == EAGAIN vs. errno == EINVAL?). CZMQ is much
cleaner: a method works, or fails if there's a recoverable error, or
asserts if there's an unrecoverable error.

Note also that "assert" is just a name. Like "malloc", this can be
replaced with more sophisticated diagnostics if there's profit in
doing that. The adding in of extra diagnostics is fine. I'm usually
more than happy with a stack backtrace. An assert invariably leads me
to a bug in my application code and a fast, accurate fix.

People who won't use a library because it contains asserts are
misinformed, and remind me of people who thing adding extra brokers to
a network will magically add message "reliability".

We get these threads every six months or so, and over time my attitude
to asserts gets more militant, as I watch the quality of the code
that's built on them, and the vast lack of counter-data grow and grow.
You get more reliable software by eliminating those strange unexpected
code paths, not by adding error handling into the mix.

-Pieter

On Wed, Aug 13, 2014 at 2:08 AM, Thomas Rodgers <rodgert at twrodgers.com> wrote:
>> At least for languages that support exceptions, I believe throwing an
>> exception for invalid arguments is far preferable to just killing the
>> process.
>
>
> I write a lot of C++ code for automated trading systems, so I come at this
> from the view that there is no way in this world to light yourself on fire
> faster than making the same stupid trade over and over in a tight loop.  My
> experience has been that error recovery logic is almost always poorly
> exercised and never works entirely as intended when Bad Things happen.
>
> Spending time writing Erlang based systems has also changed my view to favor
> the "just let it crash" approach (note, Erlang also has exceptions, but they
> are not the most important feature of it's error handling/recovery model).
> These days, I do not generally expect exceptions to be recoverable.  They
> are used a mechanism where I can hang additional reporting on what the
> failure was, and the context within which it happened, on the way to a top
> level handler that does nothing but log the failure and terminate the
> process.  It is then the responsibility of an external process to put the
> system back into a known good state and restart the failed process.
>
> To some extent, a library that aborts the process out from underneath me
> denies me the opportunity to gather more context into my logs before
> terminating, but core files are useful things for post mortem debugging.
>
>
> On Tue, Aug 12, 2014 at 6:35 PM, Michi Henning <michi at triodia.com> wrote:
>>
>> > My current view on what constitutes a sane API and behavior from the
>> > library is heavily driven by what I want, as a user. That is, my C
>> > libraries are things I primarily make to use, not to sell. I think
>> > it's been about 30 years that I wrote my first C libraries, and my
>> > style and view has shifted massively since then, to what we have in
>> > cases like CZMQ today.
>>
>> I can attest to having undergone a similar change of view over the past 30
>> years :-)
>>
>> > Mainly, the API enforces its style upwards, so that you simply
>> > *cannot* get strange code paths and bizarre arguments. If you do, your
>> > application is corrupt, or incompetent, and the library has a
>> > responsibility to stop things immediately, not allow them to continue.
>> >
>> > It is a safety cord that has proven its usefulness many times. Indeed,
>> > some of the hardest bugs to catch in recent months were from older
>> > APIs that precisely returned EINVAL on bad arguments, and where the
>> > calling code forgot to check the return code. Stuff is then...
>> > bizarrely broken and tracking that down can be insanely hard.
>>
>> I hear you, and there is probably not a single one true answer here.
>>
>> Part of the problem is C, which makes it possible to ignore error codes
>> and just blithely stumble on regardless.
>>
>> In languages with exception handling, it's a different matter though,
>> because I can force the caller to pay attention to invalid arguments.
>>
>> My main concern is that, by aborting in the library, it becomes very
>> difficult to write something that needs to have high reliability. Basically,
>> I can be sure that my program won't dump core only if I have exercised it to
>> the extent that all possible code paths with all possible argument values
>> are tested under all possible combinations. For any sizeable program
>> (especially with lots of threads and asynchronous things going on), that can
>> be damn near impossible.
>>
>> In turn, if I still want to persist, I now have to wrap the underlying C
>> API and check all the preconditions for every API call myself, just so I can
>> throw an exception when a pre-condition is violated instead of having the
>> program aborted by the library. But validating the pre-conditions myself may
>> well be very difficult or very expensive. For example, the cost of verifying
>> that a valid socket pointer is passed to every API call is quite high.
>>
>> If I'm given the option of catching an exception, I may be able to recover
>> from my own programming error, for example, by terminating only the current
>> operation. At least, the program keeps running, instead of dumping core, and
>> I can splatter my log with error messages or whatever I deem appropriate.
>> The point here is that general-purpose libraries should avoid setting
>> policy, because what should happen under certain error conditions is
>> something that needs to be under control of the caller.
>>
>> I hear you about the difficulty of debugging code that ignores EINVAL from
>> API calls. But that is the price of programming in C. It's no different from
>> making system calls and ignoring the return value; do so at your peril. But
>> a system call is policy-free: it allows me to decide what should happen when
>> I have passed bad arguments, instead of taking that decision away from me.
>>
>> At least for languages that support exceptions, I believe throwing an
>> exception for invalid arguments is far preferable to just killing the
>> process.
>>
>> Cheers,
>>
>> Michi.
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>



More information about the zeromq-dev mailing list