[zeromq-dev] zeromq, abort(), and high reliability environments

Prabhakara Yellai (pyellai) pyellai at cisco.com
Wed Aug 13 16:29:33 CEST 2014

Coming from the carrier class systems, and working in High Availability for more than a decade, I have seen and experienced that the error handling decision is better left to the application than library code making decisions. The same error/failure may be fatal to some apps but not fatal for some other apps. Library code should try to recover by retrying few times to deal with transient errors/failures. If it couldn't recover, it can return the error to the App. Let the App determine whether (1) it wants to restart to recover (2) ignore as the operation it was performing is not critical to the system (3) let user/operator determine when and how to recover ('when' is very important). We actually changed several libraries from asserting inside the libs to returning the error to apps to reduce process crashes and to increase the system stability, resiliency and fault containment.


-----Original Message-----
From: zeromq-dev-bounces at lists.zeromq.org [mailto:zeromq-dev-bounces at lists.zeromq.org] On Behalf Of Pieter Hintjens
Sent: Wednesday, August 13, 2014 12:08 PM
To: ZeroMQ development list
Subject: Re: [zeromq-dev] zeromq, abort(), and high reliability environments

What we know, anyone who's written production systems knows, is that as Thomas says, when things go badly wrong, error handling simply will not work. There is a whole science to software reliability, kind of as there is for messaging reliability. Rule #1 is, also, simpler = more reliable. Error handling creates complexity. Especially error handling by an application on its own internal consistency... that's IMO a recipe for fragile software.

In our APIs we've stripped down error reporting to a minimum. libzmq with its POSIX tendencies still relies IMO far too heavily on subtle error returns (errno == EAGAIN vs. errno == EINVAL?). CZMQ is much
cleaner: a method works, or fails if there's a recoverable error, or asserts if there's an unrecoverable error.

Note also that "assert" is just a name. Like "malloc", this can be replaced with more sophisticated diagnostics if there's profit in doing that. The adding in of extra diagnostics is fine. I'm usually more than happy with a stack backtrace. An assert invariably leads me to a bug in my application code and a fast, accurate fix.

People who won't use a library because it contains asserts are misinformed, and remind me of people who thing adding extra brokers to a network will magically add message "reliability".

We get these threads every six months or so, and over time my attitude to asserts gets more militant, as I watch the quality of the code that's built on them, and the vast lack of counter-data grow and grow.
You get more reliable software by eliminating those strange unexpected code paths, not by adding error handling into the mix.


On Wed, Aug 13, 2014 at 2:08 AM, Thomas Rodgers <rodgert at twrodgers.com> wrote:
>> At least for languages that support exceptions, I believe throwing an 
>> exception for invalid arguments is far preferable to just killing the 
>> process.
> I write a lot of C++ code for automated trading systems, so I come at 
> this from the view that there is no way in this world to light 
> yourself on fire faster than making the same stupid trade over and 
> over in a tight loop.  My experience has been that error recovery 
> logic is almost always poorly exercised and never works entirely as intended when Bad Things happen.
> Spending time writing Erlang based systems has also changed my view to 
> favor the "just let it crash" approach (note, Erlang also has 
> exceptions, but they are not the most important feature of it's error handling/recovery model).
> These days, I do not generally expect exceptions to be recoverable.  
> They are used a mechanism where I can hang additional reporting on 
> what the failure was, and the context within which it happened, on the 
> way to a top level handler that does nothing but log the failure and 
> terminate the process.  It is then the responsibility of an external 
> process to put the system back into a known good state and restart the failed process.
> To some extent, a library that aborts the process out from underneath 
> me denies me the opportunity to gather more context into my logs 
> before terminating, but core files are useful things for post mortem debugging.
> On Tue, Aug 12, 2014 at 6:35 PM, Michi Henning <michi at triodia.com> wrote:
>> > My current view on what constitutes a sane API and behavior from 
>> > the library is heavily driven by what I want, as a user. That is, 
>> > my C libraries are things I primarily make to use, not to sell. I 
>> > think it's been about 30 years that I wrote my first C libraries, 
>> > and my style and view has shifted massively since then, to what we 
>> > have in cases like CZMQ today.
>> I can attest to having undergone a similar change of view over the 
>> past 30 years :-)
>> > Mainly, the API enforces its style upwards, so that you simply
>> > *cannot* get strange code paths and bizarre arguments. If you do, 
>> > your application is corrupt, or incompetent, and the library has a 
>> > responsibility to stop things immediately, not allow them to continue.
>> >
>> > It is a safety cord that has proven its usefulness many times. 
>> > Indeed, some of the hardest bugs to catch in recent months were 
>> > from older APIs that precisely returned EINVAL on bad arguments, 
>> > and where the calling code forgot to check the return code. Stuff is then...
>> > bizarrely broken and tracking that down can be insanely hard.
>> I hear you, and there is probably not a single one true answer here.
>> Part of the problem is C, which makes it possible to ignore error 
>> codes and just blithely stumble on regardless.
>> In languages with exception handling, it's a different matter though, 
>> because I can force the caller to pay attention to invalid arguments.
>> My main concern is that, by aborting in the library, it becomes very 
>> difficult to write something that needs to have high reliability. 
>> Basically, I can be sure that my program won't dump core only if I 
>> have exercised it to the extent that all possible code paths with all 
>> possible argument values are tested under all possible combinations. 
>> For any sizeable program (especially with lots of threads and 
>> asynchronous things going on), that can be damn near impossible.
>> In turn, if I still want to persist, I now have to wrap the 
>> underlying C API and check all the preconditions for every API call 
>> myself, just so I can throw an exception when a pre-condition is 
>> violated instead of having the program aborted by the library. But 
>> validating the pre-conditions myself may well be very difficult or 
>> very expensive. For example, the cost of verifying that a valid socket pointer is passed to every API call is quite high.
>> If I'm given the option of catching an exception, I may be able to 
>> recover from my own programming error, for example, by terminating 
>> only the current operation. At least, the program keeps running, 
>> instead of dumping core, and I can splatter my log with error messages or whatever I deem appropriate.
>> The point here is that general-purpose libraries should avoid setting 
>> policy, because what should happen under certain error conditions is 
>> something that needs to be under control of the caller.
>> I hear you about the difficulty of debugging code that ignores EINVAL 
>> from API calls. But that is the price of programming in C. It's no 
>> different from making system calls and ignoring the return value; do 
>> so at your peril. But a system call is policy-free: it allows me to 
>> decide what should happen when I have passed bad arguments, instead of taking that decision away from me.
>> At least for languages that support exceptions, I believe throwing an 
>> exception for invalid arguments is far preferable to just killing the 
>> process.
>> Cheers,
>> Michi.
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev at lists.zeromq.org
>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
zeromq-dev mailing list
zeromq-dev at lists.zeromq.org

More information about the zeromq-dev mailing list