[zeromq-dev] zeromq, abort(), and high reliability environments

Michi Henning michi at triodia.com
Wed Aug 13 01:35:18 CEST 2014


> My current view on what constitutes a sane API and behavior from the
> library is heavily driven by what I want, as a user. That is, my C
> libraries are things I primarily make to use, not to sell. I think
> it's been about 30 years that I wrote my first C libraries, and my
> style and view has shifted massively since then, to what we have in
> cases like CZMQ today.

I can attest to having undergone a similar change of view over the past 30 years :-)

> Mainly, the API enforces its style upwards, so that you simply
> *cannot* get strange code paths and bizarre arguments. If you do, your
> application is corrupt, or incompetent, and the library has a
> responsibility to stop things immediately, not allow them to continue.
> 
> It is a safety cord that has proven its usefulness many times. Indeed,
> some of the hardest bugs to catch in recent months were from older
> APIs that precisely returned EINVAL on bad arguments, and where the
> calling code forgot to check the return code. Stuff is then...
> bizarrely broken and tracking that down can be insanely hard.

I hear you, and there is probably not a single one true answer here.

Part of the problem is C, which makes it possible to ignore error codes and just blithely stumble on regardless.

In languages with exception handling, it's a different matter though, because I can force the caller to pay attention to invalid arguments.

My main concern is that, by aborting in the library, it becomes very difficult to write something that needs to have high reliability. Basically, I can be sure that my program won't dump core only if I have exercised it to the extent that all possible code paths with all possible argument values are tested under all possible combinations. For any sizeable program (especially with lots of threads and asynchronous things going on), that can be damn near impossible.

In turn, if I still want to persist, I now have to wrap the underlying C API and check all the preconditions for every API call myself, just so I can throw an exception when a pre-condition is violated instead of having the program aborted by the library. But validating the pre-conditions myself may well be very difficult or very expensive. For example, the cost of verifying that a valid socket pointer is passed to every API call is quite high.

If I'm given the option of catching an exception, I may be able to recover from my own programming error, for example, by terminating only the current operation. At least, the program keeps running, instead of dumping core, and I can splatter my log with error messages or whatever I deem appropriate. The point here is that general-purpose libraries should avoid setting policy, because what should happen under certain error conditions is something that needs to be under control of the caller.

I hear you about the difficulty of debugging code that ignores EINVAL from API calls. But that is the price of programming in C. It's no different from making system calls and ignoring the return value; do so at your peril. But a system call is policy-free: it allows me to decide what should happen when I have passed bad arguments, instead of taking that decision away from me.

At least for languages that support exceptions, I believe throwing an exception for invalid arguments is far preferable to just killing the process.

Cheers,

Michi.


More information about the zeromq-dev mailing list