[zeromq-dev] czmq: Error Traceability with assert(...) and release code

Christoph Zach czach at rst-automation.de
Mon Mar 10 10:10:44 CET 2014


On Friday 07 March 2014 17:36:21 Pieter Hintjens wrote:
> On Fri, Mar 7, 2014 at 3:13 PM, Christoph Zach <czach at rst-automation.de> wrote:
> 
> > To further use zyre/czmq We are planing on replacing all the assert(...) statements
> > with actual error handling routines.
> 
> As Olaf explains, the asserts cannot ever happen in practice unless
> there is a coding bug in your app or in CZMQ.
> 
> If you can reproduce an assert under "normal" conditions, that is a
> bug that we take very seriously and fix.
> 
> Code that has hit an internal error _cannot_ continue to operate
> sanely. The extensive use of asserts is a deliberate and long-standing
> design choice, and though you may do what you like with your forks of
> the codebase, such patches would be rejected without much pity.
> 
> I'd not trust a system that had asserts disabled. Production code (and
> I've made that my profession for decades) should run with all asserts
> enabled. The correct response to a internal failure is crash fast,
> recover fast. You cannot run a software system reliably when you have
> internal errors. Adding error handling to recover from (by definition)
> unforeseen internal errors makes things less, not more reliable.
Semantically We are agreeing on detecting invalid/fatal states. Let me explain
(in more detail), why error codes and not assertions should be used to 
detected these:

1) Context Awareness
The issue with the old school assert statements is that they will
simply quit your application immediately. Even when you have enabled
them. E.g. If you have a C++ app with RAII:
[...]
{
    RAIIWrapperX x (...);

    libraryPotentiallyGoinigToAssert(....):

} // Never reached here. --> Will never call dtor of x!

The issue that when the library has detected that it has reached 
an invalid/unknown/fatal state it just quits and does not allow the 
RAIIWrapperX to clean up nicely. 

The issue with the assumption 
    "You cannot run a software system reliably when you have internal errors" 
is that 'reduced functionality' states are ignored.
This means that when a library has entered an unknown/invalid state it
does NOT mean that the other parts of the system have too! 
Therefore, the other parts must be given a chance to clean up as much 
as possible. 
Please note that this does not protect against Machiavellian errors, where
someone simply corrupts the whole memory of your application. But then
again there's Unit Testing and valgrind to determine such things.

2) Unit Testing
By unit testing a library there are different kinds of tests. E.g. a test
can validate that the function f() does what it should do. 
Then another test can validate that f() protects itself against invalid input.
This means that no matter how invalid the given arguments are the 
function f() will report an error and does not crash the application.
This test (a.k.a 'invalid parameter detection') is only possible by using
error codes. If assert(...) statements are used it can never be fully tested.

3) Design Principle: "An API must be easy to use correctly and hard to
use incorrectly".
This is part of Scott Meyers' article, called "The Most Important Deign 
Guideline?". Besides this article he also wrote some pretty good books
on how to write/design good C++ software. They have the same level as 
the books of Herb Sutter.

> 
> What can be helpful is to replace the assert() macro with a more
> extensive error reporting system. 
That was my original intention. Instead of assert() and kill the program
simply provide the user with a verbose error & message. Then it's the
user's responsibility to handle it correctly and clean up everything else.

> However be careful you don't try to
> do to much: the state of the application when it hits an assert is
> unknown. You can have arbitrary memory corruption, for instance. Doing
> *anything* more than "print error & exit" leaves you open to worse
> damage.
To protect against such an issue the only thing We can do is to write
defensive code:
 * const as much as possible
 * validate invalid input
 * report verbose errors (to better track the issue when the customer 
   reports it)
 * use unit testing (test against good and bad cases)
 * use the type system as much as possible
 * use valgrind when running unit tests
 * etc.

By applying all these (and many more) methods it's possible to
reduces the probability of such an event. That's everything We can
do, because at run-time if We detect and invariant We can not tell
if it's wise to shutdown immediately. Therefore, We shall try to clean
up as much as possible. 

> 
> -Pieter
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev at lists.zeromq.org
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
Best Regards

Christoph Zach

-----------------------------------------------------------------------------
RST Industrie Automation GmbH * Carl-Zeiss-Str. 51, D-85521 Ottobrunn 
Tel. +49-89-9616018-00 * Fax +49-89-9616018-10 * http://www.rst-automation.de

Geschäftsführer: Dipl.-Ing.(FH) Robert Schachner 
Amtsgericht München: HRB 103 626 * ID-Nr. DE 811 466 035
-----------------------------------------------------------------------------



More information about the zeromq-dev mailing list