[zeromq-dev] czmq: Error Traceability with assert(...) and release code

Christoph Zach czach at rst-automation.de
Mon Mar 10 12:10:14 CET 2014


On Monday 10 March 2014 10:36:22 Pieter Hintjens wrote:
> This theory is fine in theory. 
Please note that all the theoretical stuff I wrote is based on real-world 
projects, which have high requirements on safety and software resilience.
Here's a requirement excerpt from one of our projects:
1) You have a customer which needs a test system (HIL tests).
2) Test system is for validating very safety critical stuff.
4) Every tests needs to be protocolled correctly. No matter of the tests state
it must be ensure that when an engineer tests a hardware component it will
be protcolled.
4) The system must be in the customer-provided safe states at all times possible
to avoid any harm to the test engineers and to avoid destroying the tested 
hardware.
5) The customer wants to create it's own test scripts, which run ontop of various
client RPC libraries (zmq).
6) The customer wants to use RAII (C++) or with/finally (Python) to ensure
that he/she can clean up nicely.

So by simply assert() and exit() the application the points 6, 5 and 4 have just been
violated. To solve this issue all the assert(...) statements could be replaced
e.g. with ZMQ_ASSERT( toAssert, message ). Therefore, the API will internally
assert 'toAssert' and in case of a violation a ZMQ_INVARIANT_ERROR error code
will be returned. In addition the message 'message' will be logged to get
a verbose information about the context. Latter, in case of an error, the customer
can simply send a error report with all the verbose information. This allows to 
easily identify and fix the error.


> In practice, could you provide a case
> that reproduces the crash you got?
https://github.com/imatix/zguide/blob/master/examples/C%2B%2B/mdcliapi.hpp
line 150 - 156. If somebody sends garbage data the application will simply
exit. In such a case the API should inform the user that there was gargabe so 
the user can clean up the context etc.


> 
> 
> On Mon, Mar 10, 2014 at 10:10 AM, Christoph Zach
> <czach at rst-automation.de> wrote:
> > On Friday 07 March 2014 17:36:21 Pieter Hintjens wrote:
> >> On Fri, Mar 7, 2014 at 3:13 PM, Christoph Zach <czach at rst-automation.de> wrote:
> >>
> >> > To further use zyre/czmq We are planing on replacing all the assert(...) statements
> >> > with actual error handling routines.
> >>
> >> As Olaf explains, the asserts cannot ever happen in practice unless
> >> there is a coding bug in your app or in CZMQ.
> >>
> >> If you can reproduce an assert under "normal" conditions, that is a
> >> bug that we take very seriously and fix.
> >>
> >> Code that has hit an internal error _cannot_ continue to operate
> >> sanely. The extensive use of asserts is a deliberate and long-standing
> >> design choice, and though you may do what you like with your forks of
> >> the codebase, such patches would be rejected without much pity.
> >>
> >> I'd not trust a system that had asserts disabled. Production code (and
> >> I've made that my profession for decades) should run with all asserts
> >> enabled. The correct response to a internal failure is crash fast,
> >> recover fast. You cannot run a software system reliably when you have
> >> internal errors. Adding error handling to recover from (by definition)
> >> unforeseen internal errors makes things less, not more reliable.
> > Semantically We are agreeing on detecting invalid/fatal states. Let me explain
> > (in more detail), why error codes and not assertions should be used to
> > detected these:
> >
> > 1) Context Awareness
> > The issue with the old school assert statements is that they will
> > simply quit your application immediately. Even when you have enabled
> > them. E.g. If you have a C++ app with RAII:
> > [...]
> > {
> >     RAIIWrapperX x (...);
> >
> >     libraryPotentiallyGoinigToAssert(....):
> >
> > } // Never reached here. --> Will never call dtor of x!
> >
> > The issue that when the library has detected that it has reached
> > an invalid/unknown/fatal state it just quits and does not allow the
> > RAIIWrapperX to clean up nicely.
> >
> > The issue with the assumption
> >     "You cannot run a software system reliably when you have internal errors"
> > is that 'reduced functionality' states are ignored.
> > This means that when a library has entered an unknown/invalid state it
> > does NOT mean that the other parts of the system have too!
> > Therefore, the other parts must be given a chance to clean up as much
> > as possible.
> > Please note that this does not protect against Machiavellian errors, where
> > someone simply corrupts the whole memory of your application. But then
> > again there's Unit Testing and valgrind to determine such things.
> >
> > 2) Unit Testing
> > By unit testing a library there are different kinds of tests. E.g. a test
> > can validate that the function f() does what it should do.
> > Then another test can validate that f() protects itself against invalid input.
> > This means that no matter how invalid the given arguments are the
> > function f() will report an error and does not crash the application.
> > This test (a.k.a 'invalid parameter detection') is only possible by using
> > error codes. If assert(...) statements are used it can never be fully tested.
> >
> > 3) Design Principle: "An API must be easy to use correctly and hard to
> > use incorrectly".
> > This is part of Scott Meyers' article, called "The Most Important Deign
> > Guideline?". Besides this article he also wrote some pretty good books
> > on how to write/design good C++ software. They have the same level as
> > the books of Herb Sutter.
> >
> >>
> >> What can be helpful is to replace the assert() macro with a more
> >> extensive error reporting system.
> > That was my original intention. Instead of assert() and kill the program
> > simply provide the user with a verbose error & message. Then it's the
> > user's responsibility to handle it correctly and clean up everything else.
> >
> >> However be careful you don't try to
> >> do to much: the state of the application when it hits an assert is
> >> unknown. You can have arbitrary memory corruption, for instance. Doing
> >> *anything* more than "print error & exit" leaves you open to worse
> >> damage.
> > To protect against such an issue the only thing We can do is to write
> > defensive code:
> >  * const as much as possible
> >  * validate invalid input
> >  * report verbose errors (to better track the issue when the customer
> >    reports it)
> >  * use unit testing (test against good and bad cases)
> >  * use the type system as much as possible
> >  * use valgrind when running unit tests
> >  * etc.
> >
> > By applying all these (and many more) methods it's possible to
> > reduces the probability of such an event. That's everything We can
> > do, because at run-time if We detect and invariant We can not tell
> > if it's wise to shutdown immediately. Therefore, We shall try to clean
> > up as much as possible.
> >
> >>
> >> -Pieter
> >> _______________________________________________
> >> zeromq-dev mailing list
> >> zeromq-dev at lists.zeromq.org
> >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
> > Best Regards
> >
> > Christoph Zach
> >
> > -----------------------------------------------------------------------------
> > RST Industrie Automation GmbH * Carl-Zeiss-Str. 51, D-85521 Ottobrunn
> > Tel. +49-89-9616018-00 * Fax +49-89-9616018-10 * http://www.rst-automation.de
> >
> > Geschäftsführer: Dipl.-Ing.(FH) Robert Schachner
> > Amtsgericht München: HRB 103 626 * ID-Nr. DE 811 466 035
> > -----------------------------------------------------------------------------
Mit freundlichen Grüßen
Best Regards

Christoph Zach

-----------------------------------------------------------------------------
RST Industrie Automation GmbH * Carl-Zeiss-Str. 51, D-85521 Ottobrunn 
Tel. +49-89-9616018-00 * Fax +49-89-9616018-10 * http://www.rst-automation.de

Geschäftsführer: Dipl.-Ing.(FH) Robert Schachner 
Amtsgericht München: HRB 103 626 * ID-Nr. DE 811 466 035
-----------------------------------------------------------------------------



More information about the zeromq-dev mailing list