Tony Abernethy wrote:
Geoff Steckel wrote:>>
And yes, error recovery is a very significant part of any non-trivial
useful program which does (for instance) network I/O, because the
universe of possible errors is large.
Error recover?
Does anyone ever debug error recover?
Is there any way anyone **COULD** debug error recovery?
on order of magnitude more complicated and no tools --- predictable.
Maybe I'm overly pessimistic, but if so, (try to) prove me wrong.

In the general case you're absolutely correct.

If you separate errors into

   local environment problems: sudden memory shortage,
   disk I/O error, hardware errors
        Usually these can only be dealt with by exiting
        as cleanly as possible as quickly as possible

   internal program error/inconsistencies
        Exit cleanly after recording as much as possible
        about the input provoking the problem
        These are (almost) always symptoms of bugs.

   input data malformed, inconsistent, or missing
   peer or server failure (no communication or bad communication)
        This is the area which can be dealt with.

A deceptively simple strategy handles the last case.

   there must be only one point in the entire program which
   can block waiting for input from the outside world.

   that point must have an analysis function which
   decodes the message, and a scheduler functionality which
   vectors depending on the decoded message and explicit
   current state

   all functionality dispatches from this point and
   returns to this point without blocking for anything
   other than local disk I/O and returning a "what was
   done" code which the scheduler uses to compute the
   next state. Errors detected are reflected in the
   "what was done" code.

   setting explicit error states as needed

   all errors have explicit new states from an explicit
   state transition table - NOT from random if-then-elses

   all input and output messages can be traced for
   debugging

This -can- analyzed and (to a greater or lesser extent)
debugged. Some errors cannot be recovered from other than
by cleaning up the debris and exiting or starting over.
It is possible to be reasonably sure that the program will
never hang forever or loop forever (absent internal errors
of class 2 above).

It is isomorphic with a state driven parser. Indeed, some fairly
complex problems can be turned into explicit grammars.
Tools like yacc used to generate the required control mechanisms.
A great deal of code then doesn't need to be written.

Doing this requires an analysis of the problem far
deeper than most programmers will do or managers
wait for. Implementing requires a discipline most
programmers will not endure.

It works, and it works well, and it can be checked
by peer review and debug traces, and fixes can be
tested with message replay. Programs built this way
tend not to need repair and repairs tend to be simple,
if not necessarily easy.

    geoff steckel

Reply via email to