Re: error handling and reporting in the dispatcher

Michael Hudson-Doyle Thu, 10 Nov 2011 15:18:15 -0800

On Thu, 10 Nov 2011 16:48:47 -0600, Paul Larson <paul.lar...@linaro.org> wrote:
> On Thu, Nov 10, 2011 at 4:46 PM, Paul Larson <paul.lar...@linaro.org> wrote:
> > On Thu, Nov 10, 2011 at 4:32 PM, Michael Hudson-Doyle <
> > michael.hud...@linaro.org> wrote:
> > ...
> >
> >>
> >> After all this thinking and typing I think I've come to some
> >> conclusions:
> >>
> >> 1) There is a category of error where we should just stop.  As far as I
> >>   know, this isn't really handled today.
> >>
> >> 2) There is another category of error where we should report failure,
> >>   but not execute any other actions.  This makes me thing that having
> >>   'report results' as an action is a bit strange -- perhaps the
> >>   dashboard and bundle stream should really be directly in the job,
> >>   rather than an action?
> >>
> > ISTR we defined some exceptions in lava as being Fatal, or Non-Fatal -
> > the idea being that there would be subclasses of those to add detail.  That
> > way we don't need to decide on every single error, just how far to pass it
> > up before someone can take action on it.  The fatal ones of course would be
> > the ones where we just can't reasonably expect to proceed and gain anything
> > from it (ex. image fails to deploy).


Ah yeah, there is a CriticalError class.  Come to think of it, I think
I'm mostly unsettled[1] by how _other_ errors are handled -- basically I
think they should cause the dispatcher to exit, immediately, with code
!= 0.  Instead we currently usually try to send a bundle to the
dashboard, which if we're in some bizarro situation often fails and
fails in a way that obscures the original problem!

[1] most of the rest is the fact that we currently don't even think
    about errors in the job file.

> >> 3) I need to write another, probably equally long, email about
> >>   dependencies between actions :-)
> >>
> > Ah yes, we spoke a bit about that recently.  I'd love to hear your ideas
> > on it.

I'll get my thinking cap on then!

> I guess I forgot to add some things to the previous email... I'm mainly
> interested in 2 things when it comes to any of these errors:
> 1. highlighting them in a way that makes it easy for us to find out when
> something goes wrong.

Yes, that's true (a good bit of user focus!).  I guess it's worth
thinking about who cares about a particular class of error...

category 1 -- bugs -- that'd be us, the validation team

category 2 -- errors in the job file -- whoever submitted the job

category 3 -- failing tests -- whoever submitted the job

category 4 -- tests that fail to even run -- this one is harder to call
              I guess.  probably whoever submitted the tests as a first
              port of call though, it's kinda similar to a failing test.

category 5 -- board hangs -- depends when it happens, and even what the
              test is testing.  if it's during boot_linaro_image of a
              kernel ci test, then it should be the kernel team.  if
              it's booting a supposedly known good image for toolchain
              testing, then it's probably a lab issue.  I guess _most_
              of the time it's going to be a failure of what is being
              tested.

category 6 -- infrastructure failure -- hard to say again.  l-m-c fails
              both because of bugs but also sometimes due to duff
              hwpacks.

category 7 -- recoverable errors -- the validation team might care,
              certainly noone else will.

So that was worth thinking through.

I think in general errors that should be interpreted by the job
submitter should be reported primarily in the dashboard, although I
don't really have a good idea for is how to report errors in the job
file.

One way of distinguishing things like l-m-c bugs from duff input data is
to look across jobs -- if all deploy steps are failing, something is
probably wrong in the lab.  This obviously requires a wider view of
what's going on than the dispatcher has :-) but perhaps means we should
think about hooking the dispatcher up to statsd somehow or other.

> I think some of this, perhaps, goes along with another conversation about
> parsing the serial log and splitting it up into sections.

Yeah, and submitting these sections to the dashboard I think (see
above -- I don't think people should look at the scheduler pages most of
the time).

> 2. capturing the full backtrace (we're better about this now I think, I
> have had much less frustration with this lately)

Yeah, I think I mostly squished that problem :-)

Cheers,
mwh

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Re: error handling and reporting in the dispatcher

Reply via email to