On Thu, 10 Nov 2011 16:48:47 -0600, Paul Larson <paul.lar...@linaro.org> wrote: > On Thu, Nov 10, 2011 at 4:46 PM, Paul Larson <paul.lar...@linaro.org> wrote: > > On Thu, Nov 10, 2011 at 4:32 PM, Michael Hudson-Doyle < > > michael.hud...@linaro.org> wrote: > > ... > > > >> > >> After all this thinking and typing I think I've come to some > >> conclusions: > >> > >> 1) There is a category of error where we should just stop. As far as I > >> know, this isn't really handled today. > >> > >> 2) There is another category of error where we should report failure, > >> but not execute any other actions. This makes me thing that having > >> 'report results' as an action is a bit strange -- perhaps the > >> dashboard and bundle stream should really be directly in the job, > >> rather than an action? > >> > > ISTR we defined some exceptions in lava as being Fatal, or Non-Fatal - > > the idea being that there would be subclasses of those to add detail. That > > way we don't need to decide on every single error, just how far to pass it > > up before someone can take action on it. The fatal ones of course would be > > the ones where we just can't reasonably expect to proceed and gain anything > > from it (ex. image fails to deploy).
Ah yeah, there is a CriticalError class. Come to think of it, I think I'm mostly unsettled[1] by how _other_ errors are handled -- basically I think they should cause the dispatcher to exit, immediately, with code != 0. Instead we currently usually try to send a bundle to the dashboard, which if we're in some bizarro situation often fails and fails in a way that obscures the original problem! [1] most of the rest is the fact that we currently don't even think about errors in the job file. > >> 3) I need to write another, probably equally long, email about > >> dependencies between actions :-) > >> > > Ah yes, we spoke a bit about that recently. I'd love to hear your ideas > > on it. I'll get my thinking cap on then! > I guess I forgot to add some things to the previous email... I'm mainly > interested in 2 things when it comes to any of these errors: > 1. highlighting them in a way that makes it easy for us to find out when > something goes wrong. Yes, that's true (a good bit of user focus!). I guess it's worth thinking about who cares about a particular class of error... category 1 -- bugs -- that'd be us, the validation team category 2 -- errors in the job file -- whoever submitted the job category 3 -- failing tests -- whoever submitted the job category 4 -- tests that fail to even run -- this one is harder to call I guess. probably whoever submitted the tests as a first port of call though, it's kinda similar to a failing test. category 5 -- board hangs -- depends when it happens, and even what the test is testing. if it's during boot_linaro_image of a kernel ci test, then it should be the kernel team. if it's booting a supposedly known good image for toolchain testing, then it's probably a lab issue. I guess _most_ of the time it's going to be a failure of what is being tested. category 6 -- infrastructure failure -- hard to say again. l-m-c fails both because of bugs but also sometimes due to duff hwpacks. category 7 -- recoverable errors -- the validation team might care, certainly noone else will. So that was worth thinking through. I think in general errors that should be interpreted by the job submitter should be reported primarily in the dashboard, although I don't really have a good idea for is how to report errors in the job file. One way of distinguishing things like l-m-c bugs from duff input data is to look across jobs -- if all deploy steps are failing, something is probably wrong in the lab. This obviously requires a wider view of what's going on than the dispatcher has :-) but perhaps means we should think about hooking the dispatcher up to statsd somehow or other. > I think some of this, perhaps, goes along with another conversation about > parsing the serial log and splitting it up into sections. Yeah, and submitting these sections to the dashboard I think (see above -- I don't think people should look at the scheduler pages most of the time). > 2. capturing the full backtrace (we're better about this now I think, I > have had much less frustration with this lately) Yeah, I think I mostly squished that problem :-) Cheers, mwh _______________________________________________ linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev