On 10 Oct 2012, at 22:44, Andy Doan <andy.d...@linaro.org> wrote: > On 10/10/2012 04:17 PM, Michael Hudson-Doyle wrote: >> Andy Doan <andy.d...@linaro.org> writes: >> >>> On 10/10/2012 08:56 AM, Andrey Konovalov wrote: >>>> Hi Dave, >>>> >>>> On 10/10/2012 11:35 AM, Dave Pigott wrote: >>>>> Hi all, >>>>> >>>>> I found an interesting health failure today on origen07 >>>>> >>>>> http://validation.linaro.org/lava-server/scheduler/job/35016/log_file >>>>> >>>>> When you look at the log, you see that the board starts off at the >>>>> u-boot prompt. It then tries to do a "reboot", which (obviously) >>>>> fails. So naturally, it then does a hard reset, and this is where it >>>>> does something very odd: It interrupts the boot and tries to boot the >>>>> previously installed test image. I haven't yet looked at the >>>>> dispatcher code to figure out why (that's my next job). >>> >>> I'm not sure we can trust anything that occurred in this job file after >>> the "deploy_linaro_image is finished with error". I think at this point >>> the dispatcher is in an unknown state and doesn't know what it should be >>> sending to the serial console. >>> >>> In this case, it still tried to do the boot_linaro_image action. >>> However, we didn't successfully deploy an image, so anything going wrong >>> there probably can't be trusted. I would have guessed it would have >>> found the DTB file, but I'm not sure that's worth digging too far into. >>> >>> I think the real problem we see here is what you and I discussed on IRC >>> earlier. There are certain actions in our job file, that if failed >>> should be considered non-recoverable. ie: >>> >>> * if deploy_linaro_image fails, then boot_linaro_image can't run. >>> * if boot_linaro_image fails, lava_test_install can't run >>> * if lava_test_install fails - well that's tricky since it may have >>> installed some of the test we need but not all. >>> >>> I'm wondering if we need to spend some time trying to improve how >>> actions related to one other in code? >> >> Yes please. I don't know if we want to do something generic, or just >> ensure deployment failures raise CriticalError -- which IIUC means no >> further actions will be attempted. > > CriticalError should at least fix the immediate problem. > > Dave - you wanna take a stab at that for now, and we can do something more > elaborate in the future? >
Yep. Will look at it today. Thanks Dave _______________________________________________ linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev