On 10 Oct 2012, at 22:44, Andy Doan <andy.d...@linaro.org> wrote:

> On 10/10/2012 04:17 PM, Michael Hudson-Doyle wrote:
>> Andy Doan <andy.d...@linaro.org> writes:
>> 
>>> On 10/10/2012 08:56 AM, Andrey Konovalov wrote:
>>>> Hi Dave,
>>>> 
>>>> On 10/10/2012 11:35 AM, Dave Pigott wrote:
>>>>> Hi all,
>>>>> 
>>>>> I found an interesting health failure today on origen07
>>>>> 
>>>>> http://validation.linaro.org/lava-server/scheduler/job/35016/log_file
>>>>> 
>>>>> When you look at the log, you see that the board starts off at the
>>>>> u-boot prompt. It then tries to do a "reboot", which (obviously)
>>>>> fails. So naturally, it then does a hard reset, and this is where it
>>>>> does something very odd: It interrupts the boot and tries to boot the
>>>>> previously installed test image. I haven't yet looked at the
>>>>> dispatcher code to figure out why (that's my next job).
>>> 
>>> I'm not sure we can trust anything that occurred in this job file after
>>> the "deploy_linaro_image is finished with error". I think at this point
>>> the dispatcher is in an unknown state and doesn't know what it should be
>>> sending to the serial console.
>>> 
>>> In this case, it still tried to do the boot_linaro_image action.
>>> However, we didn't successfully deploy an image, so anything going wrong
>>> there probably can't be trusted. I would have guessed it would have
>>> found the DTB file, but I'm not sure that's worth digging too far into.
>>> 
>>> I think the real problem we see here is what you and I discussed on IRC
>>> earlier. There are certain actions in our job file, that if failed
>>> should be considered non-recoverable. ie:
>>> 
>>> * if deploy_linaro_image fails, then boot_linaro_image can't run.
>>> * if boot_linaro_image fails, lava_test_install can't run
>>> * if lava_test_install fails - well that's tricky since it may have
>>> installed some of the test we need but not all.
>>> 
>>> I'm wondering if we need to spend some time trying to improve how
>>> actions related to one other in code?
>> 
>> Yes please.  I don't know if we want to do something generic, or just
>> ensure deployment failures raise CriticalError -- which IIUC means no
>> further actions will be attempted.
> 
> CriticalError should at least fix the immediate problem.
> 
> Dave - you wanna take a stab at that for now, and we can do something more 
> elaborate in the future?
> 

Yep. Will look at it today.

Thanks

Dave
_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Reply via email to