On 04/09/2019 09:36, Jan Beulich wrote:
> On 03.09.2019 22:00, osstest service owner wrote:
>> flight 140960 xen-unstable real [real]
>> http://logs.test-lab.xenproject.org/osstest/logs/140960/
>>
>> Regressions :-(
>>
>> Tests which did not succeed and are blocking,
>> including tests which could not be run:
>>  test-amd64-amd64-xl-pvshim   18 guest-localmigrate/x10   fail REGR. vs. 
>> 139876
> This looks to be recurring, so I've taken another look.

I had a suspicion as well, but fixing the intermittent build problems
was the first priority.

A major change in shim in the range under test is switching from Credit1
to NULL as a scheduler, following Dario's fixing of what we thought was
the final outstanding bug.  Perhaps it wasn't the final bug...

>  The three
> migrations leave this abbreviated pattern in the log:
>
> Sep  3 14:20:42.446667 (XEN) HVM d1v0 save: CPU_MSR
> ...
> Sep  3 14:20:57.850670 (XEN) HVM2 restore: CPU 0
> ...
> Sep  3 14:21:37.062840 (XEN) HVM d2v0 save: CPU_MSR
> Sep  3 14:21:37.062888 (XEN) HVM3 restore: CPU 0
> ...
> Sep  3 14:21:56.438552 (XEN) HVM d3v0 save: CPU_MSR
> ...
> Sep  3 14:22:11.506508 (XEN) HVM4 restore: CPU 0
>
> Therefore I wonder whether the first one got lucky and finished
> barely ahead of timing out, while the 2nd worked instantly and the
> 3rd then ended up exceeding the timeout. What is curious are the
> intermediate log entries (between the last "save" and the first
> corresponding "restore" log entries): Many ones of the form
>
> (XEN) emul-priv-op.c:1113:d0v2 Domain attempted WRMSR c0011020 from 
> 0x0000000000000000 to 0x0040000000000000

This is due to a lack of MSR_VIRT_SPEC_CTRL.  It is sshd (or systemd on
its behalf - unclear which) using the SSBD prctl to protect itself, and
Xen, having no support, is causing Linux to fall back to native methods
and falling fowl of Xens write/discard policy on MSRs.

> with a 15s gap between the first and many subsequent ones) and
> finally one of the form
>
> [  451.267669] systemd-logind[2766]: New session 39 of user root.
>
> And finally, at around the time of the failed migration
>
> INIT: Id "T0" respawning too fast: disabled for 5 minutes

Googling around suggests it is an inittab misconfiguration.

>
> While it's not clear that this parallel activity is causing the
> migration to progress too slowly, it looks to be a possibility at
> least. Can anyone explain what these are?
>
>>  build-amd64-xsm               6 xen-build                fail REGR. vs. 
>> 139876
> I take it that this is supposed to be taken care of by a342900d48
> ("tools/shim: Apply more duct tape to the linkfarm logic").

Yes - it should do.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Reply via email to