Re: [xen-4.12-testing test] 169199: regressions - FAIL

Roger Pau Monné Fri, 08 Apr 2022 01:10:14 -0700

On Fri, Apr 08, 2022 at 09:01:11AM +0200, Jan Beulich wrote:
> On 07.04.2022 10:45, osstest service owner wrote:
> > flight 169199 xen-4.12-testing real [real]
> > http://logs.test-lab.xenproject.org/osstest/logs/169199/
> > 
> > Regressions :-(
> > 
> > Tests which did not succeed and are blocking,
> > including tests which could not be run:
> >  test-amd64-amd64-xl-qemut-debianhvm-i386-xsm 12 debian-hvm-install fail 
> > REGR. vs. 168480
> 
> While the subsequent flight passed, I thought I'd still look into
> the logs here since the earlier flight had failed too. The state of
> the machine when the debug keys were issued is somewhat odd (and
> similar to the earlier failure's): 11 of the 56 CPUs try to
> acquire (apparently) Dom0's event lock, from evtchn_move_pirqs().
> All other CPUs are idle. The test failed because the sole guest
> didn't reboot in time. Whether the failure is actually connected to
> this apparent lock contention is unclear, though.
> 
> One can further see that really all about 70 ECS_PIRQ ports are
> bound to vCPU 0 (which makes me wonder about lack of balancing
> inside Dom0 itself, but that's unrelated). This means that all
> other vCPU-s have nothing at all to do in evtchn_move_pirqs().
> Since this moving of pIRQ-s is an optimization (the value of which
> has been put under question in the past, iirc), I wonder whether we
> shouldn't add a check to the function for the list being empty
> prior to actually acquiring the lock. I guess I'll make a patch and
> post it as RFC.


Seems good to me.

I think a better model would be to migrate the PIRQs when fired, or
even better when EOI is performed?  So that Xen doesn't pointlessly
migrate PIRQs for vCPUs that aren't running.

> And of course in a mostly idle system the other aspect here (again)
> is: Why are vCPU-s moved across pCPU-s in the first place? I've
> observed (and reported) such seemingly over-aggressive vCPU
> migration before, most recently in the context of putting together
> 'x86: make "dom0_nodes=" work with credit2'. Is there anything that
> can be done about this in credit2?
> 
> A final, osstest-related question is: Does it make sense to run Dom0
> on 56 vCPU-s, one each per pCPU? The bigger a system, the less
> useful it looks to me to actually also have a Dom0 as big, when the
> purpose of the system is to run guests, not meaningful other
> workloads in Dom0. While this is Xen's default (i.e. in the absence
> of command line options restricting Dom0), I don't think it's
> representing typical use of Xen in the field.

I could add a suitable dom0_max_vcpus parameter to osstest.  XenServer
uses 16 for example.

Albeit not having such parameter has likely led you into figuring out
this issue, so it might not be so bad.  I agree however it's likely
better to test scenarios closer to real world usage.

Thanks, Roger.

Re: [xen-4.12-testing test] 169199: regressions - FAIL

Reply via email to