On 09/25/2017 03:07 PM, Dario Faggioli wrote:
Hey,
Hi Dario,
On Mon, 2017-09-25 at 09:46 +0000, osstest service owner wrote:
flight 113807 xen-unstable real [real]
http://logs.test-lab.xenproject.org/osstest/logs/113807/
So, triggered by this:
Tests which are failing intermittently (not blocking):
test-armhf-armhf-xl-credit2 16 guest-start/debian.repeat fail in
113791 pass in 113807
I went having a look, and discovered that it's indeed happening that,
from time to time, we fail to create a guest, on ARM, with Credit2.
Looking here:
http://logs.test-lab.xenproject.org/osstest/results/history/test-armhf-armhf-xl-credit2/xen-unstable
It seems to be happening only on the cubietracks, but in a non-linear
and non-deterministic fashion. E.g., 113791 failed on metzinger, which
is fine on 113800; 113611 and 113618 failed on baroque, which is fine
on 113638.
I don't see much in the logs, TBH, but both `xl vcpu-list' and the 'r'
debug key seem to suggest that vCPU 0 is running, while the other vCPUs
have never run... like it was an issue with secondary (v)CPU bringup.
It indeed shows up with Credit2, as it were _specific_ to it, but I'm
not 100% sure. In fact, it indeed seems to never show up here:
http://logs.test-lab.xenproject.org/osstest/results/history/test-armhf-
armhf-xl/xen-unstable
but it looks like it may have shown up in 112460 (but we don't have the
logs any longer):
http://logs.test-lab.xenproject.org/osstest/results/history/test-armhf-
armhf-xl-cubietruck/xen-unstable
So... ARM people? Does this ring any bell? Is this something known, or
easy to explain? What can I do for help?
It definitely rings a bell, I have seen similar trace in July and I have
been working on a potential fix since then.
Most of the time guest-start/debian.repeat fails, vCPU 0 is in
data/prefetch abort state. My guess is a latent cache bug that credit2
appears to expose.
Indeed, the arm32 kernel is using set/way cache flush instruction at
boot time. They are used to clean one by one each level of caches on
each CPUs.
At the moment, Xen does not trap those instructions. As you know cache
may not be private to a given physical processors. So if you happen to
migrate the vCPU to another physical CPU, you may hit stale data.
This means we have to trap and emulate set/way instructions. Per the ARM
ARM and also experience emulating them is a non-trivial.
Thankfully, people are trying to get rid of those instructions. For
instance arm64 Linux does not use it anymore. Sadly, arm32 linux
maintainer does not want to remove them... This is also used by EDK2 at
the moment.
The solution is to go through the P2M and clean & invalidate every page
one by one. This process is really realy slow given Xen on Arm is always
populating the P2M at guest creation.
So I have been working for the past 2 months to add PoD support on Arm.
I have a proof of concept that boot a guest and properly handle set/way
cache instructions.
I am still cleaning-up my work and hopefully can post a couple of series
soon. This is not targeting Xen 4.10 and I am not even sure it would fix
the problem here. But that's my best guess.
Cheers,
--
Julien Grall
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel