On 27/01/2021 20:12, Elliott Mitchell wrote:
> On Wed, Jan 27, 2021 at 10:47:19AM +0100, Jan Beulich wrote:
>> On 26.01.2021 18:51, Elliott Mitchell wrote:
>>> Okay, this has been reliably reproducing for a while.  I had originally
>>> thought it was a problem of HVM plus memory != maxmem, but the
>>> non-immediate restart disagrees with that assessment.
>> I guess it's not really clear what you mean with this, but anyway:
>> The important aspect here that I'm concerned about is what the
>> manifestations of the issue are. I'm still hoping that you would
>> provide such information, so we can then start thinking about how
>> to solve these. If, of course, there is anything worse than the
>> expected effects which use of PoD can have on the guest itself.
> Manifestation is domain 0 and/or Xen panic a few seconds after the
> domain.cfg file is loaded via `xl`.  Everything on the host is lost and
> the host restarts.  Any VMs which were present are lost and need to
> restart, similar to power loss without UPS.
>
> Upon pressing return for `xl create domain.cfg` there is a short period
> of apparently normal behavior in domain 0.  After this there is a short
> period of very laggy behavior in domain 0.  Finally domain 0 goes
> unresponsive and so far by the time I've gotten to the host's console it
> has already started to reboot.
>
> The periods of apparently normal and laggy behavior are perhaps 5-10
> seconds each.
>
> The configurations I've reproduced with have had maxmem substantially
> larger than the total host memory (this is intended as a prototype of a
> future larger VM).  The first recorded observation of this was with
> Debian's build of Xen 4.8, though I recall running into it with Xen 4.4
> too.
>
> Part of the problem might also be attributeable to QEMU touching all
> memory on start (thus causing PoD to try to populate *all* memory) or
> OVMF.

So.  What *should* happen is that if QEMU/OVMF dirties more memory than
exists in the PoD cache, the domain gets terminated.

Irrespective, Xen/dom0 dying isn't an expected consequence of any normal
action like this.

Do you have a serial log of the crash?  If not, can you set up a crash
kernel environment to capture the logs, or alternatively reproduce the
issue on a different box which does have serial?

Whatever the underlying bug is, avoiding 2M degrading to 4K allocations
isn't a real fix, and is at best, sidestepping the problem.

~Andrew

Reply via email to