kvm: Report the loss of a large memory page

William Roche Tue, 11 Feb 2025 13:31:44 -0800

On 2/10/25 17:48, Peter Xu wrote:

On Fri, Feb 07, 2025 at 07:02:22PM +0100, William Roche wrote:

[...]
So the main reason is a KVM "weakness" with kvm_send_hwpoison_signal(), and
the second reason is to have richer error messages.


This seems true, and I also remember something when I looked at this
previously but maybe nobody tried to fix it.  ARM seems to be correct on
that field, otoh.

Is it possible we fix KVM on x86?


Yes, very probably, and it would be a kernel fix.

This kernel modification would be needed to run on the hypervisor firstto influence a new code in qemu able to use the SIGBUS siginfoinformation and identify the size of the page impacted (instead of usingan internal addition to kvm API).But this mechanism could help to generate a large page memory errorspecific message on SIGBUS receiving.


I feel like when hwpoison becomes a serious topic, we need some more
serious reporting facility than error reports.  So that we could have this
as separate topic to be revisited.  It might speed up your prior patches
from not being blocked on this.


I explained why I think that error messages are important, but I don't want
to get blocked on fixing the hugepage memory recovery because of that.


What is the major benefit of reporting in QEMU's stderr in this case?

Such messages can be collected into VM specific log file, as any othererror_report() message, like the existing x86 error injection messagesreported by Qemu.This messages should help the administrator to better understand thebehavior of the VM.

For example, how should we consume the error reports that this patch
introduces?  Is it still for debugging purpose?

Its not only debugging, but it's a trace of a significant event that canhave major consequences on the VM.


I agree it's always better to dump something in QEMU when such happened,
but IIUC what I mentioned above (by monitoring QEMU ramblock setups, and
monitor host dmesg on any vaddr reported hwpoison) should also allow anyone
to deduce the page size of affected vaddr, especially if it's for debugging
purpose.  However I could possibly have missed the goal here..

You're right that knowing the address, the administrator can deduce whatmemory area was impacted and the associated page size. But the goal ofthese large page specific messages was to give details on the event typeand immediately qualify the consequences.Using large pages can also have drawbacks, and a large page specificmessage on memory error makes that more obvious ! Not only a debug msg,but an indication that the VM lost an unusually large amount of its memory.


If you think that not displaying a specific message for large page loss can
help to get the recovery fixed, than I can change my proposal to do so.

Early next week, I'll send a simplified version of my first 3 patches
without this specific messages and without the preallocation handling in all
remap cases, so you can evaluate this possibility.


Yes IMHO it'll always be helpful to separate it if possible.

I'm sending now a v8 version, without the specific messages and theremap notification. It should fix the main recovery bug we currentlyhave. More messages and a notification dealing with pre-allocation canbe added in a second step.

Please let me know if this v8 version can be integrated without theprealloc and specific messages ?


Thanks,
William.

Re: [PATCH v7 3/6] accel/kvm: Report the loss of a large memory page

Reply via email to