On Fri, 14 Feb 2025 14:16:31 +1000
Gavin Shan <gs...@redhat.com> wrote:

> Currently, there is only one CPER buffer (entry), meaning only one
> memory error can be reported. In extreme case, multiple memory errors
> can be raised on different vCPUs. For example, a singile memory error
> on a 64KB page of the host can results in 16 memory errors to 4KB
> pages of the guest. Unfortunately, the virtual machine is simply aborted
> by multiple concurrent memory errors, as the following call trace shows.
> A SEA exception is injected to the guest so that the CPER buffer can
> be claimed if the error is successfully pushed by acpi_ghes_memory_errors(),
> Otherwise, abort() is triggered to crash the virtual machine.
> 
>   kvm_vcpu_thread_fn
>     kvm_cpu_exec
>       kvm_arch_on_sigbus_vcpu
>         kvm_cpu_synchronize_state
>         acpi_ghes_memory_errors         (a)
>         kvm_inject_arm_sea | abort
> 
> It's arguably to crash the virtual machine in this case. The better
> behaviour would be to retry on pushing the memory errors, to keep the
> virtual machine alive so that the administrator has chance to chime
> in, for example to dump the important data with luck. This series
> adds one more parameter to acpi_ghes_memory_errors() so that it will
> be tried to push the memory error until it succeeds.
Hi Gavin,

If the ultimate aim is to support multiple memory errors why not
just do that?  Been a while since I look at how that works, but 
the spec definitely allows it.  I think by just queuing up the errors
and updating the Error Status Address as each one is handled.
I think that's what GHESv2 ack is all about as it prevents the
RAS firmware updating the error record until it is acknowledged
at which point the RAS firmware can report the next one.

Or... Given the usecase above of a 64KiB host page and 4KiB guest
can we inject a single error record with multiple CPER entries and
just handle it all in one go?

Set the Error record header -> section count to 16 and provide
16 Memory Error Sections or equivalent.

Doesn't help with multiple errors in unrelated memory addresses but
maybe removes one problem case.

I've not checked all the information makes it to the right places
however or that we don't end up with a deadlock when multiple vCPU
involved.

If doing the more significant surgery this would involve, I'd
love to see Mauro's series land first as it cleans up a lot of
how HEST is handled etc.

Jonathan

> 
> Gavin Shan (4):
>   acpi/ghes: Make ghes_record_cper_errors() static
>   acpi/ghes: Use error_report() in ghes_record_cper_errors()
>   acpi/ghes: Allow retry to write CPER errors
>   target/arm: Retry pushing CPER error if necessary
> 
>  hw/acpi/ghes-stub.c    |  3 ++-
>  hw/acpi/ghes.c         | 45 +++++++++++++++++++++---------------------
>  include/hw/acpi/ghes.h |  5 ++---
>  target/arm/kvm.c       | 31 +++++++++++++++++++++++------
>  4 files changed, 51 insertions(+), 33 deletions(-)
> 


Reply via email to