On Fri, 21 Feb 2025 11:04:35 +0000
Jonathan Cameron <jonathan.came...@huawei.com> wrote:

> On Fri, 21 Feb 2025 15:27:36 +1000
> Gavin Shan <gs...@redhat.com> wrote:
> 
> > On 2/20/25 3:55 AM, Igor Mammedov wrote:  
> > > On Fri, 14 Feb 2025 14:16:35 +1000
> > > Gavin Shan <gs...@redhat.com> wrote:
> > >     
> > >> The error -1 is returned if the previously reported CPER error
> > >> hasn't been claimed. The virtual machine is terminated due to
> > >> abort(). It's conflicting to the ideal behaviour that the affected
> > >> vCPU retries pushing the CPER error in this case since the vCPU
> > >> can't proceed its execution.
> > >>
> > >> Move the chunk of code to push CPER error to a separate helper
> > >> report_memory_errors() and retry the request when the return
> > >> value from acpi_ghes_memory_errors() is greater than zero.
> > >>
> > >> Signed-off-by: Gavin Shan <gs...@redhat.com>
> > >> ---
> > >>   target/arm/kvm.c | 31 +++++++++++++++++++++++++------
> > >>   1 file changed, 25 insertions(+), 6 deletions(-)
> > >>
> > >> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> > >> index 5c0bf99aec..9f063f6053 100644
> > >> --- a/target/arm/kvm.c
> > >> +++ b/target/arm/kvm.c
> > >> @@ -2362,6 +2362,30 @@ int kvm_arch_get_registers(CPUState *cs, Error 
> > >> **errp)
> > >>       return ret;
> > >>   }
> > >>   
> > >> +static void report_memory_error(CPUState *c, hwaddr paddr)
> > >> +{
> > >> +    int ret;
> > >> +
> > >> +    while (true) {
> > >> +        /* Retry if the previously report error hasn't been claimed */
> > >> +        ret = acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, paddr, 
> > >> true);
> > >> +        if (ret <= 0) {
> > >> +            break;
> > >> +        }
> > >> +
> > >> +        bql_unlock();
> > >> +        g_usleep(1000);    
> > 
> > Igor, thanks for the detailed comments. Sorry for a bit delay of the reply, 
> > I
> > was checking the code to understand it better :)  
> 
> This is moderately tricky stuff so I'm not 100% sure of some of the things
> I said below, but will be traveling for next few weeks so want to get some
> comments out before that!
> 
> >   
> > > even with bql released it's not safe to loop in here.
> > > consider,
> > >    a guest with 2 vcpus
> > >      * vcpu 1 gets SIGBUS due to error
> > >      * vcpu 2 trips over the same error and gets into this loop
> > >      * on guest side vcpu 1 continues to run to handle SEA but
> > >        might need to acquire a lock that vcpu 2 holds
> > >     
> > 
> > Agreed.
> >   
> > > GHESv2 error source we support, can report several errors,
> > > currently QEMU supports only 1 'error status block' which
> > > can hold several error records (CPER) (though storage size is limited)
> > > 
> > > 1:
> > > We can potentially add support for more GHESv2 error sources
> > > with their own Read ACK registers (let's say =max_cpus)
> > > (that is under assumption that no other error will be
> > > triggered while guest VCPUs handle their own SEA (upto clearing Read 
> > > ACK))  
> 
> This one seems straight forward but I'd kind of like to know if real systems
> do this (I'll try and find out about ours).  I don't think there is
> any association available between a cpu and and SEA source, so linux at
> least will just go looking for any that are active on each SEA.
> Locking looks fine but it won't help with performance
> 
> > > 
> > > 2:
> > > Another way could be for QEMU to allocate more error status _blocks_
> > > for the only one error source it has now and try to find
> > > empty status block to inject new error(s).  
> 
> Let me try to get my head around this one...
> 
> Each GHESv2 entry points, indirectly, to a single error status block at a time
> (only one address to read that from)  Curious quirk is the length for that
> error status block is fixed as that's just a value in GHESv2 not an 
> indirection
> via a register - however I think you can just make it 'big'.
> So what I think you are proposing here is that on read_ack write (which we 
> would
> need to monitor for, the value of the error status address register is updated
> to point to next one of a queue of error blocks.

All my suggestions here aren't recommendation, but rather thinking out loud
to see if we can find a crorrect way to deal with the issue.

The part I don't fancy doing on QEMU part is read_ack write intercept.

> 
> That can work.  I'm not sure it actually gets us anything over just queuing in
> qemu and writing the same error status block.  Those status blocks can contain

benefit would be that we won't have to maintain state on QEMU side
and then migrate it.

> multiple Generic Error Data entries, but unless we have a load of them 
> gathered
> up at time of first notifying the guest, I'm not sure that helps us.
> 
> One thing that I'm nervous about is that I can't actually find spec language
> that says that the OS 'must' reread the error status address register on
> each event. That isn't mentioned in the GHESv2 flow description which just 
> says:

that bothers me as well.

> "
> These are the steps the OS must take once detecting an error from a 
> particular GHESv2 error source:
> • OSPM detects error (via interrupt/exception or polling the block status)
> • OSPM copies the error status block
> • OSPM clears the block status field of the error status block
> • OSPM acknowledges the error via Read Ack register. For example:
> – OSPM reads the Read Ack register –> X
> – OSPM writes –> (( X & ReadAckPreserve) | ReadAckWrite)
> "
> 
> The linux code is confusing me, but I think it wonderfully changes
> the fixmap on every access as it needs to do an ioremap type operation
> in NMI conditions.
> 
> > >   * it can be saturated with high rate of errors (so what do we do in 
> > > case it happens?)  
> 
> Make it big :)  But sure big just makes the condition unlikely rather than 
> solving it.
> 
> > >   * subject to race between clearing/setting Read ACK
> > >      (maybe it can dealt with that on side by keeping internal read_ack 
> > > counter)  
> 
> I don't think there are any races as long as we update the register only on 
> clear which
> should I think happen before the next SEA can happen? My understanding, which 
> may
> be wrong, is the OS must just take a copy of the error status block and set 
> the
> read_ack all in the exception handler.
> 
> > > 
> > > 3:
> > > And alternatively, queue incoming errors until read ack is cleared
> > > and then inject pending errors in one go.
> > > (problem with that is that at the moment QEMU doesn't monitor
> > > read ack register memory so it won't notice guest clearing that)  
> 
> We'd need to monitor it definitely.  Injecting all we have queued up
> in one go here seems like a reasonable optimization over doing them
> one at a time.
> 
> > > 
> > > 
> > > Given spec has provision for multiple error status blocks/error data 
> > > entries
> > > it seems that #2 is an expected way to deal with the problem.
> > >     
> > 
> > I would say #1 is the ideal model because the read_ack_register is the 
> > bottleneck
> > and it should be scaled up to max_cpus. In that way, the bottleneck can be 
> > avoided
> > from the bottom. Another benefit with #1 is the error can be delivered 
> > immediately
> > to the vCPU where the error was raised. This matches with the syntax of SEA 
> > to me.  
> 
> I don't think it helps for the bottleneck in linux at least.  A whole bunch 
> of locks
> are taken on each SEA because of the novel use of the fixmap.  There is only 
> one
> VA ever used to access the error status blocks we just change what PA it 
> points to
> under a spin lock. Maybe that can be improved on if we can persuade people 
> that error
> handling performance is a thing to care about!
> 
> > 
> > #2 still has the risk to saturate the multiple error status blocks if there 
> > are
> > high rate of errors as you said. Besides, the vCPU where read_ack_register 
> > is acknoledged
> > can be different from the vCPU where the error is raised, violating the 
> > syntax of
> > SEA.
> > 
> > #3's drawback is to violate the syntax of SEA, similar to #2.
> > 
> > However, #2/#3 wouldn't be that complicated to #1. I didn't expect big 
> > surgery to
> > GHES module, but it seems there isn't perfect solution without a big 
> > surgery.
> > I would vote for #1 to resolve the issue from the ground. What do you 
> > think, Igor?
> > I'm also hoping Jonathan and Mauro can provide their preference.  
> 
> Ideally I'd like whatever we choose to look like what a bare metal machine
> does - mostly because we are less likely to hit untested OS paths.

Ack for that but,
that would need someone from hw/firmware side since error status block
handling is done by firmware. 

right now we are just making things up based on spec interpretation.

> > > PS:
> > > I'd prefer Mauro's series being merged 1st (once it's resplit),
> > > for it refactors a bunch of original code and hopefully makes
> > > code easier to follow/extend.
> > >     
> > 
> > Sure. I won't start the coding until the solution is confirmed. All the 
> > followup
> > work will base on Mauro's series.
> >   
> > >> +        bql_lock();
> > >> +    }
> > >> +
> > >> +    if (ret == 0) {
> > >> +        kvm_inject_arm_sea(c);
> > >> +    } else {
> > >> +        error_report("Error %d to report memory error", ret);
> > >> +        abort();
> > >> +    }
> > >> +}
> > >> +
> > >>   void kvm_arch_on_sigbus_vcpu(CPUState *c, int code, void *addr)
> > >>   {
> > >>       ram_addr_t ram_addr;
> > >> @@ -2387,12 +2411,7 @@ void kvm_arch_on_sigbus_vcpu(CPUState *c, int 
> > >> code, void *addr)
> > >>                */
> > >>               if (code == BUS_MCEERR_AR) {
> > >>                   kvm_cpu_synchronize_state(c);
> > >> -                if (!acpi_ghes_memory_errors(ACPI_HEST_SRC_ID_SEA, 
> > >> paddr, false)) {
> > >> -                    kvm_inject_arm_sea(c);
> > >> -                } else {
> > >> -                    error_report("failed to record the error");
> > >> -                    abort();
> > >> -                }
> > >> +                report_memory_error(c, paddr);
> > >>               }
> > >>               return;
> > >>           }    
> > >     
> > 
> > Thanks,
> > Gavin
> > 
> >   
> 


Reply via email to