On Wed, 26 Feb 2025 14:58:46 +1000 Gavin Shan <gs...@redhat.com> wrote:
> On 2/25/25 9:19 PM, Igor Mammedov wrote: > > On Fri, 21 Feb 2025 11:04:35 +0000 > > Jonathan Cameron <jonathan.came...@huawei.com> wrote: > >> > >> Ideally I'd like whatever we choose to look like what a bare metal machine > >> does - mostly because we are less likely to hit untested OS paths. > > > > Ack for that but, > > that would need someone from hw/firmware side since error status block > > handling is done by firmware. > > > > right now we are just making things up based on spec interpretation. > > > > It's a good point. I think it's worthwhile to understand how the RAS event > is processed and turned to CPER by firmware. > > I didn't figure out how CPER is generated by edk2 after looking into tf-a > (trust > firmware ARM) and edk2 for a while. I will consult to EDK2 developers to seek > their helps. However, there is a note in tf-a that briefly explaining how RAS > event is handled. > > From tf-a/plat/arm/board/fvp/aarch64/fvp_lsp_ras_sp.c: > (g...@github.com:ARM-software/arm-trusted-firmware.git) > > /* > * Note: Typical RAS error handling flow with Firmware First Handling > * > * Step 1: Exception resulting from a RAS error in the normal world is > routed to > * EL3. > * Step 2: This exception is typically signaled as either a synchronous > external > * abort or SError or interrupt. TF-A (EL3 firmware) delegates the > * control to platform specific handler built on top of the RAS > helper > * utilities. > * Step 3: With the help of a Logical Secure Partition, TF-A sends a direct > * message to dedicated S-EL0 (or S-EL1) RAS Partition managed by > SPMC. > * TF-A also populates a shared buffer with a data structure > containing > * enough information (such as system registers) to identify and > triage > * the RAS error. > * Step 4: RAS SP generates the Common Platform Error Record (CPER) and > shares > * it with normal world firmware and/or OS kernel through a > reserved > * buffer memory. > * Step 5: RAS SP responds to the direct message with information > necessary for > * TF-A to notify the OS kernel. > * Step 6: Consequently, TF-A dispatches an SDEI event to notify the OS > kernel > * about the CPER records for further logging. > */ > > According to the note, RAS SP (Secure Partition) is the black box where the > RAS > event raised by tf-a is turned to CPER. Unfortunately, I didn't find the > source > code to understand the details yet. This is very much 'a flow' rather than 'the flow'. TFA may not even be involved in many systems, nor SDEI, nor EDK2 beyond passing through some config. One option, as I understand it, is to offload the firmware handing and building of the record to a management processor and stick to SEA for the signalling. I'd be rather surprised if you can find anything beyond binary blobs for those firmware (if that!). Maybe all we can get from publicish sources is what the HEST tables look like. I've asked our firmware folk if they can share more on how we do it but might take a while. I have confirmed we only have one GHESv2 SEA entry on at least the one random board I looked at (and various interrupt ones). That board may not be representative but seems pushing everything through one structure is an option. Jonathan > > Thanks, > Gavin > >