On Tue, Apr 29, 2025 at 07:21:09PM +0200, Fabio M. De Francesco wrote:
> When Firmware First is enabled, BIOS handles errors first and then it
> makes them available to the kernel via the Common Platform Error Record
> (CPER) sections (UEFI 2.10 Appendix N). Linux parses the CPER sections
> via one of two similar paths, either ELOG or GHES.
> 
> Currently, ELOG and GHES show some inconsistencies in how they report to
> userspace via trace events.
> 
> Therfore make the two mentioned paths act similarly by tracing the CPER
> CXL Protocol Error Section (UEFI v2.10, Appendix N.2.13) signaled by the
> I/O Machine Check Architecture and reported by BIOS in FW-First.
> 
> Cc: Dan Williams <dan.j.willi...@intel.com>
> Signed-off-by: Fabio M. De Francesco <fabio.m.de.france...@linux.intel.com>
> ---
>  drivers/acpi/acpi_extlog.c | 60 ++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/ras.c     |  6 ++++
>  include/cxl/event.h        |  2 ++
>  3 files changed, 68 insertions(+)
> 
> diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
> index 7d7a813169f1..8f2ff3505d47 100644
> --- a/drivers/acpi/acpi_extlog.c
> +++ b/drivers/acpi/acpi_extlog.c
> @@ -12,6 +12,7 @@
>  #include <linux/ratelimit.h>
>  #include <linux/edac.h>
>  #include <linux/ras.h>
> +#include <cxl/event.h>
>  #include <acpi/ghes.h>
>  #include <asm/cpu.h>
>  #include <asm/mce.h>
> @@ -157,6 +158,60 @@ static void extlog_print_pcie(struct cper_sec_pcie 
> *pcie_err,
>       }
>  }
>  
> +static void
> +extlog_cxl_cper_handle_prot_err(struct cxl_cper_sec_prot_err *prot_err,
> +                             int severity)
> +{
> +#ifdef CONFIG_ACPI_APEI_PCIEAER

Why not apply this check on the function prototype?

Reference: Documentation/process/coding-style.rst
           Section 21) Conditional Compilation

> +     struct cxl_cper_prot_err_work_data wd;
> +     u8 *dvsec_start, *cap_start;
> +
> +     if (!(prot_err->valid_bits & PROT_ERR_VALID_AGENT_ADDRESS)) {
> +             pr_err_ratelimited("CXL CPER invalid agent type\n");
> +             return;
> +     }
> +
> +     if (!(prot_err->valid_bits & PROT_ERR_VALID_ERROR_LOG)) {
> +             pr_err_ratelimited("CXL CPER invalid protocol error log\n");
> +             return;
> +     }
> +
> +     if (prot_err->err_len != sizeof(struct cxl_ras_capability_regs)) {
> +             pr_err_ratelimited("CXL CPER invalid RAS Cap size (%u)\n",
> +                                prot_err->err_len);
> +             return;
> +     }
> +
> +     if (!(prot_err->valid_bits & PROT_ERR_VALID_SERIAL_NUMBER))
> +             pr_warn(FW_WARN "CXL CPER no device serial number\n");

Is this a requirement (in the spec) that we should warn users about?

The UEFI spec says that serial number is only used if "CXL agent" is a
"CXL device".

"CXL ports" won't have serial numbers. So this will be a false warning
for port errors.

> +
> +     switch (prot_err->agent_type) {
> +     case RCD:
> +     case DEVICE:
> +     case LD:
> +     case FMLD:
> +     case RP:
> +     case DSP:
> +     case USP:
> +             memcpy(&wd.prot_err, prot_err, sizeof(wd.prot_err));
> +
> +             dvsec_start = (u8 *)(prot_err + 1);
> +             cap_start = dvsec_start + prot_err->dvsec_len;
> +
> +             memcpy(&wd.ras_cap, cap_start, sizeof(wd.ras_cap));
> +             wd.severity = cper_severity_to_aer(severity);
> +             break;
> +     default:
> +             pr_err_ratelimited("CXL CPER invalid agent type: %d\n",

"invalid" is too harsh given that the specs may be updated. Maybe say
"reserved" or "unknown" or "unrecognized" instead.

Hopefully things will settle down to where a user will be able to have a
system with newer CXL "agents" without *requiring* a kernel update. :)

Thanks,
Yazen

Reply via email to