Hello Borislav, On Mon, Jul 21, 2025 at 03:57:18PM +0200, Borislav Petkov wrote: > On Mon, Jul 21, 2025 at 03:13:40AM -0700, Breno Leitao wrote: > > Introduce a generic infrastructure for tracking recoverable hardware > > errors (HW errors that did not cause a panic) and record them for vmcore > > consumption. This aids post-mortem crash analysis tools by preserving > > a count and timestamp for the last occurrence of such errors. > > > > This patch adds centralized logging for three common sources of > > "Add centralized... "
Ack! > > recoverable hardware errors: > > > > - PCIe AER Correctable errors > > - x86 Machine Check Exceptions (MCE) > > - APEI/CPER GHES corrected or recoverable errors > > > > hwerror_tracking is write-only at kernel runtime, and it is meant to be > > read from vmcore using tools like crash/drgn. For example, this is how > > it looks like when opening the crashdump from drgn. > > > > >>> prog['hwerror_tracking'] > > (struct hwerror_tracking_info [3]){ > > { > > .count = (int)844, > > .timestamp = (time64_t)1752852018, > > }, > > ... > > > > I'm still missing the justification why rasdaemon can't be used here. > You did explain it already in past emails. Sorry, I will update it. > > +enum hwerror_tracking_source { > > + HWE_RECOV_AER, > > + HWE_RECOV_MCE, > > + HWE_RECOV_GHES, > > + HWE_RECOV_MAX, > > +}; > > Are we confident this separation will serve all cloud dudes? I am not, but, I've added them to CC list of this patch, so, they are more than free to chime in. > > +void hwerror_tracking_log(enum hwerror_tracking_source src) > > A function should have a verb in its name explaining what it does: > > hwerr_log_error_type() > > or so. Ack! I will wait a bit more and send an updated version. Thanks for the review --breno