On Wed, Jun 24, 2020 at 3:20 PM Sam Bobroff <sbobr...@linux.ibm.com> wrote: > > Give a unique ID to each recovery event, to ease log parsing and > prepare for parallel recovery. > > Also add some new messages with a very simple format that may be > useful to log-parsers. > > Signed-off-by: Sam Bobroff <sbobr...@linux.ibm.com> > --- > This patch should be applied on top of my recent(ish) set: > "powerpc/eeh: Synchronization for safety".
If you're going to do a respin I'd post these as a single series and rebase it on mainline. There's a bit of drift. > diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c > index 68e6dfa526a5..54f921ff7621 100644 > --- a/arch/powerpc/kernel/eeh.c > +++ b/arch/powerpc/kernel/eeh.c > @@ -197,7 +197,8 @@ EXPORT_SYMBOL_GPL(eeh_recovery_must_be_locked); > * for the indicated PCI device, and puts them into a buffer > * for RTAS error logging. > */ > -static size_t eeh_dump_dev_log(struct eeh_dev *edev, char *buf, size_t len) > +static size_t eeh_dump_dev_log(unsigned int event_id, struct eeh_dev *edev, > + char *buf, size_t len) If we're going to pass around some event context then IMO we should pass around the eeh_event itself rather than just an ID number. That would give us somewhere to put any extra per-event context (such as the saved stacktrace) rather than dumping it into eeh_pe. We'd probably have to fix the "special" events so they're signalled by some means other than a NULL event pointer. *snip* > @@ -280,19 +283,26 @@ static void eeh_pe_report_pdev(struct pci_dev *pdev, > eeh_report_fn fn, > driver = eeh_pcid_get(pdev); > > if (!driver) > - pci_info(pdev, "no driver"); > + pci_info(pdev, "EEH(%u): no driver", event_id); > else if (!driver->err_handler) > - pci_info(pdev, "driver not EEH aware"); > + pci_info(pdev, "EEH(%u): driver not EEH aware", > event_id); > else if (late) > - pci_info(pdev, "driver bound too late"); > + pci_info(pdev, "EEH(%u): driver bound too late", > event_id); > else { > - new_result = fn(pdev, driver); > + pr_warn("EEH(%u): EVENT=HANDLER_CALL > DEVICE=%04x:%02x:%02x.%x DRIVER='%s' HANDLER='%s'\n", WHY ARE WE YELLING > @@ -579,7 +598,8 @@ static void *eeh_add_virt_device(struct eeh_dev *edev) > * lock is dropped (which it must be in order to take the PCI rescan/remove > * lock without risking a deadlock). > */ > -static void eeh_rmv_device(struct pci_dev *pdev, void *userdata) > +static void eeh_rmv_device(unsigned int event_id, > + struct pci_dev *pdev, void *userdata) > { > struct eeh_dev *edev; > struct pci_driver *driver; > @@ -588,8 +608,8 @@ static void eeh_rmv_device(struct pci_dev *pdev, void > *userdata) > > edev = pci_dev_to_eeh_dev(pdev); > if (!edev) { > - pci_warn(pdev, "EEH: Device removed during processing > (#%d)\n", > - __LINE__); > + pci_warn(pdev, "EEH(%u): Device removed during processing > (#%d)\n", > + event_id, __LINE__); It's already there, but what's with the __LINE__ ? > diff --git a/arch/powerpc/kernel/eeh_event.c b/arch/powerpc/kernel/eeh_event.c > index a7a8dc182efb..bd38d6fe5449 100644 > --- a/arch/powerpc/kernel/eeh_event.c > +++ b/arch/powerpc/kernel/eeh_event.c > @@ -26,6 +26,9 @@ static DEFINE_SPINLOCK(eeh_eventlist_lock); > static DECLARE_COMPLETION(eeh_eventlist_event); > static LIST_HEAD(eeh_eventlist); > > +/* Event ID 0 is reserved for special events */ > +static atomic_t eeh_event_id = ATOMIC_INIT(1); > + I don't think using zero for all special events is a good idea. Special events are just events that are detected by the EEH notification interrupt. Unlike the MMIO / config space detection mechanism we don't have any device or PE context available in the interrupt handler so the work of figuring out where the error came from is punted to the recovery thread. IMO this function probably shouldn't be calling eeh_handle_normal_event() at all. Instead it should queue a new eeh_event (with a unique ID) for each error it finds. That way handling a "special" event just consists of scanning for which PHB / PE is currently broken and the actual recovery path is identical. If we switched to using a threaded IRQ handler (which can block) for the EEH notification interrupts we could probably kill off special events entirely. > @@ -1338,7 +1367,7 @@ void eeh_handle_special_event(void) > if (rc == EEH_NEXT_ERR_FROZEN_PE || > rc == EEH_NEXT_ERR_FENCED_PHB) { > eeh_pe_state_mark(pe, EEH_PE_RECOVERING); > - eeh_handle_normal_event(pe); > + eeh_handle_normal_event(0, pe); I think that needs to be a unique ID even if we keep this function calling eeh_handle_normal_event() directly. > } else { > eeh_for_each_pe(pe, tmp_pe) > eeh_pe_for_each_dev(tmp_pe, edev, tmp_edev) > @@ -1347,7 +1376,7 @@ void eeh_handle_special_event(void) > /* Notify all devices to be down */ > eeh_pe_state_clear(pe, EEH_PE_PRI_BUS, true); > eeh_set_channel_state(pe, > pci_channel_io_perm_failure); > - eeh_pe_report( > + eeh_pe_report(0, > "error_detected(permanent failure)", pe, > eeh_report_failure, NULL);