I opened https://bugzilla.kernel.org/show_bug.cgi?id=59531 for this issue.
Unfortunately, I don't have an army of minions to assign things like this to. It's easy to change the dev state from frozen to normal before calling the slot_reset() callback, but we first have to unravel what that means for pcie_portdrv_slot_reset(), which restores the device state if it is frozen. Bjorn On Thu, Jun 6, 2013 at 12:29 AM, Yanmin Zhang <yanmin_zh...@linux.intel.com> wrote: > On Wed, 2013-06-05 at 07:30 -0600, Bjorn Helgaas wrote: >> On Tue, Jun 4, 2013 at 6:38 PM, Yanmin Zhang >> <yanmin_zh...@linux.intel.com> wrote: >> > On Tue, 2013-06-04 at 12:04 -0600, Bjorn Helgaas wrote: >> >> I'm not sure where we are with this patch. I think Joseph initially >> >> reported a problem (though I haven't actually seen that), and this >> >> patch fixed it, so it seems like there's something we want to do here. >> > Yes, indeed. We checked Powerpc error handling and plan to use the >> > similar method to deal with the issue. >> > >> > Sorry for replying very late. I move to Android smartphone area and am >> > very busy on many top issues. >> >> OK. Can you point us at a bugzilla or email archive of Joseph's >> original problem report, or post it to this thread? Then maybe >> somebody else can pick it up and make progress on this. > > Joseph sent email to my outlook emailbox. I changed it a little to delete > company > sensitive product name. > > ===========================Joseph's original email to me================ > Hi Tom and Yanmin, > > Hope this email can reach you well. I am working on a driver's PCI error > recovery with AER. I have a question with one of my observations from > platform AER driver's behavior. As your names and emails are listed to the > PCIe AER driver coming with the kernel, I send this question to you: > > During injecting AER Non-Correctable/Fatal error and PCIe error recovery > process, I observed the following from our Low Level Driver (LLD): > > 1. The LLD's error_detected() callback got called and the PCI channel state > passed in as "pci_channel_io_frozen", as expected; > > 2. The LLD's error_detected() callback function returned with > PCI_ERS_RESULT_NEED_RESET, requesting a PCI slot reset; > > 3. The LLD's slot_reset() callback got called and it attempted to > re-initialize the device hardware for the recovery. However, the PCI slot > state was still in "pci_channel_io_frozen" and the pci_channel_offline() > helper routine returned 1. This is not expected, and it actually prevented > LLD in performing needed access to memory mapped I/O space for preparing the > device for recovery; > > 4. Later, the LLD's resume() callback got called and the PCI slot state was > set to "pci_channel_io_normal"; > > In the upstream Linux kernel 3.7.0's pci-error-recovery.txt, "STEP 4 Slot > Reset", it stated after platform has performed PCI slot reset and then calls > LLD's slot_reset() callback: > > "This call gives drivers the chance to re-initialize the hardware > (re-download firmware, etc.). At this point, the driver may assume that the > card is in a fresh state and is fully functional. The slot is unfrozen and > the driver has full access to PCI config space, memory mapped I/O space and > DMA. Interrupts (Legacy, MSI, or MSI-X) will also be available." > > I looked into the kernel 3.7.0's AER driver's aerdrv_core.c and see that the > PCI slot state was set to "pci_channel_io_frozen" on AER_FATAL serverity, and > only set back to "pci_channel_io_normal" in report_resume() function. The PCI > slot state was not set to "pci_channel_io_normal" when invoking > report_slot_reset(). > > As a comparison, I also perform the EEH error recovery with the LLD driver on > a PPC platform, which indeed set the PCI slot state to > "pci_channel_io_normal" when calling LLD's slot_reset() callback function. > > Can you please verify this platform AER driver's behavior is intended and > will stay with the AER platform support, or it should be changed to be > consistent with the PCI error recovery procedure described in the > pci-error-recovery.txt documentation? I also noticed that before invoking the > broadcast_error_message() function with "slot_reset", there is a comment in > the 3.7.0 kernel source: > > /* > * TODO: Should call platform-specific > * functions to reset slot before calling > * drivers' slot_reset callbacks? > */ > And I do see that only the PCIe Link_Reset was performed, no PCI fundamental > reset was performed even the LLD set the PCIe device's "pdev->needs_freset = > 1", is this the area that later will be added for performing needed PCI hot > reset or fundamental reset at this stage? > > Your timely response is very appreciated. Thanks in advance and please let me > know if you think I should redirect the question to someone else. > > Best Regards, > Joseph Liu, Ph.D. > Senior Principal Engineer > Emulex Corporation > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/