Sorry for the delayed response. (2013/04/30 23:54), Sumner, William wrote: > I have installed your original patch set (from last November) and tested with > three platforms, each with a different IO configuration. On the first > platform crashdumps were consistently successful. On the second and third > platforms, the reset of one specific PCI device on each platform (a different > device type on each platform) immediately hung the crash kernel. This was > consistent across multiple tries. Temporarily modifying the patch to skip > resetting the one problem-device on each platform allowed each platform to > create a dump successfully. > > I am now working with our IO team to determine the exact nature of the hang > and to see if the hang is unique to the specific cards or the specific > platforms. > > Also I am starting to look at an alternate possibility: > > The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level > Reset) describes reset use-cases for "a partitioned environment where > hardware is migrated from one partition to another" and for "system software > is taking down the software stack for a Function and then rebuilding that > stack". > > These use-cases seem very similar to transitioning into the crash kernel. > The "Implementation Note" at the end of that section describes an algorithm > for "Avoiding Data Corruption From Stale Completions" which looks like it > might be applicable to stopping ongoing DMA. I am hopeful that adding some > of the steps from this algorithm to one of the already proposed patches would > avoid the hang that I saw on two platforms.
It seems that the algorithm you mentioned requires four steps. 1. Clear Command register 2. Wait a few milliseconds (Or polling Transactions Pending bit) 3. Do FLR 4. Wait 100 ms My patch does not do step 1 and 2. So, I would appreciate it if you could add the following into save_config() in my latest patch and confirm if kernel still hangs up or not. subordinate = dev->subordinate; list_for_each_entry(child, &subordinate->devices, bus_list) { dev_info(&child->dev, "save state\n"); pci_save_state(child); + pci_write_config_word(child, PCI_COMMAND, 0); + msleep(1000); } Thanks, Takao Indoh > > Bill Sumner > > -----Original Message----- > From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don Dutile > Sent: Friday, April 26, 2013 8:43 AM > To: Takao Indoh > Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; > linux-kernel@vger.kernel.org; tin...@gmail.com; > io...@lists.linux-foundation.org; bhelg...@google.com > Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA > > On 04/25/2013 11:10 PM, Takao Indoh wrote: >> (2013/04/26 3:01), Don Dutile wrote: >>> On 04/25/2013 01:11 AM, Takao Indoh wrote: >>>> (2013/04/25 4:59), Don Dutile wrote: >>>>> On 04/24/2013 12:58 AM, Takao Indoh wrote: >>>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When >>>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each >>>>>> PCIe root port and downstream port to reset its downstream endpoint. >>>>>> >>>>>> Problem: >>>>>> This patch solves the problem that kdump can fail when intel_iommu=on is >>>>>> specified. When intel_iommu=on is specified, many dma-remapping errors >>>>>> occur in second kernel and it causes problems like driver error or PCI >>>>>> SERR, at last kdump fails. This problem is caused as follows. >>>>>> 1) Devices are working on first kernel. >>>>>> 2) Switch to second kernel(kdump kernel). The devices are still working >>>>>> and its DMA continues during this switch. >>>>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes >>>>>> dma-remapping errors. >>>>>> >>>>>> Solution: >>>>>> All DMA transactions have to be stopped before iommu is initialized. By >>>>>> this patch devices are reset and in-flight DMA is stopped before >>>>>> pci_iommu_init. >>>>>> >>>>>> To invoke hot reset on an endpoint, its upstream link need to be reset. >>>>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root >>>>>> port/downstream port whose child is PCIe endpoint, and then reset link >>>>>> between them. If the endpoint is VGA device, it is skipped because the >>>>>> monitor blacks out if VGA controller is reset. >>>>>> >>>>> Couple questions wrt VGA device: >>>>> (1) Many graphics devices are multi-function, one function being VGA; >>>>> is the VGA always function 0, so this scan sees it first& >>>>> doesn't >>>>> do a reset on that PCIe link? if the VGA is not function 0, >>>>> won't >>>>> this logic break (will reset b/c function 0 is non-VGA graphics) >>>>> ? >>>> >>>> VGA is not reset irrespective of its function number. The logic of this >>>> patch is: >>>> >>>> for_each_pci_dev(dev) { >>>> if (dev is not PCIe) >>>> continue; >>>> if (dev is not root port/downstream port) ---(1) >>>> continue; >>>> list_for_each_entry(child,&dev->subordinate->devices, bus_list) { >>>> if (child is upstream port or bridge or VGA) ---(2) >>>> continue; >>>> } >>>> do_reset_its_child(dev); >>>> } >>>> >>>> Therefore VGA itself is skipped by (1), and upstream device(root port or >>>> downstream port) of VGA is also skipped by (2). >>>> >>>> >>>>> (2) I'm hearing VGA will soon not be the a required console; this logic >>>>> assumes it is, and why it isn't blanked. >>>>> Q: Should the filter be based on a device having a device-class >>>>> of display ? >>>> >>>> I want to avoid the situation that user's monitor blacks out and user >>>> cannot know what's going on. That's reason why I introduced the logic to >>>> skip VGA. As far as I tested the logic based on device-class works well, >>> sorry, I read your description, which said VGA, but your are filtering on >>> display class, >>> which includes non-VGA as well. So, all set ... but large, (x16) non-VGA >>> display devices >>> are probably one of the most aggressive DMA engines on a system.... and >>> will grow as >>> asymmetric processing using GPUs gets architected into a device-agnostic >>> manner. >>> So, this may work well for servers, which is the primary consumer/user of >>> this feature, >>> and they typically have built-in graphics that are generally used in simple >>> VGA mode, >>> so this may be sufficient for now. >> >> Ok, understood. >> >> >>> >>>> but I would appreciate it if there are better ways. >>>> >>> You probably don't want to hear it but.... >>> a) only turn off cmd-reg master enable bit >>> b) only do reset based on a list of devices known not to >>> obey their cmd-reg master enable bit, and only do reset to those >>> devices. >>> But, given the testing you've done so far, this optional (need cmdline) >>> feature, >>> let's start here. >> >> Ok. Either way I think we need more testing. >> >> >>>>> >>>>>> Actually this is v8 patch but quite different from v7 and it's been so >>>>>> long since previous post, so I start over again. >>>>> Thanks for this re-start. I need to continue reviewing the rest. >>>> >>>> Thank you for your review! >>>> >>>>> >>>>> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a >>>>> crash >>>>> dump? After the crash dump, the system is rebooting to previous >>>>> (iommu=on) setting. >>>>> That logic, along w/your previous patch to disable the IOMMU if >>>>> iommu=off >>>>> is set, would remove this (relatively slow) PCI init sequencing ? >>>> >>>> To force iommu off, all ongoing DMA have to be stopped before that since >>>> they are accessing the device address, not physical address. If we disable >>>> iommu without stopping in-flihgt DMA, devices access invalid memory area >>>> and it causes memory corruption or PCI-SERR due to DMA error. >>> Right, that's a 'duh' on my part. >>> I thought 'disable iommu' == 'block all dma' and it just turns it off& >>> let's the ongoing DMA run... >>> Please ignore this question... sigh. >>> >>>> >>>> So, whether we use iommu or not in second kernel, we have to stop DMA in >>>> second kernel if iommu is used in first kernel. >>>> >>>> Thanks, >>>> Takao Indoh >>>> >>>> >>>>> >>>>>> Previous post: >>>>>> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump >>>>>> https://lkml.org/lkml/2012/11/26/814 >>>>>> >>>>>> Signed-off-by: Takao Indoh<indou.ta...@jp.fujitsu.com> >>>>>> --- >>>>>> Documentation/kernel-parameters.txt | 2 + >>>>>> drivers/pci/pci.c | 103 >>>>>> +++++++++++++++++++++++++++++++++++ >>>>>> 2 files changed, 105 insertions(+), 0 deletions(-) >>>>>> >>>>>> diff --git a/Documentation/kernel-parameters.txt >>>>>> b/Documentation/kernel-parameters.txt >>>>>> index 4609e81..2a31ade 100644 >>>>>> --- a/Documentation/kernel-parameters.txt >>>>>> +++ b/Documentation/kernel-parameters.txt >>>>>> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also >>>>>> be entirely omitted. >>>>>> any pair of devices, possibly at the cost of >>>>>> reduced performance. This also guarantees >>>>>> that hot-added devices will work. >>>>>> + pcie_reset_devices Reset PCIe endpoint on boot by hot >>>>>> + reset >>>>>> cbiosize=nn[KMG] The fixed amount of bus space which is >>>>>> reserved for the CardBus bridge's IO window. >>>>>> The default value is 256 bytes. >>>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >>>>>> index b099e00..42385c9 100644 >>>>>> --- a/drivers/pci/pci.c >>>>>> +++ b/drivers/pci/pci.c >>>>>> @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus >>>>>> *bus) >>>>>> } >>>>>> EXPORT_SYMBOL(pci_fixup_cardbus); >>>>>> >>>>>> +/* >>>>>> + * Return true if dev is PCIe root port or downstream port whose child >>>>>> is PCIe >>>>>> + * endpoint except VGA device. >>>>>> + */ >>>>>> +static int __init need_reset(struct pci_dev *dev) >>>>>> +{ >>>>>> + struct pci_bus *subordinate; >>>>>> + struct pci_dev *child; >>>>>> + >>>>>> + if (!pci_is_pcie(dev) || !dev->subordinate || >>>>>> + list_empty(&dev->subordinate->devices) || >>>>>> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&& >>>>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) >>>>>> + return 0; >>>>>> + >>>>>> + subordinate = dev->subordinate; >>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>>>> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || >>>>>> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || >>>>>> + ((child->class>> 16) == PCI_BASE_CLASS_DISPLAY)) >>>>>> + /* Don't reset switch, bridge, VGA device */ >>>>>> + return 0; >>>>>> + } >>>>>> + >>>>>> + return 1; >>>>>> +} >>>>>> + >>>>>> +static void __init save_config(struct pci_dev *dev) >>> This should be renamed. it implies it is saving the config of the pdev >>> passed in, >>> when in reality, it is saving the config of all the devices attached to >>> this pdev. >>> i.e., save_downstream_configs() >>> >>>>>> +{ >>>>>> + struct pci_bus *subordinate; >>>>>> + struct pci_dev *child; >>>>>> + >>>>>> + if (!need_reset(dev)) >>>>>> + return; >>>>>> + >>>>>> + subordinate = dev->subordinate; >>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>>>> + dev_info(&child->dev, "save state\n"); >>>>>> + pci_save_state(child); >>>>>> + } >>>>>> +} >>>>>> + >>>>>> +static void __init restore_config(struct pci_dev *dev) >>> inverse of above: restore_downstream_configs() >>>>>> +{ >>>>>> + struct pci_bus *subordinate; >>>>>> + struct pci_dev *child; >>>>>> + >>>>>> + if (!need_reset(dev)) >>>>>> + return; >>>>>> + >>>>>> + subordinate = dev->subordinate; >>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>>>> + dev_info(&child->dev, "restore state\n"); >>>>>> + pci_restore_state(child); >>>>>> + } >>>>>> +} >>>>>> + >>>>>> +static void __init do_device_reset(struct pci_dev *dev) >>> do_downstream_device_reset() -- it's not resetting this pdev, >>> but the pdev's of the devices attached to it. >>> >> Right, I'll change them as you said. >> >> >>>>>> +{ >>>>>> + u16 ctrl; >>>>>> + >>>>>> + if (!need_reset(dev)) >>>>>> + return; >>>>>> + >>>>>> + dev_info(&dev->dev, "Reset Secondary bus\n"); >>>>>> + >>>>>> + /* Assert Secondary Bus Reset */ >>>>>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&ctrl); >>>>>> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; >>>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >>>>>> + >>>>>> + msleep(2); >>>>>> + >>> This works well for x86, which uses ioport registers to access >>> these<256-offset registers, b/c the write() function can't return >>> until the write is actually completed, but for a non-x86 system, >>> with fully mmconf'd PCI space, a write() may still be a write& run >>> (sitting in CPU write-merge) buffer, so if you need a full 2ms, >>> you ought to do another read_config() to the device, to flush the write, >>> before starting the msleep(2) clock. >>> >> I didn't know that... I'll insert something like this before msleep. >> >> pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&dummy); >> >> BTW, I referred aer_do_secondary_bus_reset() when I wrote this code. >> Probably need the same fix, right? >> > sounds like it. > Interested to hear from someone in non-x86-land how > they ensure pci cfg writes aren't buffered in their cpus; > maybe handled via special uncached regions on other arches. > >> >>>>>> + /* De-assert Secondary Bus Reset */ >>>>>> + ctrl&= ~PCI_BRIDGE_CTL_BUS_RESET; >>>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >>>>>> +} >>>>>> + >>>>>> +static int __initdata pcie_reset_devices; >>>>>> +static int __init reset_pcie_devices(void) >>> this should be reset_pcie_endpoints() ... it's not resetting all pcie >>> devices >>>>>> +{ >>>>>> + struct pci_dev *dev = NULL; >>>>>> + >>>>>> + if (!pcie_reset_devices) >>>>>> + return 0; >>>>>> + >>>>>> + for_each_pci_dev(dev) >>>>>> + save_config(dev); >>>>>> + >>>>>> + for_each_pci_dev(dev) >>>>>> + do_device_reset(dev); >>>>>> + >>>>>> + msleep(1000); >>>>>> + >>>>>> + for_each_pci_dev(dev) >>>>>> + restore_config(dev); >>>>>> + >>> My apologies if past thread covered this sequence... >>> Why three loops through all PCIe devices on the system? >>> why not have the first for-each-pci-dev() loop filter devices >>> to be reset, and save_config those pdev's, >>> return a list of saved_pdev's; feed that list into the do_device_reset(); >>> then mpsleep(), then restore the list. >>> in fact, once you create a link list of pdev's to reset, >>> just loop that list doing save, then reset; rtn the list, do msleep(), >>> then restore the config of pdevs in the list. >>> Otherwise, doing much more traversing than what's needed. >>> Doing a great deal more config-saving then needed right now as well >>> (saving all non-endpt devices that aren't reset). >>> >> >> Creating list is nice idea, thanks. I'll do. >> >> I'll have vacation next week, so I'll post v2 patch the week after >> next. >> >> Thanks, >> Takao Indoh >> > Have a good vacation! Thanks for your willingness > to respin again & make this improvement. > > > _______________________________________________ > kexec mailing list > ke...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/kexec > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/