On 05/07/2013 03:09 AM, Takao Indoh wrote: > Sorry for the delayed response. > > (2013/04/30 23:54), Sumner, William wrote: >> I have installed your original patch set (from last November) and tested >> with three platforms, each with a different IO configuration. On the first >> platform crashdumps were consistently successful. On the second and third >> platforms, the reset of one specific PCI device on each platform (a >> different device type on each platform) immediately hung the crash kernel. >> This was consistent across multiple tries. Temporarily modifying the patch >> to skip resetting the one problem-device on each platform allowed each >> platform to create a dump successfully. >> >> I am now working with our IO team to determine the exact nature of the hang >> and to see if the hang is unique to the specific cards or the specific >> platforms. >> >> Also I am starting to look at an alternate possibility: >> >> The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level >> Reset) describes reset use-cases for "a partitioned environment where >> hardware is migrated from one partition to another" and for "system software >> is taking down the software stack for a Function and then rebuilding that >> stack". >> >> These use-cases seem very similar to transitioning into the crash kernel. >> The "Implementation Note" at the end of that section describes an algorithm >> for "Avoiding Data Corruption From Stale Completions" which looks like it >> might be applicable to stopping ongoing DMA. I am hopeful that adding some >> of the steps from this algorithm to one of the already proposed patches >> would avoid the hang that I saw on two platforms. > > It seems that the algorithm you mentioned requires four steps. > > 1. Clear Command register > 2. Wait a few milliseconds (Or polling Transactions Pending bit) > 3. Do FLR > 4. Wait 100 ms > > My patch does not do step 1 and 2. So, I would appreciate it if you > could add the following into save_config() in my latest patch and > confirm if kernel still hangs up or not. > > subordinate = dev->subordinate; > list_for_each_entry(child,&subordinate->devices, bus_list) { > dev_info(&child->dev, "save state\n"); > pci_save_state(child); > + pci_write_config_word(child, PCI_COMMAND, 0); > + msleep(1000); As Linus pointed out in an earlier patch, msleep() after each device is s-l-o-w and unnecessary; should do a single msleep(1000) after *all* the command registers are written. That way, the added delay is 1sec, not 1sec*ndevs. q: is this turning off command register for only PCI(e) endpoints? -- shouldn't have to turn off command register in bridges/switches.
> } > > Thanks, > Takao Indoh > > >> >> Bill Sumner >> >> -----Original Message----- >> From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don >> Dutile >> Sent: Friday, April 26, 2013 8:43 AM >> To: Takao Indoh >> Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; >> linux-kernel@vger.kernel.org; tin...@gmail.com; >> io...@lists.linux-foundation.org; bhelg...@google.com >> Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA >> >> On 04/25/2013 11:10 PM, Takao Indoh wrote: >>> (2013/04/26 3:01), Don Dutile wrote: >>>> On 04/25/2013 01:11 AM, Takao Indoh wrote: >>>>> (2013/04/25 4:59), Don Dutile wrote: >>>>>> On 04/24/2013 12:58 AM, Takao Indoh wrote: >>>>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When >>>>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each >>>>>>> PCIe root port and downstream port to reset its downstream endpoint. >>>>>>> >>>>>>> Problem: >>>>>>> This patch solves the problem that kdump can fail when intel_iommu=on is >>>>>>> specified. When intel_iommu=on is specified, many dma-remapping errors >>>>>>> occur in second kernel and it causes problems like driver error or PCI >>>>>>> SERR, at last kdump fails. This problem is caused as follows. >>>>>>> 1) Devices are working on first kernel. >>>>>>> 2) Switch to second kernel(kdump kernel). The devices are still working >>>>>>> and its DMA continues during this switch. >>>>>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes >>>>>>> dma-remapping errors. >>>>>>> >>>>>>> Solution: >>>>>>> All DMA transactions have to be stopped before iommu is initialized. By >>>>>>> this patch devices are reset and in-flight DMA is stopped before >>>>>>> pci_iommu_init. >>>>>>> >>>>>>> To invoke hot reset on an endpoint, its upstream link need to be reset. >>>>>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root >>>>>>> port/downstream port whose child is PCIe endpoint, and then reset link >>>>>>> between them. If the endpoint is VGA device, it is skipped because the >>>>>>> monitor blacks out if VGA controller is reset. >>>>>>> >>>>>> Couple questions wrt VGA device: >>>>>> (1) Many graphics devices are multi-function, one function being VGA; >>>>>> is the VGA always function 0, so this scan sees it first& >>>>>> doesn't >>>>>> do a reset on that PCIe link? if the VGA is not function 0, >>>>>> won't >>>>>> this logic break (will reset b/c function 0 is non-VGA >>>>>> graphics) ? >>>>> >>>>> VGA is not reset irrespective of its function number. The logic of this >>>>> patch is: >>>>> >>>>> for_each_pci_dev(dev) { >>>>> if (dev is not PCIe) >>>>> continue; >>>>> if (dev is not root port/downstream port) ---(1) >>>>> continue; >>>>> list_for_each_entry(child,&dev->subordinate->devices, bus_list) { >>>>> if (child is upstream port or bridge or VGA) ---(2) >>>>> continue; >>>>> } >>>>> do_reset_its_child(dev); >>>>> } >>>>> >>>>> Therefore VGA itself is skipped by (1), and upstream device(root port or >>>>> downstream port) of VGA is also skipped by (2). >>>>> >>>>> >>>>>> (2) I'm hearing VGA will soon not be the a required console; this logic >>>>>> assumes it is, and why it isn't blanked. >>>>>> Q: Should the filter be based on a device having a >>>>>> device-class of display ? >>>>> >>>>> I want to avoid the situation that user's monitor blacks out and user >>>>> cannot know what's going on. That's reason why I introduced the logic to >>>>> skip VGA. As far as I tested the logic based on device-class works well, >>>> sorry, I read your description, which said VGA, but your are filtering on >>>> display class, >>>> which includes non-VGA as well. So, all set ... but large, (x16) non-VGA >>>> display devices >>>> are probably one of the most aggressive DMA engines on a system.... and >>>> will grow as >>>> asymmetric processing using GPUs gets architected into a device-agnostic >>>> manner. >>>> So, this may work well for servers, which is the primary consumer/user of >>>> this feature, >>>> and they typically have built-in graphics that are generally used in >>>> simple VGA mode, >>>> so this may be sufficient for now. >>> >>> Ok, understood. >>> >>> >>>> >>>>> but I would appreciate it if there are better ways. >>>>> >>>> You probably don't want to hear it but.... >>>> a) only turn off cmd-reg master enable bit >>>> b) only do reset based on a list of devices known not to >>>> obey their cmd-reg master enable bit, and only do reset to those >>>> devices. >>>> But, given the testing you've done so far, this optional (need cmdline) >>>> feature, >>>> let's start here. >>> >>> Ok. Either way I think we need more testing. >>> >>> >>>>>> >>>>>>> Actually this is v8 patch but quite different from v7 and it's been so >>>>>>> long since previous post, so I start over again. >>>>>> Thanks for this re-start. I need to continue reviewing the rest. >>>>> >>>>> Thank you for your review! >>>>> >>>>>> >>>>>> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a >>>>>> crash >>>>>> dump? After the crash dump, the system is rebooting to >>>>>> previous (iommu=on) setting. >>>>>> That logic, along w/your previous patch to disable the IOMMU if >>>>>> iommu=off >>>>>> is set, would remove this (relatively slow) PCI init sequencing >>>>>> ? >>>>> >>>>> To force iommu off, all ongoing DMA have to be stopped before that since >>>>> they are accessing the device address, not physical address. If we disable >>>>> iommu without stopping in-flihgt DMA, devices access invalid memory area >>>>> and it causes memory corruption or PCI-SERR due to DMA error. >>>> Right, that's a 'duh' on my part. >>>> I thought 'disable iommu' == 'block all dma' and it just turns it off& >>>> let's the ongoing DMA run... >>>> Please ignore this question... sigh. >>>> >>>>> >>>>> So, whether we use iommu or not in second kernel, we have to stop DMA in >>>>> second kernel if iommu is used in first kernel. >>>>> >>>>> Thanks, >>>>> Takao Indoh >>>>> >>>>> >>>>>> >>>>>>> Previous post: >>>>>>> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump >>>>>>> https://lkml.org/lkml/2012/11/26/814 >>>>>>> >>>>>>> Signed-off-by: Takao Indoh<indou.ta...@jp.fujitsu.com> >>>>>>> --- >>>>>>> Documentation/kernel-parameters.txt | 2 + >>>>>>> drivers/pci/pci.c | 103 >>>>>>> +++++++++++++++++++++++++++++++++++ >>>>>>> 2 files changed, 105 insertions(+), 0 deletions(-) >>>>>>> >>>>>>> diff --git a/Documentation/kernel-parameters.txt >>>>>>> b/Documentation/kernel-parameters.txt >>>>>>> index 4609e81..2a31ade 100644 >>>>>>> --- a/Documentation/kernel-parameters.txt >>>>>>> +++ b/Documentation/kernel-parameters.txt >>>>>>> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also >>>>>>> be entirely omitted. >>>>>>> any pair of devices, possibly at the cost of >>>>>>> reduced performance. This also guarantees >>>>>>> that hot-added devices will work. >>>>>>> + pcie_reset_devices Reset PCIe endpoint on boot by hot >>>>>>> + reset >>>>>>> cbiosize=nn[KMG] The fixed amount of bus space which >>>>>>> is >>>>>>> reserved for the CardBus bridge's IO window. >>>>>>> The default value is 256 bytes. >>>>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >>>>>>> index b099e00..42385c9 100644 >>>>>>> --- a/drivers/pci/pci.c >>>>>>> +++ b/drivers/pci/pci.c >>>>>>> @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus >>>>>>> *bus) >>>>>>> } >>>>>>> EXPORT_SYMBOL(pci_fixup_cardbus); >>>>>>> >>>>>>> +/* >>>>>>> + * Return true if dev is PCIe root port or downstream port whose child >>>>>>> is PCIe >>>>>>> + * endpoint except VGA device. >>>>>>> + */ >>>>>>> +static int __init need_reset(struct pci_dev *dev) >>>>>>> +{ >>>>>>> + struct pci_bus *subordinate; >>>>>>> + struct pci_dev *child; >>>>>>> + >>>>>>> + if (!pci_is_pcie(dev) || !dev->subordinate || >>>>>>> + list_empty(&dev->subordinate->devices) || >>>>>>> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&& >>>>>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))) >>>>>>> + return 0; >>>>>>> + >>>>>>> + subordinate = dev->subordinate; >>>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>>>>> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) || >>>>>>> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) || >>>>>>> + ((child->class>> 16) == PCI_BASE_CLASS_DISPLAY)) >>>>>>> + /* Don't reset switch, bridge, VGA device */ >>>>>>> + return 0; >>>>>>> + } >>>>>>> + >>>>>>> + return 1; >>>>>>> +} >>>>>>> + >>>>>>> +static void __init save_config(struct pci_dev *dev) >>>> This should be renamed. it implies it is saving the config of the pdev >>>> passed in, >>>> when in reality, it is saving the config of all the devices attached to >>>> this pdev. >>>> i.e., save_downstream_configs() >>>> >>>>>>> +{ >>>>>>> + struct pci_bus *subordinate; >>>>>>> + struct pci_dev *child; >>>>>>> + >>>>>>> + if (!need_reset(dev)) >>>>>>> + return; >>>>>>> + >>>>>>> + subordinate = dev->subordinate; >>>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>>>>> + dev_info(&child->dev, "save state\n"); >>>>>>> + pci_save_state(child); >>>>>>> + } >>>>>>> +} >>>>>>> + >>>>>>> +static void __init restore_config(struct pci_dev *dev) >>>> inverse of above: restore_downstream_configs() >>>>>>> +{ >>>>>>> + struct pci_bus *subordinate; >>>>>>> + struct pci_dev *child; >>>>>>> + >>>>>>> + if (!need_reset(dev)) >>>>>>> + return; >>>>>>> + >>>>>>> + subordinate = dev->subordinate; >>>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) { >>>>>>> + dev_info(&child->dev, "restore state\n"); >>>>>>> + pci_restore_state(child); >>>>>>> + } >>>>>>> +} >>>>>>> + >>>>>>> +static void __init do_device_reset(struct pci_dev *dev) >>>> do_downstream_device_reset() -- it's not resetting this pdev, >>>> but the pdev's of the devices attached to it. >>>> >>> Right, I'll change them as you said. >>> >>> >>>>>>> +{ >>>>>>> + u16 ctrl; >>>>>>> + >>>>>>> + if (!need_reset(dev)) >>>>>>> + return; >>>>>>> + >>>>>>> + dev_info(&dev->dev, "Reset Secondary bus\n"); >>>>>>> + >>>>>>> + /* Assert Secondary Bus Reset */ >>>>>>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&ctrl); >>>>>>> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET; >>>>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >>>>>>> + >>>>>>> + msleep(2); >>>>>>> + >>>> This works well for x86, which uses ioport registers to access >>>> these<256-offset registers, b/c the write() function can't return >>>> until the write is actually completed, but for a non-x86 system, >>>> with fully mmconf'd PCI space, a write() may still be a write& run >>>> (sitting in CPU write-merge) buffer, so if you need a full 2ms, >>>> you ought to do another read_config() to the device, to flush the write, >>>> before starting the msleep(2) clock. >>>> >>> I didn't know that... I'll insert something like this before msleep. >>> >>> pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&dummy); >>> >>> BTW, I referred aer_do_secondary_bus_reset() when I wrote this code. >>> Probably need the same fix, right? >>> >> sounds like it. >> Interested to hear from someone in non-x86-land how >> they ensure pci cfg writes aren't buffered in their cpus; >> maybe handled via special uncached regions on other arches. >> >>> >>>>>>> + /* De-assert Secondary Bus Reset */ >>>>>>> + ctrl&= ~PCI_BRIDGE_CTL_BUS_RESET; >>>>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl); >>>>>>> +} >>>>>>> + >>>>>>> +static int __initdata pcie_reset_devices; >>>>>>> +static int __init reset_pcie_devices(void) >>>> this should be reset_pcie_endpoints() ... it's not resetting all pcie >>>> devices >>>>>>> +{ >>>>>>> + struct pci_dev *dev = NULL; >>>>>>> + >>>>>>> + if (!pcie_reset_devices) >>>>>>> + return 0; >>>>>>> + >>>>>>> + for_each_pci_dev(dev) >>>>>>> + save_config(dev); >>>>>>> + >>>>>>> + for_each_pci_dev(dev) >>>>>>> + do_device_reset(dev); >>>>>>> + >>>>>>> + msleep(1000); >>>>>>> + >>>>>>> + for_each_pci_dev(dev) >>>>>>> + restore_config(dev); >>>>>>> + >>>> My apologies if past thread covered this sequence... >>>> Why three loops through all PCIe devices on the system? >>>> why not have the first for-each-pci-dev() loop filter devices >>>> to be reset, and save_config those pdev's, >>>> return a list of saved_pdev's; feed that list into the do_device_reset(); >>>> then mpsleep(), then restore the list. >>>> in fact, once you create a link list of pdev's to reset, >>>> just loop that list doing save, then reset; rtn the list, do msleep(), >>>> then restore the config of pdevs in the list. >>>> Otherwise, doing much more traversing than what's needed. >>>> Doing a great deal more config-saving then needed right now as well >>>> (saving all non-endpt devices that aren't reset). >>>> >>> >>> Creating list is nice idea, thanks. I'll do. >>> >>> I'll have vacation next week, so I'll post v2 patch the week after >>> next. >>> >>> Thanks, >>> Takao Indoh >>> >> Have a good vacation! Thanks for your willingness >> to respin again& make this improvement. >> >> >> _______________________________________________ >> kexec mailing list >> ke...@lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/kexec >> >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/