Re: [PATCH] Reset PCIe devices to stop ongoing DMA

Don Dutile Tue, 07 May 2013 05:51:47 -0700

On 05/07/2013 03:09 AM, Takao Indoh wrote:
> Sorry for the delayed response.
> 
> (2013/04/30 23:54), Sumner, William wrote:
>> I have installed your original patch set (from last November) and tested 
>> with three platforms, each with a different IO configuration.  On the first 
>> platform crashdumps were consistently successful.  On the second and third 
>> platforms, the reset of one specific PCI device on each platform (a 
>> different device type on each platform) immediately hung the crash kernel. 
>> This was consistent across multiple tries.  Temporarily modifying the patch 
>> to skip resetting the one problem-device on each platform allowed each 
>> platform to create a dump successfully.
>>
>> I am now working with our IO team to determine the exact nature of the hang 
>> and to see if the hang is unique to the specific cards or the specific 
>> platforms.
>>
>> Also I am starting to look at an alternate possibility:
>>
>> The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level 
>> Reset) describes reset use-cases for "a partitioned environment where 
>> hardware is migrated from one partition to another" and for "system software 
>> is taking down the software stack for a Function and then rebuilding that 
>> stack".
>>
>> These use-cases seem very similar to transitioning into the crash kernel.  
>> The "Implementation Note" at the end of that section describes an algorithm 
>> for "Avoiding Data Corruption From Stale Completions" which looks like it 
>> might be applicable to stopping ongoing DMA.  I am hopeful that adding some 
>> of the steps from this algorithm to one of the already proposed patches 
>> would avoid the hang that I saw on two platforms.
> 
> It seems that the algorithm you mentioned requires four steps.
> 
> 1. Clear Command register
> 2. Wait a few milliseconds (Or polling Transactions Pending bit)
> 3. Do FLR
> 4. Wait 100 ms
> 
> My patch does not do step 1 and 2. So, I would appreciate it if you
> could add the following into save_config() in my latest patch and
> confirm if kernel still hangs up or not.
> 
>          subordinate = dev->subordinate;
>          list_for_each_entry(child,&subordinate->devices, bus_list) {
>                  dev_info(&child->dev, "save state\n");
>                  pci_save_state(child);
> +               pci_write_config_word(child, PCI_COMMAND, 0);
> +               msleep(1000);
As Linus pointed out in an earlier patch, msleep() after each device
is s-l-o-w and unnecessary; should do a single msleep(1000) after
*all* the command registers are written.  That way, the added delay is 1sec,
not 1sec*ndevs.
q: is this turning off command register for only PCI(e) endpoints?
-- shouldn't have to turn off command register in bridges/switches.


>          }
> 
> Thanks,
> Takao Indoh
> 
> 
>>
>> Bill Sumner
>>
>> -----Original Message-----
>> From: kexec [mailto:[email protected]] On Behalf Of Don 
>> Dutile
>> Sent: Friday, April 26, 2013 8:43 AM
>> To: Takao Indoh
>> Cc: [email protected]; [email protected]; 
>> [email protected]; [email protected]; 
>> [email protected]; [email protected]
>> Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA
>>
>> On 04/25/2013 11:10 PM, Takao Indoh wrote:
>>> (2013/04/26 3:01), Don Dutile wrote:
>>>> On 04/25/2013 01:11 AM, Takao Indoh wrote:
>>>>> (2013/04/25 4:59), Don Dutile wrote:
>>>>>> On 04/24/2013 12:58 AM, Takao Indoh wrote:
>>>>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>>>>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
>>>>>>> PCIe root port and downstream port to reset its downstream endpoint.
>>>>>>>
>>>>>>> Problem:
>>>>>>> This patch solves the problem that kdump can fail when intel_iommu=on is
>>>>>>> specified. When intel_iommu=on is specified, many dma-remapping errors
>>>>>>> occur in second kernel and it causes problems like driver error or PCI
>>>>>>> SERR, at last kdump fails. This problem is caused as follows.
>>>>>>> 1) Devices are working on first kernel.
>>>>>>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>>>>>>          and its DMA continues during this switch.
>>>>>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>>>>>>>          dma-remapping errors.
>>>>>>>
>>>>>>> Solution:
>>>>>>> All DMA transactions have to be stopped before iommu is initialized. By
>>>>>>> this patch devices are reset and in-flight DMA is stopped before
>>>>>>> pci_iommu_init.
>>>>>>>
>>>>>>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>>>>>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
>>>>>>> port/downstream port whose child is PCIe endpoint, and then reset link
>>>>>>> between them. If the endpoint is VGA device, it is skipped because the
>>>>>>> monitor blacks out if VGA controller is reset.
>>>>>>>
>>>>>> Couple questions wrt VGA device:
>>>>>> (1) Many graphics devices are multi-function, one function being VGA;
>>>>>>           is the VGA always function 0, so this scan sees it first&    
>>>>>> doesn't
>>>>>>           do a reset on that PCIe link?  if the VGA is not function 0, 
>>>>>> won't
>>>>>>           this logic break (will reset b/c function 0 is non-VGA 
>>>>>> graphics) ?
>>>>>
>>>>> VGA is not reset irrespective of its function number. The logic of this
>>>>> patch is:
>>>>>
>>>>> for_each_pci_dev(dev) {
>>>>>          if (dev is not PCIe)
>>>>>             continue;
>>>>>          if (dev is not root port/downstream port) ---(1)
>>>>>             continue;
>>>>>          list_for_each_entry(child,&dev->subordinate->devices, bus_list) {
>>>>>              if (child is upstream port or bridge or VGA) ---(2)
>>>>>                  continue;
>>>>>          }
>>>>>          do_reset_its_child(dev);
>>>>> }
>>>>>
>>>>> Therefore VGA itself is skipped by (1), and upstream device(root port or
>>>>> downstream port) of VGA is also skipped by (2).
>>>>>
>>>>>
>>>>>> (2) I'm hearing VGA will soon not be the a required console; this logic
>>>>>>           assumes it is, and why it isn't blanked.
>>>>>>           Q: Should the filter be based on a device having a 
>>>>>> device-class of display ?
>>>>>
>>>>> I want to avoid the situation that user's monitor blacks out and user
>>>>> cannot know what's going on. That's reason why I introduced the logic to
>>>>> skip VGA. As far as I tested the logic based on device-class works well,
>>>> sorry, I read your description, which said VGA, but your are filtering on 
>>>> display class,
>>>> which includes non-VGA as well. So, all set ... but large, (x16) non-VGA 
>>>> display devices
>>>> are probably one of the most aggressive DMA engines on a system.... and 
>>>> will grow as
>>>> asymmetric processing using GPUs gets architected into a device-agnostic 
>>>> manner.
>>>> So, this may work well for servers, which is the primary consumer/user of 
>>>> this feature,
>>>> and they typically have built-in graphics that are generally used in 
>>>> simple VGA mode,
>>>> so this may be sufficient for now.
>>>
>>> Ok, understood.
>>>
>>>
>>>>
>>>>> but I would appreciate it if there are better ways.
>>>>>
>>>> You probably don't want to hear it but....
>>>> a) only turn off cmd-reg master enable bit
>>>> b) only do reset based on a list of devices known not to
>>>>        obey their cmd-reg master enable bit, and only do reset to those 
>>>> devices.
>>>> But, given the testing you've done so far, this optional (need cmdline) 
>>>> feature,
>>>> let's start here.
>>>
>>> Ok. Either way I think we need more testing.
>>>
>>>
>>>>>>
>>>>>>> Actually this is v8 patch but quite different from v7 and it's been so
>>>>>>> long since previous post, so I start over again.
>>>>>> Thanks for this re-start.  I need to continue reviewing the rest.
>>>>>
>>>>> Thank you for your review!
>>>>>
>>>>>>
>>>>>> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a 
>>>>>> crash
>>>>>>          dump?  After the crash dump, the system is rebooting to 
>>>>>> previous (iommu=on) setting.
>>>>>>          That logic, along w/your previous patch to disable the IOMMU if 
>>>>>> iommu=off
>>>>>>          is set, would remove this (relatively slow) PCI init sequencing 
>>>>>> ?
>>>>>
>>>>> To force iommu off, all ongoing DMA have to be stopped before that since
>>>>> they are accessing the device address, not physical address. If we disable
>>>>> iommu without stopping in-flihgt DMA, devices access invalid memory area
>>>>> and it causes memory corruption or PCI-SERR due to DMA error.
>>>> Right, that's a 'duh' on my part.
>>>> I thought 'disable iommu' == 'block all dma' and it just turns it off&
>>>> let's the ongoing DMA run...
>>>> Please ignore this question... sigh.
>>>>
>>>>>
>>>>> So, whether we use iommu or not in second kernel, we have to stop DMA in
>>>>> second kernel if iommu is used in first kernel.
>>>>>
>>>>> Thanks,
>>>>> Takao Indoh
>>>>>
>>>>>
>>>>>>
>>>>>>> Previous post:
>>>>>>> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
>>>>>>> https://lkml.org/lkml/2012/11/26/814
>>>>>>>
>>>>>>> Signed-off-by: Takao Indoh<[email protected]>
>>>>>>> ---
>>>>>>>        Documentation/kernel-parameters.txt |    2 +
>>>>>>>        drivers/pci/pci.c                   |  103 
>>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>>        2 files changed, 105 insertions(+), 0 deletions(-)
>>>>>>>
>>>>>>> diff --git a/Documentation/kernel-parameters.txt 
>>>>>>> b/Documentation/kernel-parameters.txt
>>>>>>> index 4609e81..2a31ade 100644
>>>>>>> --- a/Documentation/kernel-parameters.txt
>>>>>>> +++ b/Documentation/kernel-parameters.txt
>>>>>>> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also 
>>>>>>> be entirely omitted.
>>>>>>>                        any pair of devices, possibly at the cost of
>>>>>>>                        reduced performance.  This also guarantees
>>>>>>>                        that hot-added devices will work.
>>>>>>> +        pcie_reset_devices    Reset PCIe endpoint on boot by hot
>>>>>>> +                reset
>>>>>>>                cbiosize=nn[KMG]    The fixed amount of bus space which 
>>>>>>> is
>>>>>>>                        reserved for the CardBus bridge's IO window.
>>>>>>>                        The default value is 256 bytes.
>>>>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>>>>> index b099e00..42385c9 100644
>>>>>>> --- a/drivers/pci/pci.c
>>>>>>> +++ b/drivers/pci/pci.c
>>>>>>> @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus 
>>>>>>> *bus)
>>>>>>>        }
>>>>>>>        EXPORT_SYMBOL(pci_fixup_cardbus);
>>>>>>>
>>>>>>> +/*
>>>>>>> + * Return true if dev is PCIe root port or downstream port whose child 
>>>>>>> is PCIe
>>>>>>> + * endpoint except VGA device.
>>>>>>> + */
>>>>>>> +static int __init need_reset(struct pci_dev *dev)
>>>>>>> +{
>>>>>>> +    struct pci_bus *subordinate;
>>>>>>> +    struct pci_dev *child;
>>>>>>> +
>>>>>>> +    if (!pci_is_pcie(dev) || !dev->subordinate ||
>>>>>>> +        list_empty(&dev->subordinate->devices) ||
>>>>>>> +        ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&&
>>>>>>> +         (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
>>>>>>> +        return 0;
>>>>>>> +
>>>>>>> +    subordinate = dev->subordinate;
>>>>>>> +    list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>>>> +        if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>>>>>>> +            (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
>>>>>>> +            ((child->class>>     16) == PCI_BASE_CLASS_DISPLAY))
>>>>>>> +            /* Don't reset switch, bridge, VGA device */
>>>>>>> +            return 0;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return 1;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void __init save_config(struct pci_dev *dev)
>>>> This should be renamed.  it implies it is saving the config of the pdev 
>>>> passed in,
>>>> when in reality, it is saving the config of all the devices attached to 
>>>> this pdev.
>>>> i.e., save_downstream_configs()
>>>>
>>>>>>> +{
>>>>>>> +    struct pci_bus *subordinate;
>>>>>>> +    struct pci_dev *child;
>>>>>>> +
>>>>>>> +    if (!need_reset(dev))
>>>>>>> +        return;
>>>>>>> +
>>>>>>> +    subordinate = dev->subordinate;
>>>>>>> +    list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>>>> +        dev_info(&child->dev, "save state\n");
>>>>>>> +        pci_save_state(child);
>>>>>>> +    }
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void __init restore_config(struct pci_dev *dev)
>>>> inverse of above: restore_downstream_configs()
>>>>>>> +{
>>>>>>> +    struct pci_bus *subordinate;
>>>>>>> +    struct pci_dev *child;
>>>>>>> +
>>>>>>> +    if (!need_reset(dev))
>>>>>>> +        return;
>>>>>>> +
>>>>>>> +    subordinate = dev->subordinate;
>>>>>>> +    list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>>>> +        dev_info(&child->dev, "restore state\n");
>>>>>>> +        pci_restore_state(child);
>>>>>>> +    }
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void __init do_device_reset(struct pci_dev *dev)
>>>> do_downstream_device_reset() -- it's not resetting this pdev,
>>>> but the pdev's of the devices attached to it.
>>>>
>>> Right, I'll change them as you said.
>>>
>>>
>>>>>>> +{
>>>>>>> +    u16 ctrl;
>>>>>>> +
>>>>>>> +    if (!need_reset(dev))
>>>>>>> +        return;
>>>>>>> +
>>>>>>> +    dev_info(&dev->dev, "Reset Secondary bus\n");
>>>>>>> +
>>>>>>> +    /* Assert Secondary Bus Reset */
>>>>>>> +    pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&ctrl);
>>>>>>> +    ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>>>>>>> +    pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>>>>>>> +
>>>>>>> +    msleep(2);
>>>>>>> +
>>>> This works well for x86, which uses ioport registers to access
>>>> these<256-offset registers, b/c the write() function can't return
>>>> until the write is actually completed, but for a non-x86 system,
>>>> with fully mmconf'd PCI space, a write() may still be a write&   run
>>>> (sitting in CPU write-merge) buffer, so if you need a full 2ms,
>>>> you ought to do another read_config() to the device, to flush the write,
>>>> before starting the msleep(2) clock.
>>>>
>>> I didn't know that... I'll insert something like this before msleep.
>>>
>>> pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&dummy);
>>>
>>> BTW, I referred aer_do_secondary_bus_reset() when I wrote this code.
>>> Probably need the same fix, right?
>>>
>> sounds like it.
>> Interested to hear from someone in non-x86-land how
>> they ensure pci cfg writes aren't buffered in their cpus;
>> maybe handled via special uncached regions on other arches.
>>
>>>
>>>>>>> +    /* De-assert Secondary Bus Reset */
>>>>>>> +    ctrl&= ~PCI_BRIDGE_CTL_BUS_RESET;
>>>>>>> +    pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int __initdata pcie_reset_devices;
>>>>>>> +static int __init reset_pcie_devices(void)
>>>> this should be reset_pcie_endpoints() ... it's not resetting all pcie 
>>>> devices
>>>>>>> +{
>>>>>>> +    struct pci_dev *dev = NULL;
>>>>>>> +
>>>>>>> +    if (!pcie_reset_devices)
>>>>>>> +        return 0;
>>>>>>> +
>>>>>>> +    for_each_pci_dev(dev)
>>>>>>> +        save_config(dev);
>>>>>>> +
>>>>>>> +    for_each_pci_dev(dev)
>>>>>>> +        do_device_reset(dev);
>>>>>>> +
>>>>>>> +    msleep(1000);
>>>>>>> +
>>>>>>> +    for_each_pci_dev(dev)
>>>>>>> +        restore_config(dev);
>>>>>>> +
>>>> My apologies if past thread covered this sequence...
>>>> Why three loops through all PCIe devices on the system?
>>>> why not have the first for-each-pci-dev() loop filter devices
>>>> to be reset, and save_config those pdev's,
>>>> return a list of saved_pdev's;  feed that list into the do_device_reset();
>>>> then mpsleep(), then restore the list.
>>>> in fact, once you create a link list of pdev's to reset,
>>>> just loop that list doing save, then reset; rtn the list, do msleep(),
>>>> then restore the config of pdevs in the list.
>>>> Otherwise, doing much more traversing than what's needed.
>>>> Doing a great deal more config-saving then needed right now as well
>>>> (saving all non-endpt devices that aren't reset).
>>>>
>>>
>>> Creating list is nice idea, thanks. I'll do.
>>>
>>> I'll have vacation next week, so I'll post v2 patch the week after
>>> next.
>>>
>>> Thanks,
>>> Takao Indoh
>>>
>> Have a good vacation! Thanks for your willingness
>> to respin again&  make this improvement.
>>
>>
>> _______________________________________________
>> kexec mailing list
>> [email protected]
>> http://lists.infradead.org/mailman/listinfo/kexec
>>
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

Reply via email to