Gaurav Batra <gba...@linux.vnet.ibm.com> writes: > Hello Michael, > > Did you get a chance to look into this patch? I don't mean to rush you. > Just wondering if there is anything I can do to help make the patch to > Upstream.
I skimmed it and decided it wasn't a critical bug fix, and hoped someone else would review it - silly me :D But the patch looks simple enough, and the explanation is very good so I think it looks good to merge. I'll apply it for v6.5. cheers > On 6/13/23 12:17 PM, Gaurav Batra wrote: >> Hello Michael, >> >> I found this bug while going though the code. This bug is exposed when >> DDW is smaller than the max memory of the LPAR. This will result in >> creating DDW which will have Dynamically mapped TCEs (no direct mapping). >> >> I would like to stress that this bug is exposed only in Upstream >> kernel. Current kernel level in Distros are not exposed to this since >> they don't have the concept of "dynamically mapped" DDW. >> >> I didn't have access to any of the P10 boxes with large amount of >> memory to re-create the scenario. On P10 we have 2MB TCEs, which >> results in DDW large enough to be able to cover max memory I could >> have for the LPAR. As a result, IO Bus Addresses generated were >> always within DDW limits and no H_PARAMETER was returned by HCALL. >> >> So, I hacked the kernel to force the use of 64K TCEs. This resulted in >> DDW smaller than max memory. >> >> When I tried to DLPAR ADD memory, it failed with error code of -4 >> (H_PARAMETER) from HCALL (H_PUT_TCE/H_PUT_TCE_INDIRECT), when >> iommu_mem_notifier() invoked tce_setrange_multi_pSeriesLP(). >> >> I didn't test the DLPAR REMOVE path, to verify if incorrect TCEs are >> removed by tce_clearrange_multi_pSeriesLP(), since I would need to >> hack kernel to force dynamically added TCEs to the high IO Bus >> Addresses. But, the concept is same. >> >> Thanks, >> >> Gaurav >> >> On 6/13/23 12:16 PM, Gaurav Batra wrote: >>> When memory is dynamically added/removed, iommu_mem_notifier() is >>> invoked. This >>> routine traverses through all the DMA windows (DDW only, not default >>> windows) >>> to add/remove "direct" TCE mappings. The routines for this purpose are >>> tce_clearrange_multi_pSeriesLP() and tce_clearrange_multi_pSeriesLP(). >>> >>> Both these routines are designed for Direct mapped DMA windows only. >>> >>> The issue is that there could be some DMA windows in the list which >>> are not >>> "direct" mapped. Calling these routines will either, >>> >>> 1) remove some dynamically mapped TCEs, Or >>> 2) try to add TCEs which are out of bounds and HCALL returns H_PARAMETER >>> >>> Here are the side affects when these routines are incorrectly invoked >>> for >>> "dynamically" mapped DMA windows. >>> >>> tce_setrange_multi_pSeriesLP() >>> >>> This adds direct mapped TCEs. Now, this could invoke HCALL to add >>> TCEs with >>> out-of-bound range. In this scenario, HCALL will return H_PARAMETER >>> and DLAR >>> ADD of memory will fail. >>> >>> tce_clearrange_multi_pSeriesLP() >>> >>> This will remove range of TCEs. The TCE range that is calculated, >>> depending on >>> the memory range being added, could infact be mapping some other memory >>> address (for dynamic DMA window scenario). This will wipe out those >>> TCEs. >>> >>> The solution is for iommu_mem_notifier() to only invoke these >>> routines for >>> "direct" mapped DMA windows. >>> >>> Signed-off-by: Gaurav Batra <gba...@linux.vnet.ibm.com> >>> Reviewed-by: Brian King <brk...@linux.vnet.ibm.com> >>> --- >>> arch/powerpc/platforms/pseries/iommu.c | 17 +++++++++++++---- >>> 1 file changed, 13 insertions(+), 4 deletions(-) >>> >>> diff --git a/arch/powerpc/platforms/pseries/iommu.c >>> b/arch/powerpc/platforms/pseries/iommu.c >>> index 918f511837db..24dd61636400 100644 >>> --- a/arch/powerpc/platforms/pseries/iommu.c >>> +++ b/arch/powerpc/platforms/pseries/iommu.c >>> @@ -363,6 +363,7 @@ struct dynamic_dma_window_prop { >>> struct dma_win { >>> struct device_node *device; >>> const struct dynamic_dma_window_prop *prop; >>> + bool direct; >>> struct list_head list; >>> }; >>> >>> @@ -1409,6 +1410,8 @@ static bool enable_ddw(struct pci_dev *dev, >>> struct device_node *pdn) >>> goto out_del_prop; >>> >>> if (direct_mapping) { >>> + window->direct = true; >>> + >>> /* DDW maps the whole partition, so enable direct DMA >>> mapping */ >>> ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> >>> PAGE_SHIFT, >>> win64->value, >>> tce_setrange_multi_pSeriesLP_walk); >>> @@ -1425,6 +1428,8 @@ static bool enable_ddw(struct pci_dev *dev, >>> struct device_node *pdn) >>> int i; >>> unsigned long start = 0, end = 0; >>> >>> + window->direct = false; >>> + >>> for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) { >>> const unsigned long mask = IORESOURCE_MEM_64 | >>> IORESOURCE_MEM; >>> >>> @@ -1587,8 +1592,10 @@ static int iommu_mem_notifier(struct >>> notifier_block *nb, unsigned long action, >>> case MEM_GOING_ONLINE: >>> spin_lock(&dma_win_list_lock); >>> list_for_each_entry(window, &dma_win_list, list) { >>> - ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn, >>> - arg->nr_pages, window->prop); >>> + if (window->direct) { >>> + ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn, >>> + arg->nr_pages, window->prop); >>> + } >>> /* XXX log error */ >>> } >>> spin_unlock(&dma_win_list_lock); >>> @@ -1597,8 +1604,10 @@ static int iommu_mem_notifier(struct >>> notifier_block *nb, unsigned long action, >>> case MEM_OFFLINE: >>> spin_lock(&dma_win_list_lock); >>> list_for_each_entry(window, &dma_win_list, list) { >>> - ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn, >>> - arg->nr_pages, window->prop); >>> + if (window->direct) { >>> + ret |= tce_clearrange_multi_pSeriesLP(arg->start_pfn, >>> + arg->nr_pages, window->prop); >>> + } >>> /* XXX log error */ >>> } >>> spin_unlock(&dma_win_list_lock);