Hello,

On Wed, Mar 26, 2025 at 09:46:11AM -0500, Gaurav Batra wrote:
> Hello Michal,
> 
> In the patch to fix the pmemory bug, I made some changes to the code that
> determines Max memory an LPAR can have (excluding pmemory). This information
> is needed while creating Dynamic DMA Window (DDW). These changes are in the
> main line code path of DDW creation. This might have irritated QEMU somehow,
> no idea yet on how.

Yes, it's defeinitely something with the DDW code. Using the
disable_ddw=1 kernel parameter avoids the qemu crash.

The kernels in
https://download.opensuse.org/repositories/Kernel:/SLE15-SP7/pool/ppc64le/

have the patch applied.

Booting the kernel inside qemu VM with a PCI device (such as the USB
hub) and then rebooting the VM crashes qemu.

Thanks

Michal

> 
> Thanks,
> 
> Gaurav
> 
> On 3/19/25 12:29 PM, Michal Suchánek wrote:
> > Hello,
> > 
> > looks like this upsets some assumption qemu has about these windows.
> > 
> > https://lists.nongnu.org/archive/html/qemu-devel/2025-03/msg05137.html
> > 
> > When Linux kernel that has this patch applied is running inside a qemu
> > VM with a PCI device and the VM is rebooted qemu crashes shortly after
> > the next Linux kernel starts.
> > 
> > This is quite curious since qemu does AFAIK not support pmemory at all.
> > 
> > Any idea what went wrong there?
> > 
> > Thanks
> > 
> > Michal
> > 
> > On Thu, Jan 30, 2025 at 12:38:54PM -0600, Gaurav Batra wrote:
> > > iommu_mem_notifier() is invoked when RAM is dynamically added/removed. 
> > > This
> > > notifier call is responsible to add/remove TCEs from the Dynamic DMA 
> > > Window
> > > (DDW) when TCEs are pre-mapped. TCEs are pre-mapped only for RAM and not
> > > for persistent memory (pmemory). For DMA buffers in pmemory, TCEs are
> > > dynamically mapped when the device driver instructs to do so.
> > > 
> > > The issue is 'daxctl' command is capable of adding pmemory as "System RAM"
> > > after LPAR boot. The command to do so is -
> > > 
> > > daxctl reconfigure-device --mode=system-ram dax0.0 --force
> > > 
> > > This will dynamically add pmemory range to LPAR RAM eventually invoking
> > > iommu_mem_notifier(). The address range of pmemory is way beyond the Max
> > > RAM that the LPAR can have. Which means, this range is beyond the DDW
> > > created for the device, at device initialization time.
> > > 
> > > As a result when TCEs are pre-mapped for the pmemory range, by
> > > iommu_mem_notifier(), PHYP HCALL returns H_PARAMETER. This failed the
> > > command, daxctl, to add pmemory as RAM.
> > > 
> > > The solution is to not pre-map TCEs for pmemory.
> > > 
> > > Signed-off-by: Gaurav Batra <gba...@linux.ibm.com>
> > > ---
> > >   arch/powerpc/include/asm/mmzone.h      |  1 +
> > >   arch/powerpc/mm/numa.c                 |  2 +-
> > >   arch/powerpc/platforms/pseries/iommu.c | 29 ++++++++++++++------------
> > >   3 files changed, 18 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/mmzone.h 
> > > b/arch/powerpc/include/asm/mmzone.h
> > > index d99863cd6cde..049152f8d597 100644
> > > --- a/arch/powerpc/include/asm/mmzone.h
> > > +++ b/arch/powerpc/include/asm/mmzone.h
> > > @@ -29,6 +29,7 @@ extern cpumask_var_t node_to_cpumask_map[];
> > >   #ifdef CONFIG_MEMORY_HOTPLUG
> > >   extern unsigned long max_pfn;
> > >   u64 memory_hotplug_max(void);
> > > +u64 hot_add_drconf_memory_max(void);
> > >   #else
> > >   #define memory_hotplug_max() memblock_end_of_DRAM()
> > >   #endif
> > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > > index 3c1da08304d0..603a0f652ba6 100644
> > > --- a/arch/powerpc/mm/numa.c
> > > +++ b/arch/powerpc/mm/numa.c
> > > @@ -1336,7 +1336,7 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
> > >           return nid;
> > >   }
> > > -static u64 hot_add_drconf_memory_max(void)
> > > +u64 hot_add_drconf_memory_max(void)
> > >   {
> > >           struct device_node *memory = NULL;
> > >           struct device_node *dn = NULL;
> > > diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> > > b/arch/powerpc/platforms/pseries/iommu.c
> > > index 29f1a0cc59cd..abd9529a8f41 100644
> > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > @@ -1284,17 +1284,13 @@ static LIST_HEAD(failed_ddw_pdn_list);
> > >   static phys_addr_t ddw_memory_hotplug_max(void)
> > >   {
> > > - resource_size_t max_addr = memory_hotplug_max();
> > > - struct device_node *memory;
> > > + resource_size_t max_addr;
> > > - for_each_node_by_type(memory, "memory") {
> > > -         struct resource res;
> > > -
> > > -         if (of_address_to_resource(memory, 0, &res))
> > > -                 continue;
> > > -
> > > -         max_addr = max_t(resource_size_t, max_addr, res.end + 1);
> > > - }
> > > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> > > + max_addr = hot_add_drconf_memory_max();
> > > +#else
> > > + max_addr = memblock_end_of_DRAM();
> > > +#endif
> > >           return max_addr;
> > >   }
> > > @@ -1600,7 +1596,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
> > > device_node *pdn)
> > >           if (direct_mapping) {
> > >                   /* DDW maps the whole partition, so enable direct DMA 
> > > mapping */
> > > -         ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> 
> > > PAGE_SHIFT,
> > > +         ret = walk_system_ram_range(0, ddw_memory_hotplug_max() >> 
> > > PAGE_SHIFT,
> > >                                               win64->value, 
> > > tce_setrange_multi_pSeriesLP_walk);
> > >                   if (ret) {
> > >                           dev_info(&dev->dev, "failed to map DMA window 
> > > for %pOF: %d\n",
> > > @@ -2346,11 +2342,17 @@ static int iommu_mem_notifier(struct 
> > > notifier_block *nb, unsigned long action,
> > >           struct memory_notify *arg = data;
> > >           int ret = 0;
> > > + /* This notifier can get called when onlining persistent memory as well.
> > > +  * TCEs are not pre-mapped for persistent memory. Persistent memory will
> > > +  * always be above ddw_memory_hotplug_max()
> > > +  */
> > > +
> > >           switch (action) {
> > >           case MEM_GOING_ONLINE:
> > >                   spin_lock(&dma_win_list_lock);
> > >                   list_for_each_entry(window, &dma_win_list, list) {
> > > -                 if (window->direct) {
> > > +                 if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> > > +                         ddw_memory_hotplug_max()) {
> > >                                   ret |= 
> > > tce_setrange_multi_pSeriesLP(arg->start_pfn,
> > >                                                   arg->nr_pages, 
> > > window->prop);
> > >                           }
> > > @@ -2362,7 +2364,8 @@ static int iommu_mem_notifier(struct notifier_block 
> > > *nb, unsigned long action,
> > >           case MEM_OFFLINE:
> > >                   spin_lock(&dma_win_list_lock);
> > >                   list_for_each_entry(window, &dma_win_list, list) {
> > > -                 if (window->direct) {
> > > +                 if (window->direct && (arg->start_pfn << PAGE_SHIFT) <
> > > +                         ddw_memory_hotplug_max()) {
> > >                                   ret |= 
> > > tce_clearrange_multi_pSeriesLP(arg->start_pfn,
> > >                                                   arg->nr_pages, 
> > > window->prop);
> > >                           }
> > > 
> > > base-commit: 95ec54a420b8f445e04a7ca0ea8deb72c51fe1d3
> > > -- 
> > > 2.39.3 (Apple Git-146)
> > > 
> > > 

Reply via email to