> -----Original Message----- > From: Baolu Lu <baolu...@linux.intel.com> > Sent: Sunday, March 16, 2025 1:33 PM > To: Borah, Chaitanya Kumar <chaitanya.kumar.bo...@intel.com> > Cc: intel-gfx@lists.freedesktop.org; intel...@lists.freedesktop.org; > io...@lists.linux.dev; Kurmi, Suresh Kumar > <suresh.kumar.ku...@intel.com>; Saarinen, Jani <jani.saari...@intel.com>; > De Marchi, Lucas <lucas.demar...@intel.com> > Subject: Re: Regression on drm-tip > > On 3/16/25 15:27, Borah, Chaitanya Kumar wrote: > > > >> -----Original Message----- > >> From: Baolu Lu<baolu...@linux.intel.com> > >> Sent: Sunday, March 16, 2025 8:04 AM > >> To: Borah, Chaitanya Kumar<chaitanya.kumar.bo...@intel.com> > >> Cc:intel-gfx@lists.freedesktop.org;intel...@lists.freedesktop.org; > >> io...@lists.linux.dev > >> Subject: Re: Regression on drm-tip > >> > >> On 3/14/25 17:04, Borah, Chaitanya Kumar wrote: > >>> > >>>> -----Original Message----- > >>>> From: Baolu Lu<baolu...@linux.intel.com> > >>>> Sent: Thursday, March 13, 2025 7:53 PM > >>>> To: Borah, Chaitanya Kumar<chaitanya.kumar.bo...@intel.com> > >>>> Cc:baolu...@linux.intel.com;intel-gfx@lists.freedesktop.org; intel- > >>>> x...@lists.freedesktop.org;io...@lists.linux.dev > >>>> Subject: Re: Regression on drm-tip > >>>> > >>>> On 2025/3/13 16:51, Borah, Chaitanya Kumar wrote: > >>>>> Hello Lu, > >>>>> > >>>>> Hope you are doing well. I am Chaitanya from the linux graphics > >>>>> team in > >>>> Intel. > >>>>> This mail is regarding a regression we are seeing in our CI > >>>>> runs[1] on drm-tip > >>>> repository. > >>>>> `````````````````````````````````````````````````````````````````` > >>>>> `` `` ``````````` <4>[ 2.856622] WARNING: possible circular > >>>>> locking dependency detected <4>[ 2.856631] > >>>>> 6.14.0-rc5-CI_DRM_16217-gc55ef90b69d3+ #1 Tainted: G I > >>>>> <4>[ 2.856642] > >>>>> ------------------------------------------------------ > >>>>> <4>[ 2.856650] swapper/0/1 is trying to acquire lock: > >>>>> <4>[ 2.856657] ffffffff8360ecc8 > >>>>> (iommu_probe_device_lock){+.+.}-{3:3}, at: > >>>>> iommu_probe_device+0x1d/0x70 <4>[ 2.856679] > >>>>> but task is already holding lock: > >>>>> <4>[ 2.856686] ffff888102ab6fa8 > >>>>> (&device->physical_node_lock){+.+.}-{3:3}, at: > >>>>> intel_iommu_init+0xea1/0x1220 > >>>>> `````````````````````````````````````````````````````````````````` > >>>>> `` > >>>>> `` > >>>>> ``````````` > >>>>> Details log can be found in [2]. > >>>>> > >>>>> After bisecting the tree, the following patch [3] seems to be the > >>>>> first "bad" commit > >>>>> > >>>>> `````````````````````````````````````````````````````````````````` > >>>>> `` > >>>>> `` > >>>>> ``````````````````````````````````` > >>>>> commit b150654f74bf0df8e6a7936d5ec51400d9ec06d8 > >>>>> Author: LuBaolumailto:baolu...@linux.intel.com > >>>>> Date: Fri Feb 28 18:27:26 2025 +0800 > >>>>> > >>>>> iommu/vt-d: Fix suspicious RCU usage > >>>>> > >>>>> `````````````````````````````````````````````````````````````````` > >>>>> `` > >>>>> `` > >>>>> ``````````````````````````````````` > >>>>> > >>>>> We also verified that if we revert the patch the issue is not seen. > >>>>> > >>>>> Could you please check why the patch causes this regression and > >>>>> provide a > >>>> fix if necessary? > >>>> > >>>> Can you please take a quick test to check if the following fix works? > >>>> > >>>> diff --git a/drivers/iommu/intel/dmar.c > >>>> b/drivers/iommu/intel/dmar.c index > >>>> e540092d664d..06debeaec643 100644 > >>>> --- a/drivers/iommu/intel/dmar.c > >>>> +++ b/drivers/iommu/intel/dmar.c > >>>> @@ -2051,8 +2051,13 @@ int enable_drhd_fault_handling(unsigned int > >> cpu) > >>>> if (iommu->irq || iommu->node != cpu_to_node(cpu)) > >>>> continue; > >>>> > >>>> + /* > >>>> + * Call dmar_alloc_hwirq() with dmar_global_lock held, > >>>> + * could cause possible lock race condition. > >>>> + */ > >>>> + up_read(&dmar_global_lock); > >>>> ret = dmar_set_interrupt(iommu); > >>>> - > >>>> + down_read(&dmar_global_lock); > >>>> if (ret) { > >>>> pr_err("DRHD %Lx: failed to enable > >>>> fault, interrupt, ret > >> %d\n", > >>>> (unsigned long > >>>> long)drhd->reg_base_addr, ret); > >>>> > >>>> Thanks, > >>>> baolu > >>> We still see the issue with this change. > >> I am attempting to reproduce this issue with my MTL machine. I pulled > >> the test branch from: > >> > >> https://anongit.freedesktop.org/git/drm-tip.git > >> > >> and built the test kernel image using the configuration file from: > >> > >> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_16217/kconfig.txt > >> > >> But I did not observe the lockdep splat mentioned above after booting. > >> > >> Is there anything I might have missed? > >> > > +Suresh, Jani, Lucas > > > > We are seeing this only the skykale and kabylake on our CI runs. > > If so, will below change make any difference? > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > index 85aa66ef4d61..ec2f385ae25b 100644 > --- a/drivers/iommu/intel/iommu.c > +++ b/drivers/iommu/intel/iommu.c > @@ -3049,6 +3049,7 @@ static int __init > probe_acpi_namespace_devices(void) > if (dev->bus != &acpi_bus_type) > continue; > > + up_read(&dmar_global_lock); > adev = to_acpi_device(dev); > mutex_lock(&adev->physical_node_lock); > list_for_each_entry(pn, @@ -3058,6 +3059,7 @@ static > int __init > probe_acpi_namespace_devices(void) > break; > } > mutex_unlock(&adev->physical_node_lock); > + down_read(&dmar_global_lock); > > if (ret) > return ret; >
Thank you for the change. This seems to be working. Can we expect a fix patch soon? Regards Chaitanya > Thanks, > baolu