Re: [PATCH v3] drm/amdgpu: skip xcp drm device allocation when out of drm resource

2023-08-11 Thread Lazar, Lijo
[AMD Official Use Only - General] A dynamic partition switch could happen later. The switch could still be successful in terms of hardware, and hence gives a false feeling of success even if there are no render nodes available for any app to make use of the partition. Also, a kfd node is not

Re: [PATCH -next 3/7] drm/msm: Remove unnecessary NULL values

2023-08-11 Thread Abhinav Kumar
On 8/8/2023 8:44 PM, Ruan Jinjie wrote: The NULL initialization of the pointers assigned by kzalloc() first is not necessary, because if the kzalloc() failed, the pointers will be assigned NULL, otherwise it works as usual. so remove it. Signed-off-by: Ruan Jinjie --- drivers/gpu/drm/msm/d

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread James Zhu
On 2023-08-11 17:27, Chen, Xiaogang wrote: On 8/11/2023 4:22 PM, Felix Kuehling wrote: On 2023-08-11 17:12, Chen, Xiaogang wrote: I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield c

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread James Zhu
-Remove others, continue discussing internally On 2023-08-11 17:12, Chen, Xiaogang wrote: I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield cpu after each range update. It can prevent

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Chen, Xiaogang
one checkpoint: I saw they use serial port for console at kernel parameter: console=ttyS0,115200n8 * Booting Linux using a console connection that is too slow to keep up with the boot-time console-message rate. For example, a 115Kbaud serial console can be/way/too slow to keep up w

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling
If you have a complete kernel log, it may be worth looking at backtraces from other threads, to better understand the interactions. I'd expect that there is a thread there that's in an RCU read critical section. It may not be in our driver, though. If it's a customer system, it may also help to

Re: [PATCH v3] drm/amdgpu: skip xcp drm device allocation when out of drm resource

2023-08-11 Thread Felix Kuehling
On 2023-08-11 17:06, James Zhu wrote: Return 0 when drm device alloc failed with -ENOSPC in order to allow amdgpu drive loading. But the xcp without drm device node assigned won't be visiable in user space. This helps amdgpu driver loading on system which has more than 64 nodes, the current limi

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Chen, Xiaogang
On 8/11/2023 4:22 PM, Felix Kuehling wrote: On 2023-08-11 17:12, Chen, Xiaogang wrote: I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield cpu after each range update. It can prevent t

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling
On 2023-08-11 17:12, Chen, Xiaogang wrote: I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield cpu after each range update. It can prevent task holding mm lock too long. Calling schedul

[pull] amdgpu, amdkfd, radeon, drm_buddy drm-next-6.6

2023-08-11 Thread Alex Deucher
Hi Dave, Daniel, New stuff for 6.6. The following changes since commit d9aa1da9a8cfb0387eb5703c15bd1f54421460ac: Merge tag 'drm-intel-gt-next-2023-08-04' of git://anongit.freedesktop.org/drm/drm-intel into drm-next (2023-08-07 13:49:25 +1000) are available in the Git repository at: https

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling
I don't understand why this loop is causing a stall. These stall warnings indicate that there is an RCU grace period that's not making progress. That means there must be an RCU read critical section that's being blocked. But there is no RCU-read critical section in svm_range_set_attr function.

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Chen, Xiaogang
I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield cpu after each range update. It can prevent task holding mm lock too long. mm lock is rw_semophore, not RCU mechanism. Can you explain

[PATCH v3] drm/amdgpu: skip xcp drm device allocation when out of drm resource

2023-08-11 Thread James Zhu
Return 0 when drm device alloc failed with -ENOSPC in order to allow amdgpu drive loading. But the xcp without drm device node assigned won't be visiable in user space. This helps amdgpu driver loading on system which has more than 64 nodes, the current limitation. The proposal to add more drm no

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread James Zhu
On 2023-08-11 16:06, Felix Kuehling wrote: On 2023-08-11 15:11, James Zhu wrote: update_list could be big in list_for_each_entry(prange, &update_list, update_list), mmap_read_lock(mm) is kept hold all the time, adding schedule() can remove RCU stall on CPU for this case. RIP: 0010:svm_rang

Re: [PATCH v2] drm/amdgpu: skip xcp drm device allocation when out of drm resource

2023-08-11 Thread Felix Kuehling
On 2023-08-11 16:23, James Zhu wrote: Return 0 when drm device alloc failed with -ENOSPC in order to allow amdgpu drive loading. But the xcp without drm device node assigned won't be visiable in user space. This helps amdgpu driver loading on system which has more than 64 nodes, the current limi

[PATCH v2] drm/amdgpu: skip xcp drm device allocation when out of drm resource

2023-08-11 Thread James Zhu
Return 0 when drm device alloc failed with -ENOSPC in order to allow amdgpu drive loading. But the xcp without drm device node assigned won't be visiable in user space. This helps amdgpu driver loading on system which has more than 64 nodes, the current limitation. The proposal to add more drm no

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling
On 2023-08-11 15:11, James Zhu wrote: update_list could be big in list_for_each_entry(prange, &update_list, update_list), mmap_read_lock(mm) is kept hold all the time, adding schedule() can remove RCU stall on CPU for this case. RIP: 0010:svm_range_cpu_invalidate_pagetables+0x317/0x610 [amdgp

Re: [PATCH v6] drm/doc: Document DRM device reset expectations

2023-08-11 Thread Randy Dunlap
Hi, On 8/11/23 11:55, André Almeida wrote: > Create a section that specifies how to deal with DRM device resets for > kernel and userspace drivers. > > Signed-off-by: André Almeida > > --- > > Changes: > - Due to substantial changes in the content, dropped Pekka's Acked-by > - Grammar fixes

[PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread James Zhu
update_list could be big in list_for_each_entry(prange, &update_list, update_list), mmap_read_lock(mm) is kept hold all the time, adding schedule() can remove RCU stall on CPU for this case. RIP: 0010:svm_range_cpu_invalidate_pagetables+0x317/0x610 [amdgpu] Code: 00 00 00 bf 00 02 00 00 48 81 c2

[PATCH v6] drm/doc: Document DRM device reset expectations

2023-08-11 Thread André Almeida
Create a section that specifies how to deal with DRM device resets for kernel and userspace drivers. Signed-off-by: André Almeida --- Changes: - Due to substantial changes in the content, dropped Pekka's Acked-by - Grammar fixes (Randy) - Add paragraph about disabling device resets - Add no

Re: [PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2

2023-08-11 Thread Eric Huang
On 2023-08-11 09:26, Felix Kuehling wrote: Am 2023-08-10 um 18:27 schrieb Eric Huang: There is not UNMAP_QUEUES command sending for queue preemption because the queue is suspended and test is closed to the end. Function unmap_queue_cpsch will do nothing after that. How do you suspend queues w

Re: [PATCH] drm/amdgpu: don't allow userspace to create a doorbell BO

2023-08-11 Thread Felix Kuehling
Am 2023-08-09 um 15:09 schrieb Alex Deucher: We need the domains in amdgpu_drm.h for the kernel driver to manage the pool, but we don't want userspace using it until the code is ready. So reject for now. Signed-off-by: Alex Deucher Acked-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgp

Re: [PATCH V2 1/5] drm/amdkfd: ignore crat by default

2023-08-11 Thread Alex Deucher
On Fri, Aug 11, 2023 at 9:45 AM Jason Gunthorpe wrote: > > On Mon, Aug 07, 2023 at 06:05:41PM -0400, Alex Deucher wrote: > > We are dropping the IOMMUv2 path, so no need to enable this. > > It's often buggy on consumer platforms anyway. > > > > Signed-off-by: Alex Deucher > > --- > > drivers/gpu

Re: [PATCH] drm/amdgpu: don't allow userspace to create a doorbell BO

2023-08-11 Thread Alex Deucher
Ping? On Thu, Aug 10, 2023 at 11:20 AM Alex Deucher wrote: > > Ping? > > On Wed, Aug 9, 2023 at 3:10 PM Alex Deucher wrote: > > > > We need the domains in amdgpu_drm.h for the kernel driver to manage > > the pool, but we don't want userspace using it until the code > > is ready. So reject for n

Re: [PATCH V2 1/5] drm/amdkfd: ignore crat by default

2023-08-11 Thread Jason Gunthorpe
On Mon, Aug 07, 2023 at 06:05:41PM -0400, Alex Deucher wrote: > We are dropping the IOMMUv2 path, so no need to enable this. > It's often buggy on consumer platforms anyway. > > Signed-off-by: Alex Deucher > --- > drivers/gpu/drm/amd/amdkfd/kfd_crat.c | 4 > 1 file changed, 4 deletions(-)

[PATCH] drm/radeon: Use pci_dev_id() to simplify the code

2023-08-11 Thread Zheng Zengkai
PCI core API pci_dev_id() can be used to get the BDF number for a pci device. We don't need to compose it mannually. Use pci_dev_id() to simplify the code a little bit. Signed-off-by: Zheng Zengkai --- drivers/gpu/drm/radeon/radeon_acpi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) di

Re: [PATCH 28/29] drm/amdkfd: Refactor migrate init to support partition switch

2023-08-11 Thread Linux regression tracking #update (Thorsten Leemhuis)
[TLDR: This mail in primarily relevant for Linux kernel regression tracking. See link in footer if these mails annoy you.] On 19.07.23 18:17, Linux regression tracking #adding (Thorsten Leemhuis) wrote: > On 17.07.23 15:09, Michel Dänzer wrote: >> On 5/10/23 23:23, Alex Deucher wrote: >>> From: Ph

Re: [PATCH 9/9] drm/amd: Hide unsupported power attributes

2023-08-11 Thread Alex Deucher
Series is: Reviewed-by: Alex Deucher On Thu, Aug 10, 2023 at 11:40 PM Mario Limonciello wrote: > > Some ASICS only offer one type of power attribute, so in the visible > callback check whether the attributes are supported and hide if not > supported. > > Signed-off-by: Mario Limonciello > --- >

Re: [PATCH] drm/amdkfd: fix address watch clearing bug for gfx v9.4.2

2023-08-11 Thread Felix Kuehling
Am 2023-08-10 um 18:27 schrieb Eric Huang: There is not UNMAP_QUEUES command sending for queue preemption because the queue is suspended and test is closed to the end. Function unmap_queue_cpsch will do nothing after that. How do you suspend queues without sending an UNMAP_QUEUES command? Reg

Re: [PATCH] drm/amdkfd: avoid svm dump when dynamic debug disabled

2023-08-11 Thread Felix Kuehling
Am 2023-08-11 um 06:11 schrieb Mike Lothian: On Thu, 3 Aug 2023 at 20:43, Felix Kuehling wrote: Is your kernel configured without dynamic debugging? Maybe we need to wrap this in some #if defined(CONFIG_DYNAMIC_DEBUG_CORE). Apologies, I thought I'd replied to this, yes I didn't have dynamic d

Re: [PATCH] drm/amdkfd: avoid svm dump when dynamic debug disabled

2023-08-11 Thread Mike Lothian
On Thu, 3 Aug 2023 at 20:43, Felix Kuehling wrote: > > Is your kernel configured without dynamic debugging? Maybe we need to > wrap this in some #if defined(CONFIG_DYNAMIC_DEBUG_CORE). > Apologies, I thought I'd replied to this, yes I didn't have dynamic debugging enabled

Re: [PATCH V8 2/9] drivers core: add ACPI based WBRF mechanism introduced by AMD

2023-08-11 Thread Simon Horman
On Thu, Aug 10, 2023 at 03:37:56PM +0800, Evan Quan wrote: > AMD has introduced an ACPI based mechanism to support WBRF for some > platforms with AMD dGPU + WLAN. This needs support from BIOS equipped > with necessary AML implementations and dGPU firmwares. > > For those systems without the ACPI m

Re: [PATCH V8 6/9] drm/amd/pm: setup the framework to support Wifi RFI mitigation feature

2023-08-11 Thread Simon Horman
On Thu, Aug 10, 2023 at 03:38:00PM +0800, Evan Quan wrote: > With WBRF feature supported, as a driver responding to the frequencies, > amdgpu driver is able to do shadow pstate switching to mitigate possible > interference(between its (G-)DDR memory clocks and local radio module > frequency bands u

Re: [PATCH] drm/amdgpu: Add memory vendor information

2023-08-11 Thread Lazar, Lijo
On 8/11/2023 12:36 PM, Chen, Guchun wrote: [Public] -Original Message- From: amd-gfx On Behalf Of Lijo Lazar Sent: Friday, August 11, 2023 12:12 PM To: amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Zhang, Hawking Subject: [PATCH] drm/amdgpu: Add memory vendor information

RE: [PATCH] drm/amdgpu: Add memory vendor information

2023-08-11 Thread Chen, Guchun
[Public] > -Original Message- > From: amd-gfx On Behalf Of Lijo > Lazar > Sent: Friday, August 11, 2023 12:12 PM > To: amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander ; Zhang, Hawking > > Subject: [PATCH] drm/amdgpu: Add memory vendor information > > For ASICs with GC v9.4.3, dete