Re: [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()

2025-01-06 Thread Gerry Liu
> 2025年1月5日 13:22,Shuo Liu 写道: > > Hi Gerry, > > On Sun 5.Jan'25 at 10:45:29 +0800, Jiang Liu wrote: >> Fix possible resource leakage on error recovery path in function >> kgd2kfd_device_init(). >> >> Signed-off-by: Jiang Liu >> --- >> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 + >

Re: [PATCH v2 3/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()

2025-01-07 Thread Gerry Liu
> 2025年1月6日 14:51,Lazar, Lijo 写道: > > > > On 1/5/2025 8:15 AM, Jiang Liu wrote: >> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific >> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used >> to do error recovery. >> >> Signed-off-by: Jiang Liu >> --- >>

Re: [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()

2025-01-08 Thread Gerry Liu
> 2025年1月8日 06:53,Chen, Xiaogang 写道: > > > > On 1/4/2025 8:45 PM, Jiang Liu wrote: >> Fix possible resource leakage on error recovery path in function >> kgd2kfd_device_init(). >> >> Signed-off-by: Jiang Liu >> >> --- >> drivers/gpu/drm/amd/amdkfd/kfd_devi

Re: [PATCH v2 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

2025-01-08 Thread Gerry Liu
> 2025年1月8日 06:55,Chen, Xiaogang 写道: > > > > On 1/4/2025 8:45 PM, Jiang Liu wrote: >> If some GPU device failed to probe, `rmmod amdgpu` will trigger a use >> after free bug related to amdgpu_driver_release_kms() as: >> 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference,

Re: [v1 2/4] drm/amdgpu: accumulate gfx_off_entrycount in smu_suspend()

2025-02-07 Thread Gerry Liu
> 2025年2月7日 16:04,Lazar, Lijo 写道: > > > > On 2/7/2025 12:14 PM, Jiang Liu wrote: >> As pwfw resets entrycount when device is suspended, so we should >> accmulate the gfx_off_entrycount value instead of save the last value >> of it. >> >> Signed-off-by: Jiang Liu >> --- >> drivers/gpu/drm/a

Re: [v1 2/4] drm/amdgpu: accumulate gfx_off_entrycount in smu_suspend()

2025-02-07 Thread Gerry Liu
> 2025年2月7日 16:34,Lazar, Lijo 写道: > > > > On 2/7/2025 2:00 PM, Gerry Liu wrote: >> >> >>> 2025年2月7日 16:04,Lazar, Lijo 写道: >>> >>> >>> >>> On 2/7/2025 12:14 PM, Jiang Liu wrote: >>>> As pwfw resets ent

Re: [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put()

2025-01-08 Thread Gerry Liu
> 2025年1月8日 18:02,Lazar, Lijo 写道: > > > > On 1/8/2025 2:26 PM, Jiang Liu wrote: >> If error happens before amdgpu_fence_driver_hw_init() gets called during >> device probe, it will trigger a false warning in amdgpu_irq_put() as >> below: >> [ 1209.300996] [ cut here ]

Re: [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()

2025-01-08 Thread Gerry Liu
> 2025年1月8日 17:31,Lazar, Lijo 写道: > > > > On 1/8/2025 2:26 PM, Jiang Liu wrote: >> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific >> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used >> to do error recovery. >> >> Signed-off-by: Jiang Liu >> --- >>

Re:

2025-01-08 Thread Gerry Liu
> 2025年1月9日 00:33,Mario Limonciello 写道: > > On 1/8/2025 07:59, Jiang Liu wrote: >> Subject: [RFC PATCH 00/13] Enhance device state machine to better support >> suspend/resume > > I'm not sure how this happened, but your subject didn't end up in the subject > of the thread on patch 0 so the t

Re: [v3 6/6] drm/amdgpu: get rid of false warnings caused by amdgpu_irq_put()

2025-01-09 Thread Gerry Liu
> 2025年1月8日 17:05,Christian König 写道: > > Am 08.01.25 um 09:56 schrieb Jiang Liu: >> If error happens before amdgpu_fence_driver_hw_init() gets called during >> device probe, it will trigger a false warning in amdgpu_irq_put() as >> below: >> [ 1209.300996] [ cut here ]

Re: [v4 5/5] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()

2025-01-09 Thread Gerry Liu
> 2025年1月10日 14:51,Christian König 写道: > > Am 10.01.25 um 03:08 schrieb Jiang Liu: >> Function detects initialization status by checking sched->ops, so set >> sched->ops to non-NULL just before return in function >> amdgpu_fence_driver_sw_fini() and amdgpu_device_init_schedulers() >> to avoid

Re: [RFC PATCH 03/13] drm/amdgpu: add a flag to track ras debugfs creation status

2025-01-09 Thread Gerry Liu
> 2025年1月9日 01:19,Mario Limonciello 写道: > > On 1/8/2025 07:59, Jiang Liu wrote: >> Add a flag to track ras debugfs creation status, to avoid possible >> incorrect reference count management for ras block object in function >> amdgpu_ras_aca_is_supported(). > > Rather than taking a marker posi

Re: [v5 5/5] drm/amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()

2025-01-23 Thread Gerry Liu
> 2025年1月20日 17:04,Christian König 写道: > > Am 17.01.25 um 08:55 schrieb Jiang Liu: >> Introduce amdgpu_device_fini_schedulers() to clean scheduler related >> resources, and avoid possible invalid memory access. >> >> Signed-off-by: Jiang Liu >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_devic

Re: [RFC v2 10/15] drm/admgpu: make device state machine work in stack like way

2025-01-13 Thread Gerry Liu
> 2025年1月14日 06:27,Mario Limonciello 写道: > > On 1/12/2025 19:42, Jiang Liu wrote: >> Make the device state machine work in stack like way to better support >> suspend/resume by following changes: >> 1. amdgpu_driver_load_kms() >> amdgpu_device_init() >> amdgpu_device_ip_early

Re: [v4 1/5] drm/amdgpu: clear adev->in_suspend flag when fails to suspend

2025-01-13 Thread Gerry Liu
> 2025年1月11日 02:57,Mario Limonciello 写道: > > On 1/9/2025 20:08, Jiang Liu wrote: >> Clear adev->in_suspend flag when fails to suspend, otherwise it will >> cause too much warnings like: >> [ 1802.212027] [ cut here ] >> [ 1802.212028] WARNING: CPU: 97 PID: 11282 at >>

Re:

2025-01-12 Thread Gerry Liu
text or HTML by holding shift when you hit "reply all" > > For my reply I'll convert my reply to plain text, please see inline below. > > On 1/8/2025 23:34, Gerry Liu wrote: >>> 2025年1月9日 00:33,Mario Limonciello >> <mailto:mario.limoncie...@amd.com>

Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs

2025-01-14 Thread Gerry Liu
> 2025年1月14日 18:46,Christian König 写道: > > Hi Jiang, > > Some of the firmware, especially the multimedia ones, keep FW pointers to > buffers in the suspend/resume state. > > In other words the firmware needs to be in the exact same location before and > after resume. That's why we don't un

Re: [PATCH 5/6] amdgpu: fix invalid memory access in amdgpu_fence_driver_sw_fini()

2025-01-03 Thread Gerry Liu
> 2025年1月3日 07:09,Chen, Xiaogang 写道: > > > On 1/1/2025 11:36 PM, Jiang Liu wrote: >> Function detects initialization status by checking sched->ops, so set >> sched->ops to non-NULL just before return in function drm_sched_init() >> to avoid possible invalid memory access on error recover path.

Re: [PATCH 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

2025-01-03 Thread Gerry Liu
> 2025年1月3日 13:58,Chen, Xiaogang 写道: > > > > On 1/1/2025 11:36 PM, Jiang Liu wrote: >> If some GPU device failed to probe, `rmmod amdgpu` will trigger a use >> after free bug related to amdgpu_driver_release_kms() as: >> 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference,

Re: [PATCH 2/6] amdgpu: fix invalid memory access in kfd_cleanup_nodes()

2025-01-03 Thread Gerry Liu
> 2025年1月3日 13:44,Chen, Xiaogang 写道: > > > > On 1/2/2025 8:22 PM, Gerry Liu wrote: >> >> >>> 2025年1月3日 07:08,Chen, Xiaogang >> <mailto:xiaogang.c...@amd.com>> 写道: >>> >>> >>> >>> On 1/1/2025 11

Re: [PATCH 2/6] amdgpu: fix invalid memory access in kfd_cleanup_nodes()

2025-01-03 Thread Gerry Liu
> 2025年1月3日 07:08,Chen, Xiaogang 写道: > > > > On 1/1/2025 11:36 PM, Jiang Liu wrote: >> On error recover path during device probe, it may trigger invalid >> memory access as below: >> 024-12-25 12:00:53 [ 2703.773040] general protection fault, probably for >> non-canonical address 0x52445f474

Re: [PATCH 2/6] amdgpu: fix invalid memory access in kfd_cleanup_nodes()

2025-01-03 Thread Gerry Liu
> 2025年1月3日 14:19,Chen, Xiaogang 写道: > > > > On 1/2/2025 11:55 PM, Gerry Liu wrote: >> >> >>> 2025年1月3日 13:44,Chen, Xiaogang >> <mailto:xiaogang.c...@amd.com>> 写道: >>> >>> >>> >>>

Re: [PATCH 4/6] amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

2025-01-04 Thread Gerry Liu
> 2025年1月4日 01:34,Chen, Xiaogang 写道: > > > On 1/3/2025 1:43 AM, Shuo Liu wrote: >> On Fri 3.Jan'25 at 15:02:38 +0800, Gerry Liu wrote: >>> >>> >>>> 2025年1月3日 13:58,Chen, Xiaogang 写道: >>>> >>>> >>>&

Re: [PATCH v2 1/6] amdgpu: fix possible resource leakage in kfd_cleanup_nodes()

2025-01-05 Thread Gerry Liu
> 2025年1月5日 13:22,Shuo Liu 写道: > > Hi Gerry, > > On Sun 5.Jan'25 at 10:45:29 +0800, Jiang Liu wrote: >> Fix possible resource leakage on error recovery path in function >> kgd2kfd_device_init(). >> >> Signed-off-by: Jiang Liu >> --- >> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 9 + >

Re: [PATCH 2/6] amdgpu: fix invalid memory access in kfd_cleanup_nodes()

2025-01-04 Thread Gerry Liu
> 2025年1月4日 01:33,Chen, Xiaogang 写道: > > > > On 1/3/2025 1:05 AM, Gerry Liu wrote: >> >> >>> 2025年1月3日 14:19,Chen, Xiaogang >> <mailto:xiaogang.c...@amd.com>> 写道: >>> >>> >>> >>>

Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs

2025-01-14 Thread Gerry Liu
mented hypervisor? Will the hypervisor need to cooperate with the gim driver to enable resume with different vGPUs? Regards Gerry > > Regards > Shaoyun.liu > > From: amd-gfx On Behalf Of Christian > König > Sent: Tuesday, January 14, 2025 7:44 AM > To: Gerry Li

Re: [RFC v1 0/2] Enable resume with different AMD SRIOV vGPUs

2025-01-14 Thread Gerry Liu
rt info. Yeah, there are different usage scenarios: 1) live migration 2) hibernate/suspend/resume 3) snapshot and clone Currently we are focusing on live migration and hibernation, and hope that we can base on common underlying technologies. > > Regards > Shaoyun.liu > > -O

Re: [v3 2/6] drm/amdxcp: introduce new API amdgpu_xcp_drm_dev_free()

2025-01-08 Thread Gerry Liu
> 2025年1月9日 00:04,Mario Limonciello 写道: > > On 1/8/2025 02:56, Jiang Liu wrote: >> Introduce new interface amdgpu_xcp_drm_dev_free() to free a specific >> drm_device crreated by amdgpu_xcp_drm_dev_alloc(), which will be used >> to do error recovery. >> Signed-off-by: Jiang Liu >> --- >> driv

Re: [v3 3/6] drm/amdgpu: fix use after free bug related to amdgpu_driver_release_kms()

2025-01-08 Thread Gerry Liu
> 2025年1月8日 17:54,Lazar, Lijo 写道: > > > > On 1/8/2025 2:26 PM, Jiang Liu wrote: >> If some GPU device failed to probe, `rmmod amdgpu` will trigger a use >> after free bug related to amdgpu_driver_release_kms() as: >> 2024-12-26 16:17:45 [16002.085540] BUG: kernel NULL pointer dereference, >>

Re: [v3 4/6] drm/amdgpu: enhance error handling in function amdgpu_pci_probe()

2025-01-08 Thread Gerry Liu
> 2025年1月9日 00:08,Mario Limonciello 写道: > > On 1/8/2025 02:56, Jiang Liu wrote: >> Enhance error handling in function amdgpu_pci_probe() to avoid >> possible resource leakage. >> Signed-off-by: Jiang Liu >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 12 +--- >> 1 file changed, 9