Re: [PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-30 Thread Felix Kuehling
On 2024-08-29 18:16, Philip Yang wrote: > > On 2024-08-29 17:15, Felix Kuehling wrote: >> On 2024-08-23 15:49, Philip Yang wrote: >>> If GPU reset kick in while KFD restore_process_worker running, this may >>> causes different issues, for example below rcu stall warning, because >>> restore work

Re: [PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-29 Thread Philip Yang
On 2024-08-29 17:15, Felix Kuehling wrote: On 2024-08-23 15:49, Philip Yang wrote: If GPU reset kick in while KFD restore_process_worker running, this may causes different issues, for example below rcu stall warning,

Re: [PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-29 Thread Felix Kuehling
On 2024-08-23 15:49, Philip Yang wrote: If GPU reset kick in while KFD restore_process_worker running, this may causes different issues, for example below rcu stall warning, because restore work may move BOs and evict queues under VRAM pressure. Fix this race by taking adev reset_domain read sem

Re: [PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-29 Thread Philip Yang
On 2024-08-28 18:01, Felix Kuehling wrote: On 2024-08-23 15:49, Philip Yang wrote: If GPU reset kick in while KFD restore_process_worker running, this may causes different issues, for example below rcu stall warnin

Re: [PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-28 Thread Felix Kuehling
On 2024-08-23 15:49, Philip Yang wrote: If GPU reset kick in while KFD restore_process_worker running, this may causes different issues, for example below rcu stall warning, because restore work may move BOs and evict queues under VRAM pressure. Fix this race by taking adev reset_domain read s

[PATCH] drm/amdkfd: restore_process_worker race with GPU reset

2024-08-23 Thread Philip Yang
If GPU reset kick in while KFD restore_process_worker running, this may causes different issues, for example below rcu stall warning, because restore work may move BOs and evict queues under VRAM pressure. Fix this race by taking adev reset_domain read semaphore to prevent GPU reset in restore_pro