Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread James Zhu
On 2023-08-11 17:27, Chen, Xiaogang wrote: On 8/11/2023 4:22 PM, Felix Kuehling wrote: On 2023-08-11 17:12, Chen, Xiaogang wrote: I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield c

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread James Zhu
-Remove others, continue discussing internally On 2023-08-11 17:12, Chen, Xiaogang wrote: I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield cpu after each range update. It can prevent

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Chen, Xiaogang
one checkpoint: I saw they use serial port for console at kernel parameter: console=ttyS0,115200n8 * Booting Linux using a console connection that is too slow to keep up with the boot-time console-message rate. For example, a 115Kbaud serial console can be/way/too slow to keep up w

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling
If you have a complete kernel log, it may be worth looking at backtraces from other threads, to better understand the interactions. I'd expect that there is a thread there that's in an RCU read critical section. It may not be in our driver, though. If it's a customer system, it may also help to

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Chen, Xiaogang
On 8/11/2023 4:22 PM, Felix Kuehling wrote: On 2023-08-11 17:12, Chen, Xiaogang wrote: I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield cpu after each range update. It can prevent t

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling
On 2023-08-11 17:12, Chen, Xiaogang wrote: I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield cpu after each range update. It can prevent task holding mm lock too long. Calling schedul

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling
I don't understand why this loop is causing a stall. These stall warnings indicate that there is an RCU grace period that's not making progress. That means there must be an RCU read critical section that's being blocked. But there is no RCU-read critical section in svm_range_set_attr function.

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Chen, Xiaogang
I know the original jira ticket. The system got RCU cpu stall, then kernel enter panic, then no response or ssh. This patch let prange list update task yield cpu after each range update. It can prevent task holding mm lock too long. mm lock is rw_semophore, not RCU mechanism. Can you explain

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread James Zhu
On 2023-08-11 16:06, Felix Kuehling wrote: On 2023-08-11 15:11, James Zhu wrote: update_list could be big in list_for_each_entry(prange, &update_list, update_list), mmap_read_lock(mm) is kept hold all the time, adding schedule() can remove RCU stall on CPU for this case. RIP: 0010:svm_rang

Re: [PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread Felix Kuehling
On 2023-08-11 15:11, James Zhu wrote: update_list could be big in list_for_each_entry(prange, &update_list, update_list), mmap_read_lock(mm) is kept hold all the time, adding schedule() can remove RCU stall on CPU for this case. RIP: 0010:svm_range_cpu_invalidate_pagetables+0x317/0x610 [amdgp

[PATCH] drm/amdkfd: add schedule to remove RCU stall on CPU

2023-08-11 Thread James Zhu
update_list could be big in list_for_each_entry(prange, &update_list, update_list), mmap_read_lock(mm) is kept hold all the time, adding schedule() can remove RCU stall on CPU for this case. RIP: 0010:svm_range_cpu_invalidate_pagetables+0x317/0x610 [amdgpu] Code: 00 00 00 bf 00 02 00 00 48 81 c2