Am 28.12.24 um 07:32 schrieb Shuai Xue:
It's observed that most GPU jobs utilize less than one server, typically
with each GPU being used by an independent job. If a job consumed poisoned
data, a SIGBUS signal will be sent to terminate it. Meanwhile, the
gpu_recovery parameter is set to -1 by def
As the delayed free pt, the wanted freed bo has been reused which will cause
unexpected page fault, and then call svm_range_restore_pages.
Detail as below:
1.It wants to free the pt in follow code, but it is not freed immediately
and used “schedule_work(&vm->pt_free_work);”.
[ 92.276838] Call T