Am 23.05.24 um 17:35 schrieb Li, Yunxiang (Teddy):
[Public]
Here is taking a different lock than the reset_domain->sem. It is a seperate
reset_domain->gpu_sem that is only locked when we will actuall do reset, it is not
taken in the skip_hw_reset path.
Exactly that is what you should *not* do. Please don't add any new lock to the
code. This is already complicated enough.
If you think that adding wrappers for reset lock makes sense then we can
probably do that, bot don't add any lock for hw access.
The two lock protects very different things though. The first case is we need
to block two resets running in parallel,
No, that's not correct. Two parallel resets are prevent by using a
worker queue for the resets.
The reset lock is here exactly to provide external thread the
opportunity to make their operation mutual exclusive with the reset.
this does not only cover GPU reset itself but also any teardown that happens
before GPU reset. The second case is we need to ensure exclusive access to the
GPU between GPU reset and GPU init, concurrent access is fine before GPU is
reset.
If that is true you could in theory lower the locked area of the
existing lock, but adding a new one is strict no-go from my side.
Regards,
Christian.
Theoretically, the second case happens within the first case, so locking the first
case would protect against both. But with the current implementation this is
infeasible, all the generic functions called between
amdgpu_device_lock/unlock_reset_domain would need to be swapped out for special
versions so the reset thread does not dead lock itself. It is much simpler to have a
second, much narrower lock that only covers GPU reset<->GPU init because all
the accesses there are very low level anyway.
Teddy