Re: [PATCH 1/2] dmr/amdgpu: Avoid HW GPU reset for RAS.

2019-08-29 Thread Kuehling, Felix
On 2019-08-29 1:21 p.m., Grodzovsky, Andrey wrote: > On 8/29/19 12:18 PM, Kuehling, Felix wrote: >> On 2019-08-29 10:08 a.m., Grodzovsky, Andrey wrote: >>> Agree, the placement of amdgpu_amdkfd_pre/post _reset in >>> amdgpu_device_lock/unlock_adev is a bit wierd. >>> >> amdgpu_device_reset_sriov al

Re: [PATCH 1/2] dmr/amdgpu: Avoid HW GPU reset for RAS.

2019-08-29 Thread Grodzovsky, Andrey
On 8/29/19 12:18 PM, Kuehling, Felix wrote: > On 2019-08-29 10:08 a.m., Grodzovsky, Andrey wrote: >> Agree, the placement of amdgpu_amdkfd_pre/post _reset in >> amdgpu_device_lock/unlock_adev is a bit wierd. >> > amdgpu_device_reset_sriov already calls amdgpu_amdkfd_pre/post_reset > itself while i

Re: [PATCH 1/2] dmr/amdgpu: Avoid HW GPU reset for RAS.

2019-08-29 Thread Kuehling, Felix
On 2019-08-29 10:08 a.m., Grodzovsky, Andrey wrote: > > Agree, the placement of amdgpu_amdkfd_pre/post _reset in > amdgpu_device_lock/unlock_adev is a bit wierd. > amdgpu_device_reset_sriov already calls amdgpu_amdkfd_pre/post_reset itself while it has exclusive access to the GPU. It would make s

Re: [PATCH 1/2] dmr/amdgpu: Avoid HW GPU reset for RAS.

2019-08-29 Thread Grodzovsky, Andrey
>> Grodzovsky, Andrey ; Zhang, Hawking >> >> Subject: [PATCH 1/2] dmr/amdgpu: Avoid HW GPU reset for RAS. >> >> Problem: >> Under certain conditions, when some IP bocks take a RAS error, we can get > [Tao] typo: "dmr/amdgpu" -> "drm/amdgpu", &quo

Re: [PATCH 1/2] dmr/amdgpu: Avoid HW GPU reset for RAS.

2019-08-29 Thread Grodzovsky, Andrey
Agree, the placement of amdgpu_amdkfd_pre/post _reset in amdgpu_device_lock/unlock_adev is a bit wierd. Andrey On 8/29/19 10:06 AM, Koenig, Christian wrote: Felix advised that the way to stop all KFD activity is simply to NOT call amdgpu_amdkfd_post_reset so that why I added this. Do you mean y

Re: [PATCH 1/2] dmr/amdgpu: Avoid HW GPU reset for RAS.

2019-08-29 Thread Koenig, Christian
Am 29.08.19 um 16:03 schrieb Grodzovsky, Andrey: > On 8/29/19 3:30 AM, Christian König wrote: >> Am 28.08.19 um 22:00 schrieb Andrey Grodzovsky: >>> Problem: >>> Under certain conditions, when some IP bocks take a RAS error, >>> we can get into a situation where a GPU reset is not possible >>> due

Re: [PATCH 1/2] dmr/amdgpu: Avoid HW GPU reset for RAS.

2019-08-29 Thread Grodzovsky, Andrey
On 8/29/19 3:30 AM, Christian König wrote: > Am 28.08.19 um 22:00 schrieb Andrey Grodzovsky: >> Problem: >> Under certain conditions, when some IP bocks take a RAS error, >> we can get into a situation where a GPU reset is not possible >> due to issues in RAS in SMU/PSP. >> >> Temporary fix until

RE: [PATCH 1/2] dmr/amdgpu: Avoid HW GPU reset for RAS.

2019-08-29 Thread Zhou1, Tao
> -Original Message- > From: amd-gfx On Behalf Of > Andrey Grodzovsky > Sent: 2019年8月29日 4:00 > To: amd-gfx@lists.freedesktop.org > Cc: alexdeuc...@gmail.com; ckoenig.leichtzumer...@gmail.com; > Grodzovsky, Andrey ; Zhang, Hawking > > Subject: [PATCH 1/2] dmr

Re: [PATCH 1/2] dmr/amdgpu: Avoid HW GPU reset for RAS.

2019-08-29 Thread Christian König
Am 28.08.19 um 22:00 schrieb Andrey Grodzovsky: Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in SMU/PSP. Temporary fix until proper solution in PSP/SMU is ready: When uncorrectable error

Re: [PATCH 1/2] dmr/amdgpu: Avoid HW GPU reset for RAS.

2019-08-28 Thread Grodzovsky, Andrey
On 8/28/19 5:18 PM, Kuehling, Felix wrote: > On 2019-08-28 4:00 p.m., Andrey Grodzovsky wrote: >> Problem: >> Under certain conditions, when some IP bocks take a RAS error, >> we can get into a situation where a GPU reset is not possible >> due to issues in RAS in SMU/PSP. >> >> Temporary fix unti

Re: [PATCH 1/2] dmr/amdgpu: Avoid HW GPU reset for RAS.

2019-08-28 Thread Kuehling, Felix
On 2019-08-28 4:00 p.m., Andrey Grodzovsky wrote: > Problem: > Under certain conditions, when some IP bocks take a RAS error, > we can get into a situation where a GPU reset is not possible > due to issues in RAS in SMU/PSP. > > Temporary fix until proper solution in PSP/SMU is ready: > When uncor

[PATCH 1/2] dmr/amdgpu: Avoid HW GPU reset for RAS.

2019-08-28 Thread Andrey Grodzovsky
Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in SMU/PSP. Temporary fix until proper solution in PSP/SMU is ready: When uncorrectable error happens the DF will unconditionally broadcast err