RAS error will be triggered if the HW identified a faulty physical page, the error comes through an interrupt where the data payload will have information that can be translated into the bad page address, we then as a recovery measure reset the ASIC, reserve this bad page so it cannot be used by anyone else and store it's address to EEPROM for future reservation after boot.
Andrey On 9/12/19 11:15 AM, Christian König wrote: > Well I still hope to avoid VRAM lost in most of the cases, but that is > really not guaranteed. > > What is the bad page and why do you need to reserve it? > > Christian. > > Am 12.09.19 um 16:09 schrieb Grodzovsky, Andrey: >> I am not sure VRAM loss happens every time, but when it does I would >> assume you would have to reserve them again as the page tables content >> was lost. On the other hand I do remember we keep shadow system memory >> copies of all page tables so maybe that not an issue, so yes, just try >> to allocate the bad page after reset and if it's still reserved you will >> fail. >> >> Andrey >> >> On 9/12/19 7:35 AM, Zhou1, Tao wrote: >>> Hi Andrey: >>> >>> Are you sure of the VRAM content loss after gpu reset? I'm not very >>> familiar with the detail of gpu reset and I'll do experiment to >>> confirm the case you mentioned. >>> >>> Regards, >>> Tao >>> >>>> -----Original Message----- >>>> From: Grodzovsky, Andrey <andrey.grodzov...@amd.com> >>>> Sent: 2019年9月12日 10:32 >>>> To: Chen, Guchun <guchun.c...@amd.com>; Zhou1, Tao >>>> <tao.zh...@amd.com>; amd-gfx@lists.freedesktop.org >>>> Cc: Deucher, Alexander <alexander.deuc...@amd.com> >>>> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context. >>>> >>>> That not what I meant. Let's say you handled one bad page interrupt >>>> and as >>>> a result have one bad page reserved. Now unrelated gfx ring timeout >>>> happens which triggers GPU reset and VRAM loss. When you come back >>>> from >>>> reset amdgpu_ras_reserve_bad_pages will be called but since >>>> last_reserved >>>> == data_count the bad page will not be reserved again, maybe we >>>> should just >>>> set data->last_reserved to 0 again if VRAM was lost during ASIC >>>> reset... >>>> >>>> Andrey >>>> >>>> ________________________________________ >>>> From: Chen, Guchun <guchun.c...@amd.com> >>>> Sent: 11 September 2019 21:53:03 >>>> To: Grodzovsky, Andrey; Zhou1, Tao; amd-gfx@lists.freedesktop.org >>>> Cc: Deucher, Alexander >>>> Subject: RE: [PATCH] drm/amdgpu: Fix mutex lock from atomic context. >>>> >>>> Comment inline. >>>> >>>> Regards, >>>> Guchun >>>> >>>> -----Original Message----- >>>> From: Grodzovsky, Andrey <andrey.grodzov...@amd.com> >>>> Sent: Wednesday, September 11, 2019 10:41 PM >>>> To: Zhou1, Tao <tao.zh...@amd.com>; amd-gfx@lists.freedesktop.org >>>> Cc: Chen, Guchun <guchun.c...@amd.com>; Deucher, Alexander >>>> <alexander.deuc...@amd.com> >>>> Subject: Re: [PATCH] drm/amdgpu: Fix mutex lock from atomic context. >>>> >>>> On second though this will break what about reserving bad pages when >>>> resetting GPU for non RAS error reason such as manual reset ,S3 or >>>> ring >>>> timeout, (amdgpu_ras_resume->amdgpu_ras_reset_gpu) so i will keep the >>>> code as is. >>>> >>>> Another possible issue in existing code - looks like no reservation >>>> will take >>>> place in those case even now as amdgpu_ras_reserve_bad_pages >>>> data->last_reserved will be equal to data->count , no ? Looks like for >>>> this case you need to add flag to FORCE reservation for all pages from >>>> 0 to data->counnt. >>>> [Guchun]Yes, last_reserved is not updated any more, unless we >>>> unload our >>>> driver. So it maybe always equal to data->count, then no new bad >>>> page will >>>> be reserved. >>>> I see we have one eeprom reset by user, can we put this >>>> last_reserved clean >>>> operation to user in the same stack as well? >>>> >>>> Andrey >>>> >>>> On 9/11/19 10:19 AM, Andrey Grodzovsky wrote: >>>>> I like this much more, I will relocate to >>>>> amdgpu_umc_process_ras_data_cb an push. >>>>> >>>>> Andrey >>>>> >>>>> On 9/10/19 11:08 PM, Zhou1, Tao wrote: >>>>>> amdgpu_ras_reserve_bad_pages is only used by umc block, so another >>>>>> approach is to move it into amdgpu_umc_process_ras_data_cb. >>>>>> Anyway, either way is OK and the patch is: >>>>>> >>>>>> Reviewed-by: Tao Zhou <tao.zh...@amd.com> >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Andrey Grodzovsky <andrey.grodzov...@amd.com> >>>>>>> Sent: 2019年9月11日 3:41 >>>>>>> To: amd-gfx@lists.freedesktop.org >>>>>>> Cc: Chen, Guchun <guchun.c...@amd.com>; Zhou1, Tao >>>>>>> <tao.zh...@amd.com>; Deucher, Alexander >>>> <alexander.deuc...@amd.com>; >>>>>>> Grodzovsky, Andrey <andrey.grodzov...@amd.com> >>>>>>> Subject: [PATCH] drm/amdgpu: Fix mutex lock from atomic context. >>>>>>> >>>>>>> Problem: >>>>>>> amdgpu_ras_reserve_bad_pages was moved to amdgpu_ras_reset_gpu >>>>>>> because writing to EEPROM during ASIC reset was unstable. >>>>>>> But for ERREVENT_ATHUB_INTERRUPT amdgpu_ras_reset_gpu is called >>>>>>> directly from ISR context and so locking is not allowed. Also it's >>>>>>> irrelevant for this partilcular interrupt as this is generic RAS >>>>>>> interrupt and not memory errors specific. >>>>>>> >>>>>>> Fix: >>>>>>> Avoid calling amdgpu_ras_reserve_bad_pages if not in task context. >>>>>>> >>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com> >>>>>>> --- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4 +++- >>>>>>> 1 file changed, 3 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h >>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h >>>>>>> index 012034d..dd5da3c 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h >>>>>>> @@ -504,7 +504,9 @@ static inline int amdgpu_ras_reset_gpu(struct >>>>>>> amdgpu_device *adev, >>>>>>> /* save bad page to eeprom before gpu reset, >>>>>>> * i2c may be unstable in gpu reset >>>>>>> */ >>>>>>> - amdgpu_ras_reserve_bad_pages(adev); >>>>>>> + if (in_task()) >>>>>>> + amdgpu_ras_reserve_bad_pages(adev); >>>>>>> + >>>>>>> if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0) >>>>>>> schedule_work(&ras->recovery_work); >>>>>>> return 0; >>>>>>> -- >>>>>>> 2.7.4 >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx