[AMD Official Use Only - AMD Internal Distribution Only] -----Original Message----- From: Zhou1, Tao <tao.zh...@amd.com> Sent: Tuesday, April 29, 2025 6:08 PM To: Xie, Patrick <gangliang....@amd.com>; amd-gfx@lists.freedesktop.org Subject: RE: [PATCH] Add support for leagcy records in eeprom format V3
[AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Xie, Patrick <gangliang....@amd.com> > Sent: Tuesday, April 29, 2025 5:04 PM > To: Zhou1, Tao <tao.zh...@amd.com>; amd-gfx@lists.freedesktop.org > Subject: RE: [PATCH] Add support for leagcy records in eeprom format > V3 > > [AMD Official Use Only - AMD Internal Distribution Only] > > -----Original Message----- > From: Zhou1, Tao <tao.zh...@amd.com> > Sent: Tuesday, April 29, 2025 4:53 PM > To: Xie, Patrick <gangliang....@amd.com>; > amd-gfx@lists.freedesktop.org > Subject: RE: [PATCH] Add support for leagcy records in eeprom format > V3 > > [AMD Official Use Only - AMD Internal Distribution Only] > > > -----Original Message----- > > From: Xie, Patrick <gangliang....@amd.com> > > Sent: Tuesday, April 29, 2025 3:01 PM > > To: amd-gfx@lists.freedesktop.org > > Cc: Zhou1, Tao <tao.zh...@amd.com>; Xie, Patrick > > <gangliang....@amd.com> > > Subject: [PATCH] Add support for leagcy records in eeprom format V3 > > > > After eeprom records format upgrades to V3, records that have > > 'address == 0' should be supported in NPS1 > > > > Signed-off-by: ganglxie <gangl...@amd.com> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 73 > > ++++++++++++++++--------- > > 1 file changed, 48 insertions(+), 25 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > index e85143acf3a2..4f4e12452ecc 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > > @@ -2881,6 +2881,17 @@ static int > > __amdgpu_ras_convert_rec_from_rom(struct > > amdgpu_device *adev, > > save_nps = (bps->retired_page >> UMC_NPS_SHIFT) & UMC_NPS_MASK; > > bps->retired_page &= ~(UMC_NPS_MASK << UMC_NPS_SHIFT); > > > > + /* if address== 0, the rec must be a PA rec under NPS1 */ > > + if (bps->address == 0) { > > + if (nps == AMDGPU_NPS1_PARTITION_MODE) { > > + if (amdgpu_umc_pages_in_a_row(adev, err_data, > > + bps->retired_page << > > AMDGPU_GPU_PAGE_SHIFT)) > > + return -EINVAL; > > + goto out; > > + } else { > > + return -EOPNOTSUPP; > > + } > > + } > > [Tao] better to add an empty line here. > [Patrick] will add this empty line > > > if (save_nps == nps) { > > [Tao] is possible that both "bps->address == 0" and "(save_nps == nps " are > true? > [Patrick]Yes, all of the "address==0" records have save_nps=NPS1, so > when loading the records in NPS1, save_nps==nps. But it does not matter, I > check "address==0" > ahead of "save_nps==nps" , the code won't go to "save_nps==nps"'s > checking at any case. [Tao] so how about this if structure: If (save_nps == nps) { ... } else { if (bps->address) { ... } else { ... } } [Patrick] Yes, it a better logic. Based on this logic, I think we can even abort checking "address==0", keep this function as before. "address==0" records can be already parsed correctly in "if (save_nps==nps)" branch, and we do not have to avoid "failed to query address 0x0" warning, it does no harm to code logic. After this, can I change the commit message to "Refine records counting and parsing in eeprom V3"? > > if (amdgpu_umc_pages_in_a_row(adev, err_data, > > bps->retired_page << > > AMDGPU_GPU_PAGE_SHIFT)) @@ -2889,6 +2900,8 @@ static int > > __amdgpu_ras_convert_rec_from_rom(struct amdgpu_device *adev, > > if (amdgpu_ras_mca2pa_by_idx(adev, bps, err_data)) > > return -EINVAL; > > } > > + > > +out: > > return __amdgpu_ras_restore_bad_pages(adev, > > err_data->err_addr, > > > > adev- > > >umc.retire_unit); > > } > > @@ -2903,7 +2916,7 @@ int amdgpu_ras_add_bad_pages(struct > > amdgpu_device *adev, > > &adev->psp.ras_context.ras->eeprom_control; > > enum amdgpu_memory_partition nps = AMDGPU_NPS1_PARTITION_MODE; > > int ret = 0; > > - uint32_t i; > > + uint32_t i = 0; > > > > if (!con || !con->eh_data || !bps || pages <= 0) > > return 0; > > @@ -2924,28 +2937,31 @@ int amdgpu_ras_add_bad_pages(struct > > amdgpu_device *adev, > > mutex_lock(&con->recovery_lock); > > > > if (from_rom) { > > - for (i = 0; i < pages; i++) { > > - if (control->ras_num_recs - i >= > > adev->umc.retire_unit) { > > - if ((bps[i].address == bps[i + 1].address) && > > - (bps[i].mem_channel == bps[i + > > 1].mem_channel)) { > > - //deal with retire_unit records a time > > - ret = > > __amdgpu_ras_convert_rec_array_from_rom(adev, > > - > > &bps[i], > > &err_data, nps); > > - if (ret) > > - goto free; > > - i += (adev->umc.retire_unit - 1); > > + /* there is no pa recs in V3, so skip pa recs processing */ > > + if (control->tbl_hdr.version < RAS_TABLE_VER_V3) { > > + for (i = 0; i < pages; i++) { > > + if (control->ras_num_recs - i >= > > + adev->umc.retire_unit) > > { > > + if ((bps[i].address == bps[i + > > 1].address) && > > + (bps[i].mem_channel == > > + bps[i + > > 1].mem_channel)) { > > + /* deal with retire_unit > > records a time */ > > + ret = > > __amdgpu_ras_convert_rec_array_from_rom(adev, > > + > > + &bps[i], > > &err_data, nps); > > + if (ret) > > + > > + control->ras_num_bad_pages - > > = adev->umc.retire_unit; > > + i += (adev->umc.retire_unit - > > 1); > > + } else { > > + break; > > + } > > } else { > > break; > > } > > - } else { > > - break; > > } > > } > > for (; i < pages; i++) { > > ret = __amdgpu_ras_convert_rec_from_rom(adev, > > &bps[i], &err_data, nps); > > if (ret) > > - goto free; > > + control->ras_num_bad_pages -= adev- > > >umc.retire_unit; > > } > > } else { > > ret = __amdgpu_ras_restore_bad_pages(adev, bps, > > pages); @@ - > > 3040,21 +3056,28 @@ static int amdgpu_ras_load_bad_pages(struct > > amdgpu_device *adev) > > dev_err(adev->dev, "Failed to load EEPROM table records!"); > > } else { > > if (adev->umc.ras && adev->umc.ras->convert_ras_err_addr) { > > - for (i = 0; i < control->ras_num_recs; i++) { > > - if ((control->ras_num_recs - i) >= adev- > > >umc.retire_unit) { > > - if ((bps[i].address == bps[i + > > 1].address) && > > - (bps[i].mem_channel == bps[i + > > 1].mem_channel)) { > > - control->ras_num_pa_recs += > > adev- > > >umc.retire_unit; > > - i += (adev->umc.retire_unit - > > 1); > > + /*In V3, there is no pa recs, and some > > + cases(when > > address==0) may be parsed > > + as pa recs, so add verion check to avoid it. > > + */ > > + if (control->tbl_hdr.version < RAS_TABLE_VER_V3) { > > + for (i = 0; i < control->ras_num_recs; i++) { > > + if ((control->ras_num_recs - > > + i) > > + >= adev- > > >umc.retire_unit) { > > + if ((bps[i].address == > > + bps[i + > > 1].address) && > > + > > + (bps[i].mem_channel == bps[i + > > 1].mem_channel)) { > > + > > + control->ras_num_pa_recs += > > adev->umc.retire_unit; > > + i += > > (adev->umc.retire_unit - 1); > > + } else { > > + > > + control->ras_num_mca_recs += > > + > > + (control- > > >ras_num_recs - i); > > + break; > > + } > > } else { > > - control->ras_num_mca_recs += > > - > > (control- > > >ras_num_recs - i); > > + > > + control->ras_num_mca_recs += > > (control->ras_num_recs - i); > > break; > > } > > - } else { > > - control->ras_num_mca_recs += (control- > > >ras_num_recs - i); > > - break; > > } > > + } else { > > + control->ras_num_mca_recs = control- > > >ras_num_recs; > > } > > } > > > > -- > > 2.34.1 > >