ta if invoke node buffer
| ta type --|
| ta id --|
| cmd id --|
|-- shared buf len -|
|-- shared buffer --|
ta if invoke node buffer is as above, copy shared buffer data to correct
location
Signed-off-by: Stanley.Yang
---
drive
Check amdgpu_ras_mask to fix ineffective ras_mask setting
due to special asic without sram ecc enable but with poison
supported.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/
Don't modify amdgpu gpu recover get operation,
add amdgpu gpu recover set operation to select
reset method, only support mode1 and mode2 currently.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 3 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drivers/gpu/d
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 80b9642f2bc4..5f5bf0c26b1f 100644
--- a/drivers/gpu/drm/amd/amdgpu/a
GFX v9.4.4 uses mode1 reset to handle poison consumption.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 6 --
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
b/drivers/gpu/drm/amd/amdkfd/kfd_in
The eeprom table is empty before initializing,
add get eeprom table version function according
UMC HWIP version before initializing eeprom table.
Signed-off-by: Stanley.Yang
---
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 19 ++-
1 file changed, 18 insertions(+), 1 deletion(-
The eeprom table is empty before initializing,
set eeprom table version first before initializing.
Changed from V1:
Reuse amdgpu_ras_set_eeprom_table_version function
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 3 +++
1 file changed, 3 insertions(+)
The way to get ras capability has changed for some asics,
both of them need check XGMI physical nodes number to
set XGMI WAFL ras enable bit.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 +++---
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a
The ecc_irq is disabled while GPU mode2 reset suspending process,
but not be enabled during GPU mode2 reset resume process.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/aldebaran.c | 6
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 37 +
drivers/gpu/drm/a
The ecc_irq is disabled while GPU mode2 reset suspending process,
but not be enabled during GPU mode2 reset resume process.
Changed from V1:
only do sdma/gfx ras_late_init in aldebaran_mode2_restore_ip,
delete amdgpu_ras_late_resume function.
Signed-off-by: Stanley.Yang
---
driv
The ecc_irq is disabled while GPU mode2 reset suspending process,
but not be enabled during GPU mode2 reset resume process.
Changed from V1:
only do sdma/gfx ras_late_init in aldebaran_mode2_restore_ip
delete amdgpu_ras_late_resume function
Changed from V2:
check umc ras s
For the special asic with mem ecc enabled but sram ecc
not enabled, even if the ras block is not supported on
.ras_enabled, if the asic supports poison mode and the
ras block has ras configuration, it can be considered
that the ras block supports ras function only with sram
ecc is not enabled, othe
Show deferred error count for UMC syfs node
Signed-off-by: Stanley.Yang
Reviewed-by: Tao Zhou
Reviewed-by: Hawking Zhang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 ++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gp
Why:
The PCI error slot reset maybe triggered after inject ue to UMC multi
times, this
caused system hang.
[ 557.371857] amdgpu :af:00.0: amdgpu: GPU reset succeeded, trying to
resume
[ 557.373718] [drm] PCIE GART of 512M enabled.
[ 557.373722] [drm] PTB located at 0x00
The high three bits of ras features mask indicate socket
id, it should skip to check high three bits of ras features
mask before disable all ras features.
Signed-off-by: Stanley.Yang
Reviewed-by: Hawking Zhang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 ++-
drivers/gpu/drm/amd/amd
Test scenario:
modprobe amdgpu -> rmmod amdgpu -> modprobe amdgpu
Error log:
[ 54.396807] debugfs: File 'page_pool' in directory 'amdttm' already
present!
[ 54.396833] debugfs: File 'page_pool_shrink' in directory 'amdttm'
already present!
[ 54.396848] debugfs: File 'buffer_
update smu driver if version to avoid mismatch log
Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/inc/smu_v13_0.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
b/drivers/gpu
update smu driver if version to avoid mismatch log
Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/inc/smu_v13_0.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
b/drivers/gpu
update smu driver if and version to avoid mismatch log
v2:
update smu driver interface
Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
Signed-off-by: Stanley.Yang
---
.../drm/amd/pm/inc/smu13_driver_if_aldebaran.h | 18 +-
drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
update smu driver if version to 0x08 to avoid mismatch log
A version mismatch can still happen with an older FW
Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
Signed-off-by: Stanley.Yang
---
.../drm/amd/pm/inc/smu13_driver_if_aldebaran.h | 18 +-
drivers/gpu/drm/amd/pm/inc/
add message smu to query error information
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 16 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h | 4 +
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 161
3 files changed, 181 insertions(+)
diff --git a/
support ECC TABLE message, this table include unc ras error count
and error address
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 7
.../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 38 +++
.../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c| 2
if smu support ECCTABLE, driver can message smu to get ecc_table
then query umc error info from ECCTABLE
apply pmfw version check to ensure backward compatibility
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 42 ---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras
update smu driver if version to 0x08 to avoid mismatch log
A version mismatch can still happen with an older FW
Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
Signed-off-by: Stanley.Yang
---
.../drm/amd/pm/inc/smu13_driver_if_aldebaran.h | 18 +-
drivers/gpu/drm/amd/pm/inc/
add message smu to query error information
v2:
rename message_smu to ecc_info
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 16 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h | 4 +
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 161
3 files c
support ECC TABLE message, this table include umc ras error count
and error address
v2:
add smu version check to query whether support ecctable
call smu_cmn_update_table to get ecctable directly
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 8 +++
driv
if smu support ECCTABLE, driver can message smu to get ecc_table
then query umc error info from ECCTABLE
v2:
optimize source code makes logical more reasonable
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 42 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_umc
Function amdgpu_device_fini_hw is called before amdgpu_device_fini_sw,
so ras ta will unload before send ras disable command, ras dsiable operation
must before hw fini.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++--
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
v2:
still need call ras_disable_all_featrures to handle
ras initilization failure case.
Function amdgpu_device_fini_hw is called before amdgpu_device_fini_sw,
so ras ta will unload before send ras disable command, ras dsiable operation
must before hw fini.
Signed-off-by: Stanley.Yang
---
{
[ 578.019986] amdgpu :23:00.0: amdgpu: GPU reset begin!
[ 583.245566] amdgpu :23:00.0: amdgpu: Failed to disable smu features.
[ 583.245621] amdgpu :23:00.0: amdgpu: Fail to disable dpm features!
[ 583.245639] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
Reason:
{
[ 578.019986] amdgpu :23:00.0: amdgpu: GPU reset begin!
[ 583.245566] amdgpu :23:00.0: amdgpu: Failed to disable smu features.
[ 583.245621] amdgpu :23:00.0: amdgpu: Fail to disable dpm features!
[ 583.245639] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]]
this is a workaround due to get ecc info failed during gpu recovery
[ 700.236122] amdgpu :09:00.0: amdgpu: Failed to export SMU ecc table!
[ 700.236128] amdgpu :09:00.0: amdgpu: GPU reset begin!
[ 704.331171] amdgpu: qcm fence wait loop timeout expired
[ 704.331194] amdgpu: The cp migh
skip get ecc info for aldebarn through check ip version
do not affect other asic type
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgp
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index 6e781cee8bb6..e0a8224e466f 100644
-
remove in recovery stat check, skip umc ras err cnt
harvest in amdgpu_ras_log_on_err_counter
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 15 ++-
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/dri
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 0e16683876aa..d9d292c79cfa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 +-
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index ec3ebc33ee03..8fdf355d7de8 100644
--- a/drivers/gpu/drm/a
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 22 ++
1 file changed, 22 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index daf63a4c1fff..dfeaa57dd7ea 100644
--- a/drivers/gpu/drm/amd/
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 10 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
drivers/gpu/drm/amd/amdgpu/ta_ras_if.h | 3 ++-
3 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
b/dr
Changed from v1:
remove unused brace
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 9 -
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
drivers/gpu/drm/amd/amdgpu/ta_ras_if.h | 3 ++-
3 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/drivers
Change-Id: Ic2a488ee253a913d806bd33ee9c90e31a71af320
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 23 ---
drivers/gpu/drm/amd/amdgpu/umc_v8_7.c | 6 --
2 files changed, 29 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
b/drive
Pmfw read ecc info registers and store values in
eccinfo_table in the following order
umc0 ch_inst 0, 1, 2 ... 7
umc1 ch_inst 0, 1, 2 ... 7
...
umc3 ch_inst 0, 1, 2 ... 7
Driver should convert eccinfo_table_idx into channel_index according
to channel_idx_tbe.
Change-Id: Icafe93e458912b729d2e30d6
Pmfw read ecc info registers in the following order,
umc0: ch_inst 0, 1, 2 ... 7
umc1: ch_inst 0, 1, 2 ... 7
The position of the register value stored in eccinfo
table is calculated according to the below formula,
channel_index = umc_inst * channel_in_umc + ch_inst
Driver directly us
The OOB table error count info should be reset after reset
eeprom table
Change-Id: I2a39e0e44b7b1a5ab7d6b4d4b73ebe48264396b7
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras
the UMC_STATUS register is not liner, adjust offset
calculation formula to get correct address
Change-Id: Ic8926078301848330babf289c4238dc8cbcf313d
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 7 +++
1 file changed, 7 insertions(+)
diff --git a/drivers/gpu/drm/amd
print more error info when deferred uncorrectable ras error
changed from V1:
move Defferred error msg into query uncorrectable error
count function.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +-
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 72
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h | 5 +
drivers/gpu/drm/amd/amdgpu/umc_v8_7.c | 2 +-
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
index bbcccf53080d..e
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index f404c2321a6a..ca5a32944242 100644
--- a/drivers/gpu/drm/amd/amdgp
From: John Clements
support umc ras function initialization for aldebaran
Change-Id: I84155d4d3eaae86a8c1bd2331b1964946c47f6da
Signed-off-by: John Clements
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 13 +
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 15
GCEA/MMHUB EA error should not result to DF freeze, this is
fixed in next generation, but for some reasons the GCEA/MMHUB
EA error will result to DF freeze in previous generation,
diver should avoid to indicate GCEA/MMHUB EA error as hw fatal
error in kernel message by read GCEA/MMHUB err status re
mmHDP_READ_CACHE_INVALIDATE register is in HDP not in NBIO
Signed-off-by: Stanley.Yang
Change-Id: I4375a8a67d3a13f9605479e169169e22dd5833d1
---
drivers/gpu/drm/amd/amdgpu/nv.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/am
cause:
It is necessary to send ras disable command to ras-ta to program
GB_EDC_MODE to "BYPASS" mode during gfx block ras later init,
because the ras capability is disable read from vbios for vega20
gaming, but the ras context is released during ras init process,
cause:
It is necessary to send ras disable command to ras-ta during gfx
block ras later init, because the ras capability is disable read
from vbios for vega20 gaming, but the ras context is released
during ras init process, this will cause send ras disable comman
Signed-off-by: Stanley.Yang
Change-Id: Ia0ad9453ac3ac929f95c73cbee5b7a8fc42a9816
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 164 +
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 30 +++-
3 files changed, 196 insertions(
Changed from V1:
rename same functions name, only init ras error handler data for
supported asic.
Signed-off-by: Stanley.Yang
Change-Id: Ia0ad9453ac3ac929f95c73cbee5b7a8fc42a9816
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c |
Changed from V1:
rename some functions name, only init ras error handler data for
supported asic.
Changed from V2:
fix poential memory leak.
Signed-off-by: Stanley.Yang
Change-Id: Ia0ad9453ac3ac929f95c73cbee5b7a8fc42a9816
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |
Changed from V1:
rename some functions name, only init ras error handler data for
supported asic.
Changed from V2:
fix potential memory leak.
Signed-off-by: Stanley.Yang
Change-Id: Ia0ad9453ac3ac929f95c73cbee5b7a8fc42a9816
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |
noretry = 0 casue KFDGraphicsInterop test failed on SRIOV platform
for vega10, so set noretry to 1 for vega10.
Signed-off-by: Stanley.Yang
Change-Id: I241da5c20970ea889909997ff044d6e61642da81
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/
The KFDTopologyTest.BasicTest will failed if skip smc, sdma, sos, ta
and asd fw in SRIOV for vega10, so adjust above fw and skip load them
in SRIOV only for navi12.
Signed-off-by: Stanley.Yang
Change-Id: Id354be93723d7b5d769d73dc67c596af300305af
---
drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
The KFDTopologyTest.BasicTest will failed if skip smc, sdma, sos, ta
and asd fw in SRIOV for vega10, so adjust above fw and skip load them
in SRIOV only for navi12.
v2: remove unnecessary asic type check.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 3 -
each sdma instance fw_version and feature_version
should be set right value when asic type isn't
between SIENNA_CICHILD and CHIP_DIMGREY_CAVEFISH
Signed-off-by: Stanley.Yang
Change-Id: I1edbf3e0557d771eb4c0b686fa5299a3b5f26e35
---
drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 2 +-
1 file changed, 1
skip load smu and sdma fw on sriov due to smc, sos,
ta and asd fw have been skipped for SIENNA_CICHLID.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c| 3 +++
drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 4 +++-
2 files changed, 6 insertions(+), 1 deletion(-)
diff --g
skip load smu and sdma fw on sriov due to sos,
ta and asd fw have been skipped for SIENNA_CICHLID.
V2:
move asic check into smu11
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c | 3 +++
drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 10 --
drivers/
It should first check block ras obj whether be set, it should
return directly if block ras obj is not set.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 -
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/dr
It should first check block ras obj whether be set, it should
return 0 directly if block ras obj or hw_ops is not set.
Changed from V1:
return 0 directly if block ras obj or hw ops is not set
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 +-
1 file
This is workaround due to ring test failed during ras
do gpu recovery for aqua vanjaram.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 6 ++
1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_
The amdgpu_ras_get_context may return NULL if device
not support ras feature, so add check before using.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/
This is workaround, kiq ring test failed in suspend stage when do ras
recovery for gfx v9_4_3.
Change-Id: I8de9900aa76706f59bc029d4e9e8438c6e1db8e0
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 21 +
1 file changed, 21 insertions(+)
diff --git a/d
Enable smu_v13_0_6 mca debug mode when GFX RAS feature is enabled
on APU.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
b/drivers/gpu
Enable smu_v13_0_6 mca debug mode if ras is enabled.
Changed from V1:
enable mca debug mode if ras enabled.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/sws
Fix delete nodes that it has been freed.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 8831859a2c49..867afbf84
Enable RAS feature by default for aqua vanjaram on apu
platform.
Change-Id: I02105d07d169d1356251c994249a134ca5dd2a7a
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++
1 file changed, 2 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/amd/am
Reset error data info stored in vram when user clear eeprom table.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 97 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 4 +
3 files changed,
Message SMU bad channel information bitmap to update OOB table
Change-Id: I49a79af64d5263c28db059ecb8b8405a471431b4
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 3 ++
.../gpu/drm/amd/amdgpu/amdgpu_ras_eep
support message SMU update bad channel info to update HBM bad channel
info in OOB table
Change-Id: I1e50ed8118f4c1aaefb04c040e59ae4918cdc295
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 12 ++
drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 1 +
drivers/gp
It should notice SMU to update bad channel info when detected
uncorrectable error in UMC block
Change-Id: I2dc8848affdb53e52891013953ae9383fff5f20f
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 3 +++
.
Change-Id: I6afe0332cbb20528648c38665264930d6b091c2f
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 7 ++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index 9a892d6d1d7a..89fbee5
It must check asic whether support smu
before call smu powerplay function, otherwise
it may cause null point on no support smu asic.
Change-Id: Ib86f3d4c88317b23eb1040b9ce1c5c8dcae42488
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 6 ++
1 file changed, 6 insertions(+
Change-Id: I09a2aae85cde3ab2cb6b042b973da6839ad024ec
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 62 ++-
1 file changed, 60 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.
In order to debug ras error, driver will print IPID/SYND/MISC0
register value if detect correctable or uncorrectable error.
Provide umc_query_error_status_helper function to reduce code
redundancy.
Change-Id: I09a2aae85cde3ab2cb6b042b973da6839ad024ec
Signed-off-by: Stanley.Yang
---
drivers/gpu/d
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 6 ++
1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
index 6d2879ac585b..f76b1cb8baf8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgp
XGMI RAS should be according to the gmc xmgi supported flag
and xgmi physical nodes number.
Change-Id: Idf3600b30584b10b528e7237d103d84d5097b7e0
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8
1 file changed, 8 insertions(+)
diff --git a/drivers/gpu/drm/amd
Change-Id: Icceb43556eec802f11c2077c1c58a1e92c9df599
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4
drivers/gpu/drm/amd/amdgpu/ta_ras_if.h | 2 ++
2 files changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
b/drivers/gpu/drm/amd/amdg
Aldebaran supports VCN and JPEG RAS, it reports unexpected
block id message during VCN and JPEG RAS initialization if VCN
and JPEG block id not defined.
Change-Id: Icceb43556eec802f11c2077c1c58a1e92c9df599
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4
drivers/
XGMI RAS should be according to the gmc xgmi physical nodes number,
XGMI RAS should not be enabled if xgmi num_physical_nodes is zero.
Change-Id: Idf3600b30584b10b528e7237d103d84d5097b7e0
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +++
1 file changed, 7 inser
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/sdma_v4_4.c | 4
1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4.c
b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4.c
index 6f9895cdddb1..0ddb6955a6d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4.c
+++
Support VCN/JPEG instance mask checking, pass logical
mask directly except GFX/SDMA/VCN/JPEG blocks.
Changed from V1:
correct a typo
Signed-off-by: Stanley.Yang
Reviewed-by: Tao Zhou
Reviewed-by: Hawking Zhang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 +-
1 file changed, 5 i
pass xcc mask to ras ta, ras ta will compare
the mask with the one from chiplet topology.
Changed from V1:
Remove IP version checking.
Set ras_cmd->ras_init_message.init_flags.xcc_mask
directly due to xcc_mask is common structres to
all the devices.
Signed-off-by:
Changed from V1:
Remove amdgpu_ras_logical_mask_to_physical_mask
due to GET_MASK provides same feature.
Support convert VCN/JPEG logical mask to physical
mask.
Signed-off-by: Stanley.Yang
Reviewed-by: Tao Zhou
Reviewed-by: Hawking Zhang
---
drivers/gpu/drm/amd/
Rename RAS_TABLE_VER to RAS_TABLE_VER_V1,
move RAS_TABLE_VER_V1 from amdgpu_ras_eeprom.c to amdgpu_ras_eeprom.h.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 5 ++---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h | 2 ++
2 files changed, 4 insertions(+), 3 de
Add setting EEPROM table version interface for umcv8.10,
Add EEPROM table v2.1 to UMC v8.10.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h | 2 ++
drivers/gpu/drm/amd/amdgpu/umc_v8_10.c | 6 ++
2 files changed, 8 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdg
Add ras info to EEPROM table, app can analyse device ECC
status without GPU driver through EEPROM table ras info.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 204 --
.../gpu/drm/amd/amdgpu
Set EEPROM ras info: rma status, health percent and bad
page threshold.
Signed-off-by: Stanley.Yang
---
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 24 +++
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h| 5
2 files changed, 29 insertions(+)
diff --git a/drivers/gpu/drm
It's more reasonable to check EEPROM table ras info bytes.
Signed-off-by: Stanley.Yang
---
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 19 +++
1 file changed, 19 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ra
Add RAS EEPROM table version 2.1 macro definition.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h | 1 +
2 files changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eepro
Using "is_app_apu" to identify device in the native
APU mode or carveout mode.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 +++---
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 34 ++---
3 files cha
Do not compare injection address with mc_vram_size
if mc_vram_size is zero.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_r
The function callback handle_poison_consumption and callback
function poison_consumption_handler are almost same to handle
poison consumption, remove poison_consumption_handler.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 9 -
drivers/gpu/drm/amd/amdgpu/am
The function callback handle_poison_consumption and callback
function poison_consumption_handler are almost same to handle
poison consumption, remove poison_consumption_handler.
Changed from V1:
Add handle poison consumption function for VCN2.6, VCN4.0,
JPEG2.6 and JPEG4.0, return
[Why]
[ 754.862560] refcount_t: underflow; use-after-free.
[ 754.862898] Call Trace:
[ 754.862903]
[ 754.862913] amdgpu_job_free_cb+0xc2/0xe1 [amdgpu]
[ 754.863543] drm_sched_main.cold+0x34/0x39 [amd_sched]
[How]
The fw_fence may be not init, check whether dma_fenc
1 - 100 of 123 matches
Mail list logo