Register aqua vanjaram jpeg poison irq, add jpeg poison handle.
Signed-off-by: Stanley.Yang
Reviewed-by: Hawking Zhang
---
drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.c | 76
drivers/gpu/drm/amd/amdgpu/jpeg_v4_0_3.h | 7 +++
2 files changed, 83 insertions(+)
diff --git a/d
Register aqua vanjaram vcn poison irq, add vcn poison handle.
Signed-off-by: Stanley.Yang
Reviewed-by: Hawking Zhang
---
drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 65 +
drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.h | 6 +++
2 files changed, 71 insertions(+)
diff --git a/driv
Update ta ra block to keep sync with RAS TA.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 7 +++
drivers/gpu/drm/amd/amdgpu/ta_ras_if.h | 3 +++
3 files changed, 11 insertions(+)
diff --git a/drivers/gpu/drm/amd/am
Add vcn and jpeg error count parsing.
Signed-off-by: Stanley.Yang
Reviewed-by: Yang Wang
---
.../drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 24 +++
1 file changed, 24 insertions(+)
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
b/drivers/gpu/drm/amd/pm/sws
The eeprom table is empty before initializing,
set eeprom table version first before initializing.
Changed from V1:
Reuse amdgpu_ras_set_eeprom_table_version function
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 3 +++
1 file changed, 3 insertions(+)
The eeprom table is empty before initializing,
add get eeprom table version function according
UMC HWIP version before initializing eeprom table.
Signed-off-by: Stanley.Yang
---
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 19 ++-
1 file changed, 18 insertions(+), 1 deletion(-
GFX v9.4.4 uses mode1 reset to handle poison consumption.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 6 --
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
b/drivers/gpu/drm/amd/amdkfd/kfd_in
The way to get ras capability has changed for some asics,
both of them need check XGMI physical nodes number to
set XGMI WAFL ras enable bit.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 +++---
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 80b9642f2bc4..5f5bf0c26b1f 100644
--- a/drivers/gpu/drm/amd/amdgpu/a
Don't modify amdgpu gpu recover get operation,
add amdgpu gpu recover set operation to select
reset method, only support mode1 and mode2 currently.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h| 3 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
drivers/gpu/d
Check amdgpu_ras_mask to fix ineffective ras_mask setting
due to special asic without sram ecc enable but with poison
supported.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/
ta if invoke node buffer
| ta type --|
| ta id --|
| cmd id --|
|-- shared buf len -|
|-- shared buffer --|
ta if invoke node buffer is as above, copy shared buffer data to correct
location
Signed-off-by: Stanley.Yang
---
drive
The high three bits of ras features mask indicate socket
id, it should skip to check high three bits of ras features
mask before disable all ras features.
Signed-off-by: Stanley.Yang
Reviewed-by: Hawking Zhang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 11 ++-
drivers/gpu/drm/amd/amd
Why:
The PCI error slot reset maybe triggered after inject ue to UMC multi
times, this
caused system hang.
[ 557.371857] amdgpu :af:00.0: amdgpu: GPU reset succeeded, trying to
resume
[ 557.373718] [drm] PCIE GART of 512M enabled.
[ 557.373722] [drm] PTB located at 0x00
Show deferred error count for UMC syfs node
Signed-off-by: Stanley.Yang
Reviewed-by: Tao Zhou
Reviewed-by: Hawking Zhang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 ++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gp
For the special asic with mem ecc enabled but sram ecc
not enabled, even if the ras block is not supported on
.ras_enabled, if the asic supports poison mode and the
ras block has ras configuration, it can be considered
that the ras block supports ras function only with sram
ecc is not enabled, othe
The ecc_irq is disabled while GPU mode2 reset suspending process,
but not be enabled during GPU mode2 reset resume process.
Changed from V1:
only do sdma/gfx ras_late_init in aldebaran_mode2_restore_ip
delete amdgpu_ras_late_resume function
Changed from V2:
check umc ras s
The ecc_irq is disabled while GPU mode2 reset suspending process,
but not be enabled during GPU mode2 reset resume process.
Changed from V1:
only do sdma/gfx ras_late_init in aldebaran_mode2_restore_ip,
delete amdgpu_ras_late_resume function.
Signed-off-by: Stanley.Yang
---
driv
The ecc_irq is disabled while GPU mode2 reset suspending process,
but not be enabled during GPU mode2 reset resume process.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/aldebaran.c | 6
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 37 +
drivers/gpu/drm/a
Reset error data info stored in vram when user clear eeprom table.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 97 ++-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 4 +
3 files changed,
Enable RAS feature by default for aqua vanjaram on apu
platform.
Change-Id: I02105d07d169d1356251c994249a134ca5dd2a7a
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 ++
1 file changed, 2 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/amd/am
Fix delete nodes that it has been freed.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 8831859a2c49..867afbf84
Enable smu_v13_0_6 mca debug mode if ras is enabled.
Changed from V1:
enable mca debug mode if ras enabled.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/sws
Enable smu_v13_0_6 mca debug mode when GFX RAS feature is enabled
on APU.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_6_ppt.c
b/drivers/gpu
This is workaround, kiq ring test failed in suspend stage when do ras
recovery for gfx v9_4_3.
Change-Id: I8de9900aa76706f59bc029d4e9e8438c6e1db8e0
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 21 +
1 file changed, 21 insertions(+)
diff --git a/d
The amdgpu_ras_get_context may return NULL if device
not support ras feature, so add check before using.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/
This is workaround due to ring test failed during ras
do gpu recovery for aqua vanjaram.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 6 ++
1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c
b/drivers/gpu/drm/amd/amdgpu/gfx_v9_
It should first check block ras obj whether be set, it should
return 0 directly if block ras obj or hw_ops is not set.
Changed from V1:
return 0 directly if block ras obj or hw ops is not set
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 +-
1 file
It should first check block ras obj whether be set, it should
return directly if block ras obj is not set.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 -
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/dr
Only disable RAS by default for aqua vanjaram on APU platform.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 22
The xcc index should be refer to xcc_mask, convert xcc_mask
to counts then calculate device instance.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 24 +---
1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/
Disable RAS feature by default for aqua vanjaram on APU platform.
Changed from V1:
Splite Disable RAS by default on APU platform into a
separated patch.
Changed from V2:
Avoid to modify global variable amdgpu_ras_enable.
Signed-off-by: Stanley.Yang
Reviewed-by: Hawking Z
Enable RAS for aqua vanjaram.
Changed from V1:
Split the change in amdgpu_ras_asic_supported into a
separated patch.
Changed from V2:
Avoid to modify global variable amdgpu_ras_enable.
Signed-off-by: Stanley.Yang
Reviewed-by: Hawking Zhang
---
drivers/gpu/drm/amd/amdgp
Disable RAS feature by default for aqua vanjaram on APU platform.
Changed from V1:
Splite Disable RAS by default on APU platform into a
separated patch.
Signed-off-by: Stanley.Yang
Reviewed-by: Hawking Zhang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 +
1 file chan
Enable RAS for aqua vanjaram.
Changed from V1:
Split the change in amdgpu_ras_asic_supported into a
separated patch.
Signed-off-by: Stanley.Yang
Reviewed-by: Hawking Zhang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/gp
The function callback handle_poison_consumption and callback
function poison_consumption_handler are almost same to handle
poison consumption, remove poison_consumption_handler.
Changed from V1:
Add handle poison consumption function for VCN2.6, VCN4.0,
JPEG2.6 and JPEG4.0, return
The function callback handle_poison_consumption and callback
function poison_consumption_handler are almost same to handle
poison consumption, remove poison_consumption_handler.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 9 -
drivers/gpu/drm/amd/amdgpu/am
Do not compare injection address with mc_vram_size
if mc_vram_size is zero.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_r
Using "is_app_apu" to identify device in the native
APU mode or carveout mode.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8 +++---
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 34 ++---
3 files cha
Add RAS EEPROM table version 2.1 macro definition.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h | 1 +
2 files changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eepro
Set EEPROM ras info: rma status, health percent and bad
page threshold.
Signed-off-by: Stanley.Yang
---
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 24 +++
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h| 5
2 files changed, 29 insertions(+)
diff --git a/drivers/gpu/drm
It's more reasonable to check EEPROM table ras info bytes.
Signed-off-by: Stanley.Yang
---
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 19 +++
1 file changed, 19 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ra
Add ras info to EEPROM table, app can analyse device ECC
status without GPU driver through EEPROM table ras info.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
.../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 204 --
.../gpu/drm/amd/amdgpu
Add setting EEPROM table version interface for umcv8.10,
Add EEPROM table v2.1 to UMC v8.10.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h | 2 ++
drivers/gpu/drm/amd/amdgpu/umc_v8_10.c | 6 ++
2 files changed, 8 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdg
Rename RAS_TABLE_VER to RAS_TABLE_VER_V1,
move RAS_TABLE_VER_V1 from amdgpu_ras_eeprom.c to amdgpu_ras_eeprom.h.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 5 ++---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h | 2 ++
2 files changed, 4 insertions(+), 3 de
Changed from V1:
Remove amdgpu_ras_logical_mask_to_physical_mask
due to GET_MASK provides same feature.
Support convert VCN/JPEG logical mask to physical
mask.
Signed-off-by: Stanley.Yang
Reviewed-by: Tao Zhou
Reviewed-by: Hawking Zhang
---
drivers/gpu/drm/amd/
pass xcc mask to ras ta, ras ta will compare
the mask with the one from chiplet topology.
Changed from V1:
Remove IP version checking.
Set ras_cmd->ras_init_message.init_flags.xcc_mask
directly due to xcc_mask is common structres to
all the devices.
Signed-off-by:
Support VCN/JPEG instance mask checking, pass logical
mask directly except GFX/SDMA/VCN/JPEG blocks.
Changed from V1:
correct a typo
Signed-off-by: Stanley.Yang
Reviewed-by: Tao Zhou
Reviewed-by: Hawking Zhang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 +-
1 file changed, 5 i
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/sdma_v4_4.c | 4
1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4.c
b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4.c
index 6f9895cdddb1..0ddb6955a6d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4.c
+++
XGMI RAS should be according to the gmc xgmi physical nodes number,
XGMI RAS should not be enabled if xgmi num_physical_nodes is zero.
Change-Id: Idf3600b30584b10b528e7237d103d84d5097b7e0
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +++
1 file changed, 7 inser
Aldebaran supports VCN and JPEG RAS, it reports unexpected
block id message during VCN and JPEG RAS initialization if VCN
and JPEG block id not defined.
Change-Id: Icceb43556eec802f11c2077c1c58a1e92c9df599
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4
drivers/
Change-Id: Icceb43556eec802f11c2077c1c58a1e92c9df599
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4
drivers/gpu/drm/amd/amdgpu/ta_ras_if.h | 2 ++
2 files changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
b/drivers/gpu/drm/amd/amdg
XGMI RAS should be according to the gmc xmgi supported flag
and xgmi physical nodes number.
Change-Id: Idf3600b30584b10b528e7237d103d84d5097b7e0
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 8
1 file changed, 8 insertions(+)
diff --git a/drivers/gpu/drm/amd
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 6 ++
1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
index 6d2879ac585b..f76b1cb8baf8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgp
[Why]
[ 754.862560] refcount_t: underflow; use-after-free.
[ 754.862898] Call Trace:
[ 754.862903]
[ 754.862913] amdgpu_job_free_cb+0xc2/0xe1 [amdgpu]
[ 754.863543] drm_sched_main.cold+0x34/0x39 [amd_sched]
[How]
The fw_fence may be not init, check whether dma_fenc
The VCN and JPEG ras are supported, so add VCN and JPEG ras block id.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 4
drivers/gpu/drm/amd/amdgpu/ta_ras_if.h | 2 ++
2 files changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
b/driver
The EccInfo_t struct in driver_if.h is as below in official release
verion 68.55.0
typedef struct {
uint64_t mca_umc_status;
uint64_t mca_umc_addr;
uint16_t ce_count_lo_chip;
uint16_t ce_count_hi_chip;
uint32_t eccPadding;
uint64_t mca_ceumc_addr;
} EccInfo_t;
It's different
Fix aldebaran ras supported check on SRIOV guest side,
the previous check conditicon block all ras feature
on baremetal
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdg
Changed from V1:
remove unnecessary same row physical address calculation
Changed from V2:
move record_ce_addr_supported to umc_ecc_info struct
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 5 ++
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c |
SMU add a new variable mca_ceumc_addr to record
umc correctable error address in EccInfo table,
driver side add EccInfo_V2_t to support this feature
Changed from V1:
remove ecc_table_v2 and unnecessary table id, define union struct
include
EccInfo_t and EccInfo_V2_t.
Changed from
Changed from V1:
remove unnecessary same row physical address calculation
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 5 ++
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 52 ++-
.../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 1 +
3
SMU add a new variable mca_ceumc_addr to record
umc correctable error address in EccInfo table,
driver side add EccInfo_V2_t to support this feature
Changed from V1:
remove ecc_table_v2 and unnecessary table id, define union struct
include
EccInfo_t and EccInfo_V2_t.
Signed-off-b
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 5 ++
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 55 ++-
.../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 2 +
3 files changed, 60 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/
SMU add a new variable mca_ceumc_addr to record
umc correctable error address in EccInfo table,
driver side add ecctable_v2 to support this feature
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 1 +
drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h | 2 +
.../in
support umc/gfx/sdma ras on guest side
Changed from V1:
move sriov judgment in amdgpu_ras_interrupt_fatal_error_handler
Change-Id: Ic7dda45d8f8cf2d5f1abc7705abc153d558da8a1
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 +++
drivers/gpu/drm/amd/amdgpu/amdgpu
support umc/gfx/sdma ras on guest side
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4
drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c| 4
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 23 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.c |
In order to debug ras error, driver will print IPID/SYND/MISC0
register value if detect correctable or uncorrectable error.
Provide umc_query_error_status_helper function to reduce code
redundancy.
Change-Id: I09a2aae85cde3ab2cb6b042b973da6839ad024ec
Signed-off-by: Stanley.Yang
---
drivers/gpu/d
Change-Id: I09a2aae85cde3ab2cb6b042b973da6839ad024ec
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 62 ++-
1 file changed, 60 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
b/drivers/gpu/drm/amd/amdgpu/umc_v6_7.
It must check asic whether support smu
before call smu powerplay function, otherwise
it may cause null point on no support smu asic.
Change-Id: Ib86f3d4c88317b23eb1040b9ce1c5c8dcae42488
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 6 ++
1 file changed, 6 insertions(+
Change-Id: I6afe0332cbb20528648c38665264930d6b091c2f
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 7 ++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index 9a892d6d1d7a..89fbee5
It should notice SMU to update bad channel info when detected
uncorrectable error in UMC block
Change-Id: I2dc8848affdb53e52891013953ae9383fff5f20f
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 3 +++
.
support message SMU update bad channel info to update HBM bad channel
info in OOB table
Change-Id: I1e50ed8118f4c1aaefb04c040e59ae4918cdc295
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 12 ++
drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 1 +
drivers/gp
Message SMU bad channel information bitmap to update OOB table
Change-Id: I49a79af64d5263c28db059ecb8b8405a471431b4
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 3 ++
.../gpu/drm/amd/amdgpu/amdgpu_ras_eep
print more error info when deferred uncorrectable ras error
changed from V1:
move Defferred error msg into query uncorrectable error
count function.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 +-
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 72
the UMC_STATUS register is not liner, adjust offset
calculation formula to get correct address
Change-Id: Ic8926078301848330babf289c4238dc8cbcf313d
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 7 +++
1 file changed, 7 insertions(+)
diff --git a/drivers/gpu/drm/amd
The OOB table error count info should be reset after reset
eeprom table
Change-Id: I2a39e0e44b7b1a5ab7d6b4d4b73ebe48264396b7
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras
Pmfw read ecc info registers in the following order,
umc0: ch_inst 0, 1, 2 ... 7
umc1: ch_inst 0, 1, 2 ... 7
The position of the register value stored in eccinfo
table is calculated according to the below formula,
channel_index = umc_inst * channel_in_umc + ch_inst
Driver directly us
Pmfw read ecc info registers and store values in
eccinfo_table in the following order
umc0 ch_inst 0, 1, 2 ... 7
umc1 ch_inst 0, 1, 2 ... 7
...
umc3 ch_inst 0, 1, 2 ... 7
Driver should convert eccinfo_table_idx into channel_index according
to channel_idx_tbe.
Change-Id: Icafe93e458912b729d2e30d6
Change-Id: Ic2a488ee253a913d806bd33ee9c90e31a71af320
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 23 ---
drivers/gpu/drm/amd/amdgpu/umc_v8_7.c | 6 --
2 files changed, 29 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v6_7.c
b/drive
Changed from v1:
remove unused brace
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 9 -
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
drivers/gpu/drm/amd/amdgpu/ta_ras_if.h | 3 ++-
3 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/drivers
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 10 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
drivers/gpu/drm/amd/amdgpu/ta_ras_if.h | 3 ++-
3 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
b/dr
remove in recovery stat check, skip umc ras err cnt
harvest in amdgpu_ras_log_on_err_counter
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 15 ++-
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/dri
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
index 6e781cee8bb6..e0a8224e466f 100644
-
skip get ecc info for aldebarn through check ip version
do not affect other asic type
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgp
this is a workaround due to get ecc info failed during gpu recovery
[ 700.236122] amdgpu :09:00.0: amdgpu: Failed to export SMU ecc table!
[ 700.236128] amdgpu :09:00.0: amdgpu: GPU reset begin!
[ 704.331171] amdgpu: qcm fence wait loop timeout expired
[ 704.331194] amdgpu: The cp migh
Reason:
{
[ 578.019986] amdgpu :23:00.0: amdgpu: GPU reset begin!
[ 583.245566] amdgpu :23:00.0: amdgpu: Failed to disable smu features.
[ 583.245621] amdgpu :23:00.0: amdgpu: Fail to disable dpm features!
[ 583.245639] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]]
{
[ 578.019986] amdgpu :23:00.0: amdgpu: GPU reset begin!
[ 583.245566] amdgpu :23:00.0: amdgpu: Failed to disable smu features.
[ 583.245621] amdgpu :23:00.0: amdgpu: Fail to disable dpm features!
[ 583.245639] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR*
v2:
still need call ras_disable_all_featrures to handle
ras initilization failure case.
Function amdgpu_device_fini_hw is called before amdgpu_device_fini_sw,
so ras ta will unload before send ras disable command, ras dsiable operation
must before hw fini.
Signed-off-by: Stanley.Yang
---
Function amdgpu_device_fini_hw is called before amdgpu_device_fini_sw,
so ras ta will unload before send ras disable command, ras dsiable operation
must before hw fini.
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++--
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
if smu support ECCTABLE, driver can message smu to get ecc_table
then query umc error info from ECCTABLE
v2:
optimize source code makes logical more reasonable
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 42 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_umc
support ECC TABLE message, this table include umc ras error count
and error address
v2:
add smu version check to query whether support ecctable
call smu_cmn_update_table to get ecctable directly
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 8 +++
driv
add message smu to query error information
v2:
rename message_smu to ecc_info
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 16 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h | 4 +
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 161
3 files c
update smu driver if version to 0x08 to avoid mismatch log
A version mismatch can still happen with an older FW
Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
Signed-off-by: Stanley.Yang
---
.../drm/amd/pm/inc/smu13_driver_if_aldebaran.h | 18 +-
drivers/gpu/drm/amd/pm/inc/
if smu support ECCTABLE, driver can message smu to get ecc_table
then query umc error info from ECCTABLE
apply pmfw version check to ensure backward compatibility
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 42 ---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras
add message smu to query error information
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 16 +++
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h | 4 +
drivers/gpu/drm/amd/amdgpu/umc_v6_7.c | 161
3 files changed, 181 insertions(+)
diff --git a/
support ECC TABLE message, this table include unc ras error count
and error address
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 7
.../drm/amd/pm/swsmu/smu13/aldebaran_ppt.c| 38 +++
.../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c| 2
update smu driver if version to 0x08 to avoid mismatch log
A version mismatch can still happen with an older FW
Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
Signed-off-by: Stanley.Yang
---
.../drm/amd/pm/inc/smu13_driver_if_aldebaran.h | 18 +-
drivers/gpu/drm/amd/pm/inc/
update smu driver if and version to avoid mismatch log
v2:
update smu driver interface
Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
Signed-off-by: Stanley.Yang
---
.../drm/amd/pm/inc/smu13_driver_if_aldebaran.h | 18 +-
drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
update smu driver if version to avoid mismatch log
Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/inc/smu_v13_0.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
b/drivers/gpu
update smu driver if version to avoid mismatch log
Change-Id: I97f2bc4ed9a9cba313b744e2ff6812c90b244935
Signed-off-by: Stanley.Yang
---
drivers/gpu/drm/amd/pm/inc/smu_v13_0.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v13_0.h
b/drivers/gpu
1 - 100 of 125 matches
Mail list logo