[PATCH] drm/amdgpu/vcn: reset fw_shared when VCPU buffers corrupted on vcn v4.0.3

2024-11-19 Thread Xiang Liu
did v2: Refine commit message Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 32 ++--- 1 file changed, 23 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c index d011e4678ca1

[PATCH] drm/amdgpu/vcn: reset firmware flags after VCPU buffers are cleared to 0

2024-11-18 Thread Xiang Liu
In the case of RAS err_event_athub, the VCPU buffers are corrupted and cannot be restored in amdgpu_vcn_resume(), the buffers are cleared to 0 for good. However, the firmware flags stored in the buffers need to be reset, or the firmware cannot work properly. Signed-off-by: Xiang Liu --- drivers

[PATCH] drm/amdgpu/vcn: reset fw_shared when VCPU buffers corrupted on vcn v4.0.3

2024-11-20 Thread Xiang Liu
redundant code like vcn_v4_0 did v2: Refine commit message v3: Drop the volatile v3: Refine commit message Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/vcn_v4_0_3.c | 30 ++--- 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu

[PATCH v2 06/12] drm/amdgpu: add RAS CPER ring buffer

2025-02-14 Thread Xiang Liu
From: Tao Zhou And initialize it, this is a pure software ring to store RAS CPER data. v2: update the initialization of count_dw of cper ring, it's dword variable. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 39 +++--- d

[PATCH v2 02/12] drm/amdgpu: Introduce funcs for populating CPER

2025-02-14 Thread Xiang Liu
From: Hawking Zhang Introduce utility functions designed to assist in populating CPER records. v2: call cper_init/fini in device_ip_init/fini. Signed-off-by: Hawking Zhang Reviewed-by: Tao Zhou --- drivers/gpu/drm/amd/amdgpu/Makefile| 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h

[PATCH v2 00/12] Generate CPER records for RAS and commit to CPER ring

2025-02-14 Thread Xiang Liu
ring drm/amdgpu: add mutex lock for cper ring Xiang Liu (3): drm/amdgpu: Get timestamp from system time drm/amdgpu: Commit CPER entry drm/amdgpu: Generate bad page threshold cper records drivers/gpu/drm/amd/amdgpu/Makefile| 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 4

[PATCH v2 01/12] drm/amd/include: Add amd cper header

2025-02-14 Thread Xiang Liu
From: Hawking Zhang AMD is using Common Platform Error Record (CPER) format to report all gpu hardware errors. v2: add program attribute Signed-off-by: Hawking Zhang Signed-off-by: Xiang Liu Reviewed-by: Tao Zhou --- drivers/gpu/drm/amd/include/amd_cper.h | 269 + 1

[PATCH v2 09/12] drm/amdgpu: add mutex lock for cper ring

2025-02-14 Thread Xiang Liu
From: Tao Zhou Avoid the confliction between read and write of ring buffer. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 4 drivers/gpu/drm/amd/amdgpu/amdgpu_cper.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 21 +++

[PATCH v2 08/12] drm/amdgpu: add data write function for CPER ring

2025-02-14 Thread Xiang Liu
From: Tao Zhou Old CPER data will be overwritten if ring buffer is full, and read pointer always points to CPER header. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 93 drivers/gpu/drm/amd/amdgpu/amdgpu_cper.h | 2

[PATCH v2 11/12] drm/amdgpu: Commit CPER entry

2025-02-14 Thread Xiang Liu
Commit the CPER entry to the ring buffer. Signed-off-by: Xiang Liu Reviewed-by: Tao Zhou --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c

[PATCH v2 04/12] drm/amdgpu: Introduce funcs for generating cper record

2025-02-14 Thread Xiang Liu
From: Hawking Zhang Introduce new functions that are used to generate cper ue or ce records. v2: return -ENOMEM instead of false v2: check return value of fill section function Signed-off-by: Hawking Zhang Signed-off-by: Xiang Liu Reviewed-by: Yang Wang Reviewed-by: Tao Zhou --- drivers

[PATCH v2 03/12] drm/amdgpu: Include ACA error type in aca bank

2025-02-14 Thread Xiang Liu
From: Hawking Zhang ACA error types managed by driver a direct 1:1 correspondence with those managed by firmware. To address this, for each ACA bank, include both the ACA error type and the ACA SMU type. This addition is useful for creating CPER records. Signed-off-by: Hawking Zhang Reviewed-

[PATCH v2 05/12] drm/amdgpu: Generate cper records

2025-02-14 Thread Xiang Liu
From: Hawking Zhang Encode the error information in CPER format and commit to the cper ring Signed-off-by: Hawking Zhang Reviewed-by: Yang Wang Reviewed-by: Tao Zhou --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 32 + 1 file changed, 32 insertions(+) diff --git a/dri

[PATCH v2 07/12] drm/amdgpu: read CPER ring via debugfs

2025-02-14 Thread Xiang Liu
From: Tao Zhou We read CPER data from read pointer to write pointer without changing the pointers. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 47 ++-- 1 file changed, 36 insertions(+), 11 deletions(-) diff --git a/dri

[PATCH v2 10/12] drm/amdgpu: Get timestamp from system time

2025-02-14 Thread Xiang Liu
Get system local time and encode it to timestamp for CPER. Signed-off-by: Xiang Liu Reviewed-by: Tao Zhou --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c b/drivers/gpu

[PATCH v2 12/12] drm/amdgpu: Generate bad page threshold cper records

2025-02-14 Thread Xiang Liu
Generate CPER record when bad page threshold exceed and commit to CPER ring. v2: return -ENOMEM instead of false v2: check return value of fill section function Signed-off-by: Xiang Liu Reviewed-by: Tao Zhou --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 23 +++ drivers

[PATCH] drm/amdgpu: Check if CPER enabled when generating CPER

2025-02-24 Thread Xiang Liu
In the case of CPER disabled, generating CPER will cause kernel NULL pointer dereference without checking. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 3 +++ drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 5 +++-- 2 files changed, 6 insertions(+), 2 deletions(-) diff

[PATCH] drm/amdgpu: Set CPER enabled flag after ring initiailized

2025-02-24 Thread Xiang Liu
Setting cper.enabled to be true only after cper ring is successfully created. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c b/drivers/gpu/drm/amd

[PATCH] drm/amdgpu: Decode deferred error type in aca bank parser

2025-02-25 Thread Xiang Liu
In the case of poison consumption's inband log, the error type need to be specified by checking the poison bit of status register. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h | 5 + drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 4 ++-- drivers/gpu/drm/amd/a

[PATCH] drm/amdgpu: Disable fru_id field in CPER section

2025-02-25 Thread Xiang Liu
The fru_id field is disabled cause of mis-matching defination between CPER spec and driver. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c b/drivers/gpu/drm

[PATCH] drm/amdgpu: Decode deferred error type in aca bank parser

2025-02-25 Thread Xiang Liu
In the case of poison inband log, the error type need to be specified by checking the deferred or poison bit of status register. v2: check both deferred and poison bit Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 4

[PATCH] drm/amdgpu: Report generic instead of unknown boot time errors

2025-02-25 Thread Xiang Liu
Change the DMESG reporting of unknown errors to "Boot Controller Generic Error" to align with the RAS SPEC and provide more clarity to customers. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +- 2 files

[PATCH] drm/amdgpu: Remove redundant check of adev

2025-02-18 Thread Xiang Liu
There is no need to check adev for sure. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c index c0da9096a7fa..d11593cd1922

[PATCH] drm/amdgpu: Check aca enabled inside cper init/fini func

2025-02-18 Thread Xiang Liu
Move code about checking aca enabled to the cper init/fini function to make code clean. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++ 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a

[PATCH] drm/amdgpu: Free CPER entry after committing to ring

2025-02-27 Thread Xiang Liu
Free CPER entry when it's committed to CPER ring to avoid memory leak. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c

[PATCH] drm/amdgpu: Use unique CPER record id across devices

2025-03-05 Thread Xiang Liu
Encode socket id to CPER record id to be unique across devices. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c

[PATCH] drm/amdgpu: Fix computation for remain size of CPER ring

2025-03-12 Thread Xiang Liu
The mistake of computation for remain size of CPER ring will cause unbreakable while cycle when CPER ring overflow. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd

[PATCH v2] drm/amdgpu: Use unique CPER record id across devices

2025-03-11 Thread Xiang Liu
Encode socket id to CPER record id to be unique across devices. v2: add pointer check for adev->smuio.funcs->get_socket_id v2: set 0 if adev->smuio.funcs->get_socket_id is NULL Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 18 +- 1 file

[PATCH] drm/amdgpu: Enable ACA by default for psp v13_0_6/v13_0_14

2025-03-13 Thread Xiang Liu
Enable ACA by default for psp v13_0_6/v13_0_14. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 7cf8a3036828

[PATCH] drm/amdgpu: Use correct aca handle to validate aca bank

2025-03-18 Thread Xiang Liu
The aca handle is introduced by upper caller, it's inappropriate to poll aca handle to match and validate aca bank, which will cause unexcepted ras error report. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 122 ++-- drivers/gpu/drm/amd/a

[PATCH v2] drm/amdgpu: Decode deferred error type in gfx aca bank parser

2025-03-20 Thread Xiang Liu
DEs among UEs Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 25 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h | 16 +++- drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 8 3 files changed, 38 insertions(+), 11 deletions(-) diff --git a

[PATCH] drm/amdgpu: Decode deferred error type in gfx aca bank parser

2025-03-19 Thread Xiang Liu
In the case of injecting uncorrected error with background workload, the deferred error among uncorrected errors need to be specified by checking the deferred and poison bits of status register. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 3 +++ drivers/gpu/drm/amd

[PATCH] drm/amdgpu: Double check UC/PCC when parsing GFX UE

2025-03-18 Thread Xiang Liu
Double checking UC and PCC bits of status register for GFX UE to avoid unexcepted GFX UE report. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c b/drivers

[PATCH] drm/amdgpu: Use correct gfx deferred error count

2025-03-21 Thread Xiang Liu
In the case of parsing GFX deferred error from SMU corrected error channel, the error count should be set to 1 instead of parsing from MISC0 register, which is 0. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions

[PATCH] drm/amdgpu: Parse all deferred errors with UMC aca handle

2025-03-24 Thread Xiang Liu
We should only increase the deferred errors in UMC block. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 4 drivers/gpu/drm/amd/amdgpu/amdgpu_aca.h | 8 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 2 +- drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c | 8