Re: [PATCH] drm/amdgpu/mes: keep enforce isolation up to date

2025-02-14 Thread SRINIVASAN SHANMUGAM
On 2/14/2025 11:05 PM, Alex Deucher wrote: Re-send the mes message on resume to make sure the mes state is up to date. Fixes: 8521e3c5f058 ("drm/amd/amdgpu: limit single process inside MES") Signed-off-by: Alex Deucher Cc: Shaoyun Liu Cc: Srinivasan Shanmugam --- drivers/gpu/drm/amd/amdgpu/am

Re: [PATCH 2/2] drm/amdgpu/mes12: allocate hw_resource_1 buffer once

2025-02-14 Thread Liu, Shaoyun
[AMD Official Use Only - AMD Internal Distribution Only] I think I should make it more clear. When mes is been used , no matter its pipe0 or pipe1 , we expected both set_hw_resource and set_hw_resource_1 been called, that's requirement for mes_v12 and later . For none unified mes config, the pi

Re: [PATCH] Documentation/gpu: Add acronyms for some firmware components

2025-02-14 Thread Alex Deucher
On Fri, Feb 14, 2025 at 6:38 PM Rodrigo Siqueira wrote: > > On 02/14, Alex Deucher wrote: > > On Fri, Feb 14, 2025 at 6:00 PM Rodrigo Siqueira > > wrote: > > > > > > Users can check the file "/sys/kernel/debug/dri/0/amdgpu_firmware_info" > > > to get information on the firmware loaded in the sys

Re: [PATCH] Documentation/gpu: Add acronyms for some firmware components

2025-02-14 Thread Rodrigo Siqueira
On 02/14, Alex Deucher wrote: > On Fri, Feb 14, 2025 at 6:00 PM Rodrigo Siqueira wrote: > > > > Users can check the file "/sys/kernel/debug/dri/0/amdgpu_firmware_info" > > to get information on the firmware loaded in the system. This file has > > multiple acronyms that are not documented in the gl

Re: [PATCH] Documentation/gpu: Add acronyms for some firmware components

2025-02-14 Thread Alex Deucher
On Fri, Feb 14, 2025 at 6:00 PM Rodrigo Siqueira wrote: > > Users can check the file "/sys/kernel/debug/dri/0/amdgpu_firmware_info" > to get information on the firmware loaded in the system. This file has > multiple acronyms that are not documented in the glossary. This commit > introduces some mi

[PATCH] Documentation/gpu: Add acronyms for some firmware components

2025-02-14 Thread Rodrigo Siqueira
Users can check the file "/sys/kernel/debug/dri/0/amdgpu_firmware_info" to get information on the firmware loaded in the system. This file has multiple acronyms that are not documented in the glossary. This commit introduces some missing acronyms to the AMD glossary documentation. The meaning of ea

RE: [PATCH 12/16] drm/amd/display: Support BT2020 YCbCr fullrange

2025-02-14 Thread Kovac, Krunoslav
[AMD Official Use Only - AMD Internal Distribution Only] Hi Robert, We only had one enum: COLOR_SPACE_2020_YCBCR. On the output side this assumed limited range. On the input side this apparently assumed full range given the dpp matrix. Now we split it into two enums to distinguish them and add li

RE: [PATCH 12/16] drm/amd/display: Support BT2020 YCbCr fullrange

2025-02-14 Thread Li, Roman
[Public] Hi Robert, thank you for the feedback. What about this version of commit message: Fix BT2020 YCbCr limited/full range input [Why] BT2020 YCbCr input is not handled properly when full range quantization is used and limited range is not supported at all. [How] - Add enums for BT2020 YCb

Re: [PATCH 1/3] drm/amdgpu: Do not program AGP BAR regs under SRIOV

2025-02-14 Thread Deucher, Alexander
[Public] Are there any cases where the asic_type check would cause this register to fail to get programmed? Alex From: amd-gfx on behalf of Victor Lu Sent: Thursday, February 13, 2025 7:13 PM To: amd-gfx@lists.freedesktop.org Cc: Lu, Victor Cheng Chi (Victor

RE: [PATCH 2/2] drm/amdgpu/mes12: allocate hw_resource_1 buffer once

2025-02-14 Thread Liu, Shaoyun
[AMD Official Use Only - AMD Internal Distribution Only] Oh, you right. It's only for unified MES , for none-unified , it will still use the kiq from CP directly on pipe1 . So there is no MES API for it at all . It's my fault . please ignore my previous comments . Your current change for this

Re: [PATCH 2/2] drm/amdgpu/mes12: allocate hw_resource_1 buffer once

2025-02-14 Thread Deucher, Alexander
[AMD Official Use Only - AMD Internal Distribution Only] Does it matter which pipe we use for these packets? Alex From: Liu, Shaoyun Sent: Friday, February 14, 2025 12:36 PM To: Deucher, Alexander ; amd-gfx@lists.freedesktop.org Subject: RE: [PATCH 2/2] drm/am

RE: [PATCH 2/2] drm/amdgpu/mes12: allocate hw_resource_1 buffer once

2025-02-14 Thread Liu, Shaoyun
[AMD Official Use Only - AMD Internal Distribution Only] Ok . From MES point of view , we expecting both set_hw_resource and set_hw_resource_1 been called all the time. Reviewed-by: Shaoyun.liu From: Deucher, Alexander Sent: Friday, February 14, 2025 11:53 AM To: Liu, Shaoyun ; amd-gfx@list

[PATCH] drm/amdgpu/mes: keep enforce isolation up to date

2025-02-14 Thread Alex Deucher
Re-send the mes message on resume to make sure the mes state is up to date. Fixes: 8521e3c5f058 ("drm/amd/amdgpu: limit single process inside MES") Signed-off-by: Alex Deucher Cc: Shaoyun Liu Cc: Srinivasan Shanmugam --- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 13 - drivers/gpu/d

Re: [PATCH 2/2] drm/amdgpu/mes12: allocate hw_resource_1 buffer once

2025-02-14 Thread Deucher, Alexander
[AMD Official Use Only - AMD Internal Distribution Only] I can add that as a follow up patch as I don't want to change the current behavior to avoid a potential regression. Should we submit both the resource and resource_1 packets all the time? Thanks, Alex F

Re: [PATCH 1/2] drm/amdgpu/mes11: allocate hw_resource_1 buffer once

2025-02-14 Thread Alex Deucher
On Fri, Feb 14, 2025 at 11:42 AM Liu, Shaoyun wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > Looks good to me . > Reviewed-by: Shaoyun.liu < shaouyun@amd.com> Thanks, is this for the whole series or just this patch? Alex > > -Original Message- > From: amd-gf

RE: [PATCH 2/2] drm/amdgpu/mes12: allocate hw_resource_1 buffer once

2025-02-14 Thread Liu, Shaoyun
[AMD Official Use Only - AMD Internal Distribution Only] I'd suggest remove the enable_uni_mes check, set_hw_resource_1 is always required for gfx12 and up. Especially after add the cleaner_shader_fence_addr there. Regards Shaoyun.liu -Original Message- From: amd-gfx On Behalf Of A

RE: [PATCH 1/2] drm/amdgpu/mes11: allocate hw_resource_1 buffer once

2025-02-14 Thread Liu, Shaoyun
[AMD Official Use Only - AMD Internal Distribution Only] Looks good to me . Reviewed-by: Shaoyun.liu < shaouyun@amd.com> -Original Message- From: amd-gfx On Behalf Of Alex Deucher Sent: Friday, February 14, 2025 10:19 AM To: amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander Subje

Re: [PATCH] drm/amd/display: Disable -Wenum-float-conversion for dml2_dpmm_dcn4.c

2025-02-14 Thread Alex Deucher
> > > Fixes") > > > Signed-off-by: Nathan Chancellor > > > --- > > > If you would prefer reapplying the local fix, feel free to do so, but I > > > would like for it to be in the upstream source so it does not have to > > > keep being applied. &

Re: [PATCH] drm/amd/display: Disable -Wenum-float-conversion for dml2_dpmm_dcn4.c

2025-02-14 Thread Nathan Chancellor
t to be in the upstream source so it does not have to > > keep being applied. > > I've reapplied the original fix and I've confirmed that the fix will > be pushed to the DML tree as well this time. Did that actually end up happening? Commit 1b30456150e5 ("drm/amd/display: DML21 Reintegration") in next-20250214 reintroduces this warning... I guess it may be a timing thing because the author date is three weeks ago or so. Should I send my "Reapply" patch or will you take care of it? Cheers, Nathan

RE: [PATCH] drm/amdgpu: simplify xgmi peer info calls

2025-02-14 Thread Kim, Jonathan
[Public] We could be talking about 2 types of bandwidth here. 1. Bandwidth per link 2. Bandwidth per peer i.e. multiple xgmi links that are used for SDMA gang submissions for effective max bandwidth * num_link copy speed. The is currently used by runtime i.e. max divide by min. The numb

Re: [PATCH] drm/amdgpu: simplify xgmi peer info calls

2025-02-14 Thread Lazar, Lijo
[Public] For minimum bandwidth, we should keep the possibility of going to FW to get the data when XGMI DPM is in place. So it is all wrapped inside the API when the devices passed are connected. The caller doesn't need to know. BTW, what is the real requirement of bandwidth data without any pe

[PATCH 2/2] drm/amdgpu/mes12: allocate hw_resource_1 buffer once

2025-02-14 Thread Alex Deucher
Allocate the buffer at sw init time so we don't alloc and free it for every suspend/resume or reset cycle. Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 39 +- 1 file changed, 19 insertions(+), 20 deletions(-) diff --git a/drivers/gpu/drm/amd/a

[PATCH 1/2] drm/amdgpu/mes11: allocate hw_resource_1 buffer once

2025-02-14 Thread Alex Deucher
Allocate the buffer at sw init time so we don't alloc and free it for every suspend/resume or reset cycle. Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 52 +- 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/drivers/gpu/drm/amd/a

Re: [PATCH] drm/amdgpu: Replace Mutex with Spinlock for RLCG register access to avoid Priority Inversion in SRIOV

2025-02-14 Thread Skvortsov, Victor
>> On 2/14/2025 2:39 PM, Christian König wrote: >>> Am 14.02.25 um 09:57 schrieb Srinivasan Shanmugam: >>> RLCG Register Access is a way for virtual functions to safely access GPU >>> registers in a virtualized environment., including TLB flushes and >>> register reads. When multiple threads o

[PATCH 06/16] drm/amd/display: Add clear DCC and Tiling callback for DCN

2025-02-14 Thread Roman.Li
From: Rodrigo Siqueira Introduce the DCC and Tiling reset callback to all DCN versions that can call it. Reviewed-by: Alvin Lee Signed-off-by: Rodrigo Siqueira Signed-off-by: Roman Li --- drivers/gpu/drm/amd/display/dc/core/dc_surface.c| 13 ++--- .../gpu/drm/amd/display/dc/hwss/

[PATCH 04/16] drm/amd/display: Add DCC/Tiling reset helper for DCN and DCE

2025-02-14 Thread Roman.Li
From: Rodrigo Siqueira This commit introduces a function helper for resetting DCN/DCE DCC and tiling. Those functions are generic for their respective DCN/DCE, so they were added to the oldest version of each architecture. Reviewed-by: Alvin Lee Signed-off-by: Rodrigo Siqueira Signed-off-by: R

[PATCH 13/16] drm/amd/display: Guard against setting dispclk low when active

2025-02-14 Thread Roman.Li
From: Nicholas Kazlauskas [Why] We should never apply a minimum dispclk value while in prepare_bandwidth or while displays are active. This is always an optimization for when all displays are disabled. [How] Defer dispclk optimization until safe_to_lower = true and display_count reaches 0. Sinc

[PATCH 10/16] drm/amd/display: Add total_num_dpps_required field to informative structure

2025-02-14 Thread Roman.Li
From: Oleh Kuzhylnyi [Why] The informative structure needs to be extended by the total number of DPPs required per each active plane. The new informative field is going to be used as a statistical indicator. [How] The dml2_core_calcs_get_informative() routine must count a total number of DPPs.

[PATCH 05/16] drm/amd/display: Rename panic function

2025-02-14 Thread Roman.Li
From: Rodrigo Siqueira Rename dc_plane_force_update_for_panic to dc_plane_force_dcc_and_tiling_disable to describe the function operation in the name. Also, this function might be used in other contexts, and a more generic name can be helpful for this purpose. Reviewed-by: Alvin Lee Signed-off-

[PATCH 09/16] drm/amd/display: Read LTTPR ALPM caps during link cap retrieval

2025-02-14 Thread Roman.Li
From: George Shen [Why] The latest DP spec requires the DP TX to read DPCD Fh through F0009h when detecting LTTPR capabilities for the first time. [How] Update LTTPR cap retrieval to read up to F0009h (two more bytes than the previous F0007h), and store the LTTPR ALPM capabilities. Reviewed

[PATCH 16/16] drm/amd/display: 3.2.321

2025-02-14 Thread Roman.Li
From: Taimur Hassan Summary: * Add support for disconnected eDP streams * Add log for MALL entry on DCN32x * Add DCC/Tiling reset helper for DCN and DCE * Guard against setting dispclk low when active * Other minor fixes Reviewed-by: Aurabindo Pillai Signed-off-by: Taimur Hassan Signed-off-by

[PATCH 15/16] drm/amd/display: Add support for disconnected eDP streams

2025-02-14 Thread Roman.Li
From: Harry VanZyllDeJong [Why] eDP may not be connected to the GPU on driver start causing fail enumeration. [How] Move the virtual signal type check before the eDP connector signal check. Reviewed-by: Wenjing Liu Signed-off-by: Harry VanZyllDeJong Signed-off-by: Roman Li --- .../drm/amd/d

[PATCH 11/16] drm/amd/display: Add log for MALL entry on DCN32x

2025-02-14 Thread Roman.Li
From: Aurabindo Pillai [Why&How] Add a dyndbg log entry to check whether the driver requested scanout from MALL cache to PMFW via DMCUB Reviewed-by: Zaeem Mohamed Reviewed-by: Roman Li Signed-off-by: Aurabindo Pillai --- drivers/gpu/drm/amd/display/dc/hwss/dcn32/dcn32_hwseq.c | 2 ++ 1 file

[PATCH 14/16] drm/amd/display: dpia should avoid encoder used by dp2

2025-02-14 Thread Roman.Li
From: Peichen Huang [WHY] In current HPO DP2 implementation, driver would enable/disable DIG encoder when configuring HPO DP2. Therefore, usb4 dp tunnelling should not use the DIG encoder if the corresponded phy is used by a HPO DP2 stream. [HOW] A DP2 stream is treated as a dig stream. Reviewe

[PATCH 08/16] drm/amd/display: Print seamless boot message in mark_seamless_boot_stream

2025-02-14 Thread Roman.Li
From: Alex Hung [WHAT & HOW] Add a message so users know the stream will be used for seamless boot. Reviewed-by: Mario Limonciello Reviewed-by: Rodrigo Siqueira Signed-off-by: Alex Hung Signed-off-by: Roman Li --- drivers/gpu/drm/amd/display/dc/core/dc_resource.c | 4 +++- 1 file changed, 3

[PATCH 12/16] drm/amd/display: Support BT2020 YCbCr fullrange

2025-02-14 Thread Roman.Li
From: Ilya Bakoulin [Why/How] Need to add support for full-range quantization for YCbCr in BT2020 color space. Reviewed-by: Krunoslav Kovac Signed-off-by: Ilya Bakoulin Signed-off-by: Roman Li Tested-by: Robert Mader --- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 6 +++---

[PATCH 02/16] drm/amd/display: Don't treat wb connector as physical in create_validate_stream_for_sink

2025-02-14 Thread Roman.Li
From: Harry Wentland Don't try to operate on a drm_wb_connector as an amdgpu_dm_connector. While dereferencing aconnector->base will "work" it's wrong and might lead to unknown bad things. Just... don't. Reviewed-by: Alex Hung Signed-off-by: Harry Wentland Signed-off-by: Roman Li --- .../gpu

[PATCH 07/16] drm/amd/display: Add clear DCC and Tiling callback for DCE

2025-02-14 Thread Roman.Li
From: Rodrigo Siqueira Introduce the DCC and Tiling reset callback to all DCE versions that can call it. Reviewed-by: Alvin Lee Signed-off-by: Rodrigo Siqueira Signed-off-by: Roman Li --- .../gpu/drm/amd/display/dc/core/dc_surface.c | 18 ++ .../amd/display/dc/dce60/dce60_h

[PATCH 00/16] DC Patches February 14, 2025

2025-02-14 Thread Roman.Li
From: Roman Li Summary: * Add support for disconnected eDP streams * Add log for MALL entry on DCN32x * Add DCC/Tiling reset helper for DCN and DCE * Guard against setting dispclk low when active * Other minor fixes Cc: Daniel Wheeler Alex Hung (1): drm/amd/display: Print seamless boot mess

[PATCH 01/16] drm/amd/display: Exit idle optimizations before accessing PHY

2025-02-14 Thread Roman.Li
From: Ovidiu Bunea [why & how] By default, DCN HW is in idle optimized state which does not allow access to PHY registers. If BIOS powers up the DCN, it is fine because they will power up everything. Only exit idle optimized state when not taking control from VBIOS. Fixes: 53f82eb16293 ("Revert

[PATCH 03/16] Revert "drm/amd/display: Request HW cursor on DCN3.2 with SubVP"

2025-02-14 Thread Roman.Li
From: Leo Zeng This reverts commit aaa44ed6cd8af2089d2bf6a2e66a0436fef9791f. Reason to revert: idle power regression found in testing. Reviewed-by: Dillon Varone Signed-off-by: Leo Zeng Signed-off-by: Roman Li --- drivers/gpu/drm/amd/display/dc/dml/dcn32/dcn32_fpu.c | 1 - 1 file changed, 1

RE: [PATCH] drm/amdgpu: simplify xgmi peer info calls

2025-02-14 Thread Kim, Jonathan
[Public] > -Original Message- > From: Lazar, Lijo > Sent: Friday, February 14, 2025 12:58 AM > To: Kim, Jonathan ; amd-gfx@lists.freedesktop.org > Subject: Re: [PATCH] drm/amdgpu: simplify xgmi peer info calls > > > > On 2/13/2025 9:20 PM, Kim, Jonathan wrote: > > [Public] > > > >> -O

RE: [PATCH] drm/amdgpu/display: Allow DCC for video formats on GFX12

2025-02-14 Thread Dong, Ruijing
[AMD Official Use Only - AMD Internal Distribution Only] Reviewed-by: Ruijing Dong -Original Message- From: amd-gfx On Behalf Of David Rosca Sent: Thursday, February 13, 2025 12:07 PM To: amd-gfx@lists.freedesktop.org Cc: Rosca, David Subject: [PATCH] drm/amdgpu/display: Allow DCC for

Re: [PATCH] drm/amd/pm: extend the gfxoff delay for compute workload

2025-02-14 Thread Alex Deucher
On Fri, Feb 14, 2025 at 7:32 AM Kenneth Feng wrote: > > extend the gfxoff delay for compute workload on smu 14.0.2/3 > to fix the kfd test issue. This doesn't make sense. We explicitly disallow gfxoff in amdgpu_amdkfd_set_compute_idle() already so it should already be disallowed. Alex > > Sig

[PATCH v5 3/6] drm/sched: Remove a hole from struct drm_sched_job

2025-02-14 Thread Tvrtko Ursulin
We can re-order some struct members and take u32 credits outside of the pointer sandwich and also for the last_dependency member we can get away with an unsigned int since for dependency we use xa_limit_32b. Pahole report before: /* size: 160, cachelines: 3, members: 14 */ /* sum m

[PATCH v5 5/6] drm/sched: Move internal prototypes to internal header

2025-02-14 Thread Tvrtko Ursulin
Now that we have a header file for internal scheduler interfaces we can move some more prototypes into it. By doing that we eliminate the chance of drivers trying to use something which was not intended to be used. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Danilo Krummrich Cc: Matth

[PATCH 5/5] drm/scheduler: Add a basic test for modifying entities scheduler list

2025-02-14 Thread Tvrtko Ursulin
Add a basic test for exercising modifying the entities scheduler list at runtime. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Danilo Krummrich Cc: Matthew Brost Cc: Philipp Stanner --- drivers/gpu/drm/scheduler/tests/tests_basic.c | 73 ++- 1 file changed, 72 insert

Re: [PATCH 2/3] drm/amdgpu: Pop jobs from the queue more robustly

2025-02-14 Thread Tvrtko Ursulin
Hi Christian, On 11/02/2025 10:21, Christian König wrote: Am 11.02.25 um 11:08 schrieb Philipp Stanner: On Tue, 2025-02-11 at 09:22 +0100, Christian König wrote: Am 06.02.25 um 17:40 schrieb Tvrtko Ursulin: Replace a copy of DRM scheduler's to_drm_sched_job with a copy of a newly added __dr

[PATCH v5 4/6] drm/sched: Move drm_sched_entity_is_ready to internal header

2025-02-14 Thread Tvrtko Ursulin
Helper is for scheduler internal use so lets hide it from DRM drivers completely. At the same time we change the method of checking whethere there is anything in the queue from peeking to looking at the node count. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Danilo Krummrich Cc: Matt

[PATCH 4/5] drm/scheduler: Add basic priority tests

2025-02-14 Thread Tvrtko Ursulin
Add some basic tests for exercising entity priority handling. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Danilo Krummrich Cc: Matthew Brost Cc: Philipp Stanner --- drivers/gpu/drm/scheduler/tests/tests_basic.c | 99 ++- 1 file changed, 98 insertions(+), 1 deletion(

[PATCH 0/5] DRM scheduler kunit tests

2025-02-14 Thread Tvrtko Ursulin
There has repeatedly been quite a bit of apprehension when any change to the DRM scheduler is proposed, with two main reasons being code base is considered fragile, not well understood and not very well documented, and secondly the lack of systematic testing outside the vendor specific tests suites

[PATCH 2/5] drm/scheduler: Add scheduler unit testing infrastructure and some basic tests

2025-02-14 Thread Tvrtko Ursulin
Implement a mock scheduler backend and add some basic test to exercise the core scheduler code paths. Mock backend (kind of like a very simple mock GPU) can either process jobs by tests manually advancing the "timeline" job at a time, or alternatively jobs can be configured with a time duration in

Re: [PATCH 2/3] drm/amdgpu: Pop jobs from the queue more robustly

2025-02-14 Thread Tvrtko Ursulin
On 14/02/2025 10:31, Christian König wrote: Am 14.02.25 um 11:21 schrieb Tvrtko Ursulin: Hi Christian, On 11/02/2025 10:21, Christian König wrote: Am 11.02.25 um 11:08 schrieb Philipp Stanner: On Tue, 2025-02-11 at 09:22 +0100, Christian König wrote: Am 06.02.25 um 17:40 schrieb Tvrtko Ur

[PATCH v5 6/6] drm/sched: Group exported prototypes by object type

2025-02-14 Thread Tvrtko Ursulin
Do a bit of house keeping in gpu_scheduler.h by grouping the API by type of object it operates on. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Danilo Krummrich Cc: Matthew Brost Cc: Philipp Stanner --- include/drm/gpu_scheduler.h | 60 - 1 file c

[PATCH v5 2/6] drm/amdgpu: Pop jobs from the queue more robustly

2025-02-14 Thread Tvrtko Ursulin
Replace a copy of DRM scheduler's to_drm_sched_job with a copy of a newly added drm_sched_entity_queue_pop. This allows breaking the hidden dependency that queue_node has to be the first element in struct drm_sched_job. A comment is also added with a reference to the mailing list discussion expla

[PATCH v5 0/6] drm/sched: Job queue peek/pop helpers and struct job re-order

2025-02-14 Thread Tvrtko Ursulin
Lets add some helpers for peeking and popping from the job queue which allows us to re-order the fields in struct drm_sched_job and remove one hole. As in the process we have added a header file for scheduler internal prototypes, lets also use it more and cleanup the "exported" header a bit. v2:

[PATCH 3/5] drm/scheduler: Add a simple timeout test

2025-02-14 Thread Tvrtko Ursulin
Add a very simple timeout test which submits a single job and verifies that the timeout handling will run if the backend failed to complete the job in time. Signed-off-by: Tvrtko Ursulin Cc: Christian König Cc: Danilo Krummrich Cc: Matthew Brost Cc: Philipp Stanner --- .../gpu/drm/scheduler/

[PATCH 1/5] drm: Move some options to separate new Kconfig.debug

2025-02-14 Thread Tvrtko Ursulin
Move some options out into a new debug specific kconfig file in order to make things a bit cleaner. Signed-off-by: Tvrtko Ursulin --- drivers/gpu/drm/Kconfig | 109 ++ drivers/gpu/drm/Kconfig.debug | 103 2 files changed, 108

[PATCH v5 1/6] drm/sched: Add internal job peek/pop API

2025-02-14 Thread Tvrtko Ursulin
Idea is to add helpers for peeking and popping jobs from entities with the goal of decoupling the hidden assumption in the code that queue_node is the first element in struct drm_sched_job. That assumption usually comes in the form of: while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_

Re: [PATCH v2] drm/amdgpu: fix the memleak caused by fence not released

2025-02-14 Thread Yadav, Arvind
On 2/14/2025 6:09 PM, Christian König wrote: Yeah, completely agree. But not checking the syncobj handle before doing the update is actually even more problematic than leaking the memory. This could be used by userspace to put the kernel into a broken situation it can't come out any more.

Re: [PATCH v2] drm/amdgpu: fix the memleak caused by fence not released

2025-02-14 Thread Christian König
Yeah, completely agree. But not checking the syncobj handle before doing the update is actually even more problematic than leaking the memory. This could be used by userspace to put the kernel into a broken situation it can't come out any more. Arvin can you take care of the complete fix? Tha

RE: [PATCH v2] drm/amdgpu: fix the memleak caused by fence not released

2025-02-14 Thread YuanShang Mao (River)
[AMD Official Use Only - AMD Internal Distribution Only] Better to put the fence outside amdgpu_gem_va_update_vm. Since it is passed to the caller, and the caller must keep one reference at least until this fence is no longer needed. Thanks River -Original Message- From: amd-gfx On Be

RE: [PATCH 4/4] drm/amdgpu/gfx12: Implement the GFX12 KCQ pipe reset

2025-02-14 Thread Liang, Prike
[Public] The implementation of the gfx11/gfx12 pipe reset is derived from the gfx9 pipe reset sequence. Consequently, the driver sequence may not undergo significant changes except for incorporating gfx11/gfx12 firmware support for the pipe reset. To reduce the effort needed to address merge co

Re: [PATCH v2] drm/amdgpu: fix the memleak caused by fence not released

2025-02-14 Thread Yadav, Arvind
On 2/14/2025 4:08 PM, Christian König wrote: Adding Arvind, please make sure to keep him in the loop. Am 14.02.25 um 11:07 schrieb Le Ma: On systems with CONFIG_SLUB_DEBUG enabled, the memleak like below will show up explicitly during driver unloading if created bo without drm_timeline object

[PATCH] drm/amd/pm: extend the gfxoff delay for compute workload

2025-02-14 Thread Kenneth Feng
extend the gfxoff delay for compute workload on smu 14.0.2/3 to fix the kfd test issue. Signed-off-by: Kenneth Feng --- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 3 +++ drivers/gpu/drm/amd/pm/amdgpu_dpm.c | 14 ++ drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h | 1 +

Re: [PATCH 2/3] drm/amdgpu: Pop jobs from the queue more robustly

2025-02-14 Thread Christian König
Am 14.02.25 um 11:34 schrieb Tvrtko Ursulin: > > On 14/02/2025 10:31, Christian König wrote: >> Am 14.02.25 um 11:21 schrieb Tvrtko Ursulin: >>> >>> Hi Christian, >>> >>> On 11/02/2025 10:21, Christian König wrote: Am 11.02.25 um 11:08 schrieb Philipp Stanner: > On Tue, 2025-02-11 at 09:22

Re: [PATCH v2] drm/amdgpu: fix the memleak caused by fence not released

2025-02-14 Thread Christian König
Adding Arvind, please make sure to keep him in the loop. Am 14.02.25 um 11:07 schrieb Le Ma: > On systems with CONFIG_SLUB_DEBUG enabled, the memleak like below > will show up explicitly during driver unloading if created bo without > drm_timeline object before. > > BUG drm_sched_fence (Tainte

Re: [PATCH 2/3] drm/amdgpu: Pop jobs from the queue more robustly

2025-02-14 Thread Christian König
Am 14.02.25 um 11:21 schrieb Tvrtko Ursulin: > > Hi Christian, > > On 11/02/2025 10:21, Christian König wrote: >> Am 11.02.25 um 11:08 schrieb Philipp Stanner: >>> On Tue, 2025-02-11 at 09:22 +0100, Christian König wrote: Am 06.02.25 um 17:40 schrieb Tvrtko Ursulin: > Replace a copy of DRM

[PATCH v2] drm/amdgpu: fix the memleak caused by fence not released

2025-02-14 Thread Le Ma
On systems with CONFIG_SLUB_DEBUG enabled, the memleak like below will show up explicitly during driver unloading if created bo without drm_timeline object before. BUG drm_sched_fence (Tainted: G OE ): Objects remaining in drm_sched_fence on __kmem_cache_shutdown()

Re: [PATCH] drm/amdgpu: Replace Mutex with Spinlock for RLCG register access to avoid Priority Inversion in SRIOV

2025-02-14 Thread SRINIVASAN SHANMUGAM
On 2/14/2025 2:39 PM, Christian König wrote: Am 14.02.25 um 09:57 schrieb Srinivasan Shanmugam: RLCG Register Access is a way for virtual functions to safely access GPU registers in a virtualized environment., including TLB flushes and register reads. When multiple threads or VFs try to access

Re: [PATCH] drm/amdgpu: Replace Mutex with Spinlock for RLCG register access to avoid Priority Inversion in SRIOV

2025-02-14 Thread Christian König
Am 14.02.25 um 09:57 schrieb Srinivasan Shanmugam: > RLCG Register Access is a way for virtual functions to safely access GPU > registers in a virtualized environment., including TLB flushes and > register reads. When multiple threads or VFs try to access the same > registers simultaneously, it can

[PATCH] drm/amdgpu: Replace Mutex with Spinlock for RLCG register access to avoid Priority Inversion in SRIOV

2025-02-14 Thread Srinivasan Shanmugam
RLCG Register Access is a way for virtual functions to safely access GPU registers in a virtualized environment., including TLB flushes and register reads. When multiple threads or VFs try to access the same registers simultaneously, it can lead to race conditions. By using the RLCG interface, the

Re: [PATCH] drm/amdgpu: Fix crashes in enforce_isolation sysfs handling on non-supported systems

2025-02-14 Thread Christian König
Am 13.02.25 um 18:50 schrieb Srinivasan Shanmugam: > By adding these NULL pointer checks and improving error handling, we can > prevent crashes when the enforce_isolation sysfs file is accessed on > non-supported systems. > > Cc: Christian König > Cc: Alex Deucher > Signed-off-by: Srinivasan Shan

Re: [PATCH] drm/radeon/ci_dpm: Remove needless NULL checks of dpm tables

2025-02-14 Thread Nikita Zhandarovich
Gentle ping :) On 1/14/25 16:58, Nikita Zhandarovich wrote: > This patch removes useless NULL pointer checks in functions like > ci_set_private_data_variables_based_on_pptable() and > ci_setup_default_dpm_tables(). > > The pointers in question are initialized as addresses to existing > structures

[PATCH][next] drm/amd/pm: Avoid multiple -Wflex-array-member-not-at-end warnings

2025-02-14 Thread Gustavo A. R. Silva
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. So, in order to avoid ending up with a flexible-array member in the middle of other structs, we use the `struct_group_tagged()` helper to create a new tagged `struct NISLANDS_SMC_SWSTATE_HDR`

Re: [PATCH 2/3] drm/amdgpu: Do not write to GRBM_CNTL if Aldebaran SRIOV

2025-02-14 Thread Lazar, Lijo
On 2/14/2025 5:43 AM, Victor Lu wrote: > Aldebaran SRIOV VF does not have write permissions to GRBM_CTNL. > This access can be skipped to avoid a dmesg warning. > > Signed-off-by: Victor Lu > --- > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-

[PATCH v2 12/12] drm/amdgpu: Generate bad page threshold cper records

2025-02-14 Thread Xiang Liu
Generate CPER record when bad page threshold exceed and commit to CPER ring. v2: return -ENOMEM instead of false v2: check return value of fill section function Signed-off-by: Xiang Liu Reviewed-by: Tao Zhou --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 23 +++ drivers/gpu

Re: [PATCH 3/3] drm/amdgpu: Do not set power brake sequence for Aldebaran SRIOV

2025-02-14 Thread Lazar, Lijo
On 2/14/2025 5:43 AM, Victor Lu wrote: > Aldebaran SRIOV VF cannot access the power brake feature regs. > The accesses can be skipped to avoid a dmesg warning. > > Signed-off-by: Victor Lu > --- > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-)

[PATCH v2 11/12] drm/amdgpu: Commit CPER entry

2025-02-14 Thread Xiang Liu
Commit the CPER entry to the ring buffer. Signed-off-by: Xiang Liu Reviewed-by: Tao Zhou --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c in

[PATCH v2 08/12] drm/amdgpu: add data write function for CPER ring

2025-02-14 Thread Xiang Liu
From: Tao Zhou Old CPER data will be overwritten if ring buffer is full, and read pointer always points to CPER header. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 93 drivers/gpu/drm/amd/amdgpu/amdgpu_cper.h | 2

[PATCH v2 01/12] drm/amd/include: Add amd cper header

2025-02-14 Thread Xiang Liu
From: Hawking Zhang AMD is using Common Platform Error Record (CPER) format to report all gpu hardware errors. v2: add program attribute Signed-off-by: Hawking Zhang Signed-off-by: Xiang Liu Reviewed-by: Tao Zhou --- drivers/gpu/drm/amd/include/amd_cper.h | 269 + 1

[PATCH v2 09/12] drm/amdgpu: add mutex lock for cper ring

2025-02-14 Thread Xiang Liu
From: Tao Zhou Avoid the confliction between read and write of ring buffer. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 4 drivers/gpu/drm/amd/amdgpu/amdgpu_cper.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 21 +++

[PATCH v2 10/12] drm/amdgpu: Get timestamp from system time

2025-02-14 Thread Xiang Liu
Get system local time and encode it to timestamp for CPER. Signed-off-by: Xiang Liu Reviewed-by: Tao Zhou --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c b/drivers/gpu/

[PATCH v2 05/12] drm/amdgpu: Generate cper records

2025-02-14 Thread Xiang Liu
From: Hawking Zhang Encode the error information in CPER format and commit to the cper ring Signed-off-by: Hawking Zhang Reviewed-by: Yang Wang Reviewed-by: Tao Zhou --- drivers/gpu/drm/amd/amdgpu/amdgpu_aca.c | 32 + 1 file changed, 32 insertions(+) diff --git a/dri

[PATCH v2 07/12] drm/amdgpu: read CPER ring via debugfs

2025-02-14 Thread Xiang Liu
From: Tao Zhou We read CPER data from read pointer to write pointer without changing the pointers. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 47 ++-- 1 file changed, 36 insertions(+), 11 deletions(-) diff --git a/dri

[PATCH v2 00/12] Generate CPER records for RAS and commit to CPER ring

2025-02-14 Thread Xiang Liu
This patch series generate RAS CPER records for UE/DE/CE/BP threshold exceed event. SMU_TYPE_CE banks are combined into 1 CPER entry, they could be CEs or DEs or both. UEs and BPs are encoded into separate CPER entries. RAS CPER records for CEs will be generated only after CEs count been queried.

[PATCH v2 04/12] drm/amdgpu: Introduce funcs for generating cper record

2025-02-14 Thread Xiang Liu
From: Hawking Zhang Introduce new functions that are used to generate cper ue or ce records. v2: return -ENOMEM instead of false v2: check return value of fill section function Signed-off-by: Hawking Zhang Signed-off-by: Xiang Liu Reviewed-by: Yang Wang Reviewed-by: Tao Zhou --- drivers/gp

[PATCH v2 06/12] drm/amdgpu: add RAS CPER ring buffer

2025-02-14 Thread Xiang Liu
From: Tao Zhou And initialize it, this is a pure software ring to store RAS CPER data. v2: update the initialization of count_dw of cper ring, it's dword variable. Signed-off-by: Tao Zhou Reviewed-by: Hawking Zhang --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 39 +++--- d

[PATCH v2 02/12] drm/amdgpu: Introduce funcs for populating CPER

2025-02-14 Thread Xiang Liu
From: Hawking Zhang Introduce utility functions designed to assist in populating CPER records. v2: call cper_init/fini in device_ip_init/fini. Signed-off-by: Hawking Zhang Reviewed-by: Tao Zhou --- drivers/gpu/drm/amd/amdgpu/Makefile| 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu.h

[PATCH v2 03/12] drm/amdgpu: Include ACA error type in aca bank

2025-02-14 Thread Xiang Liu
From: Hawking Zhang ACA error types managed by driver a direct 1:1 correspondence with those managed by firmware. To address this, for each ACA bank, include both the ACA error type and the ACA SMU type. This addition is useful for creating CPER records. Signed-off-by: Hawking Zhang Reviewed-