On Thursday 03/12 at 01:02 -0700, Nathan Chancellor wrote:
> Hi Calvin,
> 
> On Mon, Mar 09, 2026 at 09:24:57PM -0700, Calvin Owens wrote:
> > Commit e1b385726f7f ("drm/amd/display: Add additional checks for PSP
> > footer size") introduced a use of an uninitialized stack variable
> > in dm_dmub_sw_init() (region_params.bss_data_size).
> > 
> > Interestingly, this seems to cause no issue on normal kernels. But when
> > full LTO is enabled, it causes the compiler to "optimize" out huge
> > swaths of amdgpu initialization code, and the driver is unusable:
> 
> Yeah, this appears to be a very unfortunate case of "clang encountered known
> undefined behavior and stopped code generation", which we would like to
> avoid but figuring out a proper upstreamable solution is hard. The most
> recent attempt:
> 
>   https://github.com/llvm/llvm-project/pull/146791
> 
> My guess is that LTO allows inlining of
> dmub_srv_get_fw_meta_info_from_raw_fw() into dm_dmub_sw_init(), at which
> point it can see that the result of accessing an uninitialized
> region_params.bss_data_size will be used through
> fw_meta_info_params.fw_bss_data and gives up generating the rest of the
> function.

Thanks for looking Nathan. I'll keep an eye on that and see if it's able
to catch this example. I've tried to come up with a minimal reproducer,
but I haven't had any luck yet (so far I always get the warning), would
that be helpful at all?

I put the full W=2 output for the one file here in case anyone else
wants to look:

   https://github.com/jcalvinowens/lkml-debug/blob/main/amdgpu-lto/gcc-warns.txt
   
https://github.com/jcalvinowens/lkml-debug/blob/main/amdgpu-lto/llvm-warns.txt

Somehow 'make drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.o' doesn't
work, I want to look at that later because it was mildly annoying while
digging into this.

> >     amdgpu 0000:03:00.0: [drm] Loading DMUB firmware via PSP: 
> > version=0x07002F00
> >     amdgpu 0000:03:00.0: sw_init of IP block <dm> failed 5
> >     amdgpu 0000:03:00.0: amdgpu_device_ip_init failed
> >     amdgpu 0000:03:00.0: Fatal error during GPU init
> > 
> > It surprises me that neither gcc nor clang emit a warning about this: I
> > only found it by bisecting the LTO breakage.
> 
> gcc's -Wmaybe-uninitialized is disabled by default for the kernel but
> even enabling it with KCFLAGS does not show an instance here, which I
> find quite surprising... for clang, it is harder because the warning
> happens early in the frontend where it might not be able to track a
> value that well.

GCC does flag what seems to me to be a real but benign warning about an
ERR_PTR check that doesn't handle NULL in the same file:

    
https://lore.kernel.org/lkml/6aaf2cf4bd19363a85f35e649685d7bdae400253.1773157137.git.cal...@wbinvd.org/

I'm also trying to find a minimal reproducer for GCC, no luck yet.

> > Fix by using the old value for region_params.bss_data_size in place of
> > the uninitialized reference, which makes amdgpu work with LTO again.
> > 
> > Fixes: e1b385726f7f ("drm/amd/display: Add additional checks for PSP footer 
> > size")
> > Signed-off-by: Calvin Owens <[email protected]>
> > ---
> >  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index b3d6f2cd8ab6..e69e61163ae9 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -2554,7 +2554,7 @@ static int dm_dmub_sw_init(struct amdgpu_device *adev)
> >     fw_meta_info_params.fw_inst_const = adev->dm.dmub_fw->data +
> >                                         
> > le32_to_cpu(hdr->header.ucode_array_offset_bytes) +
> >                                         PSP_HEADER_BYTES_256;
> > -   fw_meta_info_params.fw_bss_data = region_params.bss_data_size ? 
> > adev->dm.dmub_fw->data +
> > +   fw_meta_info_params.fw_bss_data = le32_to_cpu(hdr->bss_data_bytes) ? 
> > adev->dm.dmub_fw->data +
> 
> Maybe it would be better to use fw_meta_info_params.bss_data_size
> instead of le32_to_cpu(hdr->bss_data_bytes)? Obviously it is the same
> value but it would result in a smaller change. It seems likely that this
> was just a copy and paste failure.

Agreed. That ends up being almost self evidently correct if I force git
to add an extra context line with the assignment, I always forget I can
do that:

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index b3d6f2cd8ab6..0d1c772ef713 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -2553,9 +2553,9 @@ static int dm_dmub_sw_init(struct amdgpu_device *adev)
        fw_meta_info_params.bss_data_size = le32_to_cpu(hdr->bss_data_bytes);
        fw_meta_info_params.fw_inst_const = adev->dm.dmub_fw->data +
                                            
le32_to_cpu(hdr->header.ucode_array_offset_bytes) +
                                            PSP_HEADER_BYTES_256;
-       fw_meta_info_params.fw_bss_data = region_params.bss_data_size ? 
adev->dm.dmub_fw->data +
+       fw_meta_info_params.fw_bss_data = fw_meta_info_params.bss_data_size ? 
adev->dm.dmub_fw->data +
                                          
le32_to_cpu(hdr->header.ucode_array_offset_bytes) +
                                          le32_to_cpu(hdr->inst_const_bytes) : 
NULL;
        fw_meta_info_params.custom_psp_footer_size = 0;
 

I'll send a v2 in a little bit.

Thanks,
Calvin

> >                                       
> > le32_to_cpu(hdr->header.ucode_array_offset_bytes) +
> >                                       le32_to_cpu(hdr->inst_const_bytes) : 
> > NULL;
> >     fw_meta_info_params.custom_psp_footer_size = 0;
> > -- 
> > 2.47.3
> > 

Reply via email to