Thanks. You are right. I found a potential bug, and as I understand
it, the code only applies to the Aldebaran GPU and I can not check the
correctness of the code. I only test code on my navi 10 and run GPU
stress tests.
My knowledge of amdgpu is limited, and fixing potential bugs allows me
to lear
Am 2022-04-04 um 18:21 schrieb Grigory Vasilyev:
In the amdgpu_amdkfd_get_xgmi_bandwidth_mbytes function,
the peer_adev pointer can be NULL and is passed to amdgpu_xgmi_get_num_links.
In amdgpu_xgmi_get_num_links, peer_adev pointer is dereferenced
without any checks: peer_adev->gmc.xgmi.node_id
In the amdgpu_amdkfd_get_xgmi_bandwidth_mbytes function,
the peer_adev pointer can be NULL and is passed to amdgpu_xgmi_get_num_links.
In amdgpu_xgmi_get_num_links, peer_adev pointer is dereferenced
without any checks: peer_adev->gmc.xgmi.node_id .
Signed-off-by: Grigory Vasilyev
---
drivers/gp
I think this could happen if KFD initialization fails for a device.
Currently we'd add the device, and then remove it again. That may leave
a gap in the proximity domains. Oak just had a fix recently to clean
that up by only adding KFD devices to the topology after successful
initialization.
R
From the comments , "we will loop GPUs that already be processed (with
lower value of proximity_domain) ", the device should already been
added into the topology_device_list. So in this case ,
kfd_topology_device_by_proximity_domain will not return a NULL pointer.
If you really get the nu
From: Colin Ian King
The call to kfd_topology_device_by_proximity_domain can return a NULL
pointer so add a null pointer check on peer_dev to the existing null
pointer check on peer_dev->gpu to avoid any potential null pointer
dereferences.
Addresses-Coverity: ("Dereference on null return value"