Le 18/11/2019 à 03:36, Alistair Popple a écrit :
On Monday, 18 November 2019 12:24:24 PM AEDT Oliver O'Halloran wrote:
On Mon, Nov 18, 2019 at 12:06 PM Alistair Popple <alist...@popple.id.au>
wrote:

On Wednesday, 13 November 2019 4:38:21 AM AEDT Frederic Barrat wrote:

However, one question is whether this patch breaks nvlink and if nvlink
assumes the devices won’t go away because we explicitly take a reference
forever. In npu_dma.c, there are 2 functions which allow to find the GPU
associated to a npu device, and vice-versa. Both functions return a
pointer to a struct pci_dev, but they don’t take a reference on the
device being returned. So that seems dangerous. I’m probably missing
something.

Alexey, Alistair: what, if anything, guarantees, that the npu or gpu
devices stay valid. Is it because we simply don’t provide any means to
get rid of them ? Otherwise, don’t we need the callers of
pnv_pci_get_gpu_dev() and pnv_pci_get_npu_dev() to worry about reference
counting ? I’ve started looking into it and the changes are scary, which
explains Greg’s related commit 02c5f5394918b.

To be honest the reference counting looks like it has evolved into
something
quite suspect and I don't think you're missing anything. In practice
though we
likely haven't hit any issues because the original callers didn't store
references to the pdev which would make the window quite small (although
the
pass through work may have changed that). And as you say there simply
wasn't
any means to get rid of them anyway (EEH, hotplug, etc. has never been
implemented or supported for GPUs and all sorts of terrible things happen
if
you try).

In other words: leaking a ref is the only safe thing to do.

Correct.

The best solution would likely be to review the reference counting and to
teach callers of get_*_dev() to call pci_put_dev(), etc.

The issue is that the two callers of get_pci_dev() are non-GPL
exported symbols so we don't know what's calling them or what their
expectations are. We be doing whatever makes sense for OpenCAPI and if
that happens to cause a problem for someone else, then they can deal
with it.

The issue isn't that it's exported non-GPL vs. GPL, rather that there are
probably out-of-tree modules like the NVIDIA driver using it. However as you
say supporting out-of-tree modules is not generally a concern for kernel
developers so we certainly shouldn't let that prevent us doing a fix.


Thanks Alistair and Oliver. I'll bite the bullet and try do the right thing wrt ref counting in npu-dma.c

  Fred

- Alistair

Oliver





Reply via email to