On Sun, Jun 21, 2020 at 02:03:00AM -0400, Andrey Grodzovsky wrote: > This RFC is more of a proof of concept then a fully working solution as there > are a few unresolved issues we are hoping to get advise on from people on the > mailing list. > Until now extracting a card either by physical extraction (e.g. eGPU with > thunderbolt connection or by emulation through syfs -> > /sys/bus/pci/devices/device_id/remove) > would cause random crashes in user apps. The random crashes in apps were > mostly due to the app having mapped a device backed BO into its address space > was still > trying to access the BO while the backing device was gone. > To answer this first problem Christian suggested to fix the handling of > mapped memory in the clients when the device goes away by forcibly unmap all > buffers > the user processes has by clearing their respective VMAs mapping the device > BOs. Then when the VMAs try to fill in the page tables again we check in the > fault handler > if the device is removed and if so, return an error. This will generate a > SIGBUS to the application which can then cleanly terminate. > This indeed was done but this in turn created a problem of kernel OOPs were > the OOPSes were due to the fact that while the app was terminating because of > the SIGBUS > it would trigger use after free in the driver by calling to accesses device > structures that were already released from the pci remove sequence. > This was handled by introducing a 'flush' sequence during device removal were > we wait for drm file reference to drop to 0 meaning all user clients directly > using this device terminated. > With this I was able to cleanly emulate device unplug with X and glxgears > running and later emulate device plug back and restart of X and glxgears. > > v2: > Based on discussions in the mailing list with Daniel and Pekka [1] and based > on the document produced by Pekka from those discussions [2] the whole > approach with returning SIGBUS > and waiting for all user clients having CPU mapping of device BOs to die was > dropped. Instead as per the document suggestion the device structures are > kept alive until the last > reference to the device is dropped by user client and in the meanwhile all > existing and new CPU mappings of the BOs belonging to the device directly or > by dma-buf import are rerouted > to per user process dummy rw page. > Also, I skipped the 'Requirements for KMS UAPI' section of [2] since i am > trying to get the minimal set of requiremnts that still give useful solution > to work and this is the > 'Requirements for Render and Cross-Device UAPI' section and so my test case > is removing a secondary device, which is render only and is not involved in > KMS. > > This iteration is still more of a draft as I am still facing a few unsolved > issues such as a crash in user client when trying to CPU map imported BO if > the map happens after device was > removed and HW failure to plug back a removed device. Also since i don't have > real life setup with external GPU connected through TB I am using sysfs to > emulate pci remove and i > expect to encounter more issues once i try this on real life case. I am also > expecting some help on this from a user who volunteered to test in the > related gitlab ticket. > So basically this is more of a way to get feedback if I am moving in the > right direction. > > [1] - Discussions during v1 of the patchset > https://lists.freedesktop.org/archives/dri-devel/2020-May/265386.html > [2] - drm/doc: device hot-unplug for userspace > https://www.spinics.net/lists/dri-devel/msg259755.html > [3] - Related gitlab ticket > https://gitlab.freedesktop.org/drm/amd/-/issues/1081
A few high-level commments on the generic parts, I didn't really look at the amdgpu side yet. Also a nit: Please tell your mailer to break long lines, it looks funny and inconsistent otherwise, at least in some of the mailers I use here :-/ -Daniel > > > Andrey Grodzovsky (8): > drm: Add dummy page per device or GEM object > drm/ttm: Remap all page faults to per process dummy page. > drm/ttm: Add unampping of the entire device address space > drm/amdgpu: Split amdgpu_device_fini into early and late > drm/amdgpu: Refactor sysfs removal > drm/amdgpu: Unmap entire device address space on device remove. > drm/amdgpu: Fix sdma code crash post device unplug > drm/amdgpu: Prevent any job recoveries after device is unplugged. > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 19 +++++++- > drivers/gpu/drm/amd/amdgpu/amdgpu_atombios.c | 7 ++- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 50 +++++++++++++++++---- > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 23 ++++++++-- > drivers/gpu/drm/amd/amdgpu/amdgpu_gtt_mgr.c | 12 +++-- > drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 24 ++++++---- > drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h | 1 + > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 23 +++++++--- > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 8 +++- > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 3 ++ > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 21 ++++++--- > drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 17 +++++++- > drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 13 +++++- > drivers/gpu/drm/amd/amdgpu/df_v3_6.c | 10 +++-- > drivers/gpu/drm/drm_file.c | 8 ++++ > drivers/gpu/drm/drm_prime.c | 10 +++++ > drivers/gpu/drm/ttm/ttm_bo.c | 8 +++- > drivers/gpu/drm/ttm/ttm_bo_vm.c | 65 > ++++++++++++++++++++++++---- > include/drm/drm_file.h | 2 + > include/drm/drm_gem.h | 2 + > include/drm/ttm/ttm_bo_driver.h | 7 +++ > 22 files changed, 286 insertions(+), 55 deletions(-) > > -- > 2.7.4 > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel