Hi Vitaly, Thank you for looking into this issue!
We have reproduced this issue with a Radeon RX 580 (Polaris 20) passthrough-ed to a QEMU (4.0.0) VM by VFIO. All bugs were reproducible on the recent 6.8-rc4 Linux kernel ( https://github.com/torvalds/linux/tree/v6.8-rc4), which I double checked right now with previous programs. Below are the QEMU arguments used, in-VM lspci -vvv and /proc/cpuinfo. Should you need any more information, please let us know. *QEMU arguments* qemu-system-x86_64 -m 2G \ -cpu host \ -kernel $KERNEL \ -append "console=ttyS0 root=/dev/sda earlyprintk=serial net.ifnames=0" \ -drive file=$DRIVE_FILE,format=qcow2 \ -enable-kvm \ -device vfio-pci,host=$PCI_ADDR,id=gpu,multifunction=on,x-vga=on \ -nographic *root@qemu:~# lspci -vvv* 00:03.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 47) Subsystem: Gigabyte Technology Co., Ltd Ellesmere [Radeon RX 470/480/570/570X/580/580X/59] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <- Latency: 0 Interrupt: pin A routed to IRQ 24 Region 0: Memory at e0000000 (64-bit, prefetchable) [size=256M] Region 2: Memory at f0000000 (64-bit, prefetchable) [size=2M] Region 4: I/O ports at c000 [size=256] Region 5: Memory at feb80000 (32-bit, non-prefetchable) [size=256K] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L1 <1us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s (downgraded), Width x4 (downgraded) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR+ 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, Max1 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- AtomicOpsCap: 32bit+ 64bit+ 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ OBFF Disabled, AtomicOpsCtl: ReqEn+ LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPha- EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee01004 Data: 0021 Kernel driver in use: amdgpu *root@snapuzz:~# cat /proc/cpuinfo* processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 183 model name : 13th Gen Intel(R) Core(TM) i9-13900K stepping : 1 microcode : 0x1 cpu MHz : 2995.200 cache size : 16384 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 31 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflushs vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexprioritg bugs : spectre_v1 spectre_v2 spec_store_bypass mds swapgs itlb_multihit mmio_unknown eb bogomips : 5990.40 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: On Thu, Feb 22, 2024 at 10:57 AM vitaly prosyak <vpros...@amd.com> wrote: > Hi Joonkyo, > > Thanks for your reporting! > > I reproduced the first issue with 'amdgpu_gem_userptr_ioctl' when KAZAN > enabled, but i could not reproduce the other two issues. > > Could you indicate what ASIC did you use to reproduce the the issue? > > Could you provide details of your system? > > Much appreciated for your responce. > > Vitaly > > > I placed your findings below to keep the context and track the issues. > > > =========================================================================================================================== > > Reporting a slab-use-after-free in amdgpu.eml > > Subject: > Reporting a slab-use-after-free in amdgpu > > From: > Joonkyo Jung <joonk...@yonsei.ac.kr> <joonk...@yonsei.ac.kr> > > Date: > 2024-02-16, 04:22 > > To: > alexander.deuc...@amd.com, christian.koe...@amd.com, xinhui....@amd.com > > CC: > amd-gfx@lists.freedesktop.org, Dokyung Song <dokyu...@yonsei.ac.kr> > <dokyu...@yonsei.ac.kr>, jisoo.j...@yonsei.ac.kr, yw9...@yonsei.ac.kr > > Hello, > > We > would like to report a slab-use-after-free bug in the AMDGPU DRM driver > in the linux kernel v6.8-rc4 that we found with our customized > Syzkaller. > The bug can be triggered by sending two ioctls to the AMDGPU DRM driver in > succession. > > In amdgpu_bo_move, struct ttm_resource *old_mem = bo->resource is assigned. > As you can see on the alloc & free stack calls, on the same function > amdgpu_bo_move, > amdgpu_move_blit in the end frees bo->resource at ttm_bo_move_accel_cleanup > with ttm_bo_wait_free_node(bo, man->use_tt). > But > amdgpu_bo_move continues after that, reaching trace_amdgpu_bo_move(abo, > new_mem->mem_type, old_mem->mem_type) at the end, causing the > use-after-free bug. > > Steps to reproduce are as below. > union drm_amdgpu_gem_create *arg1; > > arg1 = malloc(sizeof(union drm_amdgpu_gem_create)); > arg1->in.bo_size = 0x8; > arg1->in.alignment = 0x0; > arg1->in.domains = 0x4; > arg1->in.domain_flags = 0x9; > ioctl(fd, 0xc0206440, arg1); > > arg1->in.bo_size = 0x7fffffff; > arg1->in.alignment = 0x0; > arg1->in.domains = 0x4; > arg1->in.domain_flags = 0x9; > ioctl(fd, 0xc0206440, arg1); > > The KASAN report is as follows: > ================================================================== > BUG: KASAN: slab-use-after-free in amdgpu_bo_move+0x1479/0x1550 > Read of size 4 at addr ffff88800f5bee80 by task syz-executor/219 > Call Trace: > <TASK> > amdgpu_bo_move+0x1479/0x1550 > ttm_bo_handle_move_mem+0x4d0/0x700 > ttm_mem_evict_first+0x945/0x1230 > ttm_bo_mem_space+0x6c7/0x940 > ttm_bo_validate+0x286/0x650 > ttm_bo_init_reserved+0x34c/0x490 > amdgpu_bo_create+0x94b/0x1610 > amdgpu_bo_create_user+0xa3/0x130 > amdgpu_gem_create_ioctl+0x4bc/0xc10 > drm_ioctl_kernel+0x300/0x410 > drm_ioctl+0x648/0xb30 > amdgpu_drm_ioctl+0xc8/0x160 > </TASK> > > Allocated by task 219: > kmalloc_trace+0x211/0x390 > amdgpu_vram_mgr_new+0x1d6/0xbe0 > ttm_resource_alloc+0xfd/0x1e0 > ttm_bo_mem_space+0x255/0x940 > ttm_bo_validate+0x286/0x650 > ttm_bo_init_reserved+0x34c/0x490 > amdgpu_bo_create+0x94b/0x1610 > amdgpu_bo_create_user+0xa3/0x130 > amdgpu_gem_create_ioctl+0x4bc/0xc10 > drm_ioctl_kernel+0x300/0x410 > drm_ioctl+0x648/0xb30 > amdgpu_drm_ioctl+0xc8/0x160 > > Freed by task 219: > kfree+0x111/0x2d0 > ttm_resource_free+0x17e/0x1e0 > ttm_bo_move_accel_cleanup+0x77e/0x9b0 > amdgpu_move_blit+0x3db/0x670 > amdgpu_bo_move+0xfa2/0x1550 > ttm_bo_handle_move_mem+0x4d0/0x700 > ttm_mem_evict_first+0x945/0x1230 > ttm_bo_mem_space+0x6c7/0x940 > ttm_bo_validate+0x286/0x650 > ttm_bo_init_reserved+0x34c/0x490 > amdgpu_bo_create+0x94b/0x1610 > amdgpu_bo_create_user+0xa3/0x130 > amdgpu_gem_create_ioctl+0x4bc/0xc10 > drm_ioctl_kernel+0x300/0x410 > drm_ioctl+0x648/0xb30 > amdgpu_drm_ioctl+0xc8/0x160 > > The buggy address belongs to the object at ffff88800f5bee70 > which belongs to the cache kmalloc-96 of size 96 > The buggy address is located 16 bytes inside of > freed 96-byte region [ffff88800f5bee70, ffff88800f5beed0) > > Should you need any more information, please do not hesitate to contact us. > > Best regards, > Joonkyo Jung > > Reporting a null-ptr-deref in amdgpu.eml > > Subject: > Reporting a null-ptr-deref in amdgpu > > From: > Joonkyo Jung <joonk...@yonsei.ac.kr> <joonk...@yonsei.ac.kr> > > Date: > 2024-02-16, 04:20 > > To: > alexander.deuc...@amd.com, christian.koe...@amd.com, xinhui....@amd.com > > CC: > Dokyung Song <dokyu...@yonsei.ac.kr> <dokyu...@yonsei.ac.kr>, > jisoo.j...@yonsei.ac.kr, yw9...@yonsei.ac.kr, amd-gfx@lists.freedesktop.org > > Hello, > > We > would like to report a null-ptr-deref bug in the AMDGPU DRM driver in > the linux kernel v6.8-rc4 that we found with our customized Syzkaller. > The bug can be triggered by sending two ioctls to the AMDGPU DRM driver in > succession. > > The first ioctl amdgpu_ctx_ioctl will create a ctx, and return ctx_id = 1 to > the userspace. > Second > ioctl, actually any ioctl that will eventually call > amdgpu_ctx_get_entity, carries this ctx_id and passes the context check. > Here, for example, drm_amdgpu_wait_cs. > Validations in amdgpu_ctx_get_entity can also be passed by the user-provided > values from the ioctl arguments. > This > eventually leads to drm_sched_entity_init, where the null-ptr-deref > will trigger on RCU_INIT_POINTER(entity->last_scheduled, NULL); > > Steps to reproduce are as below. > union drm_amdgpu_ctx *arg1; > union drm_amdgpu_wait_cs *arg2; > > arg1 = malloc(sizeof(union drm_amdgpu_ctx)); > arg2 = malloc(sizeof(union drm_amdgpu_wait_cs)); > > arg1->in.op = 0x1; > ioctl(AMDGPU_renderD128_DEVICE_FILE, 0x140106442, arg1); > > arg2->in.handle = 0x0; > arg2->in.timeout = 0x2000000000000; > arg2->in.ip_type = 0x9; > arg2->in.ip_instance = 0x0; > arg2->in.ring = 0x0; > arg2->in.ctx_id = 0x1; > ioctl(AMDGPU_renderD128_DEVICE_FILE, 0xc0206449, arg2); > > The KASAN report is as follows: > general protection fault, probably for non-canonical address > 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN NOPTI > KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] > Call Trace: > <TASK> > ? drm_sched_entity_init+0x16e/0x650 > ? drm_sched_entity_init+0x208/0x650 > amdgpu_ctx_get_entity+0x944/0xc30 > amdgpu_cs_wait_ioctl+0x13d/0x3f0 > drm_ioctl_kernel+0x300/0x410 > drm_ioctl+0x648/0xb30 > amdgpu_drm_ioctl+0xc8/0x160 > </TASK> > > Should you need any more information, please do not hesitate to contact us. > > Best regards, > Joonkyo Jung > > Reporting a use-after-free in amdgpu.eml > > Subject: > Reporting a use-after-free in amdgpu > > From: > 정준교 <joonk...@yonsei.ac.kr> <joonk...@yonsei.ac.kr> > > Date: > 2024-02-14, 21:34 > > To: > alexander.deuc...@amd.com, christian.koe...@amd.com, xinhui....@amd.com > > CC: > amd-gfx@lists.freedesktop.org, Dokyung Song <dokyu...@yonsei.ac.kr> > <dokyu...@yonsei.ac.kr>, jisoo.j...@yonsei.ac.kr, yw9...@yonsei.ac.kr > > Hello, > > We > would like to report a use-after-free bug in the AMDGPU DRM driver in > the linux kernel 6.2 that we found with our customized Syzkaller. > The bug can be triggered by sending a single amdgpu_gem_userptr_ioctl to the > AMDGPU DRM driver, with invalid addr and size. > We have tested that this bug can still be triggered in the latest RC kernel, > v6.8-rc4. > > Steps to reproduce are as below. > > struct drm_amdgpu_gem_userptr *arg; > arg = malloc(sizeof(struct drm_amdgpu_gem_userptr)); > arg->addr = 0xffffffffffff0000; > arg->size = 0x80000000; > arg->flags = 0x7; > ioctl(AMDGPU_renderD128_DEVICE_FILE, 0xc1186451, arg); > > The KASAN report is as follows: > ================================================================== > BUG: KASAN: use-after-free in switch_mm_irqs_off+0x89d/0xb70 > Read of size 8 at addr ffff88801f17bc00 by task syz-executor/386 > Call Trace: > <TASK> > switch_mm_irqs_off+0x89d/0xb70 > __schedule+0xa62/0x2630 > preempt_schedule_common+0x45/0xd0 > vfree+0x4d/0x60 > ttm_tt_fini+0xdf/0x1c0 > amdgpu_ttm_backend_destroy+0x9f/0xe0 > ttm_bo_cleanup_memtype_use+0x142/0x1f0 > ttm_bo_release+0x67d/0xc00 > ttm_bo_put+0x7c/0xa0 > amdgpu_bo_unref+0x3b/0x80 > amdgpu_gem_object_free+0x7f/0xc0 > drm_gem_object_free+0x5d/0x90 > amdgpu_gem_userptr_ioctl+0x452/0x7e0 > drm_ioctl_kernel+0x284/0x500 > drm_ioctl+0x55e/0xa50 > amdgpu_drm_ioctl+0xe3/0x1d0 > </TASK> > > Allocated by task 385: > kmem_cache_alloc+0x174/0x300 > copy_process+0x32d1/0x6640 > kernel_clone+0xcd/0x690 > > Freed by task 386: > kmem_cache_free+0x13b/0x550 > mmu_interval_notifier_remove+0x4c8/0x610 > amdgpu_hmm_unregister+0x47/0x90 > amdgpu_gem_object_free+0x75/0xc0 > drm_gem_object_free+0x5d/0x90 > amdgpu_gem_userptr_ioctl+0x452/0x7e0 > drm_ioctl_kernel+0x284/0x500 > drm_ioctl+0x55e/0xa50 > amdgpu_drm_ioctl+0xe3/0x1d0 > > The buggy address belongs to the object at ffff88801f17bb80 > which belongs to the cache mm_struct of size 2016 > The buggy address is located 128 bytes inside of > 2016-byte region [ffff88801f17bb80, ffff88801f17c360) > > The buggy address belongs to the physical page: > page:000000002c2a61bd refcount:1 mapcount:0 mapping:0000000000000000 > index:0x0 pfn:0x1f178 > head:000000002c2a61bd order:3 compound_mapcount:0 subpages_mapcount:0 > compound_pincount:0 > memcg:ffff8880141aa301 > flags: 0x100000000010200(slab|head|node=0|zone=1) > raw: 0100000000010200 ffff88800a44fc80 ffffea00006ca400 dead000000000004 > raw: 0000000000000000 00000000800f000f 00000001ffffffff ffff8880141aa301 > page dumped because: kasan: bad access detected > > Memory state around the buggy address: > ffff88801f17bb00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > ffff88801f17bb80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > >ffff88801f17bc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ^ > ffff88801f17bc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff88801f17bd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ================================================================== > >