Re: [BUG][powerpc] OOPs: Kernel access of bad area during zram swap write - kswapd0 crash

IBM Sun, 20 Apr 2025 21:07:59 -0700


++ linux-mm


Misbah Anjum N <misan...@linux.ibm.com> writes:

> Bug Description:
> When running Avocado-VT based functional tests on KVM guest, the system 
> encounters a
> kernel panic and crash during memory reclaim activity when zram is 
> actively used for
> swap. The crash occurs in the kswapd0 kernel thread during what appears 
> to be a write
> operation to zram.
>
>
> Steps to Reproduce:
> 1. Compile Upstream Kernel on LPAR
> 2. Compile Qemu, Libvirt for KVM Guest
> 3. Run Functional tests on KVM guest using Avocado-VT Regression Bucket
>      a. Clone: git clone https://github.com/lop-devops/tests.git
>      b. Setup: python3 avocado-setup.py --bootstrap --enable-kvm 
> --install-deps
>      c. Add guest in folder: tests/data/avocado-vt/images/
>      d. Run: python3 avocado-setup.py --run-suite guest_regression 
> --guest-os\
>              <Guest-name> --only-filter 'virtio_scsi virtio_net qcow2'\
>              --no-download
>
> The bug is reproducible when Avocado-VT Regression bucket is executed 
> which
> consists of series of functional tp-libvirt tests performed on the KVM 
> guest in the
> following order: cpu, memory, network, storage and hotplug (disk, change 
> media,
> libvirt_mem), etc.
> Whilst execution, the system crashes during test:
> io-github-autotest-libvirt.libvirt_mem.positive_test.mem_basic.cold_plug_discard
> Note: This does not appear to be caused by a single test, but by 
> cumulative
> operations during the test sequence.
>
>
> Environment Details:
>      Kernel: 6.15.0-rc1-g521d54901f98
>      Reproducible with: 6.15.0-rc2-gf3a2e2a79c9d


Looks like the issue is happening on 6.15-rc2. Did git bisect revealed a
faulty commit?


>      Platform: IBM POWER10 LPAR (ppc64le)
>      Distro: Fedora42
>      RAM: 64GB
>      CPUs: 80
>      Qemu: 9.2.93 (v10.0.0-rc3-10-g8bdd3a0308)
>      Libvirt: 11.3.0
>
>
> System Memory State:
>      # free -mh
>                      total        used        free      shared  
> buff/cache   available
>          Mem:            61Gi       3.0Gi        25Gi        11Mi        
> 33Gi        58Gi
>          Swap:          8.0Gi          0B       8.0Gi
>      # zramctl
>          NAME       ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS 
> MOUNTPOINT
>          /dev/zram0 lzo-rle         8G  64K  222B  128K         [SWAP]
>      # swapon --show
>          NAME       TYPE      SIZE USED PRIO
>          /dev/zram0 partition   8G   0B  100
>
>
> Call Trace:
> [180060.602200] BUG: Unable to handle kernel data access on read at 
> 0xc00800000a1b0000
> [180060.602219] Faulting instruction address: 0xc000000000175670
> [180060.602224] Oops: Kernel access of bad area, sig: 11 [#1]
> [180060.602227] LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=2048 NUMA 
> pSeries
> [180060.602232] Modules linked in: dm_thin_pool dm_persistent_data 
> vmw_vsock_virtio_transport_common vsock zram xfs dm_service_time sd_mod 
> [180060.602345] CPU: 68 UID: 0 PID: 465 Comm: kswapd0 Kdump: loaded Not 
> tainted
> 6.15.0-rc1-g521d54901f98 #1 VOLUNTARY
> [180060.602351] Hardware name: IBM,9080-HEX POWER10 (architected) 
> 0x800200 0xf000006 of:IBM,FW1060.21
> (NH1060_078) hv:phyp pSeries
> [180060.602355] NIP:  c000000000175670 LR: c0000000006d96b4 CTR: 
> 01fffffffffffc05
> [180060.602358] REGS: c0000000a5a56da0 TRAP: 0300   Not tainted  
> (6.15.0-rc1-g521d54901f98)
> [180060.602362] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 
> 44042880  XER: 20040001
> [180060.602370] CFAR: c0000000001756c8 DAR: c00800000a1b0000 DSISR: 
> 40000000 IRQMASK: 0
<...>
>
>
> Crash Utility Output:
> # crash /home/kvmci/linux/vmlinux vmcore
> crash 8.0.6-4.fc42
>
>        KERNEL: /home/kvmci/linux/vmlinux  [TAINTED]
>      DUMPFILE: vmcore  [PARTIAL DUMP]
>          CPUS: 80
>          DATE: Wed Dec 31 18:00:00 CST 1969
>        UPTIME: 2 days, 02:01:00
> LOAD AVERAGE: 0.72, 0.66, 0.64
>         TASKS: 1249
>      NODENAME: ***
>       RELEASE: 6.15.0-rc1-g521d54901f98
>       VERSION: #1 SMP Wed Apr  9 05:13:03 CDT 2025
>       MACHINE: ppc64le  (3450 Mhz)
>        MEMORY: 64 GB
>         PANIC: "Oops: Kernel access of bad area, sig: 11 [#1]" (check log 
> for details)
>           PID: 465
>       COMMAND: "kswapd0"
>          TASK: c000000006067d80  [THREAD_INFO: c000000006067d80]
>           CPU: 68
>         STATE: TASK_RUNNING (PANIC)
>
> crash> bt
> PID: 465      TASK: c000000006067d80  CPU: 68   COMMAND: "kswapd0"
>   R0:  000000000e000000    R1:  c0000000a5a57040    R2:  c0000000017a8100
>   R3:  c000000d34cefd00    R4:  c00800000a1affe8    R5:  fffffffffffffffa
>   R6:  01ffffffffffffff    R7:  03ffffff2cb33000    R8:  0000000080000000
>   R9:  0000000000000010    R10: 0000000000000020    R11: 0000000000000030
>   R12: 0000000000000040    R13: c000000ffde34300    R14: 0000000000000050
>   R15: 0000000000000060    R16: 0000000000000070    R17: 5deadbeef0000122
>   R18: 0000000101124ad9    R19: 0000000000018028    R20: 0000000000c01400
>   R21: c0000000a5a574b0    R22: 0000000000010000    R23: 0000000000000051
>   R24: c000000002be02f8    R25: c00c000003ef7380    R26: c00800000a190000
>   R27: c0000001054dbbe0    R28: c00c0000034d3340    R29: 00000000000002d8
>   R30: fffffffffffffffa    R31: c0000001054dbbb0
>   NIP: c000000000175670    MSR: 8000000002009033    OR3: c0000000001756c8
>   CTR: 01fffffffffffc05    LR:  c0000000006d96b4    XER: 0000000020040001
>   CCR: 0000000044042880    MQ:  0000000000000000    DAR: c00800000a1b0000
>   DSISR: 0000000040000000     Syscall Result: 0000000000000000
>   [NIP  : memcpy_power7+1648]
>   [LR   : zs_obj_write+548]
>   #0 [c0000000a5a56c50] crash_kexec at c00000000037f268
>   #1 [c0000000a5a56c80] oops_end at c00000000002b678
>   #2 [c0000000a5a56d00] __bad_page_fault at c00000000014f348
>   #3 [c0000000a5a56d70] data_access_common_virt at c000000000008be0
>   #4 [c0000000a5a57040] memcpy_power7 at c000000000175274
>   #5 [c0000000a5a57140] zs_obj_write at c0000000006d96b4

Looks like zsmalloc new object mapping API being called, which was
merged in rc1?  But let's first confirm from git bisect, unless someone
from linux-mm who knows zsmalloc subsystem better and can point on what
could be going wrong here.


>   #6 [c0000000a5a571b0] zram_write_page at c008000009174a50 [zram]
>   #7 [c0000000a5a57260] zram_bio_write at c008000009174ff4 [zram]
>   #8 [c0000000a5a57310] __submit_bio at c0000000009323ac
>   #9 [c0000000a5a573a0] __submit_bio_noacct at c000000000932614
> #10 [c0000000a5a57410] submit_bio_wait at c000000000926d34
> #11 [c0000000a5a57480] swap_writepage_bdev_sync at c00000000065ab5c
> #12 [c0000000a5a57540] swap_writepage at c00000000065b90c
> #13 [c0000000a5a57570] shmem_writepage at c0000000005ada30
> #14 [c0000000a5a57630] pageout at c00000000059d700
> #15 [c0000000a5a57850] shrink_folio_list at c00000000059eafc
> #16 [c0000000a5a57aa0] shrink_inactive_list at c0000000005a0b38
> #17 [c0000000a5a57b70] shrink_lruvec at c0000000005a12a0
> #18 [c0000000a5a57c80] shrink_node_memcgs at c0000000005a16f4
> #19 [c0000000a5a57d00] shrink_node at c0000000005a183c
> #20 [c0000000a5a57d80] balance_pgdat at c0000000005a24e0
> #21 [c0000000a5a57ef0] kswapd at c0000000005a2b50
> #22 [c0000000a5a57f80] kthread at c00000000026ea68
> #23 [c0000000a5a57fe0] start_kernel_thread at c00000000000df98
>

-ritesh

Re: [BUG][powerpc] OOPs: Kernel access of bad area during zram swap write - kswapd0 crash

Reply via email to