Hello, It looks like there has been allot of investigation for this bug. Have there been any commits identified that may resolve this bug?
Upstream v4.18-rc1 is now available. Maybe this kernel could be tested to see if a "Reverse" bisect could identify a commit to fix the issue. http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.18-rc1/ Thanks! ** Changed in: linux (Ubuntu Bionic) Assignee: (unassigned) => Joseph Salisbury (jsalisbury) ** Changed in: linux (Ubuntu) Assignee: Canonical Kernel Team (canonical-kernel-team) => Joseph Salisbury (jsalisbury) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1768115 Title: ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3g1: Migration guest running with IO stress crashed@security_file_permission+0xf4/0x160. Status in The Ubuntu-power-systems project: Triaged Status in linux package in Ubuntu: Triaged Status in linux source package in Bionic: Triaged Bug description: Problem Description: Migration Guest running with IO stress crashed@security_file_permission+0xf4/0x160 after couple of migrations. Steps to re-create: Source host - boslcp3 Destination host - boslcp4 1.boslcp3 & boslcp4 installed with latest kernel root@boslcp3:~# uname -a Linux boslcp3 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT 2018 ppc64le ppc64le ppc64le GNU/Linux root@boslcp3:~# root@boslcp4:~# uname -a Linux boslcp4 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT 2018 ppc64le ppc64le ppc64le GNU/Linux root@boslcp3:~# 2. Installed guest boslcp3g1 with kernel and started LTP run from boslcp3 host root@boslcp3g1:~# uname -a Linux boslcp3g1 4.15.0-15-generic #16+bug166877 SMP Wed Apr 18 14:47:30 CDT 2018 ppc64le ppc64le ppc64le GNU/Linux 3. Started migrating boslcp3g1 guest from source to destination & viceversa. 4. After couple of migrations it crashed at boslcp4 & enters into xmon 8:mon> t [c0000004f8a23d20] c0000000005a7674 security_file_permission+0xf4/0x160 [c0000004f8a23d60] c0000000003d1d30 rw_verify_area+0x70/0x120 [c0000004f8a23d90] c0000000003d375c vfs_read+0x8c/0x1b0 [c0000004f8a23de0] c0000000003d3d88 SyS_read+0x68/0x110 [c0000004f8a23e30] c00000000000b184 system_call+0x58/0x6c --- Exception: c01 (System Call) at 000071f1779fe280 SP (7fffe99ece50) is in userspace 8:mon> S msr = 8000000000001033 sprg0 = 0000000000000000 pvr = 00000000004e1202 sprg1 = c000000007a85800 dec = 00000000591e3e03 sprg2 = c000000007a85800 sp = c0000004f8a234a0 sprg3 = 0000000000010008 toc = c0000000016eae00 dar = 000000000000023c srr0 = c0000000000c355c srr1 = 8000000000001033 dsisr = 40000000 dscr = 0000000000000000 ppr = 0010000000000000 pir = 00000011 amr = 0000000000000000 uamor = 0000000000000000 dpdes = 0000000000000000 tir = 0000000000000000 cir = 00000000 fscr = 0500000000000180 tar = 0000000000000000 pspb = 00000000 mmcr0 = 0000000080000000 mmcr1 = 0000000000000000 mmcr2 = 0000000000000000 pmc1 = 00000000 pmc2 = 00000000 pmc3 = 00000000 pmc4 = 00000000 mmcra = 0000000000000000 siar = 0000000000000000 pmc5 = 0000026c sdar = 0000000000000000 sier = 0000000000000000 pmc6 = 00000861 ebbhr = 0000000000000000 ebbrr = 0000000000000000 bescr = 0000000000000000 iamr = 4000000000000000 pidr = 0000000000000034 tidr = 0000000000000000 cpu 0x8: Vector: 700 (Program Check) at [c0000004f8a23220] pc: c0000000000e4854: xmon_core+0x1f24/0x3520 lr: c0000000000e4850: xmon_core+0x1f20/0x3520 sp: c0000004f8a234a0 msr: 8000000000041033 current = 0xc0000004f89faf00 paca = 0xc000000007a85800 softe: 0 irq_happened: 0x01 pid = 24028, comm = top Linux version 4.15.0-20-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #21-Ubuntu SMP Tue Apr 24 06:14:44 UTC 2018 (Ubuntu 4.15.0-20.21-generic 4.15.17) cpu 0x8: Exception 700 (Program Check) in xmon, returning to main loop [c0000004f8a23d20] c0000000005a7674 security_file_permission+0xf4/0x160 [c0000004f8a23d60] c0000000003d1d30 rw_verify_area+0x70/0x120 [c0000004f8a23d90] c0000000003d375c vfs_read+0x8c/0x1b0 [c0000004f8a23de0] c0000000003d3d88 SyS_read+0x68/0x110 [c0000004f8a23e30] c00000000000b184 system_call+0x58/0x6c --- Exception: c01 (System Call) at 000071f1779fe280 SP (7fffe99ece50) is in userspace 8:mon> r R00 = c00000000043b7fc R16 = 0000000000000000 R01 = c0000004f8a23c90 R17 = ffffffffffffff70 R02 = c0000000016eae00 R18 = 00000a51b4bebfc8 R03 = c000000279557200 R19 = 00007fffe99edbb0 R04 = c0000003242499c0 R20 = 00000a51b4c04db0 R05 = 0000000000020000 R21 = 00000a51b4c20e90 R06 = 0000000000000004 R22 = 0000000000040f00 R07 = ffffff8100000000 R23 = 00000a51b4c06560 R08 = ffffff8000000000 R24 = ffffffffffffff80 R09 = 0000000000000000 R25 = 00000a51b4bec2b8 R10 = 0000000000000000 R26 = 000071f177bb0b20 R11 = 0000000000000000 R27 = 0000000000000000 R12 = 0000000000002000 R28 = c000000279557200 R13 = c000000007a85800 R29 = c0000004c7734210 R14 = 0000000000000000 R30 = 0000000000000000 R15 = 0000000000000000 R31 = c0000003242499c0 pc = c00000000043b808 __fsnotify_parent+0x88/0x1a0 cfar= c0000000003f9e78 dget_parent+0xe8/0x150 lr = c00000000043b7fc __fsnotify_parent+0x7c/0x1a0 msr = 8000000000009033 cr = 28002222 ctr = c0000000006252b0 xer = 0000000000000000 trap = 300 dar = 000000000000023c dsisr = 40000000 8:mon> e cpu 0x8: Vector: 300 (Data Access) at [c0000004f8a23a10] pc: c00000000043b808: __fsnotify_parent+0x88/0x1a0 lr: c00000000043b7fc: __fsnotify_parent+0x7c/0x1a0 sp: c0000004f8a23c90 msr: 8000000000009033 dar: 23c dsisr: 40000000 current = 0xc0000004f89faf00 paca = 0xc000000007a85800 softe: 0 irq_happened: 0x01 pid = 24028, comm = top Linux version 4.15.0-20-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #21-Ubuntu SMP Tue Apr 24 06:14:44 UTC 2018 (Ubuntu 4.15.0-20.21-generic 4.15.17) 6. Guest enters into xmon after migrating from boslcp3 to boslcp4. > > 8:mon> t > [c0000004f8a23d20] c0000000005a7674 security_file_permission+0xf4/0x160 > [c0000004f8a23d60] c0000000003d1d30 rw_verify_area+0x70/0x120 > [c0000004f8a23d90] c0000000003d375c vfs_read+0x8c/0x1b0 > [c0000004f8a23de0] c0000000003d3d88 SyS_read+0x68/0x110 > [c0000004f8a23e30] c00000000000b184 system_call+0x58/0x6c > --- Exception: c01 (System Call) at 000071f1779fe280 > SP (7fffe99ece50) is in userspace > 8:mon> r > R00 = c00000000043b7fc R16 = 0000000000000000 > R01 = c0000004f8a23c90 R17 = ffffffffffffff70 > R02 = c0000000016eae00 R18 = 00000a51b4bebfc8 > R03 = c000000279557200 R19 = 00007fffe99edbb0 > R04 = c0000003242499c0 R20 = 00000a51b4c04db0 > R05 = 0000000000020000 R21 = 00000a51b4c20e90 > R06 = 0000000000000004 R22 = 0000000000040f00 > R07 = ffffff8100000000 R23 = 00000a51b4c06560 > R08 = ffffff8000000000 R24 = ffffffffffffff80 > R09 = 0000000000000000 R25 = 00000a51b4bec2b8 > R10 = 0000000000000000 R26 = 000071f177bb0b20 > R11 = 0000000000000000 R27 = 0000000000000000 > R12 = 0000000000002000 R28 = c000000279557200 > R13 = c000000007a85800 R29 = c0000004c7734210 > R14 = 0000000000000000 R30 = 0000000000000000 > R15 = 0000000000000000 R31 = c0000003242499c0 > pc = c00000000043b808 __fsnotify_parent+0x88/0x1a0 > cfar= c0000000003f9e78 dget_parent+0xe8/0x150 > lr = c00000000043b7fc __fsnotify_parent+0x7c/0x1a0 > msr = 8000000000009033 cr = 28002222 > ctr = c0000000006252b0 xer = 0000000000000000 trap = 300 > dar = 000000000000023c dsisr = 40000000 > BUG_ON in jbd2_journal_write_metadata_buffer I've included xmon crash data from a more recent crash, this time a BUG_ON in fs/jbd2/journal.c:jbd2_journal_write_metadata_buffer(): int jbd2_journal_write_metadata_buffer(transaction_t *transaction, struct journal_head *jh_in, struct buffer_head **bh_out, sector_t blocknr) { int need_copy_out = 0; int done_copy_out = 0; int do_escape = 0; char *mapped_data; struct buffer_head *new_bh; struct page *new_page; unsigned int new_offset; struct buffer_head *bh_in = jh2bh(jh_in); journal_t *journal = transaction->t_journal; /* * The buffer really shouldn't be locked: only the current committing * transaction is allowed to write it, so nobody else is allowed * to do any IO. * * akpm: except if we're journalling data, and write() output is * also part of a shared mapping, and another thread has * decided to launch a writepage() against this buffer. */ J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in)); This is not the same as the original bug, but I suspect they are part of a class of issues we're hitting while running under very particular circumstances which might not generally be seen during normal operation and triggering various corner cases. As such I think it makes sense to group them under this bug for the time being. The general workload is running IO-heavy disk workloads on large guests (20GB memory, 16 vcpus) with SAN-based storage, and then performing migration during the workload. During migration we begin to see a high occurrence of rcu_sched stall warnings, and after 1-3 hours of operations we hit filesystem-related crashes like the ones posted. We've seen this with 2 separate FC cards, emulex and qlogic, where we invoke QEMU through libvirt as: C_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/qemu-system-ppc64 -name guest=boslcp3g1 ,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-7-boslcp3g1 /master-key.aes -machine pseries-2.10,accel=kvm,usb=off,dump-guest- core=off,max-cpu-compat=power9 -cpu host -m 20480 -realtime mlock=off -smp 16,maxcpus=32,sockets=4,cores=8,threads=1 -object memory-backend- file,id=ram-node0,prealloc=yes,mem- path=/dev/hugepages/libvirt/qemu/7-boslcp3g1,size=10737418240 -numa node,nodeid=0,cpus=0-7,memdev=ram-node0 -object memory-backend-ram,id =ram-node1,size=10737418240 -numa node,nodeid=1,cpus=8-15,memdev=ram- node1 -uuid bd110ed9-dcfc-4470-b4ae-6166a56819f0 -display none -no- user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-7-boslcp3g1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no- shutdown -boot menu=on,strict=on -device spapr-pci-host- bridge,index=1,id=pci.1 -device nec-usb-xhci,id=usb,bus=pci.0,addr=0x3 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x2 -drive file=/home /bionic-server-ppc64el.iso,format=raw,if=none,id=drive- scsi0-0-0-2,readonly=on,cache=none -device scsi- cd,bus=scsi0.0,channel=0,scsi-id=0,lun=2,drive=drive- scsi0-0-0-2,id=scsi0-0-0-2 -drive file=/dev/disk/by-id/dm-uuid-part1 -mpath-3600507680183050d28000000000002a4,format=raw,if=none,id=drive- virtio-disk0,cache=none -device virtio-blk- pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio- disk0,bootindex=1 -drive file=/dev/disk/by-id/dm-uuid-part2-mpath- 3600507680183050d28000000000002a4,format=raw,if=none,id=drive-virtio- disk1,cache=none -device virtio-blk- pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio- disk1 -drive file=/dev/disk/by-id/dm-uuid-part3-mpath- 3600507680183050d28000000000002a4,format=raw,if=none,id=drive-virtio- disk2,cache=none -device virtio-blk- pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk2,id=virtio- disk2 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net- pci,netdev=hostnet0,id=net0,mac=52:54:00:72:d2:69,bus=pci.0,addr=0x1,bootindex=2 -chardev pty,id=charserial0 -device spapr- vty,chardev=charserial0,id=serial0,reg=0x30000000 -device virtio- balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on I will attach the libvirt XML separately IBM is requesting some general filesystem skills from Canonical if they have some as we continue debugging... To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1768115/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp