** Changed in: linux (Ubuntu Noble)
       Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2077722

Title:
  [Ubuntu 24.04] MultiVM - L2 guest(s) running stress-ng getting stuck
  at booting after triggering crash

Status in The Ubuntu-power-systems project:
  In Progress
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  Fix Committed
Status in linux source package in Oracular:
  Fix Released
Status in linux source package in Plucky:
  Fix Released

Bug description:
  SRU Justification:
  ==================

  [Impact]

   * L2 guest(s) (nested virtualization) running stress-ng getting stuck
     at booting after triggering crash.

   * When for example having two Ubuntu 24.04 guests and running
     stress-ng (90% load) on both and triggering crash simultaneously,
     1st guest gets stuck and does not boot up.

   * In one of the attempts, both the guests got stuck on booting with
  console hang.

  [Fix]

   * a373830f96db a373830f96db288a3eb43a8692b6bcd0bd88dfe1
     "KVM: PPC: Book3S HV: Mask off LPCR_MER for a vCPU before running it to 
avoid spurious interrupts"

  [Test Plan]

   * An Ubuntu Server 24.04 LPAR installation, acting as KVM host,
     on IBM Power 10 hardware (with nested KVM capable FW1060 or never) is 
needed.

   * On top two (or more) KVM guests (now nested), again running 24.04,
     need to be setup.

   * Run the attached stress-ng.sh script on both KVM guests.

   * Trigger crash(es) on both KVM guests at the same time:
     echo c >/proc/sysrq-trigger

   * At least one KVM guest (sometimes both) are now stuck while rebooting,
     without the above patch in place.

  [Where problems could occur]

   * The changes are in arch/powerpc/kvm/book3s_hv.c only,
     hence are ppc specific and do not affect any other architecture.

   * The net changes are more or less only two effective code lines;
     and additional else case and the explicit masking off the 'MER' bit.

   * Wrong assumptions may have a different impact on KVM gusts (L0),
     or interfere with any other virtualization level.

   * But the commit is an upstream accepted fix
     [for ec0f6639fa88 ("KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is 
passed to the L0")]
     that landed in kernel 6.12 and was also accepted as stable update
     for kernels v6.8+.

  [Other Info]

   * This fix/commit discussed here will be part of the planned
     target kernel for plucky, hence plucky/25.04 is not affected.

   * The fix/commit is already included in oracular master-next
     as 08cbc81b9a61 and included starting with kernel Ubuntu-6.11.0-17.17.
     
   * With that only noble needs to be fixed (since this nested virtualization
     scenario is not supported by Ubuntu prior to noble).

   * Since the fix is upstream marked as stable update,
     it would usually be picked up by the kernel team automatically.
   
   * But to not loose the 24.04.2 window out of sight I was asked
     to submit this patch separately.

  __________

  Problem:
  While bringing up 2 Ubuntu 24.04 guests and running stress-ng (90% load) on 
both and triggering crash simultaneously, 1st guest gets stuck and does not 
boot up. In one of the attempts, both the guests got stuck on booting with 
console hang.

  Attempts:
  Reproducible 3/3 consecutive times
  Run 1: L2-1 guest got stuck
  Run 2: L2-1 guest got stuck
  Run 3: L2-1 and L2-2 guest got stuck

  =================================================================
  L1 Host:
  1. PowerVM
  2. OS: Ubuntu 24.04
  3. Kernel: 6.8.0-31-generic
  4. Mem (free -mh): 47Gi
  5. cpus: 40

  Guest L2-1:
  1. OS: Ubuntu 24.04
  2. Kernel: 6.8.0-31-generic
  3. Mem (free -mh): 9.5Gi
  4. cpus: 8
  5. Stress: stress-ng - 90% load
  6. XML configuration:
     <vcpu placement='static' current='8'>16</vcpu>
     <memory unit='KiB'>10971520</memory>
     <topology sockets='8' dies='1' cores='1' threads='2'/>

  Guest L2-2:
  1. OS: Ubuntu 24.04
  2. Kernel: 6.8.0-31-generic
  3. Mem (free -mh): 9.5Gi
  4. cpus: 8
  5. Stress: stress-ng - 90% load
  6. XML configuration:
     <vcpu placement='static' current='8'>16</vcpu>
     <memory unit='KiB'>10971520</memory>
     <topology sockets='2' dies='1' cores='1' threads='8'/>

  =================================================================
  Steps to reproduce:
  1. Bring up 2 Ubuntu 24.04 L2 guests with configuration mentioned as above
  2. Run the attached stress-ng.sh script on both L2 guests
  3. Trigger crash: echo c >/proc/sysrq-trigger on both L2 guests at the same 
time

  After triggering the crash, 1 or both guest consoles will get stuck.
  And then, we will not be able to enter the guest neither shut it down.
  In oder to boot into the guest, virsh destroy of the guest will be
  required.

  =================================================================
  Run1: Console.log Error message of L2-1
    Booting `Ubuntu'

  Loading Linux 6.8.0-31-generic ...
  Loading initial ramdisk ...
  OF stdout device is: /vdevice/vty@30000000
  Preparing to boot Linux version 6.8.0-31-generic (buildd@bos02-ppc64el-018) 
(powerpc64le-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU 
Binutils for Ubuntu) 2.42) #31-Ubuntu SMP Sat Apr 20 00:05:55 UTC 2024 (Ubuntu 
6.8.0-31.31-generic 6.8.1)
  Detected machine type: 0000000000000101
  command line: BOOT_IMAGE=/vmlinux-6.8.0-31-generic 
root=/dev/mapper/ubuntu--vg-ubuntu--lv ro 
crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M
  Max number of cores passed to firmware: 1024 (NR_CPUS = 2048)
  Calling ibm,client-architecture-support... done
  memory layout at init:
    memory_limit : 0000000000000000 (16 MB aligned)
    alloc_bottom : 0000000009d70000
    alloc_top    : 0000000030000000
    alloc_top_hi : 00000002a0000000
    rmo_top      : 0000000030000000
    ram_top      : 00000002a0000000
  instantiating rtas at 0x000000002fff0000... done
  prom_hold_cpus: skipped
  copying OF device tree...
  Building dt strings...
  Building dt structure...
  Device tree strings 0x0000000009d80000 -> 0x0000000009d80bc6
  Device tree struct  0x0000000009d90000 -> 0x0000000009da0000
  Quiescing Open Firmware ...
  Booting Linux via __start() @ 0x0000000000230000 ...
  [    0.000000] random: crng init done
  [    0.000000] Reserving 512MB of memory at 512MB for crashkernel (System 
RAM: 10752MB)
  [    0.000000] radix-mmu: Page sizes from device-tree:
  [    0.000000] radix-mmu: Page size shift = 12 AP=0x0
  [    0.000000] radix-mmu: Page size shift = 16 AP=0x5
  [    0.000000] radix-mmu: Page size shift = 21 AP=0x1
  [    0.000000] radix-mmu: Page size shift = 30 AP=0x2
  [    0.000000] Activating Kernel Userspace Access Prevention
  [    0.000000] Activating Kernel Userspace Execution Prevention
  [    0.000000] radix-mmu: Mapped 0x0000000000000000-0x00000000038a0000 with 
64.0 KiB pages (exec)
  [    0.000000] radix-mmu: Mapped 0x00000000038a0000-0x00000002a0000000 with 
64.0 KiB pages
  [    0.000000] lpar: Using radix MMU under hypervisor
  [    0.000000] Linux version 6.8.0-31-generic (buildd@bos02-ppc64el-018) 
(powerpc64le-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU 
Binutils for Ubuntu) 2.42) #31-Ubuntu SMP Sat Apr 20 00:05:55 UTC 2024 (Ubuntu 
6.8.0-31.31-generic 6.8.1)
  [    0.000000] Secure boot mode disabled
  [    0.000000] Found initrd at 0xc000000006200000:0xc000000009d6da29
  [    0.000000] Hardware name: IBM pSeries (emulated by qemu) POWER10 (raw) 
0x800200 0xf000006 of:SLOF,HEAD hv:linux,kvm pSeries
  [    0.000000] printk: legacy bootconsole [udbg0] enabled
  [    0.000000] Partition configured for 16 cpus.
  [    0.000000] CPU maps initialized for 2 threads per core
  [    0.000000] numa: Partition configured for 1 NUMA nodes.
  [    0.000000] -----------------------------------------------------
  [    0.000000] phys_mem_size     = 0x2a0000000
  [    0.000000] dcache_bsize      = 0x80
  [    0.000000] icache_bsize      = 0x80
  [    0.000000] cpu_features      = 0x001400eb8f5f9187
  [    0.000000]   possible        = 0x001ffbfbcf5fb187
  [    0.000000]   always          = 0x0000000380008181
  [    0.000000] cpu_user_features = 0xdc0065c2 0xaef60000
  [    0.000000] mmu_features      = 0x3c007641
  [    0.000000] firmware_features = 0x00000a85455a445f
  [    0.000000] vmalloc start     = 0xc008000000000000
  [    0.000000] IO start          = 0xc00a000000000000
  [    0.000000] vmemmap start     = 0xc00c000000000000
  [    0.000000] -----------------------------------------------------
  [    0.000000] numa:   NODE_DATA [mem 0x28ae09c00-0x28ae1197f]
  [    0.000000] rfi-flush: fallback displacement flush available
  [    0.000000] rfi-flush: ori type flush available
  [    0.000000] rfi-flush: mttrig type flush available
  [    0.000000] count-cache-flush: hardware flush enabled.
  [    0.000000] link-stack-flush: software flush enabled.
  [    0.000000] stf-barrier: eieio barrier available
  [    0.000000] PPC64 nvram contains 65536 bytes
  [    0.000000] barrier-nospec: using ORI speculation barrier
  [    0.000000] Zone ranges:
  [    0.000000]   Normal   [mem 0x0000000000000000-0x000000029fffffff]
  [    0.000000]   Device   empty
  [    0.000000] Movable zone start for each node
  [    0.000000] Early memory node ranges
  [    0.000000]   node   0: [mem 0x0000000000000000-0x000000029fffffff]
  [    0.000000] Initmem setup node 0 [mem 
0x0000000000000000-0x000000029fffffff]
  [    0.000000] percpu: Embedded 12 pages/cpu s609960 r0 d176472 u786432
  [    0.000000] Kernel command line: BOOT_IMAGE=/vmlinux-6.8.0-31-generic 
root=/dev/mapper/ubuntu--vg-ubuntu--lv ro 
crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M
  [    0.000000] Unknown kernel command line parameters 
"BOOT_IMAGE=/vmlinux-6.8.0-31-generic", will be passed to user space.
  [    0.000000] Dentry cache hash table entries: 2097152 (order: 8, 16777216 
bytes, linear)
  [    0.000000] Inode-cache hash table entries: 1048576 (order: 7, 8388608 
bytes, linear)
  [    0.000000] Fallback order for Node 0: 0
  [    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 171864
  [    0.000000] Policy zone: Normal
  [    0.000000] mem auto-init: stack:all(zero), heap alloc:on, heap free:off
  [    0.000000] Memory: 9947840K/11010048K available (23680K kernel code, 
4096K rwdata, 25472K rodata, 8832K init, 1901K bss, 1062208K reserved, 0K 
cma-reserved)
  [    0.000000] SLUB: HWalign=128, Order=0-3, MinObjects=0, CPUs=16, Nodes=1
  [    0.000000] ftrace: allocating 51717 entries in 19 pages
  [    0.000000] ftrace: allocated 19 pages with 3 groups
  [    0.000000] trace event string verifier disabled
  [    0.000000] rcu: Hierarchical RCU implementation.
  [    0.000000] rcu:   RCU restricting CPUs from NR_CPUS=2048 to nr_cpu_ids=16.
  [    0.000000]        Rude variant of Tasks RCU enabled.
  [    0.000000]        Tracing variant of Tasks RCU enabled.
  [    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 
jiffies.
  [    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=16
  [    0.000000] NR_IRQS: 512, nr_irqs: 512, preallocated irqs: 16
  [    0.000000] xive: Using IRQ range [0-f]
  [    0.000000] xive: Interrupt handling initialized with spapr backend
  [    0.000000] xive: Using priority 6 for all interrupts
  [    0.000000] xive: Using 64kB queues
  [    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
  [    0.000000] time_init: 56 bit decrementer (max: 7fffffffffffff)
  [    0.001027] clocksource: timebase: mask: 0xffffffffffffffff max_cycles: 
0x761537d007, max_idle_ns: 440795202126 ns
  [    0.002881] clocksource: timebase mult[1f40000] shift[24] registered

  =================================================================
  Host side:
  When the L2-1 guest console got stuck on first attempt
  Run 1

  # top | cat

  top - 08:53:11 up 2 days, 14:15,  6 users,  load average: 9.00, 10.53, 12.53
  Tasks: 496 total,   1 running, 495 sleeping,   0 stopped,   0 zombie
  %Cpu(s):  1.2 us,  2.2 sy,  0.0 ni, 76.7 id,  0.0 wa, 20.0 hi,  0.0 si,  0.0 
st
  MiB Mem :  48414.8 total,    303.5 free,  24681.1 used,  23777.0 buff/cache
  MiB Swap:   8191.9 total,   7910.1 free,    281.8 used.  23733.7 avail Mem

  USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  root      20   0   15.6g  10.5g  15360 S 800.0  22.2 146:46.26 qemu-system-ppc
  root      20   0   15.5g  10.5g  15360 S 100.0  22.1  88:03.53 qemu-system-ppc

  # free -mh
                 total        used        free      shared  buff/cache   
available
  Mem:            47Gi        24Gi       230Mi       2.2Mi        23Gi        
23Gi

  =================================================================
  Debugging logs/dumps:
  1. console.logs of both L2 guest consoles (All 3 attempts)
  2. virsh dump of both guests (All 3 attempts)

  Copying the above logs/dumps to june server machines under
  /dump/dumps/<bug-number>

  =================================================================
  Attachments:
  1. Run-1 console.log of L2-1 guest getting stuck
  2. Run-3 console.log of L2-1 guest getting stuck
  3. Run-3 console.log of L2-2 guest getting stuck
  4. Stress-ng script to run 90% load: stress-ng.sh

  $ pwd
  /home/dump/dumps/206735
  $ ls
  bug-206735-guest-console-logs  bug-206735-guest-virsh-dumps

  Thanks.

  ~/bug-206735-dumps# crash
  
/root/.cache/debuginfod_client/475c3a23ac990f64c5a03cf1fe8b229fde9a7692/debuginfo
  ./vmcore-ubuntu_vm1-1

  crash 8.0.4
  Copyright (C) 2002-2022  Red Hat, Inc.
  Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
  Copyright (C) 1999-2006  Hewlett-Packard Co
  Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
  Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
  Copyright (C) 2005, 2011, 2020-2022  NEC Corporation
  Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
  Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
  Copyright (C) 2015, 2021  VMware, Inc.
  This program is free software, covered by the GNU General Public License,
  and you are welcome to change it and/or distribute copies of it under
  certain conditions.  Enter "help copying" to see the conditions.
  This program has absolutely no warranty.  Enter "help warranty" for details.

  GNU gdb (GDB) 10.2
  Copyright (C) 2021 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.
  Type "show copying" and "show warranty" for details.
  This GDB was configured as "powerpc64le-unknown-linux-gnu".
  Type "show configuration" for configuration details.
  Find the GDB manual and other documentation resources online at:
      <http://www.gnu.org/software/gdb/documentation/>.

  For help, type "help".
  Type "apropos word" to search for commands related to "word"...

        KERNEL: 
/root/.cache/debuginfod_client/475c3a23ac990f64c5a03cf1fe8b229fde9a7692/debuginfo
      DUMPFILE: ./vmcore-ubuntu_vm1-1
          CPUS: 1
          DATE: Fri May 24 08:44:35 UTC 2024
        UPTIME: 00:00:00
  LOAD AVERAGE: 0.00, 0.00, 0.00
         TASKS: 1
      NODENAME: (none)
       RELEASE: 6.8.0-31-generic
       VERSION: #31-Ubuntu SMP Sat Apr 20 00:05:55 UTC 2024
       MACHINE: ppc64le  (3450 Mhz)
        MEMORY: 10.5 GB
         PANIC: ""
           PID: 0
       COMMAND: "swapper/0"
          TASK: c000000003bf8900  [THREAD_INFO: c000000003bf8900]
           CPU: 0
         STATE: TASK_RUNNING (ACTIVE)
       WARNING: panic task not found

  crash> bt
  PID: 0        TASK: c000000003bf8900  CPU: 0    COMMAND: "swapper/0"
   R0:  c0000000000de4f4    R1:  c00000028af13f80    R2:  c000000002254800
   R3:  c0000000048de000    R4:  c000000003c37bc0    R5:  0000000000000000
   R6:  0000000000000000    R7:  0000000000000000    R8:  c000000001724d18
   R9:  000000000000ff00    R10: 0000000286f80000    R11: 0000000053474552
   R12: c0000000000e4184    R13: c000000003e80000    R14: 0000000000000000
   R15: 0000000000000000    R16: 0000000000000000    R17: 0000000000000000
   R18: 0000000000000000    R19: 0000000000000000    R20: 0000000000000000
   R21: 0000000000000000    R22: 0000000000000000    R23: 0000000000000000
   R24: 0000000000000000    R25: c000000003c37bc0    R26: c00000028af13fe0
   R27: c000000003c34000    R28: c000000003c69e88    R29: c000000003c37c80
   R30: c000000003262bd8    R31: c0000000048de000
   NIP: c0000000000e41a4    MSR: 8000000000000033    OR3: 0000000000000000
   CTR: c0000000000e4184    LR:  c0000000000de4f4    XER: 0000000000000074
   CCR: 0000000082042840    MQ:  0000000000000000    DAR: 0000000000000000
   DSISR: 0000000000000000     Syscall Result: 0000000000000000
   [NIP  : xive_spapr_update_pending+32]
   [LR   : xive_get_irq+76]
   #0 [c00000028af13f80] (null) at 0  (unreliable)
   #1 [c00000028af13fb0] __do_irq at c000000000017a78
   #2 [c00000028af13fe0] __do_IRQ at c000000000018cd8
   #3 [c000000003c37bc0] (null) at 0  (unreliable)
   #4 [c000000003c37c20] do_IRQ at c000000000018e30
   #5 [c000000003c37c50] hardware_interrupt_common_virt at c00000000000953c
   #6 [c000000003c37f20] (null) at 9d6da29  (unreliable)
   #7 [c000000003c37f50] start_kernel at c00000000300fed0
   #8 [c000000003c37fe0] start_here_common at c00000000000e998
  crash> dis xive_spapr_update_pending+32
  0xc0000000000e41a4 <xive_spapr_update_pending+32>:      hwsync
  crash> dis -s xive_spapr_update_pending+32
  FILE: /build/linux-NbDBKx/linux-6.8.0/arch/powerpc/sysdev/xive/spapr.c
  LINE: 618

  dis: xive_spapr_update_pending+32: source code is not available

  crash>

  I debugged the 6.9.4-200.fc40 kernel of the evelp2g2 L2 VM as given to me by 
Lekshmi and I find that stress-ng has nothing
  to do with hitting this hang in the same __do_IRQ -> xive_get_irheq -> 
xive_spapr_update_pending call-stack.

  This call-stack hits randomly on one of the L2 vcpus whenever we do a
  "echo c > /proc/sysrq-trigger".

  The NIP points to the mb() macro in C code and hwsync in the GDB
  disassembly but this isn't really a problem with hwsync.

  I could reproduce this exact same problem with kernel 6.10.0-rc6+ on FC40 
after I:
  i)      Reset the crashkernel cmdline with the following command:
          grubby --update-kernel ALL --args 
"crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M"
  ii)     Enabled kdump via the "systemctl enable kdump" and "systemctl start 
kdump" commands.

  On debugging the vcpu thread in the Qemu instance running in L1 I find
  that the vcpu thread doesn't exit from KVM_RUN with any error code.

  I find that the xive_get_irq() -> xive_spapr_update_pending() functions are 
actually being called repeatedly by do_IRQ() -> __do_IRQ() -> __do_irq()
  in the arch/powerpc code. This means that we are constantly getting 
interrupts on this CPU. When I debugged the IRQ number we are constantly getting
  I see that it is 0x0 which is not informative as of now (to me at least).

  I think that there is some problem in the startup sequence of the
  secondary CPUs as I never faced this problem on the boot CPU as long
  as I tried today.

  I request the CPU team to investigate the startup sequence of the
  secondary SMP CPUs as they would be having a better idea of this for
  powerpc.

  The procedure to be followed is simple:
  i)    Put logs in the startup code of the secondary and primary CPU(s).
  ii)   Investigate the point at which the primary CPU waits for the secondary 
CPUs to come up and understand what isn't happening at the secondary CPUs
        such that the primary CPU doesn't go past the "smp: Bringing up 
secondary CPUs" log.

  (In reply to comment #21)
  > I think that there is some problem in the startup sequence of the secondary
  > CPUs as I never faced this problem on the boot CPU as long as I tried today.
  >
  > I request the CPU team to investigate the startup sequence of the secondary
  > SMP CPUs as they would be having a better idea of this for powerpc.
  >
  > The procedure to be followed is simple:
  > i)    Put logs in the startup code of the secondary and primary CPU(s).
  > ii)   Investigate the point at which the primary CPU waits for the secondary
  > CPUs to come up and understand what isn't happening at the secondary CPUs
  >       such that the primary CPU doesn't go past the "smp: Bringing up
  > secondary CPUs" log.

  We have done exactly that and we see that one of the secondary threads during 
the bring up is stuck in arch_local_irq_restore().
  We dont know why its gets stuck there and thats a function that cant be 
instrumented since it leads to other side-effects even before hitting
  Bringing up secondary CPUs.

  Even the addr2line for the address (NIP) shown in rcu stall points to
  the same arch_local_irq_restore().

  Just to level set.

  Gautham's patch helps if we are going to disable xive in L2
  Nick's patch will not throw RCU Stalls but L2 will still hang.

  Current interpretation of investigation:
  When we have xive on L2, CPU bring up gets stuck at mb() in 
xive_spapr_update-pending. However this is not reproducible with xive disabled.

  Its probably a bit early to conclude xive is the problem.

  
-----------------------------------------------------------------------------------

  After reverting the following commits pointed by Gautam Menghani, hang
  is not seen and kdump in L2 works as expected.

  df938a5576f3 KVM: PPC: Book3S HV nestedv2: Do not inject certain interrupts
  ec0f6639fa88 KVM: PPC: Book3S HV nestedv2: Ensure LPCR_MER bit is passed to 
the L0

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/2077722/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to