[Kernel-packages] [Bug 2035166] Re: NULL Pointer Dereference During KVM MMU Page Invalidation

Ubuntu Kernel Bot Mon, 30 Oct 2023 21:08:48 -0700

This bug is awaiting verification that the linux-intel-
iotg/5.15.0-1044.50 kernel in -proposed solves the problem. Please test
the kernel and update this bug with the results. If the problem is
solved, change the tag 'verification-needed-jammy-linux-intel-iotg' to
'verification-done-jammy-linux-intel-iotg'. If the problem still exists,
change the tag 'verification-needed-jammy-linux-intel-iotg' to
'verification-failed-jammy-linux-intel-iotg'.



If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.


See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2035166

Title:
  NULL Pointer Dereference During KVM MMU Page Invalidation

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Jammy:
  Fix Released

Bug description:
  [Impact]
  During VM live migration, there is a potential risk of dereferencing a NULL 
pointer,
  which can lead to memory access issues and result in an unstable environment.

  [Fix]
  The call trace is as follows:

  kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
  kernel: #PF: supervisor write access in kernel mode
  kernel: #PF: error_code(0x0002) - not-present page
  kernel: PGD 0 P4D 0 
  kernel: Oops: 0002 [#1] SMP NOPTI
  kernel: CPU: 29 PID: 4063601 Comm: CPU 0/KVM Tainted: G          IOE     
5.15.0-53-generic #59~20.04.1-Ubuntu
  kernel: Hardware name: Dell Inc. PowerEdge R640/0H28RR, BIOS 2.12.2 07/09/2021
  kernel: RIP: 0010:__handle_changed_spte+0x3a9/0x620 [kvm]
  kernel: Code: 48 8b 58 28 44 0f b6 63 24 48 8b 43 28 41 83 e4 0f 48 89 45 a0 
0f 1f 44 00 00 45 84 d2 0f 85 06 02 00 00 48 8b 43 08 48 8b 13 <48> 89 42 08 48 
89 10 44 0f b6 6b 23 48 b8 00 01 00 00 00 00 ad de
  kernel: RSP: 0018:ffffb580320278a8 EFLAGS: 00010246
  kernel: RAX: 0000000000000000 RBX: ffffa0fe29e94c38 RCX: 0000000000000027
  kernel: RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffb5801e24ba58
  kernel: RBP: ffffb58032027930 R08: 0000000000000000 R09: 0000000000000004
  kernel: R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000003
  kernel: R13: 0000000000000004 R14: 0000000000000000 R15: ffffb5801e235000
  kernel: FS:  00007f1553fff700(0000) GS:ffffa20eff780000(0000) 
knlGS:0000000000000000
  kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  kernel: CR2: 0000000000000008 CR3: 000000e7f7544004 CR4: 00000000007726e0
  kernel: PKRU: 55555554
  kernel: Call Trace:
  kernel:  <TASK>
  kernel:  ? __switch_to_xtra+0x109/0x510
  kernel:  zap_gfn_range+0x218/0x360 [kvm]
  kernel:  ? __smp_call_single_queue+0x59/0x90
  kernel:  ? alloc_cpumask_var_node+0x1/0x30
  kernel:  ? kvm_make_vcpus_request_mask+0x150/0x1d0 [kvm]
  kernel:  kvm_tdp_mmu_zap_invalidated_roots+0x5b/0xb0 [kvm]
  kernel:  kvm_mmu_zap_all_fast+0x19a/0x1d0 [kvm]
  --
  kernel: RAX: ffffffffffffffda RBX: 000000004020ae46 RCX: 00007f15aa26e3ab
  kernel: RDX: 00007f1553ffe050 RSI: 000000004020ae46 RDI: 000000000000002f
  kernel: RBP: 00005602a885a410 R08: 00005602a82ad000 R09: 00007f154c087470
  kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f1553ffe050
  kernel: R13: 00007f1553ffe160 R14: 0000000000000000 R15: 0000000000800000
  kernel:  </TASK>

  The error occurred randomly in different production environments of the 
customer, all with the same call trace.
  Therefore, the likelihood of other processes contaminating memory is low.
  After analyzing the call trace with the help of debug symbols, we can 
pinpoint the source of the error.

  root@focal:~/ddeb# eu-addr2line -ifae 
./usr/lib/debug/lib/modules/5.15.0-53-generic/kernel/arch/x86/kvm/kvm.ko 
__handle_changed_spte+0x3a9
  0x0000000000068109
  __list_del inlined at 
/build/linux-hwe-5.15-ZCQu4B/linux-hwe-5.15-5.15.0/include/linux/list.h:135:2 
in __handle_changed_spte
  /build/linux-hwe-5.15-ZCQu4B/linux-hwe-5.15-5.15.0/include/linux/list.h:112:13
  __list_del_entry
  /build/linux-hwe-5.15-ZCQu4B/linux-hwe-5.15-5.15.0/include/linux/list.h:135:2
  list_del
  /build/linux-hwe-5.15-ZCQu4B/linux-hwe-5.15-5.15.0/include/linux/list.h:146:2
  tdp_mmu_unlink_page
  
/build/linux-hwe-5.15-ZCQu4B/linux-hwe-5.15-5.15.0/arch/x86/kvm/mmu/tdp_mmu.c:305:2
  handle_removed_tdp_mmu_page
  
/build/linux-hwe-5.15-ZCQu4B/linux-hwe-5.15-5.15.0/arch/x86/kvm/mmu/tdp_mmu.c:340:2
  __handle_changed_spte
  
/build/linux-hwe-5.15-ZCQu4B/linux-hwe-5.15-5.15.0/arch/x86/kvm/mmu/tdp_mmu.c:491:3

  The error occurred when the kernel attempted to delete an entry from a list.
  This issue may potentially be related to timing and has proven challenging to 
reproduce consistently, making it difficult for us to pinpoint the cause.
  It's worth noting that the current kernel has replaced the list_head with 
atomic_t, as indicated by the following commit.

  d25ceb926436 KVM: x86/mmu: Track the number of TDP MMU pages, but not
  the actual pages

  While this patch doesn't modify the triggering logic, it replaces the 
problematic section with a more reliable approach while keeping the original 
logic unchanged.
  If the issue persists, it should not result in any memory access problems.
  We also requested the customer to set up a test environment and simulate a 
workload similar to the production environment.
  The patch worked well and did not introduce any adverse effects.

  [Test Plan]
  Reproducing the issue has proven to be challenging.
  Simulating heavy live migration activity in the customer's production 
environment is the appropriate approach to ensure that applying the patch will 
not result in any adverse effects.

  [Where problems could occur]
  The patch will impact the live migration workflow, but it only modifies the 
data structure in use, and no functionality will be altered.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2035166/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2035166] Re: NULL Pointer Dereference During KVM MMU Page Invalidation

Reply via email to