[Kernel-packages] [Bug 1934424] Re: kernel NULL pointer dereference during xen hibernation

Guilherme G. Piccoli Tue, 20 Jul 2021 08:35:50 -0700

Logs:
[  594.291317] ACPI: Hardware changed while hibernated, success doubtful!
[  594.411609] BUG: kernel NULL pointer dereference, address: 00000000000001f4
[  594.424658] #PF: supervisor write access in kernel mode
[  594.424660] #PF: error_code(0x0002) - not-present page
[  594.424661] PGD 0 P4D 0 
[  594.424665] Oops: 0002 [#1] SMP PTI
[  594.424668] CPU: 3 PID: 362 Comm: systemd-timesyn Not tainted 5.8.0-1036-aws 
#38~20.04.1-Ubuntu
[  594.424669] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[  594.424675] RIP: 0010:_raw_spin_lock_irqsave+0x23/0x40
[...]
[  594.424692] Call Trace:
[  594.424699]  xennet_start_xmit+0x158/0x570


By looking the assembly, this is where it fails:

ffffffff8182d8e0 <xennet_start_xmit>:
[...]
ffffffff8182da0a:       48 83 f8 0d             cmp    $0xd,%rax
ffffffff8182da0e:       0f 86 46 03 00 00       jbe    ffffffff8182dd5a 
<xennet_start_xmit+0x47a>

### pahole --hex -C netfront_queue  usr/lib/debug/boot/vmlinux-5.8.0-1035-aws 
|grep lock
###### spinlock_t                 tx_lock;              /* 0x1f4   0x4 */

ffffffff8182da14:       49 8d 86 f4 01 00 00    lea    0x1f4(%r14),%rax # <-- 
#rax = &queue->tx_lock
ffffffff8182da1b:       45 8b 4c 24 70          mov    0x70(%r12),%r9d
ffffffff8182da20:       45 2b 4c 24 74          sub    0x74(%r12),%r9d
ffffffff8182da25:       48 89 c7                mov    %rax,%rdi # <-- %rdi = 
%rax
ffffffff8182da28:       44 89 4d 94             mov    %r9d,-0x6c(%rbp)
ffffffff8182da2c:       48 89 85 78 ff ff ff    mov    %rax,-0x88(%rbp)
ffffffff8182da33:       e8 38 52 36 00          callq  ffffffff81b92c70 
<_raw_spin_lock_irqsave> #<-- OOPS here
[...]

By correlating with the code, we have this in C:

static netdev_tx_t xennet_start_xmit(struct sk_buff *skb, struct net_device 
*dev)
[...]
len = skb_headlen(skb);

spin_lock_irqsave(&queue->tx_lock, flags); // <<< HERE

if (unlikely(!netif_carrier_ok(dev) ||
[...]

Happens that queue->tx_lock is NULL.
What is interesting is the message:

[  594.291317] ACPI: Hardware changed while hibernated, success
doubtful!


So, it means the hibernation woke in a different compute node than it went 
sleeping. I'm still not 100% sure of why that would cause such OOPS...but I 
have 2 ideas to either prevent it and validate that hypothesis:

(a) To modprobe unload the xen network driver right before hibernation
and load it in the last stage of wake-up - I've done that myself in the
hibernation scripts. This would likely prevent this issue.

(b) For testing personnel : maybe you're able to "lock" the testing to
sleep/wake-up *always* in the same compute node/instance. If that
prevents the issue, we're sure that the difference between the nodes rom
sleep/wake-up is triggering some memory corruption in the device queue,
that is somewhat propagated to the kernel memory.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws in Ubuntu.
https://bugs.launchpad.net/bugs/1934424

Title:
  kernel NULL pointer dereference during xen hibernation

Status in linux-aws package in Ubuntu:
  New

Bug description:
  Encountered the following panic while doing hibernation/resume testing
  with linux-aws 5.8 on Focal on an m3.xlarge (xen) instance type:

  [  594.291317] ACPI: Hardware changed while hibernated, success doubtful!
  [  594.411609] BUG: kernel NULL pointer dereference, address: 00000000000001f4
  [  594.424658] #PF: supervisor write access in kernel mode
  [  594.424660] #PF: error_code(0x0002) - not-present page
  [  594.424661] PGD 0 P4D 0 
  [  594.424665] Oops: 0002 [#1] SMP PTI
  [  594.424668] CPU: 3 PID: 362 Comm: systemd-timesyn Not tainted 
5.8.0-1036-aws #38~20.04.1-Ubuntu
  [  594.424669] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
  [  594.424675] RIP: 0010:_raw_spin_lock_irqsave+0x23/0x40
  [  594.424678] Code: 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 9c 
58 0f 1f 44 00 00 49 89 c4 fa 66 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 
17 75 07 4c 89 e0 41 5c 5d c3 89 c6 e8 e9 d1 56 ff 66 90
  [  594.424679] RSP: 0018:ffffc900004e3848 EFLAGS: 00010046
  [  594.424680] RAX: 0000000000000000 RBX: ffff8883bcc0d000 RCX: 
0000000000000e02
  [  594.424681] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 
00000000000001f4
  [  594.424682] RBP: ffffc900004e3850 R08: ffff8883b90b5ec0 R09: 
000000000000005a
  [  594.424683] R10: ffffc900004e3910 R11: 0000000000000000 R12: 
0000000000000206
  [  594.424684] R13: ffffea000ee42d40 R14: 0000000000000000 R15: 
0000000000000001
  [  594.424686] FS:  00007f65ba055980(0000) GS:ffff8883c0ac0000(0000) 
knlGS:0000000000000000
  [  594.424687] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  594.424688] CR2: 00000000000001f4 CR3: 00000003b99f0001 CR4: 
00000000001606e0
  [  594.424692] Call Trace:
  [  594.424699]  xennet_start_xmit+0x158/0x570
  [  594.424704]  dev_hard_start_xmit+0x91/0x1f0
  [  594.424706]  ? validate_xmit_skb+0x300/0x340
  [  594.424710]  sch_direct_xmit+0x113/0x340
  [  594.424712]  __dev_queue_xmit+0x57c/0x8e0
  [  594.424714]  ? neigh_add_timer+0x37/0x60
  [  594.424716]  dev_queue_xmit+0x10/0x20
  [  594.424717]  neigh_resolve_output+0x112/0x1c0
  [  594.424721]  ip_finish_output2+0x19b/0x590
  [  594.424723]  __ip_finish_output+0xc8/0x1e0
  [  594.424725]  ip_finish_output+0x2d/0xb0
  [  594.424728]  ip_output+0x7a/0xf0
  [  594.424730]  ? __ip_finish_output+0x1e0/0x1e0
  [  594.424732]  ip_local_out+0x3d/0x50
  [  594.424734]  ip_send_skb+0x19/0x40
  [  594.424737]  udp_send_skb.isra.0+0x165/0x390
  [  594.424739]  udp_sendmsg+0xb0e/0xd50
  [  594.424742]  ? ip_reply_glue_bits+0x50/0x50
  [  594.424747]  ? delete_from_swap_cache+0x6a/0x90
  [  594.424750]  ? _cond_resched+0x19/0x30
  [  594.424754]  ? aa_sk_perm+0x43/0x1b0
  [  594.424757]  inet_sendmsg+0x65/0x70
  [  594.424759]  ? security_socket_sendmsg+0x35/0x50
  [  594.424760]  ? inet_sendmsg+0x65/0x70
  [  594.424764]  sock_sendmsg+0x5e/0x70
  [  594.424766]  __sys_sendto+0x113/0x190
  [  594.424770]  ? __secure_computing+0x42/0xe0
  [  594.424774]  ? syscall_trace_enter+0x10d/0x280
  [  594.424777]  __x64_sys_sendto+0x29/0x30
  [  594.424781]  do_syscall_64+0x49/0xc0
  [  594.424783]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [  594.424785] RIP: 0033:0x7f65baee4844
  [  594.424788] Code: 42 3f f7 ff 44 8b 4c 24 2c 4c 8b 44 24 20 89 c5 44 8b 54 
24 28 48 8b 54 24 18 b8 2c 00 00 00 48 8b 74 24 10 8b 7c 24 08 0f 05 <48> 3d 00 
f0 ff ff 77 30 89 ef 48 89 44 24 08 e8 68 3f f7 ff 48 8b
  [  594.424789] RSP: 002b:00007ffe9b5fd3a0 EFLAGS: 00000293 ORIG_RAX: 
000000000000002c
  [  594.424790] RAX: ffffffffffffffda RBX: 00007ffe9b5fd4e0 RCX: 
00007f65baee4844
  [  594.424791] RDX: 0000000000000030 RSI: 00007ffe9b5fd3f0 RDI: 
0000000000000010
  [  594.424792] RBP: 0000000000000000 R08: 0000560426541678 R09: 
0000000000000010
  [  594.424793] R10: 0000000000000040 R11: 0000000000000293 R12: 
0000000000000000
  [  594.424794] R13: 00007ffe9b5fd3e4 R14: 0000000000000068 R15: 
0000000000000000
  [  594.424796] Modules linked in: btrfs blake2b_generic xor raid6_pq ufs 
msdos xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd 
psmouse cryptd input_leds glue_helper serio_raw floppy sch_fq_codel drm 
ip_tables x_tables autofs4
  [  594.424813] CR2: 00000000000001f4
  [  594.424821] ---[ end trace bb5f35055c1a8060 ]---
  [  594.424822] RIP: 0010:_raw_spin_lock_irqsave+0x23/0x40
  [  594.424824] Code: 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 9c 
58 0f 1f 44 00 00 49 89 c4 fa 66 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 
17 75 07 4c 89 e0 41 5c 5d c3 89 c6 e8 e9 d1 56 ff 66 90
  [  594.424824] RSP: 0018:ffffc900004e3848 EFLAGS: 00010046
  [  594.424825] RAX: 0000000000000000 RBX: ffff8883bcc0d000 RCX: 
0000000000000e02
  [  594.424826] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 
00000000000001f4
  [  594.424826] RBP: ffffc900004e3850 R08: ffff8883b90b5ec0 R09: 
000000000000005a
  [  594.424827] R10: ffffc900004e3910 R11: 0000000000000000 R12: 
0000000000000206
  [  594.424827] R13: ffffea000ee42d40 R14: 0000000000000000 R15: 
0000000000000001
  [  594.424828] FS:  00007f65ba055980(0000) GS:ffff8883c0ac0000(0000) 
knlGS:0000000000000000
  [  594.424829] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  594.424830] CR2: 00000000000001f4 CR3: 00000003b99f0001 CR4: 
00000000001606e0
  [  594.424830] Kernel panic - not syncing: Fatal exception in interrupt
  [  594.425091] Kernel Offset: disabled

  This log was captured from the AWS system log (which is not a live
  console as this is a xen instance type). From the user perspective,
  the instance appears to have just rebooted during the hibernation /
  resume. It's only by logging the console that we have any information
  at all.

  This is not a frequent event. From 100 test runs on this instance, two
  failures were observed: this failure and another which left no traces
  of the issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1934424/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1934424] Re: kernel NULL pointer dereference during xen hibernation

Reply via email to