Logs: [ 594.291317] ACPI: Hardware changed while hibernated, success doubtful! [ 594.411609] BUG: kernel NULL pointer dereference, address: 00000000000001f4 [ 594.424658] #PF: supervisor write access in kernel mode [ 594.424660] #PF: error_code(0x0002) - not-present page [ 594.424661] PGD 0 P4D 0 [ 594.424665] Oops: 0002 [#1] SMP PTI [ 594.424668] CPU: 3 PID: 362 Comm: systemd-timesyn Not tainted 5.8.0-1036-aws #38~20.04.1-Ubuntu [ 594.424669] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006 [ 594.424675] RIP: 0010:_raw_spin_lock_irqsave+0x23/0x40 [...] [ 594.424692] Call Trace: [ 594.424699] xennet_start_xmit+0x158/0x570
By looking the assembly, this is where it fails: ffffffff8182d8e0 <xennet_start_xmit>: [...] ffffffff8182da0a: 48 83 f8 0d cmp $0xd,%rax ffffffff8182da0e: 0f 86 46 03 00 00 jbe ffffffff8182dd5a <xennet_start_xmit+0x47a> ### pahole --hex -C netfront_queue usr/lib/debug/boot/vmlinux-5.8.0-1035-aws |grep lock ###### spinlock_t tx_lock; /* 0x1f4 0x4 */ ffffffff8182da14: 49 8d 86 f4 01 00 00 lea 0x1f4(%r14),%rax # <-- #rax = &queue->tx_lock ffffffff8182da1b: 45 8b 4c 24 70 mov 0x70(%r12),%r9d ffffffff8182da20: 45 2b 4c 24 74 sub 0x74(%r12),%r9d ffffffff8182da25: 48 89 c7 mov %rax,%rdi # <-- %rdi = %rax ffffffff8182da28: 44 89 4d 94 mov %r9d,-0x6c(%rbp) ffffffff8182da2c: 48 89 85 78 ff ff ff mov %rax,-0x88(%rbp) ffffffff8182da33: e8 38 52 36 00 callq ffffffff81b92c70 <_raw_spin_lock_irqsave> #<-- OOPS here [...] By correlating with the code, we have this in C: static netdev_tx_t xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) [...] len = skb_headlen(skb); spin_lock_irqsave(&queue->tx_lock, flags); // <<< HERE if (unlikely(!netif_carrier_ok(dev) || [...] Happens that queue->tx_lock is NULL. What is interesting is the message: [ 594.291317] ACPI: Hardware changed while hibernated, success doubtful! So, it means the hibernation woke in a different compute node than it went sleeping. I'm still not 100% sure of why that would cause such OOPS...but I have 2 ideas to either prevent it and validate that hypothesis: (a) To modprobe unload the xen network driver right before hibernation and load it in the last stage of wake-up - I've done that myself in the hibernation scripts. This would likely prevent this issue. (b) For testing personnel : maybe you're able to "lock" the testing to sleep/wake-up *always* in the same compute node/instance. If that prevents the issue, we're sure that the difference between the nodes rom sleep/wake-up is triggering some memory corruption in the device queue, that is somewhat propagated to the kernel memory. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1934424 Title: kernel NULL pointer dereference during xen hibernation Status in linux-aws package in Ubuntu: New Bug description: Encountered the following panic while doing hibernation/resume testing with linux-aws 5.8 on Focal on an m3.xlarge (xen) instance type: [ 594.291317] ACPI: Hardware changed while hibernated, success doubtful! [ 594.411609] BUG: kernel NULL pointer dereference, address: 00000000000001f4 [ 594.424658] #PF: supervisor write access in kernel mode [ 594.424660] #PF: error_code(0x0002) - not-present page [ 594.424661] PGD 0 P4D 0 [ 594.424665] Oops: 0002 [#1] SMP PTI [ 594.424668] CPU: 3 PID: 362 Comm: systemd-timesyn Not tainted 5.8.0-1036-aws #38~20.04.1-Ubuntu [ 594.424669] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006 [ 594.424675] RIP: 0010:_raw_spin_lock_irqsave+0x23/0x40 [ 594.424678] Code: 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 9c 58 0f 1f 44 00 00 49 89 c4 fa 66 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 07 4c 89 e0 41 5c 5d c3 89 c6 e8 e9 d1 56 ff 66 90 [ 594.424679] RSP: 0018:ffffc900004e3848 EFLAGS: 00010046 [ 594.424680] RAX: 0000000000000000 RBX: ffff8883bcc0d000 RCX: 0000000000000e02 [ 594.424681] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000000001f4 [ 594.424682] RBP: ffffc900004e3850 R08: ffff8883b90b5ec0 R09: 000000000000005a [ 594.424683] R10: ffffc900004e3910 R11: 0000000000000000 R12: 0000000000000206 [ 594.424684] R13: ffffea000ee42d40 R14: 0000000000000000 R15: 0000000000000001 [ 594.424686] FS: 00007f65ba055980(0000) GS:ffff8883c0ac0000(0000) knlGS:0000000000000000 [ 594.424687] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 594.424688] CR2: 00000000000001f4 CR3: 00000003b99f0001 CR4: 00000000001606e0 [ 594.424692] Call Trace: [ 594.424699] xennet_start_xmit+0x158/0x570 [ 594.424704] dev_hard_start_xmit+0x91/0x1f0 [ 594.424706] ? validate_xmit_skb+0x300/0x340 [ 594.424710] sch_direct_xmit+0x113/0x340 [ 594.424712] __dev_queue_xmit+0x57c/0x8e0 [ 594.424714] ? neigh_add_timer+0x37/0x60 [ 594.424716] dev_queue_xmit+0x10/0x20 [ 594.424717] neigh_resolve_output+0x112/0x1c0 [ 594.424721] ip_finish_output2+0x19b/0x590 [ 594.424723] __ip_finish_output+0xc8/0x1e0 [ 594.424725] ip_finish_output+0x2d/0xb0 [ 594.424728] ip_output+0x7a/0xf0 [ 594.424730] ? __ip_finish_output+0x1e0/0x1e0 [ 594.424732] ip_local_out+0x3d/0x50 [ 594.424734] ip_send_skb+0x19/0x40 [ 594.424737] udp_send_skb.isra.0+0x165/0x390 [ 594.424739] udp_sendmsg+0xb0e/0xd50 [ 594.424742] ? ip_reply_glue_bits+0x50/0x50 [ 594.424747] ? delete_from_swap_cache+0x6a/0x90 [ 594.424750] ? _cond_resched+0x19/0x30 [ 594.424754] ? aa_sk_perm+0x43/0x1b0 [ 594.424757] inet_sendmsg+0x65/0x70 [ 594.424759] ? security_socket_sendmsg+0x35/0x50 [ 594.424760] ? inet_sendmsg+0x65/0x70 [ 594.424764] sock_sendmsg+0x5e/0x70 [ 594.424766] __sys_sendto+0x113/0x190 [ 594.424770] ? __secure_computing+0x42/0xe0 [ 594.424774] ? syscall_trace_enter+0x10d/0x280 [ 594.424777] __x64_sys_sendto+0x29/0x30 [ 594.424781] do_syscall_64+0x49/0xc0 [ 594.424783] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 594.424785] RIP: 0033:0x7f65baee4844 [ 594.424788] Code: 42 3f f7 ff 44 8b 4c 24 2c 4c 8b 44 24 20 89 c5 44 8b 54 24 28 48 8b 54 24 18 b8 2c 00 00 00 48 8b 74 24 10 8b 7c 24 08 0f 05 <48> 3d 00 f0 ff ff 77 30 89 ef 48 89 44 24 08 e8 68 3f f7 ff 48 8b [ 594.424789] RSP: 002b:00007ffe9b5fd3a0 EFLAGS: 00000293 ORIG_RAX: 000000000000002c [ 594.424790] RAX: ffffffffffffffda RBX: 00007ffe9b5fd4e0 RCX: 00007f65baee4844 [ 594.424791] RDX: 0000000000000030 RSI: 00007ffe9b5fd3f0 RDI: 0000000000000010 [ 594.424792] RBP: 0000000000000000 R08: 0000560426541678 R09: 0000000000000010 [ 594.424793] R10: 0000000000000040 R11: 0000000000000293 R12: 0000000000000000 [ 594.424794] R13: 00007ffe9b5fd3e4 R14: 0000000000000068 R15: 0000000000000000 [ 594.424796] Modules linked in: btrfs blake2b_generic xor raid6_pq ufs msdos xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd psmouse cryptd input_leds glue_helper serio_raw floppy sch_fq_codel drm ip_tables x_tables autofs4 [ 594.424813] CR2: 00000000000001f4 [ 594.424821] ---[ end trace bb5f35055c1a8060 ]--- [ 594.424822] RIP: 0010:_raw_spin_lock_irqsave+0x23/0x40 [ 594.424824] Code: 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 9c 58 0f 1f 44 00 00 49 89 c4 fa 66 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 07 4c 89 e0 41 5c 5d c3 89 c6 e8 e9 d1 56 ff 66 90 [ 594.424824] RSP: 0018:ffffc900004e3848 EFLAGS: 00010046 [ 594.424825] RAX: 0000000000000000 RBX: ffff8883bcc0d000 RCX: 0000000000000e02 [ 594.424826] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000000001f4 [ 594.424826] RBP: ffffc900004e3850 R08: ffff8883b90b5ec0 R09: 000000000000005a [ 594.424827] R10: ffffc900004e3910 R11: 0000000000000000 R12: 0000000000000206 [ 594.424827] R13: ffffea000ee42d40 R14: 0000000000000000 R15: 0000000000000001 [ 594.424828] FS: 00007f65ba055980(0000) GS:ffff8883c0ac0000(0000) knlGS:0000000000000000 [ 594.424829] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 594.424830] CR2: 00000000000001f4 CR3: 00000003b99f0001 CR4: 00000000001606e0 [ 594.424830] Kernel panic - not syncing: Fatal exception in interrupt [ 594.425091] Kernel Offset: disabled This log was captured from the AWS system log (which is not a live console as this is a xen instance type). From the user perspective, the instance appears to have just rebooted during the hibernation / resume. It's only by logging the console that we have any information at all. This is not a frequent event. From 100 test runs on this instance, two failures were observed: this failure and another which left no traces of the issue. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1934424/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp