Public bug reported: [Impact]
We have some Ubuntu 16.04 hosts (in Hyper-V) being used for testing some Ubuntu 20.04 container. As part of the testing we were attempting to take a memory dump of a container running SQL Server with Ubuntu 20.04 on the Ubuntu 16.04 host we started seeing kernel panic and core dump. It started happening after a specific Xenial kernel update on the host. 4.4.0-204-generic - Systems that are crashing 4.4.0-201-generic - Systems that are able to capture dump Note from the developer indicates following logging showing up. ---- Now the following is output right after I attempt to start the dump. (gdb, attach ###, generate-core-file /var/opt/mssql/log/rdorr.delme.core) [Fri Mar 19 20:01:38 2021] systemd-journald[581]: Successfully sent stream file descriptor to service manager. [Fri Mar 19 20:01:41 2021] cni0: port 9(vethdec5d2b7) entered forwarding state [Fri Mar 19 20:02:42 2021] systemd-journald[581]: Successfully sent stream file descriptor to service manager. [Fri Mar 19 20:03:04 2021] ------------[ cut here ]------------ [Fri Mar 19 20:03:04 2021] kernel BUG at /build/linux-qlAbvR/linux-4.4.0/mm/memory.c:3214! [Fri Mar 19 20:03:04 2021] invalid opcode: 0000 [#1] SMP [Fri Mar 19 20:03:04 2021] Modules linked in: veth vxlan ip6_udp_tunnel udp_tunnel xt_statistic xt_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs libcrc32c ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables xt_comment xt_mark xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables br_netfilter bridge stp llc aufs overlay nls_utf8 isofs crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd input_leds serio_raw i2c_piix4 hv_balloon hyperv_fb 8250_fintek joydev mac_hid autofs4 hid_generic hv_utils hid_hyperv ptp hv_netvsc hid hv_storvsc pps_core [Fri Mar 19 20:03:04 2021] hyperv_keyboard scsi_transport_fc psmouse pata_acpi hv_vmbus floppy fjes [Fri Mar 19 20:03:04 2021] CPU: 1 PID: 24869 Comm: gdb Tainted: G W 4.4.0-204-generic #236-Ubuntu [Fri Mar 19 20:03:04 2021] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 05/18/2018 [Fri Mar 19 20:03:04 2021] task: ffff880db9229c80 ti: ffff880d93b9c000 task.ti: ffff880d93b9c000 [Fri Mar 19 20:03:04 2021] RIP: 0010:[<ffffffff811cd93e>] [<ffffffff811cd93e>] handle_mm_fault+0x13de/0x1b80 [Fri Mar 19 20:03:04 2021] RSP: 0018:ffff880d93b9fc28 EFLAGS: 00010246 [Fri Mar 19 20:03:04 2021] RAX: 0000000000000100 RBX: 0000000000000000 RCX: 0000000000000120 [Fri Mar 19 20:03:04 2021] RDX: ffff880ea635f3e8 RSI: 00003ffffffff000 RDI: 0000000000000000 [Fri Mar 19 20:03:04 2021] RBP: ffff880d93b9fce8 R08: 00003ff32179a120 R09: 000000000000007d [Fri Mar 19 20:03:04 2021] R10: ffff8800000003e8 R11: 00000000000003e8 R12: ffff8800ea672708 [Fri Mar 19 20:03:04 2021] R13: 0000000000000000 R14: 000000010247d000 R15: ffff8800f27fe400 [Fri Mar 19 20:03:04 2021] FS: 00007fdc26061600(0000) GS:ffff881025640000(0000) knlGS:0000000000000000 [Fri Mar 19 20:03:04 2021] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Fri Mar 19 20:03:04 2021] CR2: 000055e3a0011290 CR3: 0000000d93ba4000 CR4: 0000000000160670 [Fri Mar 19 20:03:04 2021] Stack: [Fri Mar 19 20:03:04 2021] ffffffff81082929 fffffffffffffffd ffffffff81082252 ffff880d93b9fca8 [Fri Mar 19 20:03:04 2021] ffffffff811c7bca ffff8800f27fe400 000000010247d000 ffff880e74a88090 [Fri Mar 19 20:03:04 2021] 000000003a98d7f0 ffff880e00000001 ffff8800000003e8 0000000000000017 [Fri Mar 19 20:03:04 2021] Call Trace: [Fri Mar 19 20:03:04 2021] [<ffffffff81082929>] ? mm_access+0x79/0xa0 [Fri Mar 19 20:03:04 2021] [<ffffffff81082252>] ? mmput+0x12/0x130 [Fri Mar 19 20:03:04 2021] [<ffffffff811c7bca>] ? follow_page_pte+0x1ca/0x3d0 [Fri Mar 19 20:03:04 2021] [<ffffffff811c7fe4>] ? follow_page_mask+0x214/0x3a0 [Fri Mar 19 20:03:04 2021] [<ffffffff811c82a0>] __get_user_pages+0x130/0x680 [Fri Mar 19 20:03:04 2021] [<ffffffff8122b248>] ? path_openat+0x348/0x1360 [Fri Mar 19 20:03:04 2021] [<ffffffff811c8b74>] get_user_pages+0x34/0x40 [Fri Mar 19 20:03:04 2021] [<ffffffff811c90f4>] __access_remote_vm+0xe4/0x2d0 [Fri Mar 19 20:03:04 2021] [<ffffffff811ef6ac>] ? alloc_pages_current+0x8c/0x110 [Fri Mar 19 20:03:04 2021] [<ffffffff811cfe3f>] access_remote_vm+0x1f/0x30 [Fri Mar 19 20:03:04 2021] [<ffffffff8128d3fa>] mem_rw.isra.16+0xfa/0x190 [Fri Mar 19 20:03:04 2021] [<ffffffff8128d4c8>] mem_read+0x18/0x20 [Fri Mar 19 20:03:04 2021] [<ffffffff8121c89b>] __vfs_read+0x1b/0x40 [Fri Mar 19 20:03:04 2021] [<ffffffff8121d016>] vfs_read+0x86/0x130 [Fri Mar 19 20:03:04 2021] [<ffffffff8121df65>] SyS_pread64+0x95/0xb0 [Fri Mar 19 20:03:04 2021] [<ffffffff8186acdb>] entry_SYSCALL_64_fastpath+0x22/0xd0 [Fri Mar 19 20:03:04 2021] Code: d4 ee ff ff 48 8b 7d 98 89 45 88 e8 2d c7 fd ff 8b 45 88 89 c3 e9 be ee ff ff 48 8b bd 70 ff ff ff e8 c7 cf 69 00 e9 ad ee ff ff <0f> 0b 4c 89 e7 4c 89 9d 70 ff ff ff e8 f1 c9 00 00 85 c0 4c 8b [Fri Mar 19 20:03:04 2021] RIP [<ffffffff811cd93e>] handle_mm_fault+0x13de/0x1b80 [Fri Mar 19 20:03:04 2021] RSP <ffff880d93b9fc28> [Fri Mar 19 20:03:04 2021] ---[ end trace 9d28a7e662aea7df ]--- [Fri Mar 19 20:03:04 2021] systemd-journald[581]: Compressed data object 806 -> 548 using XZ ------------------------ We think the following code may be relevant to the crashing behavior. I think this is the relevant source for Ubuntu 4.4.0-204 (BTW, are you sure this is Ubuntu 20.04? 4.4.0 is a Xenial kernel): memory.c\mm - ~ubuntu-kernel/ubuntu/+source/linux/+git/xenial - [no description] (launchpad.net) static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd) { ... /* A PROT_NONE fault should not end up here */ BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); Line 3214 We see following fix but we are not certain if it's relevant yet. This is interesting… mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing · torvalds/linux@38e0885 · GitHub mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing · torvalds/linux@38e0885 · GitHub mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing The NUMA balancing logic uses an arch-specific PROT_NONE page table flag defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page PMDs respectively as requiring balancing upon a subsequent page fault. User-defined PROT_NONE memory regions which also have this flag set will not normally invoke the NUMA balancing code as do_page_fault() will send a segfault to the process before handle_mm_fault() is even called. However if access_remote_vm() is invoked to access a PROT_NONE region of memory, handle_mm_fault() is called via faultin_page() and __get_user_pages() without any access checks being performed, meaning the NUMA balancing logic is incorrectly invoked on a non-NUMA memory region. A simple means of triggering this problem is to access PROT_NONE mmap'd memory using /proc/self/mem which reliably results in the NUMA handling functions being invoked when CONFIG_NUMA_BALANCING is set. This issue was reported in bugzilla (issue 99101) which includes some simple repro code. There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page() added at commit c0e7cad to avoid accidentally provoking strange behavior by attempting to apply NUMA balancing to pages that are in fact PROT_NONE. The BUG_ON()'s are consistently triggered by the repro. This patch moves the PROT_NONE check into mm/memory.c rather than invoking BUG_ON() as faulting in these pages via faultin_page() is a valid reason for reaching the NUMA check with the PROT_NONE page table flag set and is therefore not always a bug. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101 We need help in understanding how to prevent core dump/kernel panic while taking memory dump of a focal container on a xenial host. [Test Plan] Testing on an 16.04 Azure instance, follow the steps: $ echo 'GRUB_FLAVOUR_ORDER="generic"' | sudo tee -a /etc/default/grub.d/99-custom.cfg $ sudo apt install linux-generic $ sudo reboot # login again and confirm the system is booted with the 4.4 kernel $ sudo apt install docker.io gdb $ sudo docker pull mcr.microsoft.com/mssql/server:2019-latest $ sudo docker run -e "ACCEPT_EULA=Y" -e "SA_PASSWORD=<YourStrong@Passw0rd>" \ -p 1433:1433 --name sql1 -h sql1 \ -d mcr.microsoft.com/mssql/server:2019-latest $ps -ef | grep sqlservr sudo gdb -p $PID -ex generate-core-file # A kernel BUG should be triggered [Where problems could occur] The patches touches the mm subsystem and because of that there's always the potential for significant regressions and in this case a revert and a re-spin would probably be necessary. On the other hand however, this patch is included into the mainline kernel since 4.8 without problems. ** Affects: linux (Ubuntu) Importance: Undecided Status: Incomplete ** Affects: linux (Ubuntu Xenial) Importance: Undecided Status: In Progress ** Tags: xenial ** Also affects: linux (Ubuntu Xenial) Importance: Undecided Status: New ** Changed in: linux (Ubuntu Xenial) Status: New => In Progress -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1921211 Title: Taking a memory dump of user mode process on Xenial hosts causes bugcheck/kernel panic and core dump Status in linux package in Ubuntu: Incomplete Status in linux source package in Xenial: In Progress Bug description: [Impact] We have some Ubuntu 16.04 hosts (in Hyper-V) being used for testing some Ubuntu 20.04 container. As part of the testing we were attempting to take a memory dump of a container running SQL Server with Ubuntu 20.04 on the Ubuntu 16.04 host we started seeing kernel panic and core dump. It started happening after a specific Xenial kernel update on the host. 4.4.0-204-generic - Systems that are crashing 4.4.0-201-generic - Systems that are able to capture dump Note from the developer indicates following logging showing up. ---- Now the following is output right after I attempt to start the dump. (gdb, attach ###, generate-core-file /var/opt/mssql/log/rdorr.delme.core) [Fri Mar 19 20:01:38 2021] systemd-journald[581]: Successfully sent stream file descriptor to service manager. [Fri Mar 19 20:01:41 2021] cni0: port 9(vethdec5d2b7) entered forwarding state [Fri Mar 19 20:02:42 2021] systemd-journald[581]: Successfully sent stream file descriptor to service manager. [Fri Mar 19 20:03:04 2021] ------------[ cut here ]------------ [Fri Mar 19 20:03:04 2021] kernel BUG at /build/linux-qlAbvR/linux-4.4.0/mm/memory.c:3214! [Fri Mar 19 20:03:04 2021] invalid opcode: 0000 [#1] SMP [Fri Mar 19 20:03:04 2021] Modules linked in: veth vxlan ip6_udp_tunnel udp_tunnel xt_statistic xt_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs libcrc32c ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6_tables xt_comment xt_mark xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables br_netfilter bridge stp llc aufs overlay nls_utf8 isofs crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd input_leds serio_raw i2c_piix4 hv_balloon hyperv_fb 8250_fintek joydev mac_hid autofs4 hid_generic hv_utils hid_hyperv ptp hv_netvsc hid hv_storvsc pps_core [Fri Mar 19 20:03:04 2021] hyperv_keyboard scsi_transport_fc psmouse pata_acpi hv_vmbus floppy fjes [Fri Mar 19 20:03:04 2021] CPU: 1 PID: 24869 Comm: gdb Tainted: G W 4.4.0-204-generic #236-Ubuntu [Fri Mar 19 20:03:04 2021] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 05/18/2018 [Fri Mar 19 20:03:04 2021] task: ffff880db9229c80 ti: ffff880d93b9c000 task.ti: ffff880d93b9c000 [Fri Mar 19 20:03:04 2021] RIP: 0010:[<ffffffff811cd93e>] [<ffffffff811cd93e>] handle_mm_fault+0x13de/0x1b80 [Fri Mar 19 20:03:04 2021] RSP: 0018:ffff880d93b9fc28 EFLAGS: 00010246 [Fri Mar 19 20:03:04 2021] RAX: 0000000000000100 RBX: 0000000000000000 RCX: 0000000000000120 [Fri Mar 19 20:03:04 2021] RDX: ffff880ea635f3e8 RSI: 00003ffffffff000 RDI: 0000000000000000 [Fri Mar 19 20:03:04 2021] RBP: ffff880d93b9fce8 R08: 00003ff32179a120 R09: 000000000000007d [Fri Mar 19 20:03:04 2021] R10: ffff8800000003e8 R11: 00000000000003e8 R12: ffff8800ea672708 [Fri Mar 19 20:03:04 2021] R13: 0000000000000000 R14: 000000010247d000 R15: ffff8800f27fe400 [Fri Mar 19 20:03:04 2021] FS: 00007fdc26061600(0000) GS:ffff881025640000(0000) knlGS:0000000000000000 [Fri Mar 19 20:03:04 2021] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Fri Mar 19 20:03:04 2021] CR2: 000055e3a0011290 CR3: 0000000d93ba4000 CR4: 0000000000160670 [Fri Mar 19 20:03:04 2021] Stack: [Fri Mar 19 20:03:04 2021] ffffffff81082929 fffffffffffffffd ffffffff81082252 ffff880d93b9fca8 [Fri Mar 19 20:03:04 2021] ffffffff811c7bca ffff8800f27fe400 000000010247d000 ffff880e74a88090 [Fri Mar 19 20:03:04 2021] 000000003a98d7f0 ffff880e00000001 ffff8800000003e8 0000000000000017 [Fri Mar 19 20:03:04 2021] Call Trace: [Fri Mar 19 20:03:04 2021] [<ffffffff81082929>] ? mm_access+0x79/0xa0 [Fri Mar 19 20:03:04 2021] [<ffffffff81082252>] ? mmput+0x12/0x130 [Fri Mar 19 20:03:04 2021] [<ffffffff811c7bca>] ? follow_page_pte+0x1ca/0x3d0 [Fri Mar 19 20:03:04 2021] [<ffffffff811c7fe4>] ? follow_page_mask+0x214/0x3a0 [Fri Mar 19 20:03:04 2021] [<ffffffff811c82a0>] __get_user_pages+0x130/0x680 [Fri Mar 19 20:03:04 2021] [<ffffffff8122b248>] ? path_openat+0x348/0x1360 [Fri Mar 19 20:03:04 2021] [<ffffffff811c8b74>] get_user_pages+0x34/0x40 [Fri Mar 19 20:03:04 2021] [<ffffffff811c90f4>] __access_remote_vm+0xe4/0x2d0 [Fri Mar 19 20:03:04 2021] [<ffffffff811ef6ac>] ? alloc_pages_current+0x8c/0x110 [Fri Mar 19 20:03:04 2021] [<ffffffff811cfe3f>] access_remote_vm+0x1f/0x30 [Fri Mar 19 20:03:04 2021] [<ffffffff8128d3fa>] mem_rw.isra.16+0xfa/0x190 [Fri Mar 19 20:03:04 2021] [<ffffffff8128d4c8>] mem_read+0x18/0x20 [Fri Mar 19 20:03:04 2021] [<ffffffff8121c89b>] __vfs_read+0x1b/0x40 [Fri Mar 19 20:03:04 2021] [<ffffffff8121d016>] vfs_read+0x86/0x130 [Fri Mar 19 20:03:04 2021] [<ffffffff8121df65>] SyS_pread64+0x95/0xb0 [Fri Mar 19 20:03:04 2021] [<ffffffff8186acdb>] entry_SYSCALL_64_fastpath+0x22/0xd0 [Fri Mar 19 20:03:04 2021] Code: d4 ee ff ff 48 8b 7d 98 89 45 88 e8 2d c7 fd ff 8b 45 88 89 c3 e9 be ee ff ff 48 8b bd 70 ff ff ff e8 c7 cf 69 00 e9 ad ee ff ff <0f> 0b 4c 89 e7 4c 89 9d 70 ff ff ff e8 f1 c9 00 00 85 c0 4c 8b [Fri Mar 19 20:03:04 2021] RIP [<ffffffff811cd93e>] handle_mm_fault+0x13de/0x1b80 [Fri Mar 19 20:03:04 2021] RSP <ffff880d93b9fc28> [Fri Mar 19 20:03:04 2021] ---[ end trace 9d28a7e662aea7df ]--- [Fri Mar 19 20:03:04 2021] systemd-journald[581]: Compressed data object 806 -> 548 using XZ ------------------------ We think the following code may be relevant to the crashing behavior. I think this is the relevant source for Ubuntu 4.4.0-204 (BTW, are you sure this is Ubuntu 20.04? 4.4.0 is a Xenial kernel): memory.c\mm - ~ubuntu-kernel/ubuntu/+source/linux/+git/xenial - [no description] (launchpad.net) static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd) { ... /* A PROT_NONE fault should not end up here */ BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))); Line 3214 We see following fix but we are not certain if it's relevant yet. This is interesting… mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing · torvalds/linux@38e0885 · GitHub mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing · torvalds/linux@38e0885 · GitHub mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing The NUMA balancing logic uses an arch-specific PROT_NONE page table flag defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page PMDs respectively as requiring balancing upon a subsequent page fault. User-defined PROT_NONE memory regions which also have this flag set will not normally invoke the NUMA balancing code as do_page_fault() will send a segfault to the process before handle_mm_fault() is even called. However if access_remote_vm() is invoked to access a PROT_NONE region of memory, handle_mm_fault() is called via faultin_page() and __get_user_pages() without any access checks being performed, meaning the NUMA balancing logic is incorrectly invoked on a non-NUMA memory region. A simple means of triggering this problem is to access PROT_NONE mmap'd memory using /proc/self/mem which reliably results in the NUMA handling functions being invoked when CONFIG_NUMA_BALANCING is set. This issue was reported in bugzilla (issue 99101) which includes some simple repro code. There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page() added at commit c0e7cad to avoid accidentally provoking strange behavior by attempting to apply NUMA balancing to pages that are in fact PROT_NONE. The BUG_ON()'s are consistently triggered by the repro. This patch moves the PROT_NONE check into mm/memory.c rather than invoking BUG_ON() as faulting in these pages via faultin_page() is a valid reason for reaching the NUMA check with the PROT_NONE page table flag set and is therefore not always a bug. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101 We need help in understanding how to prevent core dump/kernel panic while taking memory dump of a focal container on a xenial host. [Test Plan] Testing on an 16.04 Azure instance, follow the steps: $ echo 'GRUB_FLAVOUR_ORDER="generic"' | sudo tee -a /etc/default/grub.d/99-custom.cfg $ sudo apt install linux-generic $ sudo reboot # login again and confirm the system is booted with the 4.4 kernel $ sudo apt install docker.io gdb $ sudo docker pull mcr.microsoft.com/mssql/server:2019-latest $ sudo docker run -e "ACCEPT_EULA=Y" -e "SA_PASSWORD=<YourStrong@Passw0rd>" \ -p 1433:1433 --name sql1 -h sql1 \ -d mcr.microsoft.com/mssql/server:2019-latest $ps -ef | grep sqlservr sudo gdb -p $PID -ex generate-core-file # A kernel BUG should be triggered [Where problems could occur] The patches touches the mm subsystem and because of that there's always the potential for significant regressions and in this case a revert and a re-spin would probably be necessary. On the other hand however, this patch is included into the mainline kernel since 4.8 without problems. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1921211/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp