So far the best I could reach is get the kernel call trace but not crashing the node yet.
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.306845] INFO: task java:1932 blocked for more than 120 seconds. Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.308573] Not tainted 4.4.0-1069-aws #79-Ubuntu Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.309944] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312015] java D ffff88032dc57db0 0 1932 1 0x00000000 Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312019] ffff88032dc57db0 0000000000000246 ffff88041077e900 ffff88032dcae900 Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312022] ffff88032dc58000 0000000000074da7 ffff88040dbdf888 ffff88040dbdf824 Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312024] ffff88032dc57df8 ffff88032dc57dc8 ffffffff81823825 ffff88040dbdf800 Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312027] Call Trace: Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312035] [<ffffffff81823825>] schedule+0x35/0x80 Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312042] [<ffffffff812f8309>] jbd2_log_wait_commit+0xa9/0x130 Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312045] [<ffffffff810c77b0>] ? wake_atomic_t_function+0x60/0x60 Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312048] [<ffffffff812fae21>] __jbd2_journal_force_commit+0x61/0xb0 Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312049] [<ffffffff812faeb1>] jbd2_journal_force_commit+0x21/0x30 Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312053] [<ffffffff812c71c9>] ext4_force_commit+0x29/0x30 Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312057] [<ffffffff8129a7c1>] ext4_sync_file+0x1a1/0x350 Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312065] [<ffffffff81247bae>] vfs_fsync_range+0x4e/0xb0 Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312072] [<ffffffff81247c6d>] do_fsync+0x3d/0x70 Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312074] [<ffffffff81247f10>] SyS_fsync+0x10/0x20 Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312077] [<ffffffff818279ce>] entry_SYSCALL_64_fastpath+0x22/0xc1 Oct 12 15:35:01 ip-10-16-21-10 CRON[31352]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) using this simple script writing and reading same file in parallel from different process and let it run for while in my case I got the first call trace after 3h. while true do dd iflag=nocache if=/dev/zero of=/tmp/myfilesource$1 bs=1k count=$2 dd iflag=nocache if=/tmp/myfilesource$1 of=/tmp/myfileout$1 bs=1k count=$2 done root@ip-10-16-21-10:~# jobs [1] Running nohup ./mytest1.sh 1 1024000& [2]- Running nohup ./mytest1.sh 1 1024 & [3]+ Running nohup ./mytest1.sh 1 102400 & -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1788035 Title: nvme: avoid cqe corruption Status in linux package in Ubuntu: In Progress Status in linux source package in Xenial: Fix Released Bug description: To address customer-reported NVMe issue with instance types (notably c5 and m5) that expose EBS volumes as NVMe devices, this commit from mainline v4.6 should be backported to Xenial: d783e0bd02e700e7a893ef4fa71c69438ac1c276 nvme: avoid cqe corruption when update at the same time as read dmesg sample: [Wed Aug 15 01:11:21 2018] nvme 0000:00:1f.0: I/O 8 QID 1 timeout, aborting [Wed Aug 15 01:11:21 2018] nvme 0000:00:1f.0: I/O 9 QID 1 timeout, aborting [Wed Aug 15 01:11:21 2018] nvme 0000:00:1f.0: I/O 21 QID 2 timeout, aborting [Wed Aug 15 01:11:32 2018] nvme 0000:00:1f.0: I/O 10 QID 1 timeout, aborting [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: I/O 8 QID 1 timeout, reset controller [Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 21 QID 2 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with status: 0007 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887751 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887751 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 22 QID 2 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887767 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887767 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 23 QID 2 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887769 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887769 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 8 QID 1 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 9 QID 1 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with status: 0007 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 41943136 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 10 QID 1 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with status: 0007 [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 6976 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 22 QID 1 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 23 QID 1 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 24 QID 1 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 25 QID 1 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 2 QID 0 [Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x7 [Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with status: fffffffc [Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 96 [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): metadata I/O error: block 0x5000687 ("xlog_iodone") error 5 numblks 64 [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_do_force_shutdown(0x2) called from line 1197 of file /build/linux-c2Z51P/linux-4.4.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc075d428 [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_log_force: error -5 returned. [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): Log I/O Error Detected. Shutting down filesystem [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): Please umount the filesystem and rectify the problem(s) [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 872, lost async page write [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5. [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_iunlink_remove: xfs_imap_to_bp returned error -5. [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 873, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 874, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 875, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 876, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 877, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 878, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 879, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 880, lost async page write [Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 881, lost async page write [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): metadata I/O error: block 0x5000697 ("xlog_iodone") error 5 numblks 64 [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_log_force: error -5 returned. [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_do_force_shutdown(0x2) called from line 1197 of file /build/linux-c2Z51P/linux-4.4.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc075d428 [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): metadata I/O error: block 0x5000699 ("xlog_iodone") error 5 numblks 64 [Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_do_force_shutdown(0x2) called from line 1197 of file /build/linux-c2Z51P/linux-4.4.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc075d428 [Wed Aug 15 01:12:20 2018] XFS (nvme1n1): xfs_log_force: error -5 returned. [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 22 QID 1 timeout, aborting [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 23 QID 1 timeout, aborting [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 24 QID 1 timeout, aborting [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 25 QID 1 timeout, aborting [Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 24 QID 2 timeout, aborting [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2 [Wed Aug 15 01:12:50 2018] XFS (nvme1n1): xfs_log_force: error -5 returned. [Wed Aug 15 01:12:52 2018] nvme 0000:00:1f.0: I/O 22 QID 1 timeout, reset controller [Wed Aug 15 01:13:21 2018] XFS (nvme1n1): xfs_log_force: error -5 returned. [Wed Aug 15 01:13:51 2018] XFS (nvme1n1): xfs_log_force: error -5 returned. [Wed Aug 15 01:14:21 2018] XFS (nvme1n1): xfs_log_force: error -5 returned. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1788035/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp