As I was bisecting the commits, I was attempting to take advantage of parallelism. While my test kernel was building I would deploy a clean AWS r5.metal instance. I started seeing test kernels boot that I wouldn't expect to boot. So I decided as a sanity test, I would deploy an r5.metal instance, let it sit idle for 20 minutes and then install the known problematic 4.15.0-1113-aws kernel. Sure enough it booted fine. Tried the same thing again with letting it sit idle 20 mins and it worked again. So this does appear to be a race condition. I think this also explains some of the erratic test results I've seen while looking at this bug. Fortunately the console output gave us some definitive proof as to where the problem was occurring.
With that being said, it appears I have found the offending commits. PCI/MSI: Enforce that MSI-X table entry is masked for update PCI/MSI: Enforce MSI[X] entry updates to be visible https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux- aws/+git/bionic/commit/?id=27571f5ea1dd074924b41a455c50dc2278e8c2b7 https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux- aws/+git/bionic/commit/?id=2478f358c2b35fea04e005447ce99ad8dc53fd5d More specifically the hang is introduced by 'PCI/MSI: Enforce that MSI-X table entry is masked for update', but it isn't a clean revert without reverting the other commit. So for a quick test confirmation I reverted both. I have not had a chance to determine why these commits are causing the problem, but with these reverted in a test build on top of 4.15.0-1113-aws, I can migrate from 5.4 to 4.15 as soon as the instance is available. I've done at least 6 attempts now and all have passed and doing the same steps without the reverts all have hung(unless I wait 20 mins). -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal Status in linux-aws package in Ubuntu: New Bug description: When creating an r5.metal instance on AWS, the default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux- aws(4.15.0-1113-aws) the machine fails to boot the 4.15 kernel. If I remove these patches the instance correctly boots the 4.15 kernel https://lists.ubuntu.com/archives/kernel- team/2021-September/123963.html With that being said, after successfully updating to the 4.15 without those patches applied, I can then upgrade to a 4.15 kernel with the above patches included, and the instance will boot properly. This problem only appears on metal instances, which uses NVME instead of XVDA devices. AWS instances also use the 'discard' mount option with ext4, thought maybe there could be a race condition between ext4 discard and journal flush. Removed 'discard' from mount options and rebooted 5.4 kernel prior to 4.15 kernel installation, but still wouldn't boot after installing the 4.15 kernel. I have been unable to capture a stack trace using 'aws get-console- output'. After enabling kdump I was unable to replicate the failure. So there must be some sort of race with either ext4 and/or nvme. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp