Updated the System to latest FW for the storage controller 4.52 (and iLO to 2.50). With that updated to Zesty again all till working fine.
>From here I enabled intel_iommu=on and it booted which already is an improvement. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1679208 Title: Zesty (4.10.0-14) won't boot on HP ProLiant DL360 Gen9 with intel_iommu=on Status in linux package in Ubuntu: Triaged Bug description: TL;DR - one of our HP ProLiant DL360 Gen9 fails to boot with intel_iommu=on - the Disk controller fails - Xenial seems to work for a while but then fails - Zesty 100% crashes on boot - An identical system seems to work, so need HW replace to finally confirm After reboot one sees a HW report like this: After the boot I see the HW telling me this on boot: Embedded RAID : Smart HBA H240ar Controller - Operation Failed - 1719-Slot 0 Drive Array - A controller failure event occurred prior to this power-up. (Previous lock up code = 0x13) I tried several things (In between always redeploy zesty with MAAS). I think my debugging might be helpful, but I wanted to keep the documentation in the bug in case you'd go another route or that others find useful information in here. 0. I retried what I did twice, fully reproducible That is: 0.1 install zesty 0.2 change grub default cmdline in /etc/default/grub.d/50- to add intel_iommu=on 0.3 sudo update-grub 0.4 reboot 1. I tried a Recovery boot from the boot options in gub. => Failed as well 2. iLO rebooted vis "request reboot" and as well via "full system reset" => both Failed 3. Reboot the system as deployed by MAAS # /proc/cmdline before that BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro The orig grub.cfg is like http://paste.ubuntu.com/24305945/ It reboots as-is. => Reboot worked 4. without a change to anything in /etc run update-grub $ sudo update-grub Generating grub configuration file ... Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT is set is no longer supported. Found linux image: /boot/vmlinuz-4.10.0-14-generic Found initrd image: /boot/initrd.img-4.10.0-14-generic Adding boot menu entry for EFI firmware configuration done There was no diff between the new grub.cfg and the one I saved. => Reboot worked 5. add the intel_iommu=on arg $ sudo sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT=""/GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"/' /etc/default/grub.d/50-curtin-settings.cfg $ sudo update-grub # Diff in grub.cfg really only is the iommu setting => Reboot Failed So this doesn't seem so much of a cloud-init/curtin/maas bug anymore to me - maybe intel_iommu bheaves different? - Check grub cfg pre/post - not change but the expected? 6. Install Xenial and do the same => Reboot working 7. Upgrade to Z Since the Xenial system just worked and one can assume that almost only kernel is working so early in the boot process I upgraded the working system with intel_iommu=on to Zesty. That would be 4.4.0-71-generic to 4.10.0-1 On this upgrade I finally saw my I/O errors again :-/ Note: these issues are hard to miss as they mount root as read-only. I wonder if they only ever appear with intel_iommu=on as this is the only combo I ever saw them, 8. Redeploy and upgrade to Z without intel_iommu=on enabled Then enable intel_iommu=on and reboot => Reboot Fail From here I rebooted into the Xenial kerenl (that since this is an update was still there) Here I saw: Loading Linux 4.4.0-71-generic ... Loading initial ramdisk ... error: invalid video mode specification `text'. Booting in blind mode Hrm, as outlined above the "blind mode" might be a red herring, but since this kernel worked before it might still be a red herring that swims in the initrd that got regenerated on the upgrade. => Xenial Kernel Reboot - works !! So "blind mode" is a red herring of some sort. But this might allow to find some logs => No This appears as if the Failing boot has never made it to the point to actually write anything. I see: 1. the original xenial 2. the upgraded zesty 3. NOT THE zesty+iommu 4. the xenial+iommu $ egrep 'kernel:.*(Linux version|Command line)' /var/log/syslog Apr 3 12:15:20 node-horsea kernel: [ 0.000000] Linux version 4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu 4.4.0-71.92-generic 4.4.49) Apr 3 12:15:20 node-horsea kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro Apr 3 12:47:45 node-horsea kernel: [ 0.000000] Linux version 4.10.0-14-generic (buildd@lcy01-01) (gcc version 6.3.0 20170221 (Ubuntu 6.3.0-8ubuntu1) ) #16-Ubuntu SMP Fri Mar 17 15:19:26 UTC 2017 (Ubuntu 4.10.0-14.16-generic 4.10.3) Apr 3 12:47:45 node-horsea kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro Apr 3 13:15:49 node-horsea kernel: [ 0.000000] Linux version 4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu 4.4.0-71.92-generic 4.4.49) Apr 3 13:15:49 node-horsea kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro intel_iommu=on 9. Trying to avoiding HW replacement if not needed I was afraid I might need the HW to be replaced to be 100% sure, but this very much smells broken in SW to me already. To avoid RT ticket replacing without real need I asked to free another system up. So I finally could free up a identical machine. I especially checked the failing HP smart array, it has the same Product Version and FW revision. There things seem to work, so I might be down to replacing the HW :-/ 10. get some messages of the fail: With the following grub cmdline I got to see the fail: GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on --- console=ttyS1,115200" It looks just like the one I found on the running system when intel_iommu=on is set on the Xenial kernel happening later (sometimes minutes, sometimes days, but never without intel_iommu). But on zesty it seems to trigger 100% on boot and by that not even get up. I'll attach a few logs of the crashes, but the heads are [ 33.426069] hpsa 0000:03:00.0: Acknowledging event: 0x80000000 (HP SSD Smart Path configuration change) [ 618.567636] DMAR: DRHD: handling fault status reg 2 [ 618.567922] DMAR: DMAR:[DMA Read] Request device [03:00.0] fault addr ffafc000 DMAR:[fault reason 06] PTE Read access is not set Or [ 159.779566] hpsa 0000:03:00.0: Command timed out. [ 159.801113] hpsa 0000:03:00.0: hpsa_send_abort_ioaccel2: Tag:0x00000000:000000d0: unknown abort service response 0x00 While it might be a HW issue I file this still to be "findable" for anyone else if it is no HW eventually. But I assign myself for now to close/confirm once I have replaced HW. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1679208/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp