I debugged this problem a bit. The problem stems from initramfs
attempting to use /dev/console (which refers to nonexisting /dev/ttyS0),
having its logging functions unexpectedly return errors, and broking
everything around.

You may have already noticed that when this happens, 100% CPU time is
consumed. If you enable sysrq keys with sysrq_always_enabled=1, and dump
the task list (e.g. virsh send-key ubuntu18.04 KEY_LEFTALT KEY_SYSRQ
KEY_T), you'll notice that there's always a combination of
console_setup/loadkeys/setfont processes with evergrowing PIDs, which
likely means that something is running them in tight loop.

Now, if you patch "panic()" in /usr/share/initramfs-
tools/scripts/functions so it would print its argument to the console
(echo "panic 1: " "$@" >/dev/kmsg), you'll see that the panic reason is
that "filesystem on /dev/vda1 requires manual fsck", and it's printed in
a loop. Indeed, the function does contain a loop:

checkfs()
{
        while ! _checkfs_once "$@"; do
                panic "The $2 filesystem on $1 requires a manual fsck"
        done
}

This is actually a bogus error. The filesystem is (most likely) fine.
There's no fsck included in initramfs, so what happens is that the
following fragment is executed:

        if ! command -v fsck >/dev/null 2>&1; then
                log_warning_msg "fsck not present, so skipping $NAME file 
system"
                return
        fi

log_warning_msg, however, returns non-zero status due to stdout being
broken, which causes _checkfs_once return non-zero status as well.

panic doesn't work correctly either: it simply can't spawn a shell on
broken /dev/console, and exits immediately, and that's what causes the
infinite loop.


What I think about the solution.

First, debugging this is PITA. Adding a serial device might be a
perfectly acceptable fix for many, but when this issue happens,
absolutely nothing in the console points to the direction that this's
what's missing. Even if it's necessary to leave ttyS0 as the main
console, initramfs should at least warn the user (through kmsg) that
/dev/console is broken.

Second, errors returned by logging function causing _checkfs_once return
error as well is a bug. I think errors in _log_msg should be suppressed.
If you do that, unless panic happens (which is rare), the boot will
succeed.

Third, as Grant Emsley said, maybe ttyS0 doesn't really have to be the
main console?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1573095

Title:
  Cloud images fail to boot when a serial port is not available

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/1573095/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to