On Friday 08 July 2011 17:55:09 Bruce Dubbs wrote: > I've been working on bootscripts. Basically, I'm rewriting them to get > a better understanding. I may end up throwing them out completely but I > want to discuss the issue of error handling. > > There are three bootscript files that use the > > read ENTER > > construct: checkfs, udev, and functions.
Hi Bruce, I'll throw in my $0.02, as this read ENTER has been always a thorn in my side and I've been patching the lfs-bootscripts for a very long time to get around it. What I do: I have a dedicated partition of 100MB for storing boot failure logs. I pass an extra argument to grub: rescue-logs=/dev/sda6 however, with a bit of tweaking I can use rescue-logs=LABEL=rescue-logs just as well. Then in /etc/rc.d/init.d/functions I have this: # This is the partition where logs are saved in case of boot failure # this partition is never used for anything else, and is never mounted rw RESCUE_LOGS_PARTITION=none for i in $(cat /proc/cmdline); do case ${i} in rescue-logs=*) RESCUE_LOGS_PARTITION=${i#rescue-logs=} ;; esac done print_error_msg() { echo_failure # $i is inherited by the rc script boot_mesg -n "FAILURE:\n\nYou should not be reading this error message.\n\n" ${FAILURE} boot_mesg -n " It means that an unforeseen error took" boot_mesg -n " place in ${i}, which exited with a return value of" boot_mesg " ${error_value}.\n" boot_mesg_flush boot_mesg -n "If you're able to track this" boot_mesg -n " error down to a bug in one of the files provided by" boot_mesg -n " the LFS book, please be so kind to inform us at" boot_mesg " lfs-dev@linuxfromscratch.org.\n" boot_mesg_flush boot_mesg -n "\n\nWaiting ${TIMEOUT} seconds..." ${INFO} boot_mesg "" ${NORMAL} # Now try to save the error into the rescue log rescue_logs "Error in ${i}!!! Error value= ${error_value}" sleep ${TIMEOUT} } rescue_logs() { MESSAGE="$@" DATE=`date +%Y-%m-%d-%H-%M-%S` LOG="/media/rescue-logs/failed-${DATE}.log" if [ x"${RESCUE_LOGS_PARTITION}" != x"none" ]; then if mount ${RESCUE_LOGS_PARTITION} /media/rescue-logs 2>&1 > /dev/null; then echo "=== BOOT FAILURE on ${DATE} ===" > ${LOG} echo "${MESSAGE}" >> ${LOG} echo "=== END OF BOOT FAILURE on ${DATE} ===" >> ${LOG} echo -e "\n\n\n" >> ${LOG} umount /media/rescue-logs fi fi } And then for example in /etc/rc.d/init.d/udev I have this: boot_mesg "Populating /dev with device nodes..." if ! grep -q '[[:space:]]sysfs' /proc/mounts; then echo_failure boot_mesg -n "FAILURE:\n\nUnable to create" ${FAILURE} boot_mesg -n " devices without a SysFS filesystem" boot_mesg -n "\n\nAfter you press Enter, this system" boot_mesg -n " will be rebooted for repair." ${INFO} boot_mesg "" ${NORMAL} # Now try to save the error into the rescue log rescue_logs "No SysFS filesystem" sleep ${TIMEOUT} reboot -f fi Now, I use a grub trick that I believe Bruce posted to this mailing list a few years ago to set a grub env variable "recordfail" to 1 upon every boot which is then cleared in case of a normal boot. In case of a failed boot, grub picks a second entry which is a rescue mode initrd with busybox which tries to get a DHCP lease or in case of no DHCP server, it tries to find a free IP on the same network. I had it then email me this temporary IP, but I removed this as I had to include my email password in the initrd. Maybe there's a way to encrypt the password, but I didn't look hard enough. This works even for an internet-facing machine and I've successfully tested logging into my machine from a different location, as long as my internet connection is working. Finally, I just ssh to this mini os, check the rescue logs, fix the problem, reset the default grub entry and reboot. So far it has worked for me. As a matter of fact, I created this initrd in response to this email thread: http://linuxfromscratch.org/pipermail/lfs-dev/2004-January/041720.html My idea was to have a "self-healing" system as much as, and if at all, possible. An initrd which will try to fix corrupted filesystems, or at least provide a way for you to log into the system after a failed boot and allow you to troubleshoot and fix problems yourself. For headless/keayboardless machines this is a good thing. My next crazy idea is relocatable kernel which with some black voodoo magic and kexec can be loaded in case of a new kernel failing to load. Also an initrd which boots either from harddisk or from a bootable cdrom/usb thumbdrive/usb floppy/etc in case of hard disk failure. My goal is to ensure that I can always reach my system even in case of serious problems (except of course loss of power or internet connectivity). I'm not saying this is the best approach, but I submit it to your attention in case you find it, or parts of it, interesting. > > In the case of functions, the construct is used in print_error_msg that > is only called from the rc script. It is not a fatal function. > > In checkfs, the construct is called in three different places. In two > places it is followed immediately by a halt and one place a reboot. > > In udev, the construct is called in two places. In both cases, it is > followed by a halt. > > The question is how to handle these errors in a headless or keyboardless > system. The problems identified are pretty serious and it's doubtful > anything could be written to the disk. > > I'm thinking about moving the messages/halt/reboot to the functions > script so they all can be handled in one place. If we then have the > functions script do: > > [ -e /etc/sysconfig/init_params ] && . /etc/sysconfig/init_params > > then when we want to optionally stop for the user to read something: > > # Wait for the user by default > [ "${HEADLESS=0}" = "0" ] && read ENTER I always replace the read ENTER with sleep 20 (or more if the message is long). And I replace shutdown with reboot which boots into rescue mode. To me a linux server should never make itself unavailable, by either waiting infinitely for user input at the console or by shutting itself down. > > To disable the need for a keyboard entry, the /etc/sysconfig/init_params > file would define the following: > > HEADLESS=1 > > -------- > > The above would only apply to LFS bootscripts. I can't think of > anything from BLFS or a third party that would need to stop the boot > sequence to wait for the user to read a message. > > Should we integrate this into the LFS bootscripts? > > -- Bruce IvanK. -- http://linuxfromscratch.org/mailman/listinfo/lfs-dev FAQ: http://www.linuxfromscratch.org/faq/ Unsubscribe: See the above information page