Reinventing the Hurd server bootstrap

Justus Winter Fri, 26 Dec 2014 04:59:24 -0800

Hello :)

I've been thinking how to improve the early Hurd server bootstrap.  To
be clear, this is not about `bootstrap' as in sysvinit, but it is
about how to start the very first Hurd servers.


Now we all know that system bootstrap is an emotional subject, so I'd
like to address any concerns you might have about my proposal.  Also,
I'd love to get you excited about my approach, because it empowers
developers to hack the bootstrap procedure and users to rescue their
systems.

Currently, this process is fragile.  The Hurd server bootstrap is
always the first thing that breaks (which isn't really surprising,
since it is the first thing that runs).  But what's much much worse is
that the process is not hackable, and it is not interactive.

What does the current bootstrap look like ?

The bootstrap process is implemented in the bootscript parser (either
in gnumach, or in boot), libdiskfs, exec, startup, proc, auth.  All of
these servers perform some tasks, start or resume other servers, plug
ports into each other, and often just wait for another server.

Why is it so fragile ?

If any of the servers break, the whole process most often just hangs,
displaying nothing, leaving the user puzzled and incapable of
inspecting and fixing her system.

And why is it not very hackable ?

This whole mechanism is implemented across these components.  You
cannot easily extend or modify it without being aware of the whole
bootstrap procedure.  You cannot omit any of the components.  The
process is so inflexible, that we hard-coded the individual components
names into RPCs (startup_procinit, startup_authinit).

How do we fix it ?

I propose to create a self-contained shell that executes a boot
script.  We expose enough Mach/Hurd functionality to implement the
whole bootstrap procedure, e. g. (task-set-bootstrap-port,
make-send-right, task-resume, ...).  If anything goes wrong, we drop
the user into an interactive repl.

Wait, that sounds like serverboot, doesn't it ?

It does.  I think it was a mistake to abandon it in the first place.
Evidence for that: 1.  We now have two copies of the bootscript
parser, once in the kernel, once in boot.  2.  We moved non-essential
functionality (with lot's of string parsing) into the kernel.  3.  It
makes strong assumptions about the platform, e.g. assumes a bootloader
that can load modules like GRUB does with the multiboot protocol.

But we don't use some made-up shell-like language, we use a Scheme
interpreter.

\o/

Why is it awesome ?

(define (bootstrap)
  (log "Hurd server bootstrap:")
  (bind-root rootfs-task)
  [...]
  (log ".\n"))

(catch (panic "Hurd bootstrap failed: " last-exception "\n")
       (bootstrap))

Imagine the possibilities.  Say you're creating a live-cd.  You load a
statically-linked isofs translator, you locate the cd, point the isofs
to the device and resume it, load and run a exec server from the cd,
mach-defpager, start a tmpfs as root filesystem, populate it from the
cd, start other essential translators, hand of to sysvinit.

Here is a screenshot from my 3-nights prototype:

~~~ snip ~~~
Loading kernel...
Loading bootshell...
Loading boot script...
Loading hello demo...
Loading root filesystem...
GNU Mach 1.4
ELF section header table at c00102ec
[...]
module 0: bootshell ${host-port} ${device-port} ${bootscript-task} 
${hello-task} ${rootfs-task} $(task-create) $(task-resume)
module 1: runsystem.scm $(bootscript-task=task-create)
module 2: hello $(hello-task=task-create)
module 3: iso9660fs --host-priv-port=${host-port} 
--device-master-port=${device-port} --exec-server-task=${hello-task} -T typed 
device:hd2 $(rootfs-task=task-create)
4 multiboot modules
task loaded: bootshell 1 2 3 4 5
reading 854 bytes into map c82eaf40
task loaded: runsystem.scm
task loaded: hello
task loaded: iso9660fs --host-priv-port=1 --device-master-port=2 
--exec-server-task=3 -T typed device:hd2

start bootshell: bootshell/TinySCHEME 1.41.
Hurd server bootstrap: rootfs hello.
Hello world.
catch_exception_raise (23, 4, 1, 2, 36): terminating task 4.

We managed to rendezvous with the root filesystem.  Feel free to roam
around with `(cd "boot")'.  You can also try `(cat "/etc/hostname")',
or `(load "more.scm")'.

Note that we did not rely on the Hurd bootstrap code build into the
rootfs translator.  The Hurd server bootstrap you saw above is
implemented in scheme.  You can inspect it right now by doing `(cat
"/boot/runsystem.scm")'.  It is so tiny that it fits this screen.

You can remaster this image with `grub-mkrescue'.  See `remaster.txt'.

You can see how it handles failures by selecting the wrong menu entry
in GRUB.


Welcome to bootshell, a scheme shell.  Type `(help)' for help.

Sorry for the lack of -lreadline.

runsystem@nonmonolithic / > (cd "etc")
   ==> ()
runsystem@nonmonolithic /etc > (cat "hostname")
nonmonolithic
   ==> #t
runsystem@nonmonolithic /etc >
~~~ snap ~~~

Note how we started the rootfs translator on its own, and we can use
it, e.g. to load the systems hostname from "/etc/hostname" to display
it in the prompt.

Also see how it handles failures:

~~~ snip ~~~
start bootshell: bootshell/TinySCHEME 1.41.
Hurd server bootstrap: rootfsiso9660fs: device:hd0: No such device or address
catch_exception_raise (20, 5, 1, 2, 36): terminating task 5.

panic: Hurd bootstrap failed: (timeout)

(emergency-shell) > ((lambda (x) (+ 1 x)) 3)
   ==> 4
(emergency-shell) >
~~~ snap ~~~

That's right.  Your bootstrap failed but you still got a shell to
interact with your system.

Awesome, can I play with it ?

Sure!  Grab http://darnassus.sceen.net/~teythoon/bootshell24.iso and
do `kvm -k en-us -cdrom bootshell24.iso'.  You can easily remaster it
to mess with the boot script.  The code is here:

http://darnassus.sceen.net/gitweb/teythoon/hurd.git/shortlog/refs/heads/bootshell23

It requires only tiny tweaks to libdiskfs, and a tiny patch to gnumach
that enables us to load non-elf files into tasks.


Love to get your input,
Justus

Reinventing the Hurd server bootstrap

Reply via email to