Hi Henrich,

> To be clear, fscking does not correct the issue with /bsd? It's still
> just several KB? Is this random, or do you have the ability to readily
> reproduce it?

Correct. I looked into the issue more today and believe I've achieved
100% reproducibility. However, the problem is entirely unrelated to
improper shutdown and all to do with the filling up of /usr.

To reproduce, you want to *almost* fill up /usr before the kernel relink
job runs (otherwise it may abort straight away). E.g. find out how much
space left on disk, subtract a few MB then dd a dummy file to /usr.
Reboot so the reorder will run.

Shortly afterward when the relink is in progress the following is echoed
to the console:

uvn_flush: obj=<address>, offset=<address>.  error during pageout.
uvn_flush: WARNING: changes to page may be lost!

The reorder will complete without error; however, a truncated/corrupted
kernel will almost certainly be written to /bsd. Reboot and that system
will be unbootable. The symptom may be slightly different from what I
originally posted and sometimes manifests as a reboot loop (it gets part
way through loading the kernel, aborts, then the bootloader restarts). 

This is repeatable to the point where if you boot into bsd.rd, move a
working kernel into /bsd, recalculate the sha256 so reorder will run,
without cleaning up space in /usr, the next time you reboot and reorder
runs the cycle will repeat.

I found some back and forth between Theo and Alex Bluhm a few years ago
about the underlying cause of that message but didn't see anything about
the reorder failing and not sure if anything got committed to fix it:

https://marc.info/?l=openbsd-tech&m=164987816425987

As for the corrupted kernels, a good kernel had a file size of 31904467.
Corrupted/truncated kernels written to /bsd had sizes of:
29931779, 31868587, 29935875 (I saw this one twice), and 11696. I think
the sizes depend on how much free space was in /usr at the time it ran.

More interestingly, the system is writing the hash of the bad kernel out
into /var/db/kernel.SHA256 so I suspect there is a bug somewhere in the
reorder process where no error is being returned due to disk full, it
assumes the process completed successfully, and Bob's your uncle.

Here is a copy of relink.log from a corrupted kernel:

(SHA256) /bsd: OK
LD="ld" sh makegap.sh 0xcccccccc gapdummy.o
ld -T ld.script -X --warn-common -nopie -o newbsd ${SYSTEM_HEAD} vers.o ${OBJS}
text    data    bss     dec     hex
26728562        488512  1351680 28568754        1b3ecb2
mv newbsd newbsd.gdb
ctfstrip -S -o newbsd newbsd.gdb
rm -f bsd.gdb
mv -f newbsd bsd
install -F -m 700 bsd /bsd && sha256 -h /var/db/kernel.SHA256 /bsd

Kernel has been relinked and is active on next reboot.

SHA256 (/bsd) = 75600b28045794fa983d0823435c3a07c276b13dbbf11dc01dca71a2d4fe8d6d

The size of this corrupt kernel was 29935875 bytes.

Regards
Lloyd

Reply via email to