I broke my TV.

My TV is a monitor powered by a laptop running OpenBSD and in trying
to diagnose a problem which turned out to be in the NAS I managed to
fry the disklabel.

How?

Well, being unimportant the machine is also the guinea pig for
snapshot builds and other experiments so I thought I might try
removing the problem (thus confirming it's to do with running a
snapshot) by reverting to 7.2 stable.

I copied bsd.rd from a stable machine, booted into that and installed
over the top of the snapshot.

The install worked fine but nothing much ran and / filled up with
core dumps. Bummer.

Oh well, easy enough to fix, just move the system files out of the
way and reinstall. It's getting late but the OpenBSD installer's a
breeze and it should be all done and config files merged in 15
minutes.

Well it wasn't 15 minutes. I did the wrong thing and blasted the
disklabel.

The partitions weren't formatted. I knew this because I'd popped
into the installer's shell and replaced /sbin/newfs with an empty
file (what /bin/true used to be). The files were safe.

After stepping back and using another machine to get the music going
again I remembered that scan_ffs can find partitions when the
disklabel is lost. Using that I can either recover it in full, or
at least find /var with the backup.

Well the numbers scan_ffs gave me were gibberish. The manual warns
that it only looks for ffs1 partitions, not ffs2, but I ran it
anyway and tried poking variations on the numbers it gave me into
disklabel. That didn't work.

In the end I opened up scan_ffs.c to look at how it does its scan.
It proceeds in disk-block-sized chunks (512K) and applies each to
a 'struct fs' as defined in /usr/include/ufs/ffs/fs.h. Unfortunately
as the manual states it only considers ffs1 partitions, marked by
FS_MAGIC aka FS_UFS1_MAGIC. While there's a FS_UFS2_MAGIC printing
the location in which it was found didn't give me the result I
expected...

By this time, since I was figuring out scan_ffs and not looking for
my missing /var, I was running it over a small disk image where I
knew there was a partition at block 64, but the modifed scan_ffs
said the first partition was on block 192.

I thought that was strange but maybe there's a ffs1-like non-super-block
which points to the real ffs2 block later on. Hoping this was the
case I ran a scan over the whole disk printing each block that had
a FS_UFS2_MAGIC signature, offset by the amount to feed to disklabel.

Armed with a list of matching blocks (there were around 300 when I
stopped scanning after I was confident /var was found or missed) I
wrote a script to delete and recreate a partition beginning at each
potential block (the length doesn't matter) and try to mount it
(read only!). Since there were only a few blocks to check I scanned
the output by eye, found /var and copied the disklabel backup out
of it. With that it was a simple matter to restore the correct
disklabel, check the partitions and recover the system in full.

To obtain the list of blocks I added this clause after the main
test in scan_ffs.c:

        else if (sb->fs_magic == FS_UFS2_MAGIC) {
                printf("ufs2 @ %lld\n", (blk*512+n)/512 - 128);

This script fiddled with the disklabel to find a partition which
worked (this changes the real disklabel):

        while read maybe; do
                echo 'd d\na d\n'$maybe'\n\n\nw\n' | disklabel -E sd0 
>/dev/null 2>&1
                echo -n "$maybe: "
                mount -r /dev/sd0d /mnt && ls /mnt /mnt/moved
                umount /mnt 2>/dev/null
        done

There are certainly better ways.

Restoring the disklabel is described in the manual:

        disklabel -R sd0 /tmp/disklabel.sd0.current

The fully-integrated build system made testing changes to scan_ffs
a breeze even though my dev box is on the snapshot and the recovery
system was stable. Putting the snapshot back on the telly took much
less than 15 minutes.

It's possible scan_ffs could be simply extended to print potential
ffs2 partitions (there are a few more checks it could make to whittle
down the result) if the 128-block offset is constant across platforms.

Cheers,

Matthew

Reply via email to