Hi,

Le 11/09/2024, Andy Smith <a...@strugglers.net> a écrit:

> Since booting from sdb wasn't working in any case, I thought I'd
> experiment a bit. I copied the first 446 bytes of sda to sdb. This
> made matters worse! Instead of a "grub> " prompt, I just got a blank
> screen.
>
> I then rebooted from sda and did:

I believe “sda” and “sdb” are swapped with respect to your first
message. Of course, it's expected that these are not stable across
reboots, however it's a bit confusing for me here.

(...)

> This does leave me wondering however, if the boot code in the mBR of
> sdb is now set to believe that this is "the second drive", I suppose
> (hd1) in grub terms? With the implication that should sda fail or be
> removed, this machine may still not boot because its boot code looks
> for something on a drive that no longer exists (sdb now being (hd0))?

I believe this is not necessary the case. I've tried to read some of the
GRUB 2 stage 1 code from the grub2 2.12-5 package. I'm far from being
able to claim I understand everything, but... let's see.

My impression is that the “drive number” that is written to the MBR can
be of two kinds:

  (a) an actual number, typically 0x80, 0x81, etc. for hard disks
      (it is the BIOS drive number for INT 13h, cf. [1]);

  (b) or the special value 0xFF (thus, the 128th hard disk is not
      available for case (a)—too bad if you have that many disks!).

The special value 0xFF is the one you had on both of your drives and
means “use the boot drive” (the one the BIOS booted from, whose number
is in register DL when the BIOS transfers control to the MBR code loaded
at physical address 0x7C00):

(From grub2-2.12/grub-core/boot/i386/pc/boot.S)

        .org GRUB_BOOT_MACHINE_BOOT_DRIVE
boot_drive:
        .byte 0xff      /* the disk to load kernel from */
                        /* 0xff means use the boot drive */

(...)

.org GRUB_BOOT_MACHINE_DRIVE_CHECK
(...)   ← fixup of DL in case in was incorrectly set by the BIOS

        /*
         *  Check if we have a forced disk reference here
         */
        movb   boot_drive, %al
        cmpb    $0xff, %al
        je      1f
        movb    %al, %dl
1:
        /* save drive reference first thing! */
        pushw   %dx

[ One may find it “interesting” that the “jmp 3f” from line 216 of
  boot.S may be overwritten by internal ”grub-setup” code (cf.
  “grub-bios-setup” in grub-install.c) from grub2-2.12/util/setup.c:

    boot_drive_check = (grub_uint8_t *) (boot_img
                                          + GRUB_BOOT_MACHINE_DRIVE_CHECK);

    (...)

    /* If DEST_DRIVE is a hard disk, enable the workaround, which is
       for buggy BIOSes which don't pass boot drive correctly. Instead,
       they pass 0x00 or 0x01 even when booted from 0x80.  */
    if (!allow_floppy && !grub_util_biosdisk_is_floppy (dest_dev->disk))
      {
        /* Replace the jmp (2 bytes) with double nop's.  */
        boot_drive_check[0] = 0x90;
        boot_drive_check[1] = 0x90;
      }
]

In your case, I pretend your MBR-stored drive config (from your first
message for both drives) was “use the boot drive” for both sda and sdb,
because from grub2-2.12/include/grub/i386/pc/boot.h:

/* The offset of BOOT_DRIVE.  */
#define GRUB_BOOT_MACHINE_BOOT_DRIVE    0x64

and both of your MBRs had 0xff at this offset:

00000060: 0000 0000 fffa 9090 f6c2 8074 05f6 c270  ...........t...p

You can also see right here the two NOPs (9090) at offset
GRUB_BOOT_MACHINE_DRIVE_CHECK (i.e. 0x66) which override the
aforementioned “jmp 3f” from boot.S line 216, because this stage1 code
was written to hard disks.

Conclusion: in your case, the option was “load the next stage from the
drive the BIOS booted from” for both MBRs. Therefore, AFAIUI, assuming
everything else was good (incl. the offset for finding the next stage),
it should still have been able to boot with only one of the drives
present in the machine.

> The grub.cfg itself (and later, the fstab) finds its drives by UUID
> so I'm not worried about that part.
>
> I just have dim memories about having to do grub-install to sdb but
> trick it somehow that this was (hd0)…

Yep... AFAIUI, hd0 is for times when GRUB talks to the BIOS (at boot)
and corresponds to 0x80 (on x86 machines), but when running grub-install
or the internal grub-bios-setup, GRUB attempts to guess how the BIOS is
going to number the devices you gave it in Linux-speak (/dev/sda,
/dev/sdb, etc.), which may be unreliable. At least the GRUB 1.x
documentation clearly said so according to my recollection, and
therefore indicated (in the 2000s) as the bullet-proof recipe, to
perform GRUB installation to hard disk *from GRUB itself* using the
(hd0), (hd1), etc. notations, e.g. after booting from a GRUB floppy
disk.

> I do also wonder why my simple dd of the first 446 bytes did not
> work, as the /boot partition is at the same position on both drives
> and is an MDADM RAID1 so should have its stage2 at the same LBA.
> After doing the "dpkg-reconfigure grub-pc" the first 446 bytes of
> both sad and sdb are (still) identical so something else somewhere
> else must have been changed.

GRUB is a complex beast; available documentation may be a bit confusing
when it comes to stage 1.5 and stage 2, e.g.[3]:

  Version 0 (GRUB Legacy)
  ~~~~~~~~~~~~~~~~~~~~~~~

  Stage 1 can load stage 2 directly, but it is normally set up to load the
  stage 1.5., located in the first 30 KiB of hard disk immediately
  following the MBR and before the first partition. (...) The stage 1.5
  image contains file system drivers, enabling it to directly load stage 2
  from any known location in the filesystem, for example from /boot/grub.

  Version 2 (GRUB 2)
  ~~~~~~~~~~~~~~~~~~

  [Different description]

In any case, stage 1 can load some “stage 1.5” from “empty sectors (if
available) between the MBR and the first partition”. These sectors
wouldn't by synchronized by MD RAID, unless you're using it on the whole
drives—as opposed to partition by partition. I don't claim that “this is
it”, but this might explain some difference between your drives' booting
behavior, even with identical:
  - stage1 code+data in the MBR;
  - boot partitions' start offset and contents.

> Not understanding quite what is going on is worrying to me, even if
> things do now work. 🙁

I just hope I didn't confuse you more. :-)

Regards

[1] https://en.wikipedia.org/wiki/INT_13H#List_of_INT_13h_services
[2] https://wiki.osdev.org/MBR_(x86)#MBR_Bootstrap
[3] https://en.wikipedia.org/wiki/GNU_GRUB

-- 
Florent

Reply via email to