Package: mdadm Version: 3.4-4+b1 Severity: important Tags: patch When an array has disk(s) marked write-mostly, mdadm ignores them when finding the most recent metadata. That can mean that it assembles and starts a degraded array using only the broken disk, rather than only the good but write-mostly one. This happened to me.
I have set the severity to `important' because this could cause data loss. TBH I think `serious' might be warranted but I will leave that to you. To reproduce: run the attached `repro' script. You will need to change `vg' at the top to refer to some volume group on your system. It will need a few tens of Mb of space. The repro script will make two LVs to play in, and not otherwise disturb your VG. To tidy up afterwards, add `exit 0' before `: ----- create -----'. I have repro'd this with mdadm from jessie and from stretch. The patches are against jessie's version of mdadm. If you agree with my analysis, I'd appreciate it if you would advise whether you think a stable update would be warranted. Ian.
#!/bin/bash set -ex vg=vg-zealot2 : ----- tidy ----- echo '0 1 zero' |dmsetup load mdtest-a ||: dmsetup resume mdtest-a ||: mdadm --stop /dev/md/mdtest-md ||: echo '0 1 zero' |dmsetup load mdtest-zero ||: dmsetup resume mdtest-zero ||: dmsetup remove mdtest-a ||: lvremove -f /dev/$vg/mdtest-a-base ||: lvremove -f /dev/$vg/mdtest-b ||: dmsetup remove mdtest-empty ||: : ----- create ----- yes | lvcreate -W y -Z y -L 10M -n mdtest-a-base $vg yes | lvcreate -W y -Z y -L 10M -n mdtest-b $vg size=$(dmsetup table /dev/$vg/mdtest-a-base | perl -ne ' my ($size, $device, $offset) = m{^0 (\d+) linear (\S+) (\d+)$} or die $!; print $size, "\n" or die $!; ') dmsetup create mdtest-empty <<END 0 $(( $size + 1 )) error END ab=/dev/$vg/mdtest-a-base <<END tee /dev/stderr | dmsetup create mdtest-a 0 $size linear $ab 0 END a=/dev/mapper/mdtest-a b=/dev/$vg/mdtest-b m=/dev/md/mdtest-md mdadm -C --auto=md --raid-devices=2 -lmirror -e1.1 --assume-clean \ $m $a -W $b eval "$(blkid -o export $b)" echo hi | dd of=$m showstate () { set +e mdadm -E $a mdadm -E $b mdadm -QD $m ls -al $a $b $m set -e } showstate : ----- break a ----- # linear # # Parameters: <dev path> <offset> # <dev path>: Full pathname to the underlying block-device, or a # "major:minor" device-number. # <offset>: Starting sector within the device. # dm-delay # ======== # ... # Parameters: # <device> <offset> <delay> [<write_device> <write_offset> <write_delay>] dmsetup suspend mdtest-a <<END tee /dev/stderr | dmsetup --checks load mdtest-a 0 $size delay $ab 0 0 /dev/mapper/mdtest-empty 0 0 END dmsetup resume mdtest-a showstate dmsetup table mdtest-a echo ho | dd of=$m showstate : ----- wrong reassembly mdadm --stop $m showstate nodename=$(uname -n) cat >config <<END ARRAY $m level=raid1 metadata=1.1 num-devices=2 UUID=$UUID name=$nodename:mdtest-md END mdadm --assemble --config=config --auto=md $m # mdadm --assemble --scan --auto=md $m showstate
>From 4f11973f921b0fb9c5ea0eef240c28bae0584bb3 Mon Sep 17 00:00:00 2001 From: Ian Jackson <ijack...@chiark.greenend.org.uk> Date: Sun, 12 Aug 2018 15:32:12 +0100 Subject: [PATCH 1/3] Assemble.c: Use MD_* names when checking state for most_recent The literal constant `6' is pretty opaque here. Source-diving tells me that .disk.state is a bitmask of bits numbered according to MD_*. No functional change. Signed-off-by: Ian Jackson <ijack...@chiark.greenend.org.uk> --- Assemble.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Assemble.c b/Assemble.c index cdcdb0f8..84db272d 100644 --- a/Assemble.c +++ b/Assemble.c @@ -705,7 +705,7 @@ static int load_devices(struct devs *devices, char *devmap, devices[devcnt].i.disk.major = major(stb.st_rdev); devices[devcnt].i.disk.minor = minor(stb.st_rdev); - if (devices[devcnt].i.disk.state == 6) { + if (devices[devcnt].i.disk.state == (1<<MD_DISK_ACTIVE | 1<<MD_DISK_SYNC)) { if (most_recent < 0 || devices[devcnt].i.events > devices[most_recent].i.events) { -- 2.11.0
>From 2b77820e50c7520b84f3d8d97ced1d0f2c3b9370 Mon Sep 17 00:00:00 2001 From: Ian Jackson <ijack...@chiark.greenend.org.uk> Date: Sun, 12 Aug 2018 15:38:37 +0100 Subject: [PATCH 2/3] Assemble.c: Do not disregard write-mostly disks The most recent metadata might well be on a write-mostly disk. Consider for example my situation: a mirror with an RRD and an SSD. The SSD has just failed in such a way that it rejects writes (which is not uncommon for an SSD). We need the metadata from the RRD. The fact that it's marked write-mostly is a performance optimisation, and is not relevant for correctness and should not be seen as a reason to disregard that disk. Without this patch, the array is assembled using only the failed SSD! Signed-off-by: Ian Jackson <ijack...@chiark.greenend.org.uk> --- Assemble.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/Assemble.c b/Assemble.c index 84db272d..7bd66108 100644 --- a/Assemble.c +++ b/Assemble.c @@ -705,7 +705,8 @@ static int load_devices(struct devs *devices, char *devmap, devices[devcnt].i.disk.major = major(stb.st_rdev); devices[devcnt].i.disk.minor = minor(stb.st_rdev); - if (devices[devcnt].i.disk.state == (1<<MD_DISK_ACTIVE | 1<<MD_DISK_SYNC)) { + if ((devices[devcnt].i.disk.state & ~(1<<MD_DISK_WRITEMOSTLY)) == + (1<<MD_DISK_ACTIVE | 1<<MD_DISK_SYNC)) { if (most_recent < 0 || devices[devcnt].i.events > devices[most_recent].i.events) { -- 2.11.0
>From 6f0a7d466aacc39972ae05dd096ffb757b5e1471 Mon Sep 17 00:00:00 2001 From: Ian Jackson <ijack...@chiark.greenend.org.uk> Date: Sun, 12 Aug 2018 15:50:07 +0100 Subject: [PATCH 3/3] Assemble.c: Do not disregard to-be-replaced disks The most recent metadata might well be on a to-be-replaced disk. I am not 100% sure, but I think MD_DISK_REPLACEMENT needs to be treated the same as MD_DISK_WRITEMOSTLY. AFAICT this flag means that the drive is marked for replacement when a spare is available. This would not mean it hasn't got the best metadata. In some particularly bad case, maybe it has the only good copy. The reason I am not 100% sure is that I'm concerned I may have the meaning of MD_DISK_REPLACEMENT wrong. The code is constantly converting these disk states/flags between different representations and I may not have followed them all. Signed-off-by: Ian Jackson <ijack...@chiark.greenend.org.uk> --- Assemble.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Assemble.c b/Assemble.c index 7bd66108..bb51ccbc 100644 --- a/Assemble.c +++ b/Assemble.c @@ -705,7 +705,7 @@ static int load_devices(struct devs *devices, char *devmap, devices[devcnt].i.disk.major = major(stb.st_rdev); devices[devcnt].i.disk.minor = minor(stb.st_rdev); - if ((devices[devcnt].i.disk.state & ~(1<<MD_DISK_WRITEMOSTLY)) == + if ((devices[devcnt].i.disk.state & ~(1<<MD_DISK_WRITEMOSTLY | 1<<MD_DISK_REPLACEMENT)) == (1<<MD_DISK_ACTIVE | 1<<MD_DISK_SYNC)) { if (most_recent < 0 || devices[devcnt].i.events -- 2.11.0
-- Ian Jackson <ijack...@chiark.greenend.org.uk> These opinions are my own. If I emailed you from an address @fyvzl.net or @evade.org.uk, that is a private address which bypasses my fierce spamfilter.