Package: mdadm
Version: 3.4-4+b1
Severity: important
Tags: patch

When an array has disk(s) marked write-mostly, mdadm ignores them when
finding the most recent metadata.  That can mean that it assembles and
starts a degraded array using only the broken disk, rather than only
the good but write-mostly one.  This happened to me.

I have set the severity to `important' because this could cause data
loss.  TBH I think `serious' might be warranted but I will leave that
to you.

To reproduce: run the attached `repro' script.  You will need to
change `vg' at the top to refer to some volume group on your system.
It will need a few tens of Mb of space.  The repro script will make
two LVs to play in, and not otherwise disturb your VG.
To tidy up afterwards, add `exit 0' before `: ----- create -----'.

I have repro'd this with mdadm from jessie and from stretch.

The patches are against jessie's version of mdadm.  If you agree with
my analysis, I'd appreciate it if you would advise whether you think a
stable update would be warranted.

Ian.

#!/bin/bash
set -ex

vg=vg-zealot2

: ----- tidy -----

echo '0 1 zero' |dmsetup load mdtest-a ||:
dmsetup resume mdtest-a ||:

mdadm --stop /dev/md/mdtest-md ||:

echo '0 1 zero' |dmsetup load mdtest-zero ||:
dmsetup resume mdtest-zero ||:

dmsetup remove mdtest-a ||:

lvremove -f /dev/$vg/mdtest-a-base ||:
lvremove -f /dev/$vg/mdtest-b ||:

dmsetup remove mdtest-empty ||:

: ----- create -----

yes | lvcreate -W y -Z y -L 10M -n mdtest-a-base $vg
yes | lvcreate -W y -Z y -L 10M -n mdtest-b $vg

size=$(dmsetup table /dev/$vg/mdtest-a-base | perl -ne '
        my ($size, $device, $offset) = 
                m{^0 (\d+) linear (\S+) (\d+)$}
                or die $!;
        print $size, "\n" or die $!;
')

dmsetup create mdtest-empty <<END
0 $(( $size + 1 )) error
END

ab=/dev/$vg/mdtest-a-base

<<END tee /dev/stderr | dmsetup create mdtest-a 
0 $size linear $ab 0
END

a=/dev/mapper/mdtest-a
b=/dev/$vg/mdtest-b
m=/dev/md/mdtest-md

mdadm -C --auto=md --raid-devices=2 -lmirror -e1.1 --assume-clean \
        $m $a -W $b

eval "$(blkid -o export $b)"

echo hi | dd of=$m

showstate () {
        set +e
        mdadm -E $a
        mdadm -E $b
        mdadm -QD $m
        ls -al $a $b $m
        set -e
}
showstate

: ----- break a -----

# linear
#
# Parameters: <dev path> <offset>
#     <dev path>: Full pathname to the underlying block-device, or a
#                 "major:minor" device-number.
#     <offset>: Starting sector within the device.

# dm-delay
# ========
# ...
# Parameters:
#     <device> <offset> <delay> [<write_device> <write_offset> <write_delay>]

dmsetup suspend mdtest-a
<<END tee /dev/stderr | dmsetup --checks load mdtest-a
0 $size delay $ab 0 0 /dev/mapper/mdtest-empty 0 0
END
dmsetup resume mdtest-a

showstate

dmsetup table mdtest-a

echo ho | dd of=$m

showstate

: ----- wrong reassembly

mdadm --stop $m

showstate

nodename=$(uname -n)

cat >config <<END
ARRAY $m level=raid1 metadata=1.1 num-devices=2 UUID=$UUID 
name=$nodename:mdtest-md
END

mdadm --assemble --config=config --auto=md $m

# mdadm --assemble --scan --auto=md $m

showstate
>From 4f11973f921b0fb9c5ea0eef240c28bae0584bb3 Mon Sep 17 00:00:00 2001
From: Ian Jackson <ijack...@chiark.greenend.org.uk>
Date: Sun, 12 Aug 2018 15:32:12 +0100
Subject: [PATCH 1/3] Assemble.c: Use MD_* names when checking state for
 most_recent

The literal constant `6' is pretty opaque here.  Source-diving tells
me that .disk.state is a bitmask of bits numbered according to MD_*.

No functional change.

Signed-off-by: Ian Jackson <ijack...@chiark.greenend.org.uk>
---
 Assemble.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Assemble.c b/Assemble.c
index cdcdb0f8..84db272d 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -705,7 +705,7 @@ static int load_devices(struct devs *devices, char *devmap,
                devices[devcnt].i.disk.major = major(stb.st_rdev);
                devices[devcnt].i.disk.minor = minor(stb.st_rdev);
 
-               if (devices[devcnt].i.disk.state == 6) {
+               if (devices[devcnt].i.disk.state == (1<<MD_DISK_ACTIVE | 
1<<MD_DISK_SYNC)) {
                        if (most_recent < 0 ||
                            devices[devcnt].i.events
                            > devices[most_recent].i.events) {
-- 
2.11.0

>From 2b77820e50c7520b84f3d8d97ced1d0f2c3b9370 Mon Sep 17 00:00:00 2001
From: Ian Jackson <ijack...@chiark.greenend.org.uk>
Date: Sun, 12 Aug 2018 15:38:37 +0100
Subject: [PATCH 2/3] Assemble.c: Do not disregard write-mostly disks

The most recent metadata might well be on a write-mostly disk.

Consider for example my situation: a mirror with an RRD and an SSD.
The SSD has just failed in such a way that it rejects writes (which is
not uncommon for an SSD).  We need the metadata from the RRD.  The
fact that it's marked write-mostly is a performance optimisation, and
is not relevant for correctness and should not be seen as a reason to
disregard that disk.  Without this patch, the array is assembled using
only the failed SSD!

Signed-off-by: Ian Jackson <ijack...@chiark.greenend.org.uk>
---
 Assemble.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Assemble.c b/Assemble.c
index 84db272d..7bd66108 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -705,7 +705,8 @@ static int load_devices(struct devs *devices, char *devmap,
                devices[devcnt].i.disk.major = major(stb.st_rdev);
                devices[devcnt].i.disk.minor = minor(stb.st_rdev);
 
-               if (devices[devcnt].i.disk.state == (1<<MD_DISK_ACTIVE | 
1<<MD_DISK_SYNC)) {
+               if ((devices[devcnt].i.disk.state & ~(1<<MD_DISK_WRITEMOSTLY)) 
==
+                   (1<<MD_DISK_ACTIVE | 1<<MD_DISK_SYNC)) {
                        if (most_recent < 0 ||
                            devices[devcnt].i.events
                            > devices[most_recent].i.events) {
-- 
2.11.0

>From 6f0a7d466aacc39972ae05dd096ffb757b5e1471 Mon Sep 17 00:00:00 2001
From: Ian Jackson <ijack...@chiark.greenend.org.uk>
Date: Sun, 12 Aug 2018 15:50:07 +0100
Subject: [PATCH 3/3] Assemble.c: Do not disregard to-be-replaced disks

The most recent metadata might well be on a to-be-replaced disk.

I am not 100% sure, but I think MD_DISK_REPLACEMENT needs to be
treated the same as MD_DISK_WRITEMOSTLY.  AFAICT this flag means that
the drive is marked for replacement when a spare is available.  This
would not mean it hasn't got the best metadata.  In some particularly
bad case, maybe it has the only good copy.

The reason I am not 100% sure is that I'm concerned I may have the
meaning of MD_DISK_REPLACEMENT wrong.  The code is constantly
converting these disk states/flags between different representations
and I may not have followed them all.

Signed-off-by: Ian Jackson <ijack...@chiark.greenend.org.uk>
---
 Assemble.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Assemble.c b/Assemble.c
index 7bd66108..bb51ccbc 100644
--- a/Assemble.c
+++ b/Assemble.c
@@ -705,7 +705,7 @@ static int load_devices(struct devs *devices, char *devmap,
                devices[devcnt].i.disk.major = major(stb.st_rdev);
                devices[devcnt].i.disk.minor = minor(stb.st_rdev);
 
-               if ((devices[devcnt].i.disk.state & ~(1<<MD_DISK_WRITEMOSTLY)) 
==
+               if ((devices[devcnt].i.disk.state & ~(1<<MD_DISK_WRITEMOSTLY | 
1<<MD_DISK_REPLACEMENT)) ==
                    (1<<MD_DISK_ACTIVE | 1<<MD_DISK_SYNC)) {
                        if (most_recent < 0 ||
                            devices[devcnt].i.events
-- 
2.11.0



-- 
Ian Jackson <ijack...@chiark.greenend.org.uk>   These opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.

Reply via email to