On 22/07/10 12:38, Gregory Seidman wrote:
I have a RAID1 (using md) running on two USB disks. (I'm working on moving
to eSATA, but it's USB for now.) That means I don't have any insight using
SMART. Meanwhile, I've been getting occasional fail events. Unfortunately,
I don't get any information on which disk is failing.
When the system comes up, it seems to be entirely random which disk comes
up as /dev/sda and which comes up as /dev/sdb. In fact, since my root disk
is on SATA, at least one time it came up as /dev/sda and the USB drives
came up as /dev/sdb and /dev/sdc, though I think that was under a different
kernel version. When I get a failure email, it tells me that it might be
due to /dev/sda1 failing -- except when it tells me that it might be due to
/dev/sdb1 failing. When things are working, mdadm -D /dev/md0 looks like
this:
/dev/md0:
Version : 00.90
Creation Time : Wed Feb 22 20:50:29 2006
Raid Level : raid1
Array Size : 312496256 (298.02 GiB 320.00 GB)
Used Dev Size : 312496256 (298.02 GiB 320.00 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Thu Jul 22 07:30:46 2010
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : e4feee4a:6b6be6d2:013f88ab:1b80cac5
Events : 0.17961786
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 1 1 active sync /dev/sda1
When it fails, however, the device names disappear and it just tells me
it's clean, degraded and shows an active disk, a removed disk, and a faulty
spare without any device names.
I even tried doing dd if=/dev/md0 of=/dev/null to see if I could get the
light flickering on one and not the other, but I just get I/O errors. Once
a disk fails, the RAID seems to go into a nasty state where it reads
properly through the crypto loop and LVM I have on top of it, but the
filesystems become read-only and the block devices just give errors. Worse,
the first indication (even before the mdadm email) that something is wrong
is a message to console that an ext3 journal write failed.
What I've been doing (which makes me tremendously uncomfortable since I
know a disk is failing) is to reboot and bring everything back up. This has
been working, but I know it's just a matter of time before the failing disk
becomes a failed disk. I could wait until then, since presumably I'll then
know which is which, but who knows what data corruption is possible between
now and then?
So, um, help?
--Greg
cat /proc/mdstat can help but you need to get the serial numbers. Do this;
~# hdparm -i /dev/sda
/dev/sda:
Model=ST31000340AS , FwRev=SD15 ,
SerialNo=
9QJ1TRWK
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=unknown, BuffSize=0kB, MaxMultSect=16, MultSect=?16?
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953523055
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
AdvancedPM=no WriteCache=enabled
Drive conforms to: unknown: ATA/ATAPI-4,5,6,7
* signifies the current active mode
You see it says SerialNo = On each HDD you will see the serial number on
their somewhere, often it's hard to ready, so get a lable machine out
and clearly lable each HDD with it's serial number. When one dies. do a
cat /proc/mdstat to see which drive has failed, so say /dev/sda has
failed, run that command to get the serial number of /dev/sda, open the
case, rip it out, stick a new HDD in making sure you label this one with
it's serial number, boot up and rebuild etc etc
--
To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4c483f88.70...@sharescope.co.uk