On 11/09/17 13:04, deloptes wrote:
David Christensen wrote:
What RAID technology are you using?
Linux software raid - kernel is 4.12.10
Most people call it 'mdadm', after the command-line tool. I am running
the same, but on Debian "stable":
2017-11-09 14:00:32 root@dipsy ~
# dpkg-query --show mdadm
mdadm 3.4-4+b1
2017-11-09 14:00:40 root@dipsy ~
# cat /etc/debian_version
9.2
2017-11-09 14:01:00 root@dipsy ~
# uname -a
Linux dipsy 4.9.0-4-amd64 #1 SMP Debian 4.9.51-1 (2017-09-28) x86_64
GNU/Linux
2017-11-09 14:01:06 root@dipsy ~
# dpkg-query --show mdadm
mdadm 3.4-4+b1
Take a look at:
# smartctl --xall /dev/sdg
This is nothing spectacular - see attachment.
I'll comment on the information I think I understand...
> Device Model: ST3500630AS
I deal with 8 @ ST31500341AS drives, which I believe are of the same
vintage. They all seem good.
> SMART overall-health self-assessment test result: PASSED
That is good.
> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> 1 Raw_Read_Error_Rate POSR-- 105 095 006 - 0
> 10 Spin_Retry_Count PO--C- 100 100 097 - 0
> 187 Reported_Uncorrect -O--CK 100 100 000 - 0
> 198 Offline_Uncorrectable ----C- 100 100 000 - 0
A RAW_VALUE of 0 for these attributes is good.
> 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 7
7 is low, but the two in my file server are both 0.
Check your cable connections -- they should be fully engaged and not
loose. Otherwise swap the cable. (I wrote a serial number on all of my
SATA cables with Sharpie and track which cable is where.)
> 9 Power_On_Hours -O--CK 034 034 000 - 58404
If 58404 means ~6.6 years (and I think it does), that is a lot of time.
But, I would not worry based on just this value.
> 7 Seek_Error_Rate POSR-- 088 060 030 - 747385748
> 195 Hardware_ECC_Recovered -O-RC- 064 056 000 - 179548239
I don't know how to interpret these raw values. STFW I am not alone.
> SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
> No Errors Logged
That is good.
In fact I think it does not
write to this disk at all as the partition in the raid setup shows to a
disk with same id.
I think the problem is that blkid reports the same ID for both and that
somehow RAID is using this information, rather than using some of the other
mechanisms - UUID or UDEV Maker/Model/Serial .. which can be found
under /dev/disk
As I understand it, when mdadm creates an array, mdadm puts a metadata
header into each device that includes identification of the array and
identification of each member.
When the system boots, mdadm reads /etc/mdadm/mdadm.conf for array
specifications, scans all devices for mdadm metadata, and then assembles
the specified arrays using the devices it finds (as best it can).
It looks like you partitioned your drives with one large partition on
each drive, and then created the array on the partitions.
The matching PTUUID values for both drives, and matching UUID and
PARTUUID values for both partitions, indicates that one drive was cloned
onto the other at some point after creating the array. I agree that
this is likely a mistake, and is likely to confuse mdadm.
If you learn smartctl well enough, capture reports on a schedule
(weekly?), and look for trends, you might be able to predict failure.
STFW for information on this approach.
Download the bootable CD image of Seagate Seatools and run it:
https://www.seagate.com/support/downloads/seatools/
might do that,
You want that CD as part of your tool kit -- it makes running the SMART
tests easy, lets you know if everything passed, and helps you understand
anything that is questionable.
but I think the problem is in raid itself as it does not
indicate activity on the second disk and blkid reports the same id for two
disks - I really might need to look into the raid code if blkid is used in
any way.
Another alternative to crawling code would be to build another array on
a pair of USB flash drives using the same process as you used for your
500 GB drives, and then see what blkid(8) says about the USB drives.
Do you have the console session from when you built the array?
Be sure to keep a console session of any and all mdadm commands you
issue from now on.
[the drives] are in server that virtually runs 24/7 and indeed I have replaced
many over the years. In fact most of the old disks are gone. The Seagate is
the oldest there ... the only left, so I think I'll just replace it so that
I may sleep well ... the problem is I don't know which disk is really
writing, might be the Seagate and the WD is not operational ... I think it
is best to be on the safe side :)
If the array is working, leave it alone. Backup/ archive, build a
replacement array, rsync the data over, validate, migrate services to
the new array, validate services, and backup again (to validate your
backup process). Once the new array has been up and running for a
while, tear it down and pull the drives.
David