Re: Rebuild after disk fail

Russell Coker via luv-main Tue, 28 Jan 2020 01:08:24 -0800

On Monday, 20 January 2020 2:08:44 AM AEDT Craig Sanders via luv-main wrote:
> On Sun, Jan 19, 2020 at 05:34:46PM +1100, [email protected] wrote:
> > I generally agree that RAID-1 is the way to go.  But if you can't do that
> > then BTRFS "dup" and ZFS "copies=2" are good options, especially with SSD.
> 
> I don't see how that's the case, how it can help much (if at all). Making a
> second copy of the data on the same drive that's failing doesn't add much
> redundancy, but does add significantly to the drive's workload (increasing
> the risk of failure).
> 
> It might be ok on a drive with only a few bad sectors or in conjunction with
> some kind of RAID, but it's not a substitute for RAID.


Having a storage device fail entirely seems like a rare occurance.  The only 
time it happened to me in the last 5 years is a SSD that stopped accepting 
writes (reads still mostly worked OK).

I've had a couple of SSDs have checksum errors recently and a lot of hard 
drives have checksum errors.  Checksum errors (where the drive returns what it 
considers good data but BTRFS or ZFS regard as bad data) are by far the most 
common failures I see of the 40+ storage devices I'm running in recent times.

BTRFS "dup" and ZFS "copies=2" would cover almost all storage hardware issues 
that I've seen in the last 5+ years.

> > So far I have not seen a SSD entirely die, the worst I've seen is a SSD
> > stop
> I haven't either, but I've heard & read of it.  Andrew's rootfs SSD seems to
> have died (or possibly just corrupted so badly it can't be mounted. i'm not
> sure)
> 
> I've seen LOTS of HDDs die.  Even at home I've had dozens die on me over the
> years - I've got multiple stacks of dead drives of various ages and sizes
> cluttering up shelves (mostly waiting for me to need another fridge magnet
> or shiny coffee-cup coaster :)

I've seen them die in the past.  But recently they seem to just give 
increasing error counts.  Maybe if I ran a disk that was giving ZFS or BTRFS 
checksum errors for another few years it might die entirely, but I generally 
have such disks discarded or drastically repurposed after getting ~40 checksum 
errors.

> > For hard drives also I haven't seen a total failure (like stiction) for
> > many years.  The worst hard drive problem I've seen was about 12,000 read
> > errors, that sounds like a lot but is a very small portion of a 3TB disk
> > and "dup" or "copies=2" should get most of your data back in that
> > situation.
> If a drive is failing, all the read or write re-tries kill performance on a
> zpool, and that drive will eventually be evicted from the pool. Lose enough
> drives, and your pool goes from "DEGRADED" to "FAILED", and your data goes
> with it.

So far I haven't seen that happen on my ZFS servers.  I have replaced at least 
20 disks in zpools due to excessive checksum errors.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

_______________________________________________
luv-main mailing list
[email protected]
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main

Re: Rebuild after disk fail

Reply via email to