Re: [zfs-discuss] What is your data error rate?

Jim Klimov Tue, 24 Jan 2012 08:26:38 -0800

2012-01-24 19:50, Stefan Ring пишет:

After having read this mailing list for a little while, I get the
impression that there are at least some people who regularly
experience on-disk corruption that ZFS should be able to report and
handle. I’ve been running a raidz1 on three 1TB consumer disks for
approx. 2 years now (about 90% full), and I scrub the pool every 3-4
weeks and have never had a single error. From the oft-quoted 10^14
error rate that consumer disks are rated at, I should have seen an
error by now -- the scrubbing process is not the only activity on the
disks, after all, and the data transfer volume from that alone clocks
in at almost exactly 10^14 by now.


Not that I’m worried, of course, but it comes at a slight surprise to
me. Or does the 10^14 rating just reflect the strength of the on-disk
ECC algorithm?


I maintained several dozen storage servers for about
12 years, and I've seen quite a few drive deaths as
well as automatically triggered RAID array rebuilds.
But usually these were "infant deaths" in the first
year, and those drives who passed the age test often
give no noticeable problems for the next decade.
Several 2-4 disk systems work as OpenSolaris SXCE
servers with ZFS pools for root and data for years
now, and also show now problems. However most of
these are branded systems and disks from Sun.
I think we've only had one or two drives die, but
happened to have cold-spares due to over-ordering ;)

I do have a suspiciously high error rate on my home-NAS
which was thrown together from whatever pieces I had
at home at the time I left for an overseas trip. The
box is nearly unmaintained since then, and can suffer
from physical reasons known and unknown, such as the
SATA cabling (varied and quite possibly bad), non-ECC
memory, dust and overheating, etc.

It is also possible that aging components such as the
CPU and Motherboard which have about 5 years of active
lifetime (including an overclocked past) can contribute
to error-rates.

The old 80gb root drive has had some bad sectors (READ
errors in scrub and data access) and rpool was recreated
with copies=2 for a few times now, thanks to LiveUSB,
but the main data pool had no substantial errors until
the CKSUM errors reported this winter (metadata:0x0 and
then the dozen of in-file checksum mismatches). Since
one of the drives got itself lost soon after, and only
reappeared after all the cables were replugged, I still
tend to blame this on SATA cabling as the most probable
root cause.

I do not have an up-to-date SMART error report, and
the box is not accessible at the moment, so I can't
comment on lower-level errors in the main pool drives.
They were new at the time I put the box together (almost
a year ago now).

However, so far much more than discovered on-disk CKSUM
errors (whichever way they've appeared) I am bothered
by tendency of this box to lock up and/or reboot after
somewhat repeatable actions (such as destroying large
snapshots of deduped datasets, etc.) I tend to write
this off as shortcomings of the OS (i.e. memory-hunger
and lockup in scarate hell as the most frequent cause),
and this really bothers me more now - causing lots of
downtime until some friend comes to that apartment to
reboot the box.

> Or does the 10^14 rating just reflect the strength
> of the on-disk ECC algorithm?

I am not sure how much the algorithms differ between
"enterprise" and "consumer" disks, while the UBER is
said to differ about 100 times. It might have also
to do with quality of materials (better steel in ball
bearings, etc.) as well as better firmware/processors
which optimize mechanical workloads and reduce the
mechanical wear. Maybe so, at least...

Finally, this is statistics. It does not "guarantee"
that for some 90Tbits of transferred data you will
certainly see an error (and just one for that matter).
Those drives which died young hopefully also count
in the overall stats, moving the bar a bit higher
for their better-made brethren.

Also, disk UBER regards media failures and ability
of disks' cache, firmware and ECC to deal with that.
After the disk sends the "correct" sector on the wire,
many things can happen like noise in bad connectors,
electromagnetic interference from all the motors in
your computer onto the data cable, ability or lack
thereof for the data protocol (IDE, ATA, SCSI) to
detect and/or recover from such incoming random bits
between disk and HBA, errors in HBA chips and code,
noise in old rusty PCI* connector slots, bitflips in
non-ECC RAM or overheated CPUs, power surges from PSU...
There is a lot of stuff that can break :)

//Jim Klimov
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What is your data error rate?

Reply via email to