On 05/13/2013 06:13 AM, Edward Ned Harvey (lopser) wrote:
From: [email protected] [mailto:[email protected]]
On Behalf Of Skylar Thompson
Second, we depend LTO's data validation while data are being written to
tape.
I don't want to say "all devices," but I'll say all hard drives include data
integrity, in the form of FEC, built into the hardware. If you get data out of the disk,
it means it already passed the hardware checksum. The same is true for TCP. Yet things
like zfs layer additional integrity checking on top of that ... And in a lot of
circumstances, it's wise to validate files transferred over a network too.
I certainly know, I can't enumerate the number of times I've discovered corrupt
data by checking the md5, or scrubbing the filesystem. So the hardware and TCP
checksumming is extremely useful, but by at least my measure, not good enough.
One thing that works in our favor, is that much of our data comes from
sources that are already known to be lossy (gene sequencers, mass
spectrometers, etc.). There's already QA going on to correct for that,
and in many cases it can correct in storage lossiness as well -
sequencing frequently runs at 30x coverage, so any given region of the
genome is processed 30 times. If one of those copies is bad (whether on
the sequencer or in the storage), then QA can catch that.
The final results, of course, need to be better protected, but they're
also much smaller than the raw data. The raw data is good to have backed
up for a few months, though, since each run (lasting 4-8 days, and we
might have 20 of them going at once) can cost upwards of $20k.
Skylar
_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/