On 05/13/2013 06:13 AM, Edward Ned Harvey (lopser) wrote:
From: [email protected] [mailto:[email protected]]
On Behalf Of Skylar Thompson

Second, we depend LTO's data validation while data are being written to
tape.
I don't want to say "all devices," but I'll say all hard drives include data 
integrity, in the form of FEC, built into the hardware.  If you get data out of the disk, 
it means it already passed the hardware checksum.  The same is true for TCP.  Yet things 
like zfs layer additional integrity checking on top of that ...  And in a lot of 
circumstances, it's wise to validate files transferred over a network too.

I certainly know, I can't enumerate the number of times I've discovered corrupt 
data by checking the md5, or scrubbing the filesystem.  So the hardware and TCP 
checksumming is extremely useful, but by at least my measure, not good enough.

One thing that works in our favor, is that much of our data comes from sources that are already known to be lossy (gene sequencers, mass spectrometers, etc.). There's already QA going on to correct for that, and in many cases it can correct in storage lossiness as well - sequencing frequently runs at 30x coverage, so any given region of the genome is processed 30 times. If one of those copies is bad (whether on the sequencer or in the storage), then QA can catch that.

The final results, of course, need to be better protected, but they're also much smaller than the raw data. The raw data is good to have backed up for a few months, though, since each run (lasting 4-8 days, and we might have 20 of them going at once) can cost upwards of $20k.

Skylar
_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to