On 05/13/2013 05:50 AM, Edward Ned Harvey (lopser) wrote:
From: Skylar Thompson [mailto:skylar.thomp...@gmail.com]

I've been a TSM admin for years, so I admit I'm biased, but at scale I
don't think there is much competition to TSM. The advantage of the
progressive incremental backup (basically, incremental-forever w/o every
doing a full) outweighs the licensing costs.
This really gets at the heart of what I was thinking when I posted the question.  When I'm asking 
you guys what you think about reliability, I'm asking if you have strong faith that these tools 
perform as designed, or if you think bugs etc, might cause failures with unexpected results.  
(Worse yet, undected failures.)  It's understood, that if you were using "cp" or 
"tar" or basically any of these other tools for backup purposes, that you'd need to wrap 
it up with some kind of script or something.  And if you want validation, you're creating and 
storing your own checksums, etc.

When you have the incremental-forever (such as rdiff-backup) it seems to me, you have 
somewhat higher risk of data corruption than otherwise.  But it's a good starting place, 
if you ensure the "current" image is the base standard, and you have to follow 
all the incrementals to go further back in time.  It's definitely true, that recent 
backups get restored more frequently than old backups.

I do believe zfs gets it right.  The parent block contains the hash of all the 
children, and since the grandparent already validated the hash of the parent, 
you can safely assume the parent hash is correct.  If a child hash doesn't 
match, the problem is the child, and not a corrupt hash stored in the parent.  
As Andrew said, end-to-end checksums is the way to go.

TSM, along with the underlying storage, actually have multiple strategies to mitigate the chance of data corruption. First, even though there is no end-to-end checksumming, you have the option of checksumming data at each tier in a storage hierarchy. For us, we have checksumming enabled at the disk pool level, since no other data validation strategies are available at th at level.

Second, we depend LTO's data validation while data are being written to tape. An LTO drive's read heads are positioned ahead of the write heads, which allows the drive to be continuously checking that the data written out matches the data still in the drive's buffers.

Third, TSM is continuously shuffling data from tapes that used to be full but are becoming empty with expired data, to non-full ("filling") and scratch tapes through a process called space reclamation. This ensures that data do not just sit around idly on tape, but are being re-read and re-written over time. Due to our low change rate (10TB/day on 6.5PB total data), we're planning on augmenting this with our own refresh operations, based on tape age rather than percent-full.

Fourth, if all else fails, we have at at least one copy pool held offsite for our primary data pools. This allows us to recover in the event of media failure. For our critical data (i.e. important databases, home directories, etc.) we have two copy pools.

While I would love to have end-to-end checksumming like ZFS, the cost both in terms of money (ZFS doesn't exist for tape yet, which means we would be paying 3-4x for disk) and resource usage (have to read all the data we are scanning for backups to get a checksum) make it prohibitive so far.

Skylar
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to