Re: [lopsa-tech] Backup Reliability

Skylar Thompson Mon, 13 May 2013 06:02:56 -0700

On 05/13/2013 05:50 AM, Edward Ned Harvey (lopser) wrote:

From: Skylar Thompson [mailto:skylar.thomp...@gmail.com]


I've been a TSM admin for years, so I admit I'm biased, but at scale I
don't think there is much competition to TSM. The advantage of the
progressive incremental backup (basically, incremental-forever w/o every
doing a full) outweighs the licensing costs.

This really gets at the heart of what I was thinking when I posted the question.  When I'm asking 
you guys what you think about reliability, I'm asking if you have strong faith that these tools 
perform as designed, or if you think bugs etc, might cause failures with unexpected results.  
(Worse yet, undected failures.)  It's understood, that if you were using "cp" or 
"tar" or basically any of these other tools for backup purposes, that you'd need to wrap 
it up with some kind of script or something.  And if you want validation, you're creating and 
storing your own checksums, etc.

When you have the incremental-forever (such as rdiff-backup) it seems to me, you have 
somewhat higher risk of data corruption than otherwise.  But it's a good starting place, 
if you ensure the "current" image is the base standard, and you have to follow 
all the incrementals to go further back in time.  It's definitely true, that recent 
backups get restored more frequently than old backups.

I do believe zfs gets it right.  The parent block contains the hash of all the 
children, and since the grandparent already validated the hash of the parent, 
you can safely assume the parent hash is correct.  If a child hash doesn't 
match, the problem is the child, and not a corrupt hash stored in the parent.  
As Andrew said, end-to-end checksums is the way to go.

TSM, along with the underlying storage, actually have multiplestrategies to mitigate the chance of data corruption. First, even thoughthere is no end-to-end checksumming, you have the option of checksummingdata at each tier in a storage hierarchy. For us, we have checksummingenabled at the disk pool level, since no other data validationstrategies are available at th at level.

Second, we depend LTO's data validation while data are being written totape. An LTO drive's read heads are positioned ahead of the write heads,which allows the drive to be continuously checking that the data writtenout matches the data still in the drive's buffers.

Third, TSM is continuously shuffling data from tapes that used to befull but are becoming empty with expired data, to non-full ("filling")and scratch tapes through a process called space reclamation. Thisensures that data do not just sit around idly on tape, but are beingre-read and re-written over time. Due to our low change rate (10TB/dayon 6.5PB total data), we're planning on augmenting this with our ownrefresh operations, based on tape age rather than percent-full.

Fourth, if all else fails, we have at at least one copy pool heldoffsite for our primary data pools. This allows us to recover in theevent of media failure. For our critical data (i.e. important databases,home directories, etc.) we have two copy pools.

While I would love to have end-to-end checksumming like ZFS, the costboth in terms of money (ZFS doesn't exist for tape yet, which means wewould be paying 3-4x for disk) and resource usage (have to read all thedata we are scanning for backups to get a checksum) make it prohibitiveso far.


Skylar
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/

Re: [lopsa-tech] Backup Reliability

Reply via email to