On 05/13/2013 05:50 AM, Edward Ned Harvey (lopser) wrote:
From: Skylar Thompson [mailto:skylar.thomp...@gmail.com]
I've been a TSM admin for years, so I admit I'm biased, but at scale I
don't think there is much competition to TSM. The advantage of the
progressive incremental backup (basically, incremental-forever w/o every
doing a full) outweighs the licensing costs.
This really gets at the heart of what I was thinking when I posted the question. When I'm asking
you guys what you think about reliability, I'm asking if you have strong faith that these tools
perform as designed, or if you think bugs etc, might cause failures with unexpected results.
(Worse yet, undected failures.) It's understood, that if you were using "cp" or
"tar" or basically any of these other tools for backup purposes, that you'd need to wrap
it up with some kind of script or something. And if you want validation, you're creating and
storing your own checksums, etc.
When you have the incremental-forever (such as rdiff-backup) it seems to me, you have
somewhat higher risk of data corruption than otherwise. But it's a good starting place,
if you ensure the "current" image is the base standard, and you have to follow
all the incrementals to go further back in time. It's definitely true, that recent
backups get restored more frequently than old backups.
I do believe zfs gets it right. The parent block contains the hash of all the
children, and since the grandparent already validated the hash of the parent,
you can safely assume the parent hash is correct. If a child hash doesn't
match, the problem is the child, and not a corrupt hash stored in the parent.
As Andrew said, end-to-end checksums is the way to go.
TSM, along with the underlying storage, actually have multiple
strategies to mitigate the chance of data corruption. First, even though
there is no end-to-end checksumming, you have the option of checksumming
data at each tier in a storage hierarchy. For us, we have checksumming
enabled at the disk pool level, since no other data validation
strategies are available at th at level.
Second, we depend LTO's data validation while data are being written to
tape. An LTO drive's read heads are positioned ahead of the write heads,
which allows the drive to be continuously checking that the data written
out matches the data still in the drive's buffers.
Third, TSM is continuously shuffling data from tapes that used to be
full but are becoming empty with expired data, to non-full ("filling")
and scratch tapes through a process called space reclamation. This
ensures that data do not just sit around idly on tape, but are being
re-read and re-written over time. Due to our low change rate (10TB/day
on 6.5PB total data), we're planning on augmenting this with our own
refresh operations, based on tape age rather than percent-full.
Fourth, if all else fails, we have at at least one copy pool held
offsite for our primary data pools. This allows us to recover in the
event of media failure. For our critical data (i.e. important databases,
home directories, etc.) we have two copy pools.
While I would love to have end-to-end checksumming like ZFS, the cost
both in terms of money (ZFS doesn't exist for tape yet, which means we
would be paying 3-4x for disk) and resource usage (have to read all the
data we are scanning for backups to get a checksum) make it prohibitive
so far.
Skylar
_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/