Since this message is pretty pro-de-dupe, I want to mention that I don't sell any of this stuff. I'm just excited about the technology, have many customers large and small using it, and want to make sure it's accurately represented.
>"We don't need tape, because disk is cheap!" >[...hiatus...] >"We have to save disk! Buy (and integrate, and manage) a new product!" I would put that history slightly differently. I don't know anyone who knew what they were doing that was saying "we don't need tape!" What they were saying is: "Tape drives are now way too fast! We have to stage to disk to backup to stream the drives. Wouldn't it be cool if we could also do away with tape onsite, but we still need it for offsite." [...hiatus...] "Holy crap! VTLs are expensive! Forget the store all onsite backups on disk part. Let's just do staging. That requires a much smaller amount of disk." [...hiatus...] "De-dupe is here. Using that, we can take the amount of disk that we would have bought just for staging and store all our onsite backups on it. Wow." >I think a back-end de-dup (de do da da) would still offer advantages >to TSM: if you've got mumblety-hundred (e.g.) Win2K boxen, then most >of their system and app space would be identical. This could, >concievably, end up as close to one system-images' worth of space on >the back end. In a fantasy. :) This is not a fantasy. There are products that have been GA for 3+ years that are doing just this. These products also notice when a file has been modified multiple times and just backs up the new blocks that were changed each time. In addition, these products also notice users' files that are common between the filesystem and sitting inside Exchange inboxes and Sent Items folders, for example. They notice attachments that were sent to multiple remote offices that have already been backed up. All of tis is reality, is GA, and is being used by many companies, many of them very, very large. >However, the server would need to do an awful lot of work to correlate >all these data. It's not easy, but it's not as hard as you may think. The main work comes from two things: computing a SHA-1 hash on each block of data and looking up that hash in a big hash table. The first is only performed by each client (speaking of source de-dupe) on new or changed files, so it's not as bad as you might think. The second can handle quite a few clients simultaneously without being a bottleneck. At some point, you may need multiple hash tables and servers to handle the lookup, but the workload can be distributed. For example, install a second lookup server and each server handles lookups for half of the total list of hashes. As to how fast de-dupe backup software is, it's definitely fast enough to keep up with remote offices and medium-sized datacenters. Once we start getting into many TBs of LOCAL data (i.e. a large datacenter), there are much more efficient ways to back it up. But if the data is remote, de-dupe backup software is hard to beat. (These last few comments were about de-dupe backup software -- not to be confused with de-dupe VTLs. Those actually go VERY fast and can handle the largest of environments.)