>As others have noted, different vendors dedup at different levels of >granularity.
I think I'd put it slightly differently. I'd say that they each approach it differently. Those different approaches may have advantages and disadvantages with different data types. >When I spoke to Diligent at the Gartner conference over >a year ago, they were very tight-lipped about their actual >algorithm. The patent was filed. It's not that secret. ;) They are quite different in their approach, and it's a little different to grock. But based on what I know about their approach, the scenario that started the discussion may indeed be a limitation. (Or all the vendors may have this limitation; I have some questions out to them.) >The[y] would, however, state that they were able to dedup >parts of two files that had similar data, but were not >identical. I.e., if data was inserted at the beginning of the file, >some parts of the end of the file could still be deduped. Neat trick >if it's true. Any de-dupe vendor is able to claim that. If it wasn't true, they wouldn't see the de-dupe rates they're seeing. They can also identify blocks that are common between a file in the file system and the same file emailed via Exchange. >Other vendors dedup at the file or block (or chunk) level. If a vendor doesn't do subfile de-dupe, then they're not a de-dupe vendor; they're a CAS vendor. File-level de-dupe is CAS (i.e. Centerra, Archivas), and the de-dupe is not really pitched as the main feature. It's about using the signature as a way to provide immutability of data stored in the CAS array. >I've not been able to gather much more detail about the specific >dedup algorithms, but hope to get some more info this fall, as take a >closer look at these products. If anyone has more details, I'd love >to hear them. I wrote this article that may help: http://tinyurl.com/3588fb . I also blog about de-dupe quite a bit at www.backupcentral.com.