[EMAIL PROTECTED] wrote on 07/22/2008 11:48:30 AM: > Chris Cosby wrote: > > > > > > On Tue, Jul 22, 2008 at 11:19 AM, <[EMAIL PROTECTED] > > <mailto:[EMAIL PROTECTED]>> wrote: > > > > [EMAIL PROTECTED] > > <mailto:[EMAIL PROTECTED]> wrote on 07/22/2008 > > 09:58:53 AM: > > > > > To do dedup properly, it seems like there would have to be some > > > overly complicated methodology for a sort of delayed dedup of the > > > data. For speed, you'd want your writes to go straight into the > > > cache and get flushed out as quickly as possibly, keep everything as > > > ACID as possible. Then, a dedup scrubber would take what was > > > written, do the voodoo magic of checksumming the new data, scanning > > > the tree to see if there are any matches, locking the duplicates, > > > run the usage counters up or down for that block of data, swapping > > > out inodes, and marking the duplicate data as free space. > > I agree, but what you are describing is file based dedup, ZFS > > already has > > the groundwork for dedup in the system (block level checksuming and > > pointers). > > > > > It's a > > > lofty goal, but one that is doable. I guess this is only necessary > > > if deduplication is done at the file level. If done at the block > > > level, it could possibly be done on the fly, what with the already > > > implemented checksumming at the block level, > > > > exactly -- that is why it is attractive for ZFS, so much of the > > groundwork > > is done and needed for the fs/pool already. > > > > > but then your reads > > > will suffer because pieces of files can potentially be spread all > > > over hell and half of Georgia on the zdevs. > > > > I don't know that you can make this statement without some study of an > > actual implementation on real world data -- and then because it is > > block > > based, you should see varying degrees of this dedup-flack-frag > > depending > > on data/usage. > > > > It's just a NonScientificWAG. I agree that most of the duplicated > > blocks will in most cases be part of identical files anyway, and thus > > lined up exactly as you'd want them. I was just free thinking and typing. > > > No, you are right to be concerned over block-level dedup seriously > impacting seeks. The problem is that, given many common storage > scenarios, you will have not just similar files, but multiple common > sections of many files. Things such as the various standard > productivity app documents will not just have the same header sections, > but internally, there will be significant duplications of considerable > length with other documents from the same application. Your 5MB Word > file is thus likely to share several (actually, many) multi-kB segments > with other Word files. You will thus end up seeking all over the disk > to read _most_ Word files. Which really sucks. I can list at least a > couple more common scenarios where dedup has to potential to save at > least some reasonable amount of space, yet will absolutely kill performance.
While you may have a point on some data sets, actual testing of this type of data (28.000+ of actual end user doc files) using xdelta with 4k and 8k block sizes shows that the similar blocks in these files are in the 2% range (~ 6% for 4k). That means a full read of each file on average would require < 6% seeks to other disk areas. That is not bad, but this is the worst case picture as those duplicate blocks would need to live in the same offsets and have the same block boundaries to "match" under the proposed algo. To me this means word docs are not a good candidate for dedup at the block level -- but the actual cost to dedup anyways seems small. Of course you could come up with data that is pathologically bad for these benchmarks, but I do not believe it would be nearly as bad as you are making it out to be on real world data. > > > > For instance, I would imagine that in many scenarios much od the > > dedup > > data blocks would belong to the same or very similar files. In > > this case > > the blocks were written as best they could on the first write, > > the deduped > > blocks would point to a pretty sequential line o blocks. Now on > > some files > > there may be duplicate header or similar portions of data -- these may > > cause you to jump around the disk; but I do not know how much this > > would be > > hit or impact real world usage. > > > > > > > Deduplication is going > > > to require the judicious application of hallucinogens and man hours. > > > I expect that someone is up to the task. > > > > I would prefer the coder(s) not be seeing "pink elephants" while > > writing > > this, but yes it can and will be done. It (I believe) will be easier > > after the grow/shrink/evac code paths are in place though. Also, the > > grow/shrink/evac path allows (if it is done right) for other cool > > things > > like a base to build a roaming defrag that takes into account snaps, > > clones, live and the like. I know that some feel that the > > grow/shrink/evac > > code is more important for home users, but I think that it is super > > important for most of these additional features. > > > > The elephants are just there to keep the coders company. There are > > tons of benefits for dedup, both for home and non-home users. I'm > > happy that it's going to be done. I expect the first complaints will > > come from those people who don't understand it, and their df and du > > numbers look different than their zpool status ones. Perhaps df/du > > will just have to be faked out for those folks, or we just apply the > > same hallucinogens to them instead. > > > I'm still not convinced that dedup is really worth it for anything but > very limited, constrained usage. Disk is just so cheap, that you > _really_ have to have an enormous amount of dup before the performance > penalties of dedup are countered. If you can dedup 30% of your data, your disk just became 30% cheaper. Depending on workflow, the cost of disk is the barrier -- not cpu cycles or write/read speed. > > This in many ways reminds me the last year's discussion over file > versioning in the filesystem. It sounds like a cool idea, but it's not > a generally-good idea. I tend to think that this kind of problem is > better served by applications handling it, if they are concerned about it. > snapping a full filesystem for versions is expensive -- you are dealing with one file changing. doing dedup on zfs is inexpensive vs a follow the writes queue. > Pretty much, here's what I've heard: > > Dedup Advantages: > > (1) save space relative to the amount of duplication. this is highly > dependent on workload, and ranges from 0% to 99%, but the distribution > of possibilities isn't a bell curve (i.e. the average space saved isn't > 50%). > > > Dedup Disadvantages: > > (1) increase codebase complexity, in both cases of dedup during write, > and ex-post-facto batched dedup yes, but the code path is optional. > > (2) noticable write performance penalty (assuming block-level dedup on > write), with potential write cache issues. there is cost, but smart use of hash lookups and caching should absorb most of these. most of the cost comes with using a better hashing algo instead of fletch2/4 > > (3) very significant post-write dedup time, at least on the order of > 'zfs scrub'. Also, during such a post-write scenario, it more or less > takes the zpool out of usage. post write, while not as bad as a separate dedup app, reduces the value of tying it to zfs. it should be done inline. > > (4) If dedup is done at block level, not at file level, it kills read > performance, effectively turning all dedup'd files from sequential read > to a random read. That is, block-level dedup drastically accelerates > filesystem fragmentation. again, this is completely dependant on the implementation and data sets. looking at our real world data on a 14tb user file store shows that most dedup that would happen (using 4, 8, 16 and 128k blocks) happens on totally binary similar files, a small percentage of dedup happens on other data if a static block seek is used (no sliding delta window). > > (5) Something no one has talked about, but is of concern. By removing > duplication, you increase the likelihood that loss of the "master" > segment will corrupt many more files. Yes, ZFS has self-healing and > such. But, particularly in the case where there is no ZFS pool > redundancy (or pool-level redundancy has been compromised), loss of one > block can thus be many more times severe. I assume that no one has talked about that because it seems obvious. Your blocks become N times more "valuable" where N is the number of blocks that are pointed to that block for dedup. A lost block on zfs can therefore affect N files + X snapshots + Y clones, or the entire filesystem if it was holding one of a few zfs structures. > > > We need to think long and hard about what the real widespread benefits > are of dedup before committing to a filesystem-level solution, rather > than an application-level one. In particular, we need some real-world > data on the actual level of duplication under a wide variety of > circumstances. There was already a post that shows how to exploit the zfs block checksums to gather similar block stats. An issue I have with that is zfs default hashing is pretty collision prone and the data seems suspect. I can probably post the perl scripts I used to gather data on my systems. The hash lookup tables that they generate are pretty damn huge, but the reporting part could display relative info in a compact way for posting. Assumptions I made were fixed block seeks (slurping in the largest block of data each read and acting on it as all block sizes in the test phase to be efficient), md5 match = bin match (pretty safe but a real system would bit level compare on a hash match). -Wade > > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > Timezone: US/Pacific (GMT-0800) > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss