On Mon, Feb 11, 2013 at 03:50:10AM +0100, Benoît Canet wrote: > As you can read dedup keep me awake at night. > > I still think that there is a need for a deduplication implementation > that would perform nearly as fast as regular qcow2. > > I though about this: http://en.wikipedia.org/wiki/Normal_distribution. > > Not all block are equals for deduplication. > Some will deduplicate well and some won't. > > My idea would be to run periodically a filter on the in ram tree in order to > drop the less performing and the less promising block. > > The less performing block involved on a deduplication operation since the last > run of the filter would be kept because they are promising so they would > survive and have a chance to climb among the top performers. > > The less performing block not involved in a deduplication operation since the > last run of the filter would be definitively dropped from the HashNode tree > since they are loosers. > > The center of the bell curve would be kept since they are champions. > > This way this ram based implementation could offer speed while it's memory > usage > being limited.
This means inline dedup is opportunistic and not guaranteed to catch every dedup. There needs to be a trade-off between a hash's dedup score and its age. Young hashes are allowed to stay for a while, even with low dedup scores so they have a chance to accumulate dedups. I still think a lookup data structure that spills to disk is better, but perhaps you have data that shows it's reasonable to expect decent dedup rates with the opportunistic approach? Stefan