On 21 janv. 2010, at 22:55, Daniel Carosone wrote: > On Thu, Jan 21, 2010 at 05:04:51PM +0100, erik.ableson wrote: > >> What I'm trying to get a handle on is how to estimate the memory >> overhead required for dedup on that amount of storage. > > We'd all appreciate better visibility of this. This requires: > - time and observation and experience, and > - better observability tools and (probably) data exposed for them
I'd guess that since every written block is going to go and ask for the hash keys, this should result in this data living in the ARC based on the MFU ruleset. The theory being that as a result if I can determine the maximum memory requirement for these keys, I know what my minimum memory baseline requirements will be to guarantee that I won't be caught short. >> So the question is how much memory or L2ARC would be necessary to >> ensure that I'm never going back to disk to read out the hash keys. > > I think that's a wrong-goal for optimisation. > > For performance (rather than space) issues, I look at dedup as simply > increasing the size of the working set, with a goal of reducing the > amount of IO (avoided duplicate writes) in return. True. but as a practical aspect, we've seen that overall performance drops off the cliff if you overstep your memory bounds and the system is obliged to go to disk to evaluate a new block to write against the hash keys. Compounded by the fact that the ARC is full so it's obliged to go straight to disk, further exacerbating the problem. It's this particular scenario that I'm trying to avoid and from a business aspect of selling ZFS based solutions (whether to a client or to an internal project) we need to be able to ensure that the performance is predictable with no surprises. Realizing of course that all of this is based on a slew of uncontrollable variables (size of the working set, IO profiles, ideal block sizes, etc.). The empirical approach of "give it lots and we'll see if we need to add an L2ARC later" is not really viable for many managers (despite the fact that the real world works like this). > The trouble is that the hash function produces (we can assume) random > hits across the DDT, so the working set depends on the amount of > data and the rate of potentially dedupable writes as well as the > actual dedup hit ratio. A high rate of writes also means a large > amount of data in ARC waiting to be written at the same time. This > makes analysis very hard (and pushes you very fast towards that very > steep cliff, as we've all seen). I don't think it would be random since _any_ write operation on a deduplicated filesystem would require a hash check, forcing them to live in the MFU. However I agree that a high write rate would result in memory pressure on the ARC which could result in the eviction of the hash keys. So the next factor to include in memory sizing is the maximum write rate (determined by IO availability). So with a team of two GbE cards, I could conservatively say that I need to size for inbound write IO of 160MB/s, worst case accumulated for the 30 second flush cycle so, say about 5GB of memory (leaving aside ZIL issues etc.). Noting that this is all very back of the napkin estimations, and I also need to have some idea of what my physical storage is capable of ingesting which could add to this value. > I also think a threshold on the size of blocks to try deduping would > help. If I only dedup blocks (say) 64k and larger, i might well get > most of the space benefit for much less overhead. Well - since my primary use case is iSCSI presentation to VMware backed by zvols and I can manually force the block size on volume creation to 64, this reduces the unpredictability a little bit. That's based on the hypothesis that zvols use a fixed block size. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss