On Tue, 21 Apr 2015, Russ Cox wrote: > My paper with Sean Rhea and Alex Pesterev documents the performance > effect of double-checking the equality in some detail. > https://www.usenix.org/legacy/event/usenix08/tech/full_papers/rhea/rhea.pdf
Very nice paper. Specialy from chapter 3. By reading it, my same original question arises again: **** Foundation’s CAS layer is modeled on the Venti [34] content-addressed storage server, but we have adapted the Venti algorithms for use in a single-disk system and also optionally eliminated the assumption that SHA-1 is free of collisions, producing two operating modes for Foundation: compare-by-hash and compare-by-value. (Page 4) **** And the question is answered in the paper: **** While we originally investigated this mode [Compare-by-value] due to (in our opinion, unfounded) concerns about cryptographic hash collisions (see [5, 16] for a lively debate), we were surprised to find that its overall write performance was close to that of compare-by-hash mode, despite the added comparisons. Moreover, compare-by-value is always faster for reads, as naming blocks by their log offsets completely eliminates index lookups during reads. (page 7) **** > (Caveat: in the usual academic tradition, the paper uses "Venti" to mean > the system described in the original paper, not the system in Plan 9 today. > The current Plan 9 implementation is much closer to what the paper calls > "Foundation: Compare by Hash".) New question: and if I compile it with "int verifywrites = 1", is it closer to "Compare-by-Value"? I mean offset as handle. > Hope this helps. I wanted to know if the optional compiling with full check was in consideration of people that have concerns about the (in)correctness of compare-by-hash. One can disagree about the risk of using compare-by-hash, but one cannot disagree in the fact that one disagrees. :) I think, everyone should decide himself if he uses "compare-as-hash" and where he uses it. In some applications I would even take much more risk than compare-by-hash. And I find interesting the experiments with hash functions, including compare-by-hash. I appreciate that the option of "compare-by-value" is there. Documentation about where "compare-by-hash" is used, is important in orther that people may decide by themselve. Interesting would also be the possibility of easily changing the hash functions. As you note in the paper, this is important in "compare-by-value" for increasing performance. The problem I have with "compare-by-hash" is not only the probability of hash colisions, but that it seems to rely on empirical knowledge about the used hash function. People used to analytical arguments may find empirical arguments and empirical programming gruesome. If the empirical knowledge changes, if one discovers that the used hash function does not distribute homogenously enough its domain in its range, then one will want (specially in the case of compare by hash) to change the hash function with a better one. Trial and error is the empirical method of solving (and making) problems. Rodrigo.