On Tue, 21 Apr 2015, Russ Cox wrote:

> My paper with Sean Rhea and Alex Pesterev documents the performance 
> effect of double-checking the equality in some detail.
> https://www.usenix.org/legacy/event/usenix08/tech/full_papers/rhea/rhea.pdf

Very nice paper. Specialy from chapter 3.

By reading it, my same original question arises again:

****
  Foundation’s CAS layer is modeled on the Venti [34]
  content-addressed storage server, but we have adapted the
  Venti algorithms for use in a single-disk system and also
  optionally eliminated the assumption that SHA-1 is free
  of collisions, producing two operating modes for Foundation:
  compare-by-hash and compare-by-value. (Page 4)
****


And the question is answered in the paper:

****
While we originally investigated this mode [Compare-by-value] 
due to (in our opinion, unfounded) concerns about cryptographic
hash collisions (see [5, 16] for a lively debate), we
were surprised to find that its overall write performance
was close to that of compare-by-hash mode, despite the
added comparisons. Moreover, compare-by-value is always
faster for reads, as naming blocks by their log offsets
completely eliminates index lookups during reads. (page 7)
****


> (Caveat: in the usual academic tradition, the paper uses "Venti" to mean
> the system described in the original paper, not the system in Plan 9 today.
> The current Plan 9 implementation is much closer to what the paper calls 
> "Foundation: Compare by Hash".)

New question: and if I compile it with "int verifywrites = 1", is it
closer to "Compare-by-Value"?

I mean offset as handle.


> Hope this helps.

I wanted to know if the optional compiling with full check was in
consideration of people that have concerns about the (in)correctness of
compare-by-hash. One can disagree about the risk of using compare-by-hash,
but one cannot disagree in the fact that one disagrees. :)

I think, everyone should decide himself if he uses "compare-as-hash" 
and where he uses it. In some applications I would even take much more 
risk than compare-by-hash. And I find interesting the experiments with
hash functions, including compare-by-hash.

I appreciate that the option of "compare-by-value" is there. Documentation 
about where "compare-by-hash" is used, is important in orther that people 
may decide by themselve.

Interesting would also be the possibility of easily changing the hash
functions. As you note in the paper, this is important in "compare-by-value"
for increasing performance.

The problem I have with "compare-by-hash" is not only the probability of
hash colisions, but that it seems to rely on empirical knowledge about 
the used hash function. People used to analytical arguments may find
empirical arguments and empirical programming gruesome. If the empirical 
knowledge changes, if one discovers that the used hash function does not 
distribute homogenously enough its domain in its range, then one will 
want (specially in the case of compare by hash) to change the hash 
function with a better one. Trial and error is the empirical method 
of solving (and making) problems.

Rodrigo.


Reply via email to