Am 20.05.2012 17:41, schrieb Justin Erenkrantz:
On Sun, May 20, 2012 at 4:37 AM, Stefan Fuhrmann
<[email protected]>  wrote:
Directory deltification making wordpress.org
go from 400+GB to 10GB *is* a reason.
Without stable hashes, we would need special
code for hash deltification.
Having a stable hash function sure doesn't seem like this would
account for that reduction.  Can you please elaborate?

Subversion up to and including 1.7 will serialize directories
as string->string hashes in FSFS. wordpress.org uses projects
as the top-level of its repository (just like Apache). So, every
commit writes a new version of that. At >26k projects, that's
>1.4MB per revision.

In 1.8, one may activate directory deltification. After serialization,
the resulting text will be deltified just like any other node and
the result be zip-compressed. Many revisions are now about 2KB.
However, that hinges on successive versions of the directories
to produce serialized text. Even a random "shift" by a larger
number of entries will leave no 64 byte matches (our xdelta
granularity) within the 100k text windows used by xdelta.
Again, these are my reasons for using svn_hash__make:

* consistent behavior of SVN across different APR versions
* give devs time to check all the 500+ places that create
   hashes throughout SVN for implicit assumptions on
   ordering and such.
* performance improvement; particularly with directory-
   or property-related operations
I don't believe the first two matter in any tangible way.

Well, I am a developer and reproducibility between test runs
*does* matter to me.

On a more general note: We don't use hashes as a means to
randomize our data. For us, they are simply containers with
an average O(1) insertion and lookup behavior. The APR interface
also allows for iterating that container - so it *has* an ordering
and it has been quite stable under different operations and
over many versions of APR.

The change in 1.4.6 did *not* solve the fundamental performance
problem but it makes our life harder - at least for a while.
If we want a reproducible UI behavior, we must now eliminate
the use of hashes in all relevant output functions and replace
them with e.g. sorted arrays. That may take some time.
And, the third point doesn't make any sense to me without a further
explanation.  -- justin

When we e.g. do an "svn ls -v" (TSVN repo browser), we will
create and fill the revprop hash for the respective revision
multiple times for each entry in the directory - just to produce
a few bytes of output. The hash function showed up in profiles
with 10% of the total runtime.

So, I tuned that. Because apr_hash_t is so popular in our code,
that very localized improvement will give (small) returns in
improved performance all over the place.

-- Stefan^2.

Reply via email to