On 12.02.2012 06:27, Branko Čibej wrote:
On 12.02.2012 02:52, Stefan Fuhrmann wrote:
The silly part of FSFS is that it does not optimize access
paths, yet, but stores changes individually. The challenge
is our two-dimensional key space and the fact that different
operations traverse the data along different dimensions
(e.g. log ./. checkout).
Interestingly enough, the 2D keyspace isn't that big a problem.
On the physical level, it becomes a problem as files
model a linear data stream, i.e. you have to find a
mapping onto a 1-dimensional structure that keeps
related / dependent information closely together.
The real
issue is that we don't even represent all the relevant keys, because of
the lazy copying of subtrees. That's what actually prevents us from
doing one-shot lookups of arbitrary path@rev; and even then, we'd only
really have to do a step-by-step top-down resolve if the initial fast
lookup failed.
How do you verify that path@rev even exists?
For instance, you could maintain an index of (only)
every path ever touched. As long as we want to keep
our O(1) delete operations for paths, entries in that
index cannot be updated / removed. So, we need
to always verify that there is actually a path from /@rev
to path@rev.
Question: how many entries would a direct lookup structure
need to have (i.e. path@rev -> data pointer)? Keep in
mind that may valid paths like /branches/foo/bar will never
be mentioned anywhere in a SVN repository because they
never got touched under that name. A rough estimation for
a fully expanded list of entries is
#nodesInTrunk * #openBranches * #revisions
This yields 10^9 entries for small repositories and>10^14
for KDE-sized ones. Clearly impractical.
So that's not how you do it. :)
You'd only need one reference per actual node, not per possible node
lookup paths including revisions. Obviously you have to resolve path@rev
to a concrete node before you can do anything with its attributes. With
the caveat that there's a nasty edge case involving not-yet-lazy-copied
child nodes; but given the way we currently crawl the tree, that's not
really an issue.
We simply follow the tree, i.e. one lookup per parent.
With any luck most of the intermediate nodes have
already been cached, which is exactly what makes our
differential trees so efficient: as long as a sub-tree is
not being modified, all copies and future revisions
share the *same* representation.
But maybe we are simply talking past each other here.
So, could you give a short summary of what you consider
wrong / inefficient about SVN / FSFS and how you would
like to see this addressed.
-- Stefan^2.