On 12.02.2012 17:58, Daniel Shahaf wrote: > Stefan Fuhrmann wrote on Sun, Feb 12, 2012 at 13:57:23 +0100: >> On 12.02.2012 06:27, Branko Čibej wrote: >>> On 12.02.2012 02:52, Stefan Fuhrmann wrote: >>> >>>> The silly part of FSFS is that it does not optimize access >>>> paths, yet, but stores changes individually. The challenge >>>> is our two-dimensional key space and the fact that different >>>> operations traverse the data along different dimensions >>>> (e.g. log ./. checkout). >>> Interestingly enough, the 2D keyspace isn't that big a problem. >> On the physical level, it becomes a problem as files >> model a linear data stream, i.e. you have to find a >> mapping onto a 1-dimensional structure that keeps >> related / dependent information closely together. >> > So perhaps solve _that_ problem?
Mercurial has solved that problem ages ago. Their revlog files are append-only, and their versioned tree info is stored in a revlog file. Mercurial also happens to /not/ do lazy copying of the tree, however, which simplifies the index enormously. Instead, the structure of the directory index is designed to make even whole-tree change representations reasonably compact. (This is where one could write an essay titled "lazy copying considered harmful" :) The trouble about Subversion's representation of the versioned tree is that it considers lazy copying to apply to whole nodes, and on the logical (though not physical) level treats the node's name as an inherent propery of said node, at least as far as lazy copying is concerned, so we don't have a complete path@rev index for relevant revisions. The (invalid) assumption was that if we did have concrete index entries for each path@rev, the index would grow too quickly and become unmanageable. And the result is that, instead, you have to jump through hoops every time you resolve a path@rev, and even bigger hoops when you modify a child of a copied node, since you have to figure out how to populate the directory index every time you do that instead of doing it once during the initial copy. I guess what I'm proposing is that we decouple the path@rev index from the actual file contents data; currently directory information is stored in the directory node's contents. Instead, we should keep the path@rev->node index completely separate. That way we would probably end up not having directory nodes at all, since the only interesting bit of information that a directory node would contain are the properties, and if we want to implement any kind of efficient inheritable property scheme, those props would have to become indexed separately, if not independently, of the node contents. To illustrate: Currently our tree structure looks like this: +------------+ | node (dir) | +--+---------+--+ +-------------+ | contents-> | ---> | file->, ... | | props | +-------------+ +------------+ | v +-------------+ | node (file) | +--+----------+-+ +----------------+ | contents-> | ---> | representation | | props | +----------------+ +------------+ What I'm proposing would essentially look like this: Index Contents +--------------+------------+ +---------------+ | dir@rev | props-> | -+-> | property, ... | +--------------+------------+ | +---------------+ | dir/file@rev | props-> | / +----------------+ | | contents-> | ---> | representation | +--------------+------------+ +----------------+ (That this example shows one possible representation of inheritable properties, although it's a bit too simplistic to be useful in an actual implementation.) The idea is that we'd always maintain the complete index, i.e., in order to determine if path@15 exists, one only needs to search the index for (path, rev <= 15, !deleted) -- which is trivial in a properly ordered index. (Yes, this assumes that we record deletions in the index as well as insertions.) All of this can be done with an append-only index representation. What you can't do with append-only is represent forward history links, but -- we're not representing them very well right now, are we. :) The other issue with an append-only index is that you essentially have to reconstruct the whole index before you can do a lookup. A much better structure for representing such (compressed, ordered) indexes has been known for a while now, it's called a B-tree. ... -- Brane