Re-reading this to see how it affects the fs-successor-ids work (in particular, whether it supersedes the done/planned work), one question:
Why do you refer to skip-deltas? Consider the following transformation to a repository: for noderev in FS: noderev.props['svn:contents'] = noderev.cat() noderev.contents = "" This transformation does not change the successor relations, but it does change the skip-deltas mapping. Would your algorithms for successors/copyto work correctly on a repository that has undergone the above transformation? Stefan Fuhrmann wrote on Mon, Sep 19, 2011 at 10:42:55 +0200: > On 29.08.2011 18:35, Stefan Sperling wrote: > > Sorry for the late response. I have been knocked > out for a while. > >On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote: > >>>See > >>>http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README > >>>for what this is all about. > >>But the assumptions in that file are actually not valid. > >Which ones are invalid? Can you explain in detail? > It makes at least 5 implicit assumptions: > > * Noderevs are the relevant level on which user-level > questions can be answered. > -> That might not be the case. Noderevs are a purely > internal construct. In particular, a noderev might get > copied but the actual delta is empty. Thinking about > svn obliterate & friends that might actually happen > as a result of some node removal / replacement. > > * User-level queries (where got node copied to?) and > internal queries (what are the copy dependencies?) > are looking for the same information. > -> This is linked to the previous. Noderev copy-to info > can be useful to determine the dependency graph. > The point here is that we are mixing queries on two > different layers here. If we need both, they should > both exist as separate APIs. > > * Noderevs changes are relevant for all svn obliterate. > -> That may not always be true. If you only want to > remove "unused" history, the references in the skip- > delta tree express the relevant dependencies. > > * Noderev changes are the sufficient to answer the > "where has this been copied to?" question. > -> They are quite unhelpful here. Noderevs might > help find you changes on the node itself (see first > assumption). But you need to evaluate the user- > level copy operations of all parent folders for all > currently known copies. There are far less of these > copies (e.g. branch creation) than noderev changes > (typ. multiple per revision). > > * A noderev-based cache alone would reduce the > amount of data to inspect / process for these queries. > -> The noderev cache is *larger* than e.g. a log cache > due to each change also modifying all parent nodes. > To be efficient, you need a node-level copy-to cache > as well (see previous item). > > > >>* "Where did path P at rev N to M directly or indirectly > >> copied to, e.g. which releases contain a certain faulty > >> code segment; optionally list changes to targets?" > >>-> needs to scan parent paths for copies, too > >> (reversed "log", revision graph) > >Yes, the successor-id cache only gives us operation roots. > >Information for child nodes needs to be derived -- it is not > >within the scope of the cache itself. > That is unnecessarily complicated / expensive. > Using the proposed log cache, the relevant > information can be found in a *single*, highly > efficient scan. > > > >>It turns out that we can produce even humongous > >>reverse logs (50+ k nodes, thousands of branches > >>and tags) within a second by simply performing > >>a full history scan. > >> > >>A example of how the whole process can be > >>implemented efficiently, can be found here: > >> > >>https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp > >I'll take a look at that, thanks! > The gist of it is that you keep all currently known > copy targets (e.g. branches, tags) in a tree. For > each revision, you walk that tree to find the affected > sub-tree. The LRU path gets optimized to speed up > the next look-up. > > A usually fixed repo layout plus typical change > patterns result in a quasi-constant processing > time for each revision - independent from the > number of copies already found. > > Things become much simpler and faster if you > eliminate string operations and use path IDs > instead. TSVN does an "svn log" at over 10M > revs per second. > >>>Storage of successor IDs is essentially a cache of the result of the > >>>inversion of the predecessor relation between all node-revisions. > >>>This inversion is a *very* expensive operation (follow every node > >>>that ever existed in the repository from its most recent revision > >>>back to the revision it was created in). > >>Not true. At least the "*very* expensive" part. > >>Log -v takes about 1min for AFS over ra_local. > >>Building any of the index data described below > >>can be done in< 10s. > >Any of it (if so, which parts?), or all of it? > All of it, of course. The expensive part is "svn log -v" > which may be expensive on slow disks. Everything > else is just writing a few MB of data. > >>I propose a modified version of TSVN's log cache > >>data structure. The adaptations: > >> > >>* split into multiple files to reduce modification overhead > >>* remove rev-prop-dependent information (log comment, > >> author, timestamp) > >>* add the reverse copy info > >>* simplify the data structures > >This looks very interesting. > > > >What about FSFS-specific requirements? > See assumptions above, this may require a different > data structure. But I think that noderev dependencies > will turn out to be redundant if you have a log cache > and access to the skip-delta forwards dependencies. > >It sounds like you avoid those by storing data in semantics of the repos > >layer (path@revision) instead of the FS layer (node-revision-id)? > Yes. > >In this case separate implementations for FSFS and BDB aren't needed. > >This could be an advantage (e.g. third party FS implementations > >wouldn't need to change to support this). > It also eliminates on of the performance weaknesses > of SVN today: A log on some old / seldom changed > path can take a very long time. > > > >I'll think about this some more, thanks. > > > Welcome ;) > > -- Stefan^2.