Re: server-side log cache

Daniel Shahaf Tue, 20 Sep 2011 14:41:35 -0700

Re-reading this to see how it affects the fs-successor-ids work (in
particular, whether it supersedes the done/planned work), one question:



Why do you refer to skip-deltas?


Consider the following transformation to a repository:

  for noderev in FS:
    noderev.props['svn:contents'] = noderev.cat()
    noderev.contents = ""

This transformation does not change the successor relations, but it does
change the skip-deltas mapping.  Would your algorithms for
successors/copyto work correctly on a repository that has undergone the
above transformation?



Stefan Fuhrmann wrote on Mon, Sep 19, 2011 at 10:42:55 +0200:
> On 29.08.2011 18:35, Stefan Sperling wrote:
> 
> Sorry for the late response. I have been knocked
> out for a while.
> >On Sun, Aug 28, 2011 at 03:46:03PM +0200, Stefan Fuhrmann wrote:
> >>>See 
> >>>http://svn.apache.org/repos/asf/subversion/branches/fs-successor-ids/BRANCH-README
> >>>for what this is all about.
> >>But the assumptions in that file are actually not valid.
> >Which ones are invalid? Can you explain in detail?
> It makes at least 5 implicit assumptions:
> 
> * Noderevs are the relevant level on which user-level
>   questions can be answered.
> -> That might not be the case. Noderevs are a purely
>    internal construct. In particular, a noderev might get
>    copied but the actual delta is empty. Thinking about
>    svn obliterate & friends that might actually happen
>    as a result of some node removal / replacement.
> 
> * User-level queries (where got node copied to?) and
>   internal queries (what are the copy dependencies?)
>   are looking for the same information.
> -> This is linked to the previous. Noderev copy-to info
>    can be useful to determine the dependency graph.
>    The point here is that we are mixing queries on two
>    different layers here. If we need both, they should
>    both exist as separate APIs.
> 
> * Noderevs changes are relevant for all svn obliterate.
> -> That may not always be true. If you only want to
>    remove "unused" history, the references in the skip-
>    delta tree express the relevant dependencies.
> 
> * Noderev changes are the sufficient to answer the
>   "where has this been copied to?" question.
> -> They are quite unhelpful here. Noderevs might
>    help find you changes on the node itself (see first
>    assumption). But you need to evaluate the user-
>    level copy operations of all parent folders for all
>    currently known copies. There are far less of these
>    copies (e.g. branch creation) than noderev changes
>    (typ. multiple per revision).
> 
> * A noderev-based cache alone would reduce the
>   amount of data to inspect / process for these queries.
> -> The noderev cache is *larger* than e.g. a log cache
>    due to each change also modifying all parent nodes.
>    To be efficient, you need a node-level copy-to cache
>    as well (see previous item).
> >
> >>* "Where did path P at rev N to M directly or indirectly
> >>  copied to, e.g. which releases contain a certain faulty
> >>  code segment; optionally list changes to targets?"
> >>->  needs to scan parent paths for copies, too
> >>   (reversed "log", revision graph)
> >Yes, the successor-id cache only gives us operation roots.
> >Information for child nodes needs to be derived -- it is not
> >within the scope of the cache itself.
> That is unnecessarily complicated / expensive.
> Using the proposed log cache, the relevant
> information can be found in a *single*, highly
> efficient scan.
> >
> >>It turns out that we can produce even humongous
> >>reverse logs (50+ k nodes, thousands of branches
> >>and tags) within a second by simply performing
> >>a full history scan.
> >>
> >>A example of how the whole process can be
> >>implemented efficiently, can be found here:
> >>
> >>https://tortoisesvn.googlecode.com/svn/trunk/src/TortoiseProc/RevisionGraph/FullGraphBuilder.cpp
> >I'll take a look at that, thanks!
> The gist of it is that you keep all currently known
> copy targets (e.g. branches, tags) in a tree. For
> each revision, you walk that tree to find the affected
> sub-tree. The LRU path gets optimized to speed up
> the next look-up.
> 
> A usually fixed repo layout plus typical change
> patterns result in a quasi-constant processing
> time for each revision - independent from the
> number of copies already found.
> 
> Things become much simpler and faster if you
> eliminate string operations and use path IDs
> instead. TSVN does an "svn log" at over 10M
> revs per second.
> >>>Storage of successor IDs is essentially a cache of the result of the
> >>>inversion of the predecessor relation between all node-revisions.
> >>>This inversion is a *very* expensive operation (follow every node
> >>>that ever existed in the repository from its most recent revision
> >>>back to the revision it was created in).
> >>Not true. At least the "*very* expensive" part.
> >>Log -v takes about 1min for AFS over ra_local.
> >>Building any of the index data described below
> >>can be done in<  10s.
> >Any of it (if so, which parts?), or all of it?
> All of it, of course. The expensive part is "svn log -v"
> which may be expensive on slow disks. Everything
> else is just writing a few MB of data.
> >>I propose a modified version of TSVN's log cache
> >>data structure. The adaptations:
> >>
> >>* split into multiple files to reduce modification overhead
> >>* remove rev-prop-dependent information (log comment,
> >>   author, timestamp)
> >>* add the reverse copy info
> >>* simplify the data structures
> >This looks very interesting.
> >
> >What about FSFS-specific requirements?
> See assumptions above, this may require a different
> data structure. But I think that noderev dependencies
> will turn out to be redundant if you have a log cache
> and access to the skip-delta forwards dependencies.
> >It sounds like you avoid those by storing data in semantics of the repos
> >layer (path@revision) instead of the FS layer (node-revision-id)?
> Yes.
> >In this case separate implementations for FSFS and BDB aren't needed.
> >This could be an advantage (e.g. third party FS implementations
> >wouldn't need to change to support this).
> It also eliminates on of the performance weaknesses
> of SVN today: A log on some old / seldom changed
> path can take a very long time.
> >
> >I'll think about this some more, thanks.
> >
> Welcome ;)
> 
> -- Stefan^2.

Re: server-side log cache

Reply via email to