On Thu, 6 Dec 2007, Jon Loeliger wrote:
>
> On Thu, 2007-12-06 at 00:09, Linus Torvalds wrote:
> > Git also does delta-chains, but it does them a lot more "loosely". There 
> > is no fixed entity. Delta's are generated against any random other version 
> > that git deems to be a good delta candidate (with various fairly 
> > successful heursitics), and there are absolutely no hard grouping rules.
> 
> I'd like to learn more about that.  Can someone point me to
> either more documentation on it?  In the absence of that,
> perhaps a pointer to the source code that implements it?

Well, in a very real sense, what the delta code does is:
 - just list every single object in the whole repository
 - walk over each object, trying to find another object that it can be 
   written as a delta against
 - write out the result as a pack-file

That's simplified: we may not walk _all_ objects, for example: only a 
global repack does that (and most pack creations are actually for pushign 
and pulling between two repositories, so we only walk the objects that are 
in the source but not the destination repository).

The interesting phase is the "walk each object, try to find a delta" part. 
In particular, you don't want to try to find a delta by comparing each 
object to every other object out there (that would be O(n^2) in objects, 
and with a fairly high constant cost too!). So what it does is to sort the 
objects by a few heuristics (type of object, base name that object was 
found as when traversing a tree and size, and how recently it was found in 
the history).

And then over that sorted list, it tries to find deltas between entries 
that are "close" to each other (and that's where the "--window=xyz" thing 
comes in - it says how big the window is for objects being close. A 
smaller window generates somewhat less good deltas, but takes a lot less 
effort to generate).

The source is in git/builtin-pack-objects.c, with the core of it being

 - try_delta() - try to generate a *single* delta when given an object 
   pair.

 - find_deltas() - do the actual list traversal

 - prepare_pack() and type_size_sort() - create the delta sort list from 
   the list of objects.

but that whole file is probably some of the more opaque parts of git.

> I guess one question I posit is, would it be more accurate
> to think of this as a "delta net" in a weighted graph rather
> than a "delta chain"?

It's certainly not a simple chain, it's more of a set of acyclic directed 
graphs in the object list. And yes, it's weigted by the size of the delta 
between objects, and the optimization problem is kind of akin to finding 
the smallest spanning tree (well, forest - since you do *not* want to 
create one large graph, you also want to make the individual trees shallow 
enough that you don't have excessive delta depth).

There are good algorithms for finding minimum spanning trees, but this one 
is complicated by the fact that the biggest cost (by far!) is the 
calculation of the weights itself. So rather than really worry about 
finding the minimal tree/forest, the code needs to worry about not having 
to even calculate all the weights!

(That, btw, is a common theme. A lot of git is about traversing graphs, 
like the revision graph. And most of the trivial graph problems all assume 
that you have the whole graph, but since the "whole graph" is the whole 
history of the repository, those algorithms are totally worthless, since 
they are fundamentally much too expensive - if we have to generate the 
whole history, we're already screwed for a big project. So things like 
revision graph calculation, the main performance issue is to avoid having 
to even *look* at parts of the graph that we don't need to see!)

                        Linus

Reply via email to