On Thu, 6 Dec 2007, Jon Loeliger wrote: > > On Thu, 2007-12-06 at 00:09, Linus Torvalds wrote: > > Git also does delta-chains, but it does them a lot more "loosely". There > > is no fixed entity. Delta's are generated against any random other version > > that git deems to be a good delta candidate (with various fairly > > successful heursitics), and there are absolutely no hard grouping rules. > > I'd like to learn more about that. Can someone point me to > either more documentation on it? In the absence of that, > perhaps a pointer to the source code that implements it?
Well, in a very real sense, what the delta code does is: - just list every single object in the whole repository - walk over each object, trying to find another object that it can be written as a delta against - write out the result as a pack-file That's simplified: we may not walk _all_ objects, for example: only a global repack does that (and most pack creations are actually for pushign and pulling between two repositories, so we only walk the objects that are in the source but not the destination repository). The interesting phase is the "walk each object, try to find a delta" part. In particular, you don't want to try to find a delta by comparing each object to every other object out there (that would be O(n^2) in objects, and with a fairly high constant cost too!). So what it does is to sort the objects by a few heuristics (type of object, base name that object was found as when traversing a tree and size, and how recently it was found in the history). And then over that sorted list, it tries to find deltas between entries that are "close" to each other (and that's where the "--window=xyz" thing comes in - it says how big the window is for objects being close. A smaller window generates somewhat less good deltas, but takes a lot less effort to generate). The source is in git/builtin-pack-objects.c, with the core of it being - try_delta() - try to generate a *single* delta when given an object pair. - find_deltas() - do the actual list traversal - prepare_pack() and type_size_sort() - create the delta sort list from the list of objects. but that whole file is probably some of the more opaque parts of git. > I guess one question I posit is, would it be more accurate > to think of this as a "delta net" in a weighted graph rather > than a "delta chain"? It's certainly not a simple chain, it's more of a set of acyclic directed graphs in the object list. And yes, it's weigted by the size of the delta between objects, and the optimization problem is kind of akin to finding the smallest spanning tree (well, forest - since you do *not* want to create one large graph, you also want to make the individual trees shallow enough that you don't have excessive delta depth). There are good algorithms for finding minimum spanning trees, but this one is complicated by the fact that the biggest cost (by far!) is the calculation of the weights itself. So rather than really worry about finding the minimal tree/forest, the code needs to worry about not having to even calculate all the weights! (That, btw, is a common theme. A lot of git is about traversing graphs, like the revision graph. And most of the trivial graph problems all assume that you have the whole graph, but since the "whole graph" is the whole history of the repository, those algorithms are totally worthless, since they are fundamentally much too expensive - if we have to generate the whole history, we're already screwed for a big project. So things like revision graph calculation, the main performance issue is to avoid having to even *look* at parts of the graph that we don't need to see!) Linus