>>>>> "LT" == Linus Torvalds <[EMAIL PROTECTED]> writes:
LT> What do people think? I'm not so much worried about the data itself: the LT> git architecture is _so_ damn simple that now that the size estimate has LT> been confirmed, that I don't think it would be a problem per se to put LT> 3.2GB into the archive. But it will bog down "rsync" horribly, so it will LT> actually hurt synchronization untill somebody writes the rev-tree-like LT> stuff to communicate changes more efficiently.. LT> IOW, it smells to me like we don't have the infrastructure to really work LT> with 3GB archives, and that if we start from scratch (2.6.12-rc2), we can LT> build up the infrastructure in parallell with starting to really need it. LT> But it's _great_ to have the history in this format, especially since LT> looking at CVS just reminded me how much I hated it. LT> Comments? I have been cooking this idea before I dove into the merge stuff and did not have time to implement it myself (Hint Hint), but I think something along the following lines would work nicely: * A script git-archive-tar is used to create a "base tarball" that roughly corresponds to "linux-*.tar.gz". This works as follows: $ git-archive-tar C [B1 B2...] This reads the named commit C, grabs the associated tree (i.e. its sub-tree objects and the blob they refer to), and makes a tarball of ??/?????????????????????????????????????? files. The tarball does not have to contain any extra information to reproduce any ancestor of the named commit. When extra parameters, B1 B2..., are given, it also creates "diff package" that roughly corresponds to "patch-*.gz" for each Bn given. They must be ancestors of commit. The intention is to store enough information to ensure that the recipient can recreate all the SHA1 files "base tarball" for commits between (Bn, C] would contain, provided if the recipient already has all the SHA1 files "base tarball" for Bn. * A script git-archive-patch is used to read such a "diff package". So a user needs to: * First pick some baseline B and download the base tarball for commit B. It is up to him to make trade-offs between how far back he wants to see the history and how much bandwidth he wants to waste. Untar it to get the baseline. * Then periodically pick up "diff package" for (C, B] where C is the latest available. Run git-archive-patch to populate the rest. * In addition the user can run rsync with timestamp option to pick up SHA1 files created upstream since C after this happens. What git-archive-tar needs to do to produce "diff package" for (Bn, C] is fairly obvious. * From rev-tree output, find all the commits that are on path from Bn to C. * Find all the SHA1 objects that appear on this commit chain; subtract what is in Bn since we assume the recipient has them already. * Run diff-tree between neighboring commits [*1*] to find out the set of blobs that are "related". Extract those related blobs and run "diff" [*2*] between them to see if it produces a patch smaller than the whole thing when compressed. If diff+patch is a win, then we do not have to transmit the blob that we could reproduce by sending the diff. Note that fact. * When you are all done, you have a single patch file that contains small edits on numerous blobs, and set of SHA1 files that are cheaper to transmit than in the patch form. Compress the patch file and package them together to make a tar archive. Given the above, the operation of git-archive-patch is also quite obvious. Extract the "diff package" tarball into the objects/ directory that has (at least) the full Bn, uncompress the patch file part, and run patch on it. [Footnotes] *1* Alternatively, this diff-tree can be run between Bn and each commit between (Bn, C]. It is like incremental dump strategy. We should experiment and find a good balance. *2* This does not have to be "diff -u" --- we are assuming the exact patch so diff -e or xdelta would do. We should experiment and find a good diff+patch pair. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html