Sorry, forgot to forward to the list... But while I'm at it: To me "pathological" doesn't mean bad, but simply "a very seldomly appearing case", as out maths profs use it - I just realized that the expression could be easily taken wrong.
And: Many thanks for the info about changes in Linux! The massive changerate is interesting news to me. ---------- Weitergeleitete Nachricht ---------- Betreff: Re: Revision control Datum: Mittwoch 04 Juni 2008 Von: Arne Babenhauserheide <[EMAIL PROTECTED]> An: Ivan Shmakov <[EMAIL PROTECTED]> Am Mittwoch 04 Juni 2008 11:18:57 schrieb Ivan Shmakov: > > This is bound to take a little bit more space, especially for big > > files which are changed very often, > > Consider, e. g.: > > linux-2.6.25.y.git $ git diff --numstat v2.6.2{4,5} | wc -l > 9738 > linux-2.6.25.y.git $ > > I. e., 9738 files (out of 23810, about 41%) have changed between > Linux 2.6.24 and 2.6.25. I guess that it would take a whole lot > of disk space to store the whole history for each of them a > second time. > > (The new files should have been taken into account in the above > estimation, but I don't think these will influence the result > significantly enough.) Yes, that case seems quite patological to me... I don't know many projects, where between 2 releases 40% of the files change, but maybe I simply lack experience. And git was written for Linux, so it should be optimal for it (which it sure is). > > but having the changes for files data stored in a structure similar > > to the tracked files takes advantage of disk and filesystem > > capabilities when pulling in many changes by reducing seeks. > > I don't think so. Wouldn't the balanced tree be better than an > average project's source tree in terms of speed? And doesn't > Git's objects/ make a close approximation of a balanced tree? Wouldn't a balanced tree mean, that each update changes random files on the disk? A filesystem is working best, if it can read and write sequencially, so the balanced tree equals random access, about the worst case access for a file system or disk, as far as I know (just short of a structured access with stride sizes just above the read ahead size / block size of the filesystem or disk). Repacked repositories should avoid this, though, by having one pack which to read (I don't know how the changes get stored in the pack. Does git have to read the whole pack to get some changes or does it read only parts of it? Reading the whole pack would be most efficient for disk access, but less efficient in terms of necessary memory). > > It's speed vs size, I think. > > Perhaps, but it seems that Mercurial could be faster only when > it comes to random access to the history of a given file. It > seems to me that the per-file structure will have a speed > penalty (when compared to the per-changeset one) for every other > usage case. It should be faster when checking many changes (just read in each changed store file from position x to position y) and when transmitting many commits at once (only one access per changed file instead of one per changeset). > It may make sense to use a kind of ``combined'' approach, but I > don't know of any VCS that implements some. It might get complicated as hell. Maybe that is why :) > > Besides: Is there some easy way to have a shared push repository to > > do garbage collection automatically? > > To be honest, I don't use Git on a regular basis. For my own > projects I tend to use GNU Arch, and I typically use whatever > VCS for the projects I do participate in. > > I use Git primarily to track changes in the projects which > either use Git (e. g., Linux, Gnulib, etc.) or use some > non-distributed VCS (like SVN, via git-svn(1).) It seems to be > a specific enough usage case, so that I've never encountered any > need to GC my Git repositories. Maybe you could do a test. - Create 1000 files and add+commit them. - Check the size of the repository. - Append 100 random lines to each file, then commit each one in a single commit. - Check the size again. - Now garbage collect. - Check the size one last time. From my experience, garbage collection makes a huge difference. I only use git in normal operation to work on the Hurd wiki, but I prefer to work with Mercurial. It feels far cleaner. > FWIW, Git repositories (containing no packs) could be cloned > with `rsync', too, while retaining all the savings in network > bandwidth (again, due to the per-changeset structure of the > repository.) I just checked Mercurial on that, and it seems to be possible, too (though I didn't think of it before). - http://www.selenic.com/mercurial/wiki/index.cgi/UsingRsyncToPushAndPull I didn't check that myself, though (I prefer plain cloning my repositories :) ). Best wishes, Arne -- Unpolitisch sein Heißt politisch sein Ohne es zu merken. - Arne Babenhauserheide ( http://draketo.de ) -- Weblog: http://blog.draketo.de -- Infinite Hands: http://infinite-hands.draketo.de - singing a part of the history of free software. -- Ein Würfel System: http://1w6.org - einfach saubere (Rollenspiel-) Regeln -- Mein öffentlicher Schlüssel (PGP/GnuPG): http://draketo.de/inhalt/ich/pubkey.txt
signature.asc
Description: This is a digitally signed message part.