Fwd: Re: Revision control

Arne Babenhauserheide Sat, 07 Jun 2008 01:31:13 -0700

Sorry, forgot to forward to the list... 

But while I'm at it: To me "pathological" doesn't mean bad, but simply "a very 
seldomly appearing case", as out maths profs use it - I just realized that 
the expression could be easily taken wrong.


And: Many thanks for the info about changes in Linux! The massive changerate 
is interesting news to me. 

----------  Weitergeleitete Nachricht  ----------

Betreff: Re: Revision control
Datum: Mittwoch 04 Juni 2008
Von: Arne Babenhauserheide <[EMAIL PROTECTED]>
An: Ivan Shmakov <[EMAIL PROTECTED]>

Am Mittwoch 04 Juni 2008 11:18:57 schrieb Ivan Shmakov:
>  > This is bound to take a little bit more space, especially for big
>  > files which are changed very often,
>
>       Consider, e. g.:
>
> linux-2.6.25.y.git $ git diff --numstat v2.6.2{4,5} | wc -l
> 9738
> linux-2.6.25.y.git $
>
>       I. e., 9738 files (out of 23810, about 41%) have changed between
>       Linux 2.6.24 and 2.6.25.  I guess that it would take a whole lot
>       of disk space to store the whole history for each of them a
>       second time.
>
>       (The new files should have been taken into account in the above
>       estimation, but I don't think these will influence the result
>       significantly enough.)

Yes, that case seems quite patological to me... I don't know many projects, 
where between 2 releases 40% of the files change, but maybe I simply lack 
experience. 

And git was written for Linux, so it should be optimal for it (which it sure 
is). 

>  > but having the changes for files data stored in a structure similar
>  > to the tracked files takes advantage of disk and filesystem
>  > capabilities when pulling in many changes by reducing seeks.
>
>       I don't think so.  Wouldn't the balanced tree be better than an
>       average project's source tree in terms of speed?  And doesn't
>       Git's objects/ make a close approximation of a balanced tree?

Wouldn't a balanced tree mean, that each update changes random files on the 
disk? 

A filesystem is working best, if it can read and write sequencially, so the 
balanced tree equals random access, about the worst case access for a file 
system or disk, as far as I know (just short of a structured access with 
stride sizes just above the read ahead size / block size of the filesystem or 
disk). 

Repacked repositories should avoid this, though, by having one pack which to 
read (I don't know how the changes get stored in the pack. Does git have to 
read the whole pack to get some changes or does it read only parts of it? 
Reading the whole pack would be most efficient for disk access, but less 
efficient in terms of necessary memory). 

>  > It's speed vs size, I think.
>
>       Perhaps, but it seems that Mercurial could be faster only when
>       it comes to random access to the history of a given file.  It
>       seems to me that the per-file structure will have a speed
>       penalty (when compared to the per-changeset one) for every other
>       usage case.

It should be faster when checking many changes (just read in each changed 
store file from position x to position y) and when transmitting many commits 
at once (only one access per changed file instead of one per changeset). 

>       It may make sense to use a kind of ``combined'' approach, but I
>       don't know of any VCS that implements some.

It might get complicated as hell. Maybe that is why :) 

>  > Besides: Is there some easy way to have a shared push repository to
>  > do garbage collection automatically?
>
>       To be honest, I don't use Git on a regular basis.  For my own
>       projects I tend to use GNU Arch, and I typically use whatever
>       VCS for the projects I do participate in.
>
>       I use Git primarily to track changes in the projects which
>       either use Git (e. g., Linux, Gnulib, etc.) or use some
>       non-distributed VCS (like SVN, via git-svn(1).)  It seems to be
>       a specific enough usage case, so that I've never encountered any
>       need to GC my Git repositories.

Maybe you could do a test. 
- Create 1000 files and add+commit them. 
- Check the size of the repository. 
- Append 100 random lines to each file, then commit each one in a single 
commit. 
- Check the size again. 
- Now garbage collect. 
- Check the size one last time. 

From my experience, garbage collection makes a huge difference. 

I only use git in normal operation to work on the Hurd wiki, but I prefer to 
work with Mercurial. It feels far cleaner. 

>       FWIW, Git repositories (containing no packs) could be cloned
>       with `rsync', too, while retaining all the savings in network
>       bandwidth (again, due to the per-changeset structure of the
>       repository.)

I just checked Mercurial on that, and it seems to be possible, too (though I 
didn't think of it before). 

- http://www.selenic.com/mercurial/wiki/index.cgi/UsingRsyncToPushAndPull

I didn't check that myself, though (I prefer plain cloning my 
repositories :) ). 

Best wishes, 
Arne
-- 
Unpolitisch sein
Heißt politisch sein
Ohne es zu merken. 
- Arne Babenhauserheide ( http://draketo.de )

-- Weblog: http://blog.draketo.de
-- Infinite Hands: http://infinite-hands.draketo.de - singing a part of the 
history of free software. 
-- Ein Würfel System: http://1w6.org - einfach saubere (Rollenspiel-) Regeln

-- Mein öffentlicher Schlüssel (PGP/GnuPG): 
http://draketo.de/inhalt/ich/pubkey.txt

signature.asc
Description: This is a digitally signed message part.

Fwd: Re: Revision control

Reply via email to