On Thu, 21 Apr 2005, Chris Mason wrote:
> 
> Shrug, we shouldn't need help from the kernel for something like this.  git 
> as 
> a database hits worst case scenarios for almost every FS.

I really disagree. 

> We've got:
> 
> 1) subdirectories with lots of files
> 2) wasted space for tiny files
> 3) files that are likely to be accessed together spread across the whole disk

On the other hand, git does a lot of things that are just _lovely_ for a 
filesystem:

 - it never rewrites a file. Rewriting a file is unquestionably _the_ 
   single worst access pattern for any filesystem. In contrast, never
   writing to a file again means that filesystems can optimize their
   layout and that things like defragmentation actually works.

 - it caches beautifully, and efficiently. Part of it comes from never 
   modifying files after they are written (which means that any network 
   filesystem automatically breathes a huge sign of relief), but part of 
   it is that it always reads full files, and the layout is done so that 
   it really actually _uses_ everything it reads.

   It also caches beautifully on a memory subsystem level, largely for the
   same reasons.

 - it doesn't use tons of directories.

   You say that "subdirectories with lots of files" is painful, but that's 
   not really the whole story. A _deep_ directory structure tends to 
   actually be worse in many ways, because it's much easier to optimize a 
   flat directory structure than a deep one. In other words, git ends up 
   making name hashing etc _productive_. 

So yes, it's a bit wasteful. But it's wasteful of what is absolutely the
cheapest resource around: disk space. It's not a huge downside, and in
fact I really do believe that the biggest downside _by_far_ in diskspace
utilization is the _seek_ costs, not the space itself. Let's face it, 
anybody who wants three years of kernel archives and thinks that 3GB of 
disk is too much, has some serious problems.

The _seek_ issue is real, but git actually has a very nice architecture
even there: not only dos it cache really really well (and you can do a
simple "ls-tree $(cat .git/HEAD)" and populate the case from the results),
but the low level of indirection in a git archive means that it's almost
totally prefetchable with near-perfect access patterns.

In seeking, the real cost is synchronization, and the git model actually
means that there are very few seeks that have to be synchronized. You
could literally do the "ls-tree" thing and make an absolutely trivial
prefetcher that did the prefetching with enough parallellism that the
filesystem could probably get decent IO performance out of a disk.

In other words, we really could have a "git prefetch" command that would 
populate the cache of the current head quite efficiently. Because the data 
layout supports that.

                Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to