There currently is a thread on the Gerrit list about how much faster cloning 
can be when using Gerrit/jgit GCed packs with bitmaps versus C git GCed packs 
with bitmaps.

Some differences outlined are that jgit seems to have more bitmaps, it creates 
one for every refs/heads, is C git doing that?  Another difference seems to be 
that jgit creates two packs, splitting stuff not reachable from refs/heads into 
its own pack.  This makes a clone have zero CPU server side in the pristine 
case.  In the Gerrit use case, this second "unreachable" packfile can be 
sizeable, I wonder if there are other use cases where this might also be the 
case (and this slowing down clones for C git GCed repos)?

If there is not a lot of parallelism left to squeak out, perhaps a focus with 
better returns is trying to do whatever is possible to make all clones (and 
potentially any fetch use case deemed important on a particular server) have 
zero CPU?  Depending on what a server's primary mission is, I could envision 
certain admins willing to sacrifice significant amounts of disk space to speed 
up their fetches.  Perhaps some more extreme thinking (such as what must have 
led to bitmaps) is worth brainstorming about to improve server use cases?

What if an admin were willing to sacrifice a packfile for every use case he 
deemed important, could git be made to support that easily?  For example, maybe 
the admin considers a clone or a fetch from master to be important, could zero 
percent CPU be achieved regularly for those two use cases?  Cloning is possible 
if the repository were repacked in the jgit style after any push to a head.  Is 
it worth exploring ways of making GC efficient enough to make this feasible?  
Can bitmaps be leveraged to make repacking faster?  I believe that at least 
reachability checking could potentially be improved with bitmaps? Are there 
potentially any ways to make better deltification reuse during repacking (not 
bitmap related), by somehow reversing or translating deltas to new objects that 
were just received, without actually recalculating them, but yet still getting 
most objects deltified against the newest objects (achieving the same packs as 
git GC would achieve today, but faster)? What other pieces need to be improved 
to make repacking faster?

As for the single branch fetch case, could this somehow be improved by 
allocating one or more packfiles to this use case?  The simplest single branch 
fetch use case is likely someone doing a git init followed by a single branch 
fetch.  I think the android repo tool can be used in this way, so this may 
actually be a common use case?  With a packfile dedicated to this branch, git 
should be able to just stream it out without any CPU.  But I think git would 
need to know this packfile exists to be able to use it.  It would be nice if 
bitmaps could help here, but I believe bitmaps can so far only be used for one 
packfile.  I understand that making bitmaps span multiple packfiles would be 
very complicated, but maybe it would not be so hard to support bitmaps on 
multiple packfiles if each of these were "self contained"?  By self contained I 
mean that all objects referenced by objects in the packfile were contained in 
that packfile.

What other still unimplemented caching techniques could be used to improve 
clone/fetch use cases? 

- Shallow clones (dedicate a special packfile to this, what about another 
bitmap format that only maps objects in a single tree to help this)?

- Small fetches (simple branch FF updates), I suspect these are fast enough, 
but if not, maybe caching some thin packs (that could result in zero CPU 
requests for many clients) would be useful?  Maybe spread these out 
exponentially over time so that many will be available for recent updates and 
fewer for older updates?  I know git normally throws away thin packs after 
receiving them and resolving them, but if it kept them around (maybe in a 
special directory), it seems that they could be useful for updating other 
clients with zero CPU?  A thin pack cache might be something really easy to 
manage based on file timestamps, an admin may simply need to set a max cache 
size.  But how can git know what thin packs it has, and what they would be 
useful for, name them with their start and ending shas?

Sorry for the long winded rant. I suspect that some variation of all my 
suggestions have already been suggested, but maybe they will rekindle some 
older, now useful thoughts, or inspire some new ones.  And maybe some of these 
are better to pursue then more parallelism?

-Martin

Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a 
Linux Foundation Collaborative ProjectOn Feb 16, 2015 8:47 AM, Jeff King 
<p...@peff.net> wrote:
>
> On Mon, Feb 16, 2015 at 07:31:33AM -0800, David Lang wrote: 
>
> > >Then the server streams the data to the client. It might do some light 
> > >work transforming the data as it comes off the disk, but most of it is 
> > >just blitted straight from disk, and the network is the bottleneck. 
> > 
> > Depending on how close to full the WAN link is, it may be possible to 
> > improve this with multiple connections (again, referencing bbcp), but 
> > there's also the question of if it's worth trying to use the entire WAN for 
> > a single user. The vast majority of the time the server is doing more than 
> > one thing and would rather let any individual user wait a bit and service 
> > the other users. 
>
> Yeah, I have seen clients that make multiple TCP connections to each 
> request a chunk of a file in parallel. The short answer is that this is 
> going to be very hard with git. Each clone generates the pack on the fly 
> based on what's on disk and streams it out. It should _usually_ be the 
> same, but there's nothing to guarantee byte-for-byte equality between 
> invocations. So you'd have to multiplex all of the connections into the 
> same server process. And even then it's hard; that process knows its 
> going to send you byte the bytes for object X, but it doesn't know at 
> exactly which offset until it gets there, which makes sending things out 
> of order tricky. And the whole output is checksummed by a single sha1 
> over the whole stream that comes at the end. 
>
> I think the most feasible thing would be to quickly spool it to a server 
> on the LAN, and then use an existing fetch-in-parallel tool to grab it 
> from there over the WAN. 
>
> -Peff 
> -- 
> To unsubscribe from this list: send the line "unsubscribe git" in 
> the body of a message to majord...@vger.kernel.org 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html 
N�����r��y����b�X��ǧv�^�)޺{.n�+����ا���ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?�����&�)ߢf

Reply via email to