Daniel Barkalow wrote:
I have a design for using http-pull on a packed repository, and it only
requires one extra file in the repository: an append-only list of the pack
files (because getting the directory listing is very painful and
failure-prone).
A few comments (as I've been tinkering with a way to solve the problem
myself).
As long as the pack files are named sensibly (i.e. if they are created
by git-repack-script), it's not very error-prone to just get the
directory listing, and look for matches for pack-<sha1>.idx. It seems to
work quite well (see below). It isn't beautiful in any way, but it works...
[snip]
If an individual file is not available, figure out what packs are
available:
Get the list of pack files the repository has
(currently, I just use "e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135")
For any packs we don't have, get the index files.
This part might be slightly expensive, for large repositories. If one
assumes that packs are named as by git-repack-script, however, one might
cache indexes we've already seen (again, see below). Or, if you go for
the mandatory "pack-index-file", require that it has a reliable order,
so that you can get the last added index first.
Keep a list of the struct packed_gits for the packs the server has
(these are not used as places to look for objects)
Each time we need an object, check the list for it. If it is in there,
download the corresponding pack and report success.
Here you will need some strategy to deal with packs that overlap with
what we've already got. Basically, small and overlapping packs should be
unpacked, big and non-overlapping ones saved as is (since
git-unpack-objects is painfully slow and memory-hungry...).
One could also optimize the pack-download bit, by figuring out the last
object in the pack that we need (easy enough to do from the index file),
and just get the part of the pack file leading up to that object. That
could be a huge win for independently packed repositories (I don't do
that in my code below, though).
Anyway, here's my attempt at the same thing. It introduces
"git-dumb-fetch", with usage like git-fetch-pack (except that it works
with http and rsync). And it adds some uglyness to git-cat-file, for
figuring out which objects we already have.
I'm sort of using the same basic strategy as you, except that I check
the pack files first (I didn't want to mess with http-pull.c, and I
wanted something that would work with rsync as well).
The strategy is this:
o Check if the repository has some pack files we haven't seen
already
o If there are new pack files, download indexes, and see if
they contain anything new. If so, download pack file and
store or unpack. In either case, note that we have seen the
pack file in question (I've used $GIT_DIR/checked_packs).
o Then
o if http: do the git-http-pull stuff, and we're done
o if rsync: get a list of all object files in the
repository, and download the ones we're still missing.
Feel free to take a look, and use anything that might be useful (if
anything...)
I'm not claiming that this method is better than your way; the only main
differences are the caching of seen index files, and that I download
packs first.
My way is faster if the repository contains overlapping object files and
packs. And doesn't require any new infrastructure.
On the other hand, my method risks fetching too many objects, if a pack
file solely contains stuff from a branch we don't want. And it requires
the git-repack-script naming convention to be used on the remote side.
/dan
diff --git a/cat-file.c b/cat-file.c
--- a/cat-file.c
+++ b/cat-file.c
@@ -11,6 +11,42 @@ int main(int argc, char **argv)
char type[20];
void *buf;
unsigned long size;
+ int obj_count = 0;
+ int missing_count = 0;
+ char line[1000];
+
+ if (argc == 2 && !strcmp("--count", argv[1])) {
+ while (fgets(line, sizeof(line), stdin)) {
+ if (get_sha1(line, sha1))
+ die("invalid id %s", line);
+ if (has_sha1_file(sha1))
+ ++obj_count;
+ else
+ ++missing_count;
+ }
+ printf("%i %i\n", obj_count, missing_count);
+ return 0;
+ }
+
+ if (argc == 2 && !strcmp("--existing", argv[1])) {
+ while (fgets(line, sizeof(line), stdin)) {
+ if (get_sha1(line, sha1))
+ die("invalid id %s", line);
+ if (has_sha1_file(sha1))
+ printf ("%s", line);
+ }
+ return 0;
+ }
+
+ if (argc == 2 && !strcmp("--missing", argv[1])) {
+ while (fgets(line, sizeof(line), stdin)) {
+ if (get_sha1(line, sha1))
+ die("invalid id %s", line);
+ if (!has_sha1_file(sha1))
+ printf ("%s", line);
+ }
+ return 0;
+ }
if (argc != 3 || get_sha1(argv[2], sha1))
usage("git-cat-file [-t | -s | tagname] <sha1>");
diff --git a/git-dumb-fetch b/git-dumb-fetch
new file mode 100755
--- /dev/null
+++ b/git-dumb-fetch
@@ -0,0 +1,182 @@
+#! /bin/sh
+
+# git-dumb-fetch pulls objects from (optionally) packed remote
+# git repositories.
+
+. git-sh-setup-script || die "Not a git archive"
+
+checked_packs=$GIT_DIR/checked_packs
+
+usage() {
+ die "usage: git-dumb-fetch [-w ref] commit-id url"
+}
+
+http_download() {
+ tmpf=$(basename "$1")
+ wget -O "$tmpd/$tmpf" "$1"
+}
+
+http_cat() {
+ wget -q -O - "$1"
+}
+
+http_list_packs() {
+ # XXX: It would be nice to be able to differentiate between failed
+ # connections and missing pack dir. For now, assume the latter.
+ pindex=$(http_cat "$1/objects/pack/") || return 0
+ # die "error getting $1"
+ echo "$pindex" |
+ sed -n 's,.*pack-\([0-9a-f]\{40\}\)\.idx.*,\1\n,gp' |
+ sed '/^$/d' | sort | uniq
+}
+
+http_pull() {
+ git-http-pull -v -a "$1" "$2/"
+}
+
+rsync_download() {
+ rsync "$1" "$tmpd/" > /dev/null
+}
+
+rsync_cat() {
+ tmpf=$(basename "$1")
+ rsync_download "$1" && cat "$tmpd/$tmpf"
+}
+
+rsync_list_packs() {
+ # list every file on the remote side. we'll use that later
+ echo "Listing remote objects" >&2
+ rsync -zr "$1/objects/" > "$tmpd/files" &&
+ LANG=C sed -n 's,.*pack/pack-\([0-9a-f]\{40\}\)\.idx.*,\1,p' < \
+ "$tmpd/files"
+}
+
+rsync_pull() {
+ LANG=C sed -n 's,.*\([0-9a-f][0-9a-f]\)/\([0-9a-f]\{38\}\).*,\1\2,p' \
+ < "$tmpd/files" |
+ git-cat-file --missing > "$tmpd/missing" &&
+ LANG=C sed 's,^..,\0/,' < "$tmpd/missing" > "$tmpd/tofetch" || exit 1
+
+ [ -s "$tmpd/tofetch" ] || { echo "Nothing new to fetch" >&2; return; }
+
+ if rsync --help 2>&1 | grep -q files-from; then
+ rsync -avz --ignore-existing --whole-file \
+ --files-from="$tmpd/tofetch" \
+ "$2/objects/" "$GIT_OBJECT_DIRECTORY/" >&2
+ else
+ LANG=C sed -n \
+ 's,.*\([0-9a-f][0-9a-f]\)/\([0-9a-f]\{38\}\).*,\1\2,p' \
+ < "$tmpd/files" |
+ git-cat-file --existing > "$tmpd/got" &&
+ LANG=C sed 's,^..,\0/,' < "$tmpd/got" > "$tmpd/excl" || exit 1
+ if [ -f "$checked_packs" ]; then
+ sed 's,^.*,pack/pack-&.idx,' < "$checked_packs"
+ sed 's,^.*,pack/pack-&.pack,' < "$checked_packs"
+ fi >> "$tmpd/excl"
+ rsync -avz --ignore-existing --whole-file \
+ --exclude-from="$tmpd/excl" \
+ "$2/objects/" "$GIT_OBJECT_DIRECTORY/" >&2
+ fi
+}
+
+idx_policy() {
+ # existing=$1 missing=$2
+ if [ $1 -eq 0 -a $2 -eq 0 ]; then
+ echo empty
+ elif [ $2 -eq 0 ]; then
+ echo all
+ elif [ $1 -eq 0 ]; then
+ echo none
+ else
+ if [ $2 -gt 5000 -a $1 -lt $2 ]; then
+ # It's a really big pack. Don't unpack
+ echo all
+ else
+ echo partial
+ fi
+ fi
+}
+
+check_idx() {
+ counts=$(git-show-index | cut -d' ' -f2 | git-cat-file --count) ||
+ exit 1
+ idx_policy $counts
+}
+
+has_pack() {
+ [ -f "$GIT_OBJECT_DIRECTORY/pack/pack-$1.idx" -a \
+ "$GIT_OBJECT_DIRECTORY/pack/pack-$1.pack" ] && return 0
+ [ -f "$checked_packs" ] && grep -q $1 < "$checked_packs"
+}
+
+fetch_packs() {
+ idx=$($list_packs "$1") || exit 1
+ [ "$idx" ] || return 0
+ echo "Examining remote packs: $idx" >&2
+ for i in $idx; do
+ has_pack $i && continue
+ echo "Downloading pack $i" >&2
+ $download "$1/objects/pack/pack-$i.idx" &&
+ gotit=$(check_idx < "$tmpd/pack-$i.idx") || exit 1
+
+ case $gotit in
+ partial | none)
+ $download "$1/objects/pack/pack-$i.pack" &&
+ git-verify-pack "$tmpd/pack-$i" || die "invalid pack" ;;
+ *)
+ echo "Already got all objects in pack $i" >&2 ;;
+ esac
+
+ case $gotit in
+ partial)
+ git-unpack-objects < "$tmpd/pack-$i.pack" || exit 1 ;;
+ none)
+ mv "$tmpd/pack-$i.idx" "$tmpd/pack-$i.pack" \
+ "$GIT_OBJECT_DIRECTORY/pack/" 2>/dev/null ||
+ cp "$tmpd/pack-$i.idx" "$tmpd/pack-$i.pack" \
+ "$GIT_OBJECT_DIRECTORY/pack/" || exit 1 ;;
+ esac
+ echo $i >> "$checked_packs"
+ done
+}
+
+
+while true; do
+ case $1 in
+ --) shift; break ;;
+ -*) die "unknown option: $1" ;;
+ *) break ;;
+ esac
+ shift
+done
+
+url=$1 srchead=$2
+[ -n "$srchead" -a -n "$url" ] || usage
+
+case $url in
+ http://*) proto=http ;;
+ rsync://*) proto=rsync ;;
+ *) die "don't know how to fetch from $url" ;;
+esac
+
+download=${proto}_download
+cat=${proto}_cat
+list_packs=${proto}_list_packs
+pull=${proto}_pull
+
+tmpd=$(mktemp -d "${TMPDIR:-/tmp}/dumbfetch.XXXXXX") || exit 1
+trap "rm -rf '$tmpd'" 0 1 2 3 15
+
+echo "Fetching from $url" >&2
+remoteid=$($cat "$url/refs/$srchead") || die "error reading $srchead"
+
+if [ "$previd" = "$remoteid" ]; then
+ echo "Up to date" >&2
+ exit 0
+fi
+
+fetch_packs "$url" &&
+$pull "$remoteid" "$url" || die "fetch failed"
+
+echo $remoteid
+