On Thu, May 23 2019, Jeff King wrote:

> On Wed, Apr 10, 2019 at 06:57:21PM -0400, Jeff King wrote:
>
>>   2. The answer we get from a bitmap versus a regular traversal are not
>>      apples-to-apples equivalent. The regular traversal walks down to
>>      the UNINTERESTING commits, marks the boundaries trees and blobs as
>>      UNINTERESTING, and then adds in all the interesting trees and blobs
>>      minus the UNINTERESTING parts. So it can sometimes give the wrong
>>      answer, claiming something is interesting when it is not.
>>
>>      Whereas with bitmaps we fill in the trees and blobs as we walk, and
>>      you get the true answer. But it means we may open up a lot more
>>      trees than the equivalent traversal would.
>>
>>      So one thing I've never really experimented with (and indeed, never
>>      really thought about until writing this email) is that the bitmaps
>>      could try to do that looser style of traversal, knowing that we
>>      might err on the side of calling things interesting in a few cases.
>>      But hopefully spending a lot less time opening trees.
>>
>>      I'm not even 100% sure what that would look like in code, but just
>>      thinking about it from a high level, I don't there's a particular
>>      mathematical reason it couldn't work.
>
> I spent a while thinking and experimenting with this tonight. The result
> is the patch below. Ævar, do you still have a copy of the repo that
> misbehaved? I'd be curious to see how it fares.

No, sorry. I think we could make an artificial test to emulate it, which
would be something like:

 * ~1 million commits
 * local clone setup to fetch all branches/tags (default 'git clone')
 * local a bit ahead/behind
 * Have old "main" pack with *.bitmap, bunch of other packs/loose objects 
without that
 * push try to push the local change upstream (or to a topic branch)

I tried briefly to emulate this with git.git with:

    (
        rm -rf /tmp/git /tmp/git.old &&
        git init /tmp/git.old &&
        git clone --bare https://github.com/git/git.git /tmp/git &&
        cd /tmp/git &&
        old_commit=$(git rev-parse v2.20.0^{}) &&
        git rev-list v2.12.0..v2.21.0|parallel 'git branch topic-{} {}' &&
        cd /tmp/git.old &&
        echo /tmp/git/objects >.git/objects/info/alternates &&
        git branch master $old_commit &&
        git reset --hard master &&
        git repack -Adb &&
        rm .git/objects/info/alternates &&
        for c in {1..10}
        do
            >$c &&
            git add $c &&
            git commit -m"bump $c"
        done
    )

But didn't get anywhere, probably because here my topics are all stuff I
have already, and I just have 2x packs.

It would be really cool to have some test tool that could produce
various "shape of repo" states like that. I.e. given an end-state of a
public repo emulate various plausible local client states like that
given some assumptions about how often the local client fetches, what
the GC settings are etc.

> Finding the right trees to explore is a little tricky with bitmaps.  In
> a normal traversal, we consider the "edges" to be worth exploring.
> I.e., the places where an UNINTERESTING commit is the parent of an
> interesting one.
>
> But with bitmaps, we don't have that information in the same way,
> because we're trying to avoid walking the commits in the first place. So
> e.g., in a history like this:
>
>   A--B--C
>       \
>        D
>
> Let's imagine we're computing "C..D", and that D has a bitmap on disk
> but C does not. When we visit D, we'll see that it has a bitmap, fill in
> the results in our cumulative "want" bitmap, and then stop walking its
> parents (because we know everything they could tell us already).
>
> Then we visit C. It's not bitmapped, so we keep walking. Unlike a normal
> traversal, we can't stop walking even though there are only
> UNINTERESTING commits left, because we have to fill the complete bitmap,
> including A and B (and you can imagine there might be thousands of
> ancestors of A, too). The reason is that we skipped adding ancestors of
> D when we saw its bitmap, so no still_interesting() cutoff will work
> there.
>
> But how do we know that it's worth looking into the tree of "B"? If we
> assume we visit commits before their ancestors (because of the commit
> timestamp ordering), then we'd see that "B" is in the "want" bitmap
> already, which makes it worth visiting its tree.
>
> Unfortunately we'd then continue on to "A", and it would _also_ look
> interesting, because it's also in the "want" bitmap. We don't have an
> easy way of knowing that "A" is basically already covered by "B", and is
> therefore not worth pursuing. In this example, we could pass down a bit
> from B to A as we traverse. But you can imagine another interesting
> commit branched from A, which _would_ make A's tree worth considering.
>
> Fundamentally, as soon as we load a bitmap and stop traversing, we lose
> all information about the graph structure.
>
> So the patch below just looks at every tree that might be worth
> exploring (so both "A" and "B" here, but not "C"). That shouldn't be any
> _worse_ than what the current code does (because it looks at all the
> trees). It appears to behave correctly, at least so far as passing the
> test suite. Running p5310 and p5311 against git.git and linux.git, it
> seems to perform worse. I'm not quite sure why that is (i.e., if I
> screwed up something in the implementation, or if the algorithm is
> fundamentally worse).
>
> There are a lot of rough edges in the patch; I was just trying to see if
> the idea was even viable. So don't bother reviewing it as a real
> proposal for inclusion. If you do read it, I'd recommend just looking at
> the post-image, since it's essentially a rewrite and the diff is a bit
> messy.
>
> -Peff
>
> ---
>  pack-bitmap.c | 398 ++++++++++++++++++++++++--------------------------
>  1 file changed, 193 insertions(+), 205 deletions(-)
>
> diff --git a/pack-bitmap.c b/pack-bitmap.c
> index 6069b2fe55..4a40f62a38 100644
> --- a/pack-bitmap.c
> +++ b/pack-bitmap.c
> @@ -12,6 +12,8 @@
>  #include "packfile.h"
>  #include "repository.h"
>  #include "object-store.h"
> +#include "blob.h"
> +#include "prio-queue.h"
>
>  /*
>   * An entry on the bitmap index, representing the bitmap for a given
> @@ -422,180 +424,22 @@ static int ext_index_add_object(struct bitmap_index 
> *bitmap_git,
>       return bitmap_pos + bitmap_git->pack->num_objects;
>  }
>
> -struct bitmap_show_data {
> -     struct bitmap_index *bitmap_git;
> -     struct bitmap *base;
> -};
> -
> -static void show_object(struct object *object, const char *name, void *data_)
> -{
> -     struct bitmap_show_data *data = data_;
> -     int bitmap_pos;
> -
> -     bitmap_pos = bitmap_position(data->bitmap_git, &object->oid);
> -
> -     if (bitmap_pos < 0)
> -             bitmap_pos = ext_index_add_object(data->bitmap_git, object,
> -                                               name);
> -
> -     bitmap_set(data->base, bitmap_pos);
> -}
> -
> -static void show_commit(struct commit *commit, void *data)
> -{
> -}
> -
> -static int add_to_include_set(struct bitmap_index *bitmap_git,
> -                           struct include_data *data,
> -                           const struct object_id *oid,
> -                           int bitmap_pos)
> -{
> -     khiter_t hash_pos;
> -
> -     if (data->seen && bitmap_get(data->seen, bitmap_pos))
> -             return 0;
> -
> -     if (bitmap_get(data->base, bitmap_pos))
> -             return 0;
> -
> -     hash_pos = kh_get_oid_map(bitmap_git->bitmaps, *oid);
> -     if (hash_pos < kh_end(bitmap_git->bitmaps)) {
> -             struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, 
> hash_pos);
> -             bitmap_or_ewah(data->base, lookup_stored_bitmap(st));
> -             return 0;
> -     }
> -
> -     bitmap_set(data->base, bitmap_pos);
> -     return 1;
> -}
> -
> -static int should_include(struct commit *commit, void *_data)
> -{
> -     struct include_data *data = _data;
> -     int bitmap_pos;
> -
> -     bitmap_pos = bitmap_position(data->bitmap_git, &commit->object.oid);
> -     if (bitmap_pos < 0)
> -             bitmap_pos = ext_index_add_object(data->bitmap_git,
> -                                               (struct object *)commit,
> -                                               NULL);
> -
> -     if (!add_to_include_set(data->bitmap_git, data, &commit->object.oid,
> -                             bitmap_pos)) {
> -             struct commit_list *parent = commit->parents;
> -
> -             while (parent) {
> -                     parent->item->object.flags |= SEEN;
> -                     parent = parent->next;
> -             }
> -
> -             return 0;
> -     }
> -
> -     return 1;
> -}
> -
> -static struct bitmap *find_objects(struct bitmap_index *bitmap_git,
> -                                struct rev_info *revs,
> -                                struct object_list *roots,
> -                                struct bitmap *seen)
> +static int add_commit_to_bitmap(struct bitmap_index *bitmap_git,
> +                             struct bitmap **base, struct commit *commit)
>  {
> -     struct bitmap *base = NULL;
> -     int needs_walk = 0;
> -
> -     struct object_list *not_mapped = NULL;
> -
> -     /*
> -      * Go through all the roots for the walk. The ones that have bitmaps
> -      * on the bitmap index will be `or`ed together to form an initial
> -      * global reachability analysis.
> -      *
> -      * The ones without bitmaps in the index will be stored in the
> -      * `not_mapped_list` for further processing.
> -      */
> -     while (roots) {
> -             struct object *object = roots->item;
> -             roots = roots->next;
> -
> -             if (object->type == OBJ_COMMIT) {
> -                     khiter_t pos = kh_get_oid_map(bitmap_git->bitmaps, 
> object->oid);
> -
> -                     if (pos < kh_end(bitmap_git->bitmaps)) {
> -                             struct stored_bitmap *st = 
> kh_value(bitmap_git->bitmaps, pos);
> -                             struct ewah_bitmap *or_with = 
> lookup_stored_bitmap(st);
> -
> -                             if (base == NULL)
> -                                     base = ewah_to_bitmap(or_with);
> -                             else
> -                                     bitmap_or_ewah(base, or_with);
> -
> -                             object->flags |= SEEN;
> -                             continue;
> -                     }
> -             }
> -
> -             object_list_insert(object, &not_mapped);
> -     }
> -
> -     /*
> -      * Best case scenario: We found bitmaps for all the roots,
> -      * so the resulting `or` bitmap has the full reachability analysis
> -      */
> -     if (not_mapped == NULL)
> -             return base;
> -
> -     roots = not_mapped;
> -
> -     /*
> -      * Let's iterate through all the roots that don't have bitmaps to
> -      * check if we can determine them to be reachable from the existing
> -      * global bitmap.
> -      *
> -      * If we cannot find them in the existing global bitmap, we'll need
> -      * to push them to an actual walk and run it until we can confirm
> -      * they are reachable
> -      */
> -     while (roots) {
> -             struct object *object = roots->item;
> -             int pos;
> -
> -             roots = roots->next;
> -             pos = bitmap_position(bitmap_git, &object->oid);
> -
> -             if (pos < 0 || base == NULL || !bitmap_get(base, pos)) {
> -                     object->flags &= ~UNINTERESTING;
> -                     add_pending_object(revs, object, "");
> -                     needs_walk = 1;
> -             } else {
> -                     object->flags |= SEEN;
> -             }
> -     }
> -
> -     if (needs_walk) {
> -             struct include_data incdata;
> -             struct bitmap_show_data show_data;
> -
> -             if (base == NULL)
> -                     base = bitmap_new();
> -
> -             incdata.bitmap_git = bitmap_git;
> -             incdata.base = base;
> -             incdata.seen = seen;
> -
> -             revs->include_check = should_include;
> -             revs->include_check_data = &incdata;
> -
> -             if (prepare_revision_walk(revs))
> -                     die("revision walk setup failed");
> +     khiter_t pos = kh_get_oid_map(bitmap_git->bitmaps, commit->object.oid);
> +     if (pos < kh_end(bitmap_git->bitmaps)) {
> +             struct stored_bitmap *st = kh_value(bitmap_git->bitmaps, pos);
> +             struct ewah_bitmap *or_with = lookup_stored_bitmap(st);
>
> -             show_data.bitmap_git = bitmap_git;
> -             show_data.base = base;
> +             if (*base == NULL)
> +                     *base = ewah_to_bitmap(or_with);
> +             else
> +                     bitmap_or_ewah(*base, or_with);
>
> -             traverse_commit_list(revs, show_commit, show_object,
> -                                  &show_data);
> +             return 1;
>       }
> -
> -     return base;
> +     return 0;
>  }
>
>  static void show_extended_objects(struct bitmap_index *bitmap_git,
> @@ -665,24 +509,122 @@ static void show_objects_for_type(
>       }
>  }
>
> -static int in_bitmapped_pack(struct bitmap_index *bitmap_git,
> -                          struct object_list *roots)
> +static void add_object_to_bitmap(struct bitmap_index *bitmap_git,
> +                              struct object *obj,
> +                              struct bitmap *bitmap,
> +                              struct bitmap *seen,
> +                              int ignore_missing);
> +
> +static void add_tag_to_bitmap(struct bitmap_index *bitmap_git,
> +                           struct tag *tag,
> +                           struct bitmap *bitmap,
> +                           struct bitmap *seen,
> +                           int ignore_missing)
> +{
> +     if (parse_tag(tag) < 0 || !tag->tagged) {
> +             if (ignore_missing)
> +                     return;
> +             die("unable to parse tag %s", oid_to_hex(&tag->object.oid));
> +     }
> +     add_object_to_bitmap(bitmap_git, tag->tagged, bitmap, seen,
> +                          ignore_missing);
> +}
> +
> +static void add_tree_to_bitmap(struct bitmap_index *bitmap_git,
> +                            struct tree *tree,
> +                            struct bitmap *bitmap,
> +                            struct bitmap *seen,
> +                            int ignore_missing)
>  {
> -     while (roots) {
> -             struct object *object = roots->item;
> -             roots = roots->next;
> +     struct tree_desc desc;
> +     struct name_entry entry;
>
> -             if (find_pack_entry_one(object->oid.hash, bitmap_git->pack) > 0)
> -                     return 1;
> +     if (parse_tree(tree) < 0) {
> +             if (ignore_missing)
> +                     return;
> +             die("unable to parse tree %s", oid_to_hex(&tree->object.oid));
>       }
>
> -     return 0;
> +     init_tree_desc(&desc, tree->buffer, tree->size);
> +     while (tree_entry(&desc, &entry)) {
> +             if (S_ISDIR(entry.mode)) {
> +                     struct tree *t = lookup_tree(the_repository, 
> &entry.oid);
> +                     if (!t) {
> +                             die(_("entry '%s' in tree %s has tree mode, "
> +                                   "but is not a tree"),
> +                                 entry.path, oid_to_hex(&tree->object.oid));
> +                     }
> +                     add_object_to_bitmap(bitmap_git, &t->object,
> +                                          bitmap, seen, ignore_missing);
> +             } else if (!S_ISGITLINK(entry.mode)) {
> +                     struct blob *b = lookup_blob(the_repository, 
> &entry.oid);
> +                     if (!b) {
> +                             die(_("entry '%s' in tree %s has blob mode, "
> +                                   "but is not a blob"),
> +                                 entry.path, oid_to_hex(&tree->object.oid));
> +                     }
> +                     add_object_to_bitmap(bitmap_git, &b->object,
> +                                          bitmap, seen, ignore_missing);
> +             }
> +     }
> +
> +     free_tree_buffer(tree);
> +}
> +
> +static void add_object_to_bitmap(struct bitmap_index *bitmap_git,
> +                              struct object *obj,
> +                              struct bitmap *bitmap,
> +                              struct bitmap *seen,
> +                              int ignore_missing)
> +{
> +     int pos = bitmap_position(bitmap_git, &obj->oid);
> +
> +     if (pos < 0)
> +             pos = ext_index_add_object(bitmap_git, obj, NULL);
> +
> +     if (bitmap_get(bitmap, pos))
> +             return; /* already there */
> +
> +     if (seen && bitmap_get(seen, pos))
> +             return; /* seen in other bitmap; not worth digging further */
> +
> +     bitmap_set(bitmap, pos);
> +
> +     if (obj->type == OBJ_BLOB)
> +             return;
> +     else if (obj->type == OBJ_TAG)
> +             add_tag_to_bitmap(bitmap_git, (struct tag *)obj,
> +                               bitmap, seen, ignore_missing);
> +     else if (obj->type == OBJ_TREE)
> +             add_tree_to_bitmap(bitmap_git, (struct tree *)obj,
> +                                bitmap, seen, ignore_missing);
> +     else
> +             BUG("unexpected object type: %d", obj->type);
> +}
> +
> +static void add_objects_to_bitmap(struct bitmap_index *bitmap_git,
> +                               struct object_list **list,
> +                               struct bitmap *bitmap,
> +                               struct bitmap *seen,
> +                               int ignore_missing)
> +{
> +     while (*list) {
> +             struct object_list *cur = *list;
> +
> +             add_object_to_bitmap(bitmap_git, cur->item,
> +                                  bitmap, seen, ignore_missing);
> +
> +             *list = cur->next;
> +             free(cur);
> +     }
>  }
>
>  struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs)
>  {
>       unsigned int i;
>
> +     struct prio_queue commits = { compare_commits_by_commit_date };
> +
>       struct object_list *wants = NULL;
>       struct object_list *haves = NULL;
>
> @@ -714,24 +656,16 @@ struct bitmap_index *prepare_bitmap_walk(struct 
> rev_info *revs)
>                       object = parse_object_or_die(&tag->tagged->oid, NULL);
>               }
>
> -             if (object->flags & UNINTERESTING)
> -                     object_list_insert(object, &haves);
> -             else
> -                     object_list_insert(object, &wants);
> +             if (object->type == OBJ_COMMIT)
> +                     prio_queue_put(&commits, object);
> +             else {
> +                     if (object->flags & UNINTERESTING)
> +                             object_list_insert(object, &haves);
> +                     else
> +                             object_list_insert(object, &wants);
> +             }
>       }
>
> -     /*
> -      * if we have a HAVES list, but none of those haves is contained
> -      * in the packfile that has a bitmap, we don't have anything to
> -      * optimize here
> -      */
> -     if (haves && !in_bitmapped_pack(bitmap_git, haves))
> -             goto cleanup;
> -
> -     /* if we don't want anything, we're done here */
> -     if (!wants)
> -             goto cleanup;
> -
>       /*
>        * now we're going to use bitmaps, so load the actual bitmap entries
>        * from disk. this is the point of no return; after this the rev_list
> @@ -742,20 +676,74 @@ struct bitmap_index *prepare_bitmap_walk(struct 
> rev_info *revs)
>
>       object_array_clear(&revs->pending);
>
> -     if (haves) {
> -             revs->ignore_missing_links = 1;
> -             haves_bitmap = find_objects(bitmap_git, revs, haves, NULL);
> -             reset_revision_walk();
> -             revs->ignore_missing_links = 0;
> +     haves_bitmap = bitmap_new();
> +     wants_bitmap = bitmap_new();
>
> -             if (haves_bitmap == NULL)
> -                     BUG("failed to perform bitmap walk");
> -     }
> +     /*
> +      * First traverse the relevant commits as we would for a normal
> +      * traversal.
> +      */
> +     while (commits.nr) {
> +             struct commit *commit = prio_queue_get(&commits);
> +             struct bitmap **dst_bitmap;
> +             int pos;
> +             struct commit_list *parent;
> +
> +             if (commit->object.flags & UNINTERESTING)
> +                     dst_bitmap = &haves_bitmap;
> +             else
> +                     dst_bitmap = &wants_bitmap;
>
> -     wants_bitmap = find_objects(bitmap_git, revs, wants, haves_bitmap);
> +             pos = bitmap_position(bitmap_git, &commit->object.oid);
> +             if (pos >= 0 && *dst_bitmap && bitmap_get(*dst_bitmap, pos))
> +                     continue; /* already covered */
>
> -     if (!wants_bitmap)
> -             BUG("failed to perform bitmap walk");
> +             if (add_commit_to_bitmap(bitmap_git, dst_bitmap, commit))
> +                     continue; /* skip adding parents, they're covered */
> +
> +
> +             /* otherwise mark ourselves and queue dependent objects */
> +             if (pos < 0)
> +                     pos = ext_index_add_object(bitmap_git, &commit->object, 
> NULL);
> +             bitmap_set(*dst_bitmap, pos);
> +             if (parse_commit(commit)) {
> +                     if (commit->object.flags & UNINTERESTING)
> +                             continue;
> +                     die("unable to parse commit %s",
> +                         oid_to_hex(&commit->object.oid));
> +             }
> +             if (commit->object.flags & UNINTERESTING) {
> +                     /*
> +                      * If an uninteresting commit is in the "wants" bitmap,
> +                      * then our tree is worth exploring. This means we may
> +                      * miss some trees in the "haves" that are not
> +                      * ancestors of "wants" (or that are, but we missed
> +                      * because of out-of-order timestamps).
> +                      */
> +                     if (wants_bitmap && bitmap_get(wants_bitmap, pos))
> +                             
> object_list_insert(&get_commit_tree(commit)->object,
> +                                                &haves);
> +             } else {
> +                     /*
> +                      * "wants" must always be explored
> +                      */
> +                     object_list_insert(&get_commit_tree(commit)->object,
> +                                        &wants);
> +             }
> +
> +             for (parent = commit->parents; parent; parent = parent->next) {
> +                     if (commit->object.flags & UNINTERESTING)
> +                             parent->item->object.flags |= UNINTERESTING;
> +                     prio_queue_put(&commits, parent->item);
> +             }
> +     }
> +
> +     /*
> +      * At this point we've processed all of the commits, and queued items
> +      * in "haves" and "wants" that need to be marked.
> +      */
> +     add_objects_to_bitmap(bitmap_git, &haves, haves_bitmap, NULL, 1);
> +     add_objects_to_bitmap(bitmap_git, &wants, wants_bitmap, haves_bitmap, 
> 0);
>
>       if (haves_bitmap)
>               bitmap_and_not(wants_bitmap, haves_bitmap);

Reply via email to