On Mon, Jan 08 2018, Jeff King jotted:

> On Mon, Jan 08, 2018 at 05:20:29AM -0500, Jeff King wrote:
>
>> I.e., what if we did something like this:
>>
>> diff --git a/sha1_name.c b/sha1_name.c
>> index 611c7d24dd..04c661ba85 100644
>> --- a/sha1_name.c
>> +++ b/sha1_name.c
>> @@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
>> *sha1, int len)
>>      if (len == GIT_SHA1_HEXSZ || !len)
>>              return GIT_SHA1_HEXSZ;
>>
>> +    /*
>> +     * A default length of 10 implies a repository big enough that it's
>> +     * getting expensive to double check the ambiguity of each object,
>> +     * and the chance that any particular object of interest has a
>> +     * collision is low.
>> +     */
>> +    if (len >= 10)
>> +            return len;
>> +
>
> Oops, this really needs to terminate the string in addition to returning
> the length (so it was always printing 40 characters in most cases). The
> correct patch is below, but it performs the same.
>
> diff --git a/sha1_name.c b/sha1_name.c
> index 611c7d24dd..5921298a80 100644
> --- a/sha1_name.c
> +++ b/sha1_name.c
> @@ -600,6 +600,17 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
> *sha1, int len)
>       if (len == GIT_SHA1_HEXSZ || !len)
>               return GIT_SHA1_HEXSZ;
>
> +     /*
> +      * A default length of 10 implies a repository big enough that it's
> +      * getting expensive to double check the ambiguity of each object,
> +      * and the chance that any particular object of interest has a
> +      * collision is low.
> +      */
> +     if (len >= 10) {
> +             hex[len] = 0;
> +             return len;
> +     }
> +
>       mad.init_len = len;
>       mad.cur_len = len;
>       mad.hex = hex;

That looks much more sensible, leaving aside other potential benefits of
MIDX.

Given the argument Linus made in e6c587c733 ("abbrev: auto size the
default abbreviation", 2016-09-30) maybe we should add a small integer
to the length for good measure, i.e. something like:

        if (len >= 10) {
                int extra = 2; /* or  just 1? or maybe 0 ... */
                hex[len + extra] = 0;
                return len + extra;
        }

I tried running:

    git log --pretty=format:%h --abbrev=7 | perl -nE 'chomp; say 
length'|sort|uniq -c|sort -nr

On several large repos, which forces something like the disambiguation
we had before Linus's patch, on e.g. David Turner's
2015-04-03-1M-git.git test repo it's:

     952858 7
      44541 8
       2861 9
        168 10
         17 11
          2 12

And the default abbreviation picks 12. I haven't yet found a case where
it's wrong, but if we wanted to be extra safe we could just add a byte
or two to the SHA-1.

Reply via email to