Ævar Arnfjörð Bjarmason <[email protected]> wrote:
> On Sat, May 11 2019, Eric Wong wrote:
> > +static int files_differ(FILE *fp, const char *path)
> > +{
> > + struct stat st;
> > + git_hash_ctx c;
> > + struct object_id oid_old, oid_new;
> > + struct strbuf tmp = STRBUF_INIT;
> > + long new_len = ftell(fp);
> > +
> > + if (new_len < 0 || stat(path, &st) < 0)
> > + return 1;
> > + if (!S_ISREG(st.st_mode))
> > + return 1;
> > + if ((off_t)new_len != st.st_size)
> > + return 1;
> > +
> > + rewind(fp);
> > + if (strbuf_fread(&tmp, (size_t)new_len, fp) != (size_t)new_len)
> > + return 1;
> > + the_hash_algo->init_fn(&c);
> > + the_hash_algo->update_fn(&c, tmp.buf, tmp.len);
> > + the_hash_algo->final_fn(oid_new.hash, &c);
> > + strbuf_release(&tmp);
> > +
> > + if (strbuf_read_file(&tmp, path, (size_t)st.st_size) < 0)
> > + return 1;
> > + the_hash_algo->init_fn(&c);
> > + the_hash_algo->update_fn(&c, tmp.buf, tmp.len);
> > + the_hash_algo->final_fn(oid_old.hash, &c);
> > + strbuf_release(&tmp);
> > +
> > + return hashcmp(oid_old.hash, oid_new.hash);
> > +}
>
> This way of doing it just seems so weirdly convoluted. Read them one at
> a time, compute the SHA-1, just to see if they're different. Why not
> something closer to a plain memcmp():
>
> static int files_differ(FILE *fp, const char *path)
> {
> struct strbuf old = STRBUF_INIT, new = STRBUF_INIT;
> long new_len = ftell(fp);
> int diff = 1;
>
> rewind(fp);
> if (strbuf_fread(&new, (size_t)new_len, fp) != (size_t)new_len)
> goto release_return;
> if (strbuf_read_file(&old, path, 0) < 0)
> goto release_return;
>
> diff = strbuf_cmp(&old, &new);
>
> release_return:
> strbuf_release(&old);
> strbuf_release(&new);
>
> return diff;
> }
>
> I.e. optimze for code simplicity with something close to a dumb "cmp
> --silent". I'm going to make the bold claim that nobody using "dumb
> http" is going to be impacted by the performance of reading their
> for-each-ref or for-each-pack dump, hence opting for not even bothing to
> stat() to get the size before reading.
I've been trying to improve dumb HTTP for more cases; actually.
(since it's much cheaper than smart HTTP in server memory/CPU)
> Because really, if we were *trying* to micro-optimize this for time or
> memory use there's much better ways, e.g. reading the old file and
> memcmp() as we go and stream the "generate" callback, but I just don't
> see the point of trying in this case.
I was actually going towards that route; but wasn't sure if this
idea would be accepted at all (and I've been trying to stay away
from using non-scripting languages).
I don't slurping all of info/refs into memory at all; so maybe
a streaming memcmp of the existing file is worth doing...
> > /*
> > * Create the file "path" by writing to a temporary file and renaming
> > * it into place. The contents of the file come from "generate", which
> > * should return non-zero if it encounters an error.
> > */
> > -static int update_info_file(char *path, int (*generate)(FILE *))
> > +static int update_info_file(char *path, int (*generate)(FILE *), int force)
> > {
> > char *tmp = mkpathdup("%s_XXXXXX", path);
>
> Unrelated to this, patch, but I hadn't thought about this nasty race
> condition. We recommend users run this from the "post-update" (or
> "post-receive") hook, and don't juggle the lock along with the ref
> update, thus due to the vagaries of scheduling you can end up with two
> concurrent ref updates where the "old" one wins.
>
> But I guess that brings me back to something close to "nobody with that
> sort of update rate is using 'dumb http'" :)
>
> For this to work properly we'd need to take some sort of global "ref
> update/pack update" lock, and I guess at that point this "cmp" case
> would be a helper similar to commit_lock_file_to(),
> i.e. a commit_lock_file_to_if_different().
Worth a separate patch, at some point, I think. I'm not too
familiar with the existing locking in git, actually...
Along those lines, I think repack/gc should automatically
update objects/info/packs if the file already exists.
> > int ret = -1;
> > int fd = -1;
> > FILE *fp = NULL, *to_close;
> > + int do_update;
> >
> > safe_create_leading_directories(path);
> > fd = git_mkstemp_mode(tmp, 0666);
> > if (fd < 0)
> > goto out;
> > - to_close = fp = fdopen(fd, "w");
> > + to_close = fp = fdopen(fd, "w+");
> > if (!fp)
> > goto out;
> > fd = -1;
> > ret = generate(fp);
> > if (ret)
> > goto out;
> > +
> > + do_update = force || files_differ(fp, path);
> > [...]
> >
> > -static int update_info_refs(void)
> > +static int update_info_refs(int force)
>
> So, I was going to say "shouldn't we update the docs?" which for --force
> say "Update the info files from scratch.".
>
> But reading through it that "from scratch" wording is from c743e6e3c0
> ("Add a link from update-server-info documentation to repository
> layout.", 2005-09-04).
Yes, that wording is awkward and I can update it. But maybe making
it undocumented is sufficient and would save us the trouble
of describing it :)
"--force" might be seen as a performance optimization for cases
where you're certain the result will differ, but I'm not
sure if that's worth mentioning in the manpage.
> There it was a refrence to a bug since fixed in 60d0526aaa ("Unoptimize
> info/refs creation.", 2005-09-14), and then removed from the docs in
> c5fe5b6de9 ("Remove obsolete bug warning in man git-update-server-info",
> 2009-04-25).
>
> Then in b3223761c8 ("update_info_refs(): drop unused force parameter",
> 2019-04-05) Jeff removed the unused-for-a-decade "force" param.
>
> So having gone through the history I think we're better off just
> dropping the --force argument entirely, or at least changing the
> docs.
I can update the docs, or make it undocumented. Compatibility
from the command-line needs to remain in case there are scripts
using it.