Jeff King wrote:
> Looking at apply_single_file_filter(), it's not the _original_ file that
> it's trying to store, but rather the data coming back from the filter.
> It's just that we use the original file size as a hint!
Thanks much for working that out!
> In other words, I think this patch fixes your problem:
>
> diff --git a/convert.c b/convert.c
> index 0d89ae7c23..85aebe2ed3 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -732,7 +732,7 @@ static int apply_single_file_filter(const char *path,
> const char *src, size_t le
> if (start_async(&async))
> return 0; /* error was already reported */
>
> - if (strbuf_read(&nbuf, async.out, len) < 0) {
> + if (strbuf_read(&nbuf, async.out, 0) < 0) {
> err = error(_("read from external filter '%s' failed"), cmd);
> }
> if (close(async.out)) {
Yes, confirmed that does fix it.
> though possibly we should actually continue to use the file size as a
> hint up to a certain point, which avoids reallocations for more "normal"
> filters where the input and output sizes are in the same ballpark.
>
> Just off the top of my head, something like:
>
> /* guess that the filtered output will be the same size as the original */
> hint = len;
>
> /* allocate 10% extra in case the clean size is slightly larger */
> hint *= 1.1;
>
> /*
> * in any case, never go higher than half of core.bigfileThreshold.
> * We'd like to avoid allocating more bytes than that, and that still
> * gives us room for our strbuf to preemptively double if our guess is
> * just a little on the low side.
> */
> if (hint > big_file_threshold / 2)
> hint = big_file_threshold / 2;
>
> But to be honest, I have no idea if that would even produce measurable
> benefits over simply growing the strbuf from scratch (i.e., hint==0).
Half of 512 MB is still quite a lot of memory to default to using in
this situation. Eg smaller VPS's still often only have a GB or two of ram.
When the clean filter is being used in a way that doesn't involve hashes
of large files, it will mostly be operating on typically sized source
code files. So capping the maximum hint size around the size of a typical
source code file would be plenty for both common cases for the clean filter.
I did some benchmarking, using cat as the clean filter:
git status 32 kb file, hint == len
time 3.865 ms (3.829 ms .. 3.943 ms)
0.994 R² (0.987 R² .. 0.999 R²)
mean 3.934 ms (3.889 ms .. 4.021 ms)
std dev 191.8 μs (106.8 μs .. 291.8 μs)
git status 32 kb file, hint == 0
time 3.887 ms (3.751 ms .. 4.064 ms)
0.992 R² (0.986 R² .. 0.998 R²)
mean 4.002 ms (3.931 ms .. 4.138 ms)
std dev 292.2 μs (189.0 μs .. 498.3 μs)
git status 1 mb file, hint == len
time 3.942 ms (3.816 ms .. 4.078 ms)
0.995 R² (0.991 R² .. 0.999 R²)
mean 3.969 ms (3.916 ms .. 4.054 ms)
std dev 220.1 μs (155.1 μs .. 304.3 μs)
git status 1 mb file, hint == 0
time 3.869 ms (3.836 ms .. 3.911 ms)
0.998 R² (0.995 R² .. 1.000 R²)
mean 3.895 ms (3.868 ms .. 3.947 ms)
std dev 112.3 μs (47.93 μs .. 182.7 μs)
git status 1 gb file, hint == len
time 7.173 s (6.834 s .. 7.564 s)
0.999 R² (0.999 R² .. 1.000 R²)
mean 7.560 s (7.369 s .. 7.903 s)
std dev 333.2 ms (27.65 ms .. 412.2 ms)
git status 1 gb file, hint == 0
time 7.652 s (6.307 s .. 8.263 s)
0.996 R² (0.992 R² .. 1.000 R²)
mean 8.082 s (7.843 s .. 8.202 s)
std dev 232.3 ms (2.362 ms .. 277.1 ms)
From this, it looks like the file has to be quite large before the
preallocation makes a sizable improvement to runtime, and the
smudge/clean filters have to be used for actual content filtering
(not for hash generation purposes as git-annex and git-lfs use it).
An unusual edge case I think. So hint == 0 seems fine.
--
see shy jo