Am 26.04.19 um 16:51 schrieb Johannes Schindelin:> Hi Peff,
>
> On Mon, 15 Apr 2019, Jeff King wrote:
>
>> On Sun, Apr 14, 2019 at 12:01:10AM +0200, René Scharfe wrote:
>>
>>>>> As we already link to the zlib library, we can perform the compression
>>>>> without even requiring gzip on the host machine.
>>>>
>>>> Very cool. It's nice to drop a dependency, and this should be a bit more
>>>> efficient, too.
>>>
>>> Getting rid of dependencies is good, and using zlib is the obvious way to
>>> generate .tgz files. Last time I tried something like that, a separate gzip
>>> process was faster, though -- at least on Linux [1]. How does this one
>>> fare?
>>
>> I'd expect a separate gzip to be faster in wall-clock time for a
>> multi-core machine, but overall consume more CPU. I'm slightly surprised
>> that your timings show that it actually wins on total CPU, too.
My initial expectation back then was that moving data between processes
is costly and that compressing in-process would improve the overall
performance. Your expectation is more in line with what I then actually
saw. The difference in total CPU time wasn't that big, perhaps just
noise.
> If performance is really a concern, you'll be much better off using `pigz`
> than `gzip`.
Performance is always a concern, but on the other hand I didn't see any
complaints about slow archiving so far.
>> Here are best-of-five times for "git archive --format=tar.gz HEAD" on
>> linux.git (the machine is a quad-core):
>>
>> [before, separate gzip]
>> real 0m21.501s
>> user 0m26.148s
>> sys 0m0.619s
>>
>> [after, internal gzwrite]
>> real 0m25.156s
>> user 0m25.059s
>> sys 0m0.096s
>>
>> which does show what I expect (longer overall, but less total CPU).
I get similar numbers with hyperfine:
Benchmark #1: git archive --format=tar.gz HEAD >/dev/null
Time (mean ± σ): 16.683 s ± 0.451 s [User: 20.230 s, System: 0.375 s]
Range (min … max): 16.308 s … 17.852 s 10 runs
Benchmark #2: ~/src/git/git-archive --format=tar.gz HEAD >/dev/null
Time (mean ± σ): 19.898 s ± 0.228 s [User: 19.825 s, System: 0.073 s]
Range (min … max): 19.627 s … 20.355 s 10 runs
Benchmark #3: git archive --format=zip HEAD >/dev/null
Time (mean ± σ): 16.449 s ± 0.075 s [User: 16.340 s, System: 0.109 s]
Range (min … max): 16.326 s … 16.611 s 10 runs
#1 is git v2.21.0, #2 is with the two patches applied, #3 is v2.21.0
again, but with zip output, just to put things into perspective.
>> Which one you prefer depends on your situation, of course. A user on a
>> workstation with multiple cores probably cares most about end-to-end
>> latency and using all of their available horsepower. A server hosting
>> repositories and receiving many unrelated requests probably cares more
>> about total CPU (though the differences there are small enough that it
>> may not even be worth having a config knob to un-parallelize it).
>
> I am a bit sad that this is so noticeable. Nevertheless, I think that
> dropping the dependency is worth it, in particular given that `gzip` is
> not exactly fast to begin with (you really should switch to `pigz` or to a
> faster compression if you are interested in speed).
We could import pigz verbatim, it's just 11K LOCs total. :)
>>> Doing compression in its own thread may be a good idea.
>>
>> Yeah. It might even make the patch simpler, since I'd expect it to be
>> implemented with start_async() and a descriptor, making it look just
>> like a gzip pipe to the caller. :)
>
> Sadly, it does not really look like it is simpler.
I have to agree -- at least I was unable to pull off the stdout
plumbing trick. Is there a way? But it doesn't look too bad, and
the performance is closer to using the real gzip:
Benchmark #1: ~/src/git/git-archive --format=tar.gz HEAD >/dev/null
Time (mean ± σ): 17.300 s ± 0.198 s [User: 20.825 s, System: 0.356 s]
Range (min … max): 17.042 s … 17.638 s 10 runs
This is with the following patch:
---
archive-tar.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 59 insertions(+), 4 deletions(-)
diff --git a/archive-tar.c b/archive-tar.c
index 3e53aac1e6..c889b84c2c 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -38,11 +38,13 @@ static int write_tar_filter_archive(const struct archiver
*ar,
#define USTAR_MAX_MTIME 077777777777ULL
#endif
+static int out_fd = 1;
+
/* writes out the whole block, but only if it is full */
static void write_if_needed(void)
{
if (offset == BLOCKSIZE) {
- write_or_die(1, block, BLOCKSIZE);
+ write_or_die(out_fd, block, BLOCKSIZE);
offset = 0;
}
}
@@ -66,7 +68,7 @@ static void do_write_blocked(const void *data, unsigned long
size)
write_if_needed();
}
while (size >= BLOCKSIZE) {
- write_or_die(1, buf, BLOCKSIZE);
+ write_or_die(out_fd, buf, BLOCKSIZE);
size -= BLOCKSIZE;
buf += BLOCKSIZE;
}
@@ -101,10 +103,10 @@ static void write_trailer(void)
{
int tail = BLOCKSIZE - offset;
memset(block + offset, 0, tail);
- write_or_die(1, block, BLOCKSIZE);
+ write_or_die(out_fd, block, BLOCKSIZE);
if (tail < 2 * RECORDSIZE) {
memset(block, 0, offset);
- write_or_die(1, block, BLOCKSIZE);
+ write_or_die(out_fd, block, BLOCKSIZE);
}
}
@@ -434,6 +436,56 @@ static int write_tar_archive(const struct archiver *ar,
return err;
}
+static int internal_gzip(int in, int out, void *data)
+{
+ int *levelp = data;
+ gzFile gzip = gzdopen(1, "wb");
+ if (!gzip)
+ die(_("gzdopen failed"));
+ if (gzsetparams(gzip, *levelp, Z_DEFAULT_STRATEGY) != Z_OK)
+ die(_("unable to set compression level"));
+
+ for (;;) {
+ char buf[BLOCKSIZE];
+ ssize_t read = xread(in, buf, sizeof(buf));
+ if (read < 0)
+ die_errno(_("read failed"));
+ if (read == 0)
+ break;
+ if (gzwrite(gzip, buf, read) != read)
+ die(_("gzwrite failed"));
+ }
+
+ if (gzclose(gzip) != Z_OK)
+ die(_("gzclose failed"));
+ close(in);
+ return 0;
+}
+
+static int write_tar_gzip_archive(const struct archiver *ar,
+ struct archiver_args *args)
+{
+ struct async filter;
+ int r;
+
+ memset(&filter, 0, sizeof(filter));
+ filter.proc = internal_gzip;
+ filter.data = &args->compression_level;
+ filter.in = -1;
+
+ if (start_async(&filter))
+ die(_("unable to fork off internal gzip"));
+ out_fd = filter.in;
+
+ r = write_tar_archive(ar, args);
+
+ close(out_fd);
+ if (finish_async(&filter))
+ die(_("error in internal gzip"));
+
+ return r;
+}
+
static int write_tar_filter_archive(const struct archiver *ar,
struct archiver_args *args)
{
@@ -445,6 +497,9 @@ static int write_tar_filter_archive(const struct archiver
*ar,
if (!ar->data)
BUG("tar-filter archiver called with no filter defined");
+ if (!strcmp(ar->data, "gzip -cn"))
+ return write_tar_gzip_archive(ar, args);
+
strbuf_addstr(&cmd, ar->data);
if (args->compression_level >= 0)
strbuf_addf(&cmd, " -%d", args->compression_level);
--
2.21.0