I've been trying to figure out why my new buildfarm animal mamba occasionally fails the pg_basebackup tests [1][2]. I've not run that to ground yet, but one thing I've found that's consistently reproducible everywhere is that pg_basebackup's --gzip switch misbehaves. The manual says, and the principle of least astonishment agrees, that that should invoke gzip with the default compression level. However, the three test cases beginning at about line 810 of 010_pg_basebackup.pl produce these output file sizes on my x86_64 Linux machine:
backup_gzip ("--compress 1"): total 3672 -rw-r-----. 1 postgres postgres 137756 Sep 2 23:38 backup_manifest -rw-r-----. 1 postgres postgres 3538992 Sep 2 23:38 base.tar.gz -rw-r-----. 1 postgres postgres 73991 Sep 2 23:38 pg_wal.tar.gz backup_gzip2 ("--gzip"): total 19544 -rw-r-----. 1 postgres postgres 137756 Sep 2 23:38 backup_manifest -rw-r-----. 1 postgres postgres 3086972 Sep 2 23:38 base.tar.gz -rw-r-----. 1 postgres postgres 16781399 Sep 2 23:38 pg_wal.tar.gz backup_gzip3 ("--compress gzip:1"): total 3672 -rw-r-----. 1 postgres postgres 137756 Sep 2 23:38 backup_manifest -rw-r-----. 1 postgres postgres 3539006 Sep 2 23:38 base.tar.gz -rw-r-----. 1 postgres postgres 73989 Sep 2 23:38 pg_wal.tar.gz It makes sense that base.tar.gz is compressed a little better with --gzip than with level-1 compression, but why is pg_wal.tar.gz not compressed at all? It looks like the problem probably boils down to which of "-1" and "0" means "default behavior" vs "no compression", with different code layers interpreting that differently. I can't find exactly where that's happening, but I did manage to stop the failures with this crude hack: diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c index e90aa0ba37..edddd9b578 100644 --- a/src/bin/pg_basebackup/walmethods.c +++ b/src/bin/pg_basebackup/walmethods.c @@ -1358,7 +1358,7 @@ CreateWalTarMethod(const char *tarbase, sprintf(tar_data->tarfilename, "%s%s", tarbase, suffix); tar_data->fd = -1; tar_data->compression_algorithm = compression_algorithm; - tar_data->compression_level = compression_level; + tar_data->compression_level = compression_level > 0 ? compression_level : Z_DEFAULT_COMPRESSION; tar_data->sync = sync; #ifdef HAVE_LIBZ if (compression_algorithm == PG_COMPRESSION_GZIP) That's not right as a real fix, because it would have the effect that "--compress gzip:0" would also invoke default compression, whereas what it should do is produce the uncompressed output we're actually getting. Both cases have compression_level == 0 by the time we reach here, though. I suspect that there are related bugs in other code paths in this rat's nest of undocumented functions and dubious API abstractions; but since it's all undocumented, who can say which places are wrong and which are not? I might not ding this code quite this hard, if I hadn't had equally-unpleasant encounters with it previously (eg 248c3a937). It's a mess, and I do not find it to be up to project standards. A vaguely-related matter is that the deflateParams calls all pass "0" as the third parameter: if (deflateParams(tar_data->zp, tar_data->compression_level, 0) != Z_OK) Aside from being unreadable, that's entirely unwarranted familiarity with the innards of libz. zlib.h says you should be writing a named constant, probably Z_DEFAULT_STRATEGY. BTW, I'm fairly astonished that anyone would have thought that three complete pg_basebackup cycles testing essentially-identical options were a good use of developer time and buildfarm cycles from here to eternity. Even if digging into it did expose a bug, the test case deserves little credit for that, because it entirely failed to call attention to the problem. I had to whack the script pretty hard just to get it to not delete the evidence. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-09-01%2018%3A38%3A27 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-08-31%2011%3A46%3A09