On Fri, Mar 4, 2022 at 3:32 AM Dipesh Pandit <dipesh.pan...@gmail.com> wrote: > GZIP manages to overcome this problem as it provides an option to turn on/off > compression on the fly while writing a compressed archive with the help of > zlib > library function deflateParams(). The current gzip implementation for > CreateWalTarMethod uses this library function to turn off compression just > before > step #1 and it writes the uncompressed header of size equal to TAR_BLOCK_SIZE. > It uses the same library function to turn on the compression for writing the > contents > of the WAL file as part of step #2. It again turns off the compression just > before step > #3 to overwrite the header. The header is overwritten at the same offset with > size > equal to TAR_BLOCK_SIZE.
This is a real mess. To me, it seems like a pretty big hack to use deflateParams() to shut off compression in the middle of the compressed data stream so that we can go back and overwrite that part of the data later. It appears that the only reason we need that hack is because we don't know the file size starting out. Except we kind of do know the size, because pad_to_size specifies a minimum size for the file. It's true that the maximum file size is unbounded, but I'm not sure why that's important. I wonder if anyone else has an idea why we didn't just set the file size to pad_to_size exactly when we write the tar header the first time, instead of this IMHO kind of nutty approach where we back up. I'd try to figure it out from the comments, but there basically aren't any. I also had a look at the relevant commit messages and didn't see anything relevant there either. If I'm missing something, please point it out. While I'm complaining, I noticed while looking at this code that it is documented that "The caller must ensure that only one method is instantiated in any given program, and that it's only instantiated once!" As far as I can see, this is because somebody thought about putting all of the relevant data into a struct and then decided on an alternative strategy of storing some of it there, and the rest in a global variable. I can't quite imagine why anyone would think that was a good idea. There may be some reason that I can't see right now, but here again there appear to be no relevant code comments. I'm somewhat inclined to wonder whether we could just get rid of walmethods.c entirely and use the new bbstreamer stuff instead. That code also knows how to write plain files into a directory, and write tar archives, and compress stuff, but in my totally biased opinion as the author of most of that code, it's better code. It has no restriction on using at most one method per program, or of instantiating that method only once, and it already has LZ4 support, and there's a pending patch for ZSTD support that I intend to get committed soon as well. It also has, and I know I might be beating a dead horse here, comments. Now, admittedly, it does need to know the size of each archive member up front in order to work, so if we can't solve the problem then we can't go this route. But if we can't solve that problem, then we also can't add LZ4 and ZSTD support to walmethods.c, because random access to compressed data is not really a thing, even if we hacked it to work for gzip. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com