On 2020-04-15 17:57, Robert Haas wrote:
Over at http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9eitzwmod2mpqr_rif...@mail.gmail.com there's a proposal for a parallel backup patch which works in the way that I have always thought parallel backup would work: instead of having a monolithic command that returns a series of tarballs, you request individual files from a pool of workers. Leaving aside the quality-of-implementation issues in that patch set, I'm starting to think that the design is fundamentally wrong and that we should take a whole different approach. The problem I see is that it makes a parallel backup and a non-parallel backup work very differently, and I'm starting to realize that there are good reasons why you might want them to be similar.
That would clearly be a good goal. Non-parallel backup should ideally be parallel backup with one worker.
But it doesn't follow that the proposed design is wrong. It might just be that the design of the existing backup should change.
I think making the wire format so heavily tied to the tar format is dubious. There is nothing particularly fabulous about the tar format. If the server just sends a bunch of files with metadata for each file, the client can assemble them in any way they want: unpacked, packed in several tarball like now, packed all in one tarball, packed in a zip file, sent to S3, etc.
Another thing I would like to see sometime is this: Pull a minimal basebackup, start recovery and possibly hot standby before you have received all the files. When you need to access a file that's not there yet, request that as a priority from the server. If you nudge the file order a little with perhaps prewarm-like data, you could get a mostly functional standby without having to wait for the full basebackup to finish. Pull a file on request is a requirement for this.
So, my new idea for parallel backup is that the server will return tarballs, but just more of them. Right now, you get base.tar and ${tablespace_oid}.tar for each tablespace. I propose that if you do a parallel backup, you should get base-${N}.tar and ${tablespace_oid}-${N}.tar for some or all values of N between 1 and the number of workers, with the server deciding which files ought to go in which tarballs.
I understand the other side of this: Why not compress or encrypt the backup already on the server side? Makes sense. But this way seems weird and complicated. If I want a backup, I want one file, not an unpredictable set of files. How do I even know I have them all? Do we need a meta-manifest?
A format such as ZIP would offer more flexibility, I think. You can build a single target file incrementally, you can compress or encrypt each member file separately, thus allowing some compression etc. on the server. I'm not saying it's perfect for this, but some more thinking about the archive formats would potentially give some possibilities.
All things considered, we'll probably want more options and more ways of doing things.
-- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services