Re: design for parallel backup

Peter Eisentraut Mon, 20 Apr 2020 13:03:42 -0700

On 2020-04-15 17:57, Robert Haas wrote:

Over at 
http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9eitzwmod2mpqr_rif...@mail.gmail.com
there's a proposal for a parallel backup patch which works in the way
that I have always thought parallel backup would work: instead of
having a monolithic command that returns a series of tarballs, you
request individual files from a pool of workers. Leaving aside the
quality-of-implementation issues in that patch set, I'm starting to
think that the design is fundamentally wrong and that we should take a
whole different approach. The problem I see is that it makes a
parallel backup and a non-parallel backup work very differently, and
I'm starting to realize that there are good reasons why you might want
them to be similar.

That would clearly be a good goal. Non-parallel backup should ideallybe parallel backup with one worker.

But it doesn't follow that the proposed design is wrong. It might justbe that the design of the existing backup should change.

I think making the wire format so heavily tied to the tar format isdubious. There is nothing particularly fabulous about the tar format.If the server just sends a bunch of files with metadata for each file,the client can assemble them in any way they want: unpacked, packed inseveral tarball like now, packed all in one tarball, packed in a zipfile, sent to S3, etc.

Another thing I would like to see sometime is this: Pull a minimalbasebackup, start recovery and possibly hot standby before you havereceived all the files. When you need to access a file that's not thereyet, request that as a priority from the server. If you nudge the fileorder a little with perhaps prewarm-like data, you could get a mostlyfunctional standby without having to wait for the full basebackup tofinish. Pull a file on request is a requirement for this.

So, my new idea for parallel backup is that the server will return
tarballs, but just more of them. Right now, you get base.tar and
${tablespace_oid}.tar for each tablespace. I propose that if you do a
parallel backup, you should get base-${N}.tar and
${tablespace_oid}-${N}.tar for some or all values of N between 1 and
the number of workers, with the server deciding which files ought to
go in which tarballs.

I understand the other side of this: Why not compress or encrypt thebackup already on the server side? Makes sense. But this way seemsweird and complicated. If I want a backup, I want one file, not anunpredictable set of files. How do I even know I have them all? Do weneed a meta-manifest?

A format such as ZIP would offer more flexibility, I think. You canbuild a single target file incrementally, you can compress or encrypteach member file separately, thus allowing some compression etc. on theserver. I'm not saying it's perfect for this, but some more thinkingabout the archive formats would potentially give some possibilities.

All things considered, we'll probably want more options and more ways ofdoing things.


--
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: design for parallel backup

Reply via email to