On Wed, Aug 21, 2019 at 9:53 AM Asif Rehman <asifr.reh...@gmail.com> wrote:
> - BASE_BACKUP [PARALLEL] - returns a list of files in PGDATA
> If the parallel option is there, then it will only do pg_start_backup, scans 
> PGDATA and sends a list of file names.

So IIUC, this would mean that BASE_BACKUP without PARALLEL returns
tarfiles, and BASE_BACKUP with PARALLEL returns a result set with a
list of file names. I don't think that's a good approach. It's too
confusing to have one replication command that returns totally
different things depending on whether some option is given.

> - SEND_FILES_CONTENTS (file1, file2,...) - returns the files in given list.
> pg_basebackup will then send back a list of filenames in this command. This 
> commands will be send by each worker and that worker will be getting the said 
> files.

Seems reasonable, but I think you should just pass one file name and
use the command multiple times, once per file.

> - STOP_BACKUP
> when all workers finish then, pg_basebackup will send STOP_BACKUP command.

This also seems reasonable, but surely the matching command should
then be called START_BACKUP, not BASEBACKUP PARALLEL.

> I have done a basic proof of concenpt (POC), which is also attached. I would 
> appreciate some input on this. So far, I am simply dividing the list equally 
> and assigning them to worker processes. I intend to fine tune this by taking 
> into consideration file sizes. Further to add tar format support, I am 
> considering that each worker process, processes all files belonging to a 
> tablespace in its list (i.e. creates and copies tar file), before it 
> processes the next tablespace. As a result, this will create tar files that 
> are disjointed with respect tablespace data. For example:

Instead of doing this, I suggest that you should just maintain a list
of all the files that need to be fetched and have each worker pull a
file from the head of the list and fetch it when it finishes receiving
the previous file.  That way, if some connections go faster or slower
than others, the distribution of work ends up fairly even.  If you
instead pre-distribute the work, you're guessing what's going to
happen in the future instead of just waiting to see what actually does
happen. Guessing isn't intrinsically bad, but guessing when you could
be sure of doing the right thing *is* bad.

If you want to be really fancy, you could start by sorting the files
in descending order of size, so that big files are fetched before
small ones.  Since the largest possible file is 1GB and any database
where this feature is important is probably hundreds or thousands of
GB, this may not be very important. I suggest not worrying about it
for v1.

> Say, tablespace t1 has 20 files and we have 5 worker processes and tablespace 
> t2 has 10. Ignoring all other factors for the sake of this example, each 
> worker process will get a group of 4 files of t1 and 2 files of t2. Each 
> process will create 2 tar files, one for t1 containing 4 files and another 
> for t2 containing 2 files.

This is one of several possible approaches. If we're doing a
plain-format backup in parallel, we can just write each file where it
needs to go and call it good. But, with a tar-format backup, what
should we do? I can see three options:

1. Error! Tar format parallel backups are not supported.

2. Write multiple tar files. The user might reasonably expect that
they're going to end up with the same files at the end of the backup
regardless of whether they do it in parallel. A user with this
expectation will be disappointed.

3. Write one tar file. In this design, the workers have to take turns
writing to the tar file, so you need some synchronization around that.
Perhaps you'd have N threads that read and buffer a file, and N+1
buffers.  Then you have one additional thread that reads the complete
files from the buffers and writes them to the tar file. There's
obviously some possibility that the writer won't be able to keep up
and writing the backup will therefore be slower than it would be with
approach (2).

There's probably also a possibility that approach (2) would thrash the
disk head back and forth between multiple files that are all being
written at the same time, and approach (3) will therefore win by not
thrashing the disk head. But, since spinning media are becoming less
and less popular and are likely to have multiple disk heads under the
hood when they are used, this is probably not too likely.

I think your choice to go with approach (2) is probably reasonable,
but I'm not sure whether everyone will agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to