Re: WIP/PoC for parallel backup

Jeevan Chalke Thu, 03 Oct 2019 04:48:44 -0700

Hi  Asif,

I was looking at the patch and tried comipling it. However, got few errors
and warnings.


Fixed those in the attached patch.

On Fri, Sep 27, 2019 at 9:30 PM Asif Rehman <asifr.reh...@gmail.com> wrote:

> Hi Robert,
>
> Thanks for the feedback. Please see the comments below:
>
> On Tue, Sep 24, 2019 at 10:53 PM Robert Haas <robertmh...@gmail.com>
> wrote:
>
>> On Wed, Aug 21, 2019 at 9:53 AM Asif Rehman <asifr.reh...@gmail.com>
>> wrote:
>> > - BASE_BACKUP [PARALLEL] - returns a list of files in PGDATA
>> > If the parallel option is there, then it will only do pg_start_backup,
>> scans PGDATA and sends a list of file names.
>>
>> So IIUC, this would mean that BASE_BACKUP without PARALLEL returns
>> tarfiles, and BASE_BACKUP with PARALLEL returns a result set with a
>> list of file names. I don't think that's a good approach. It's too
>> confusing to have one replication command that returns totally
>> different things depending on whether some option is given.
>>
>
> Sure. I will add a separate command (START_BACKUP)  for parallel.
>
>
>> > - SEND_FILES_CONTENTS (file1, file2,...) - returns the files in given
>> list.
>> > pg_basebackup will then send back a list of filenames in this command.
>> This commands will be send by each worker and that worker will be getting
>> the said files.
>>
>> Seems reasonable, but I think you should just pass one file name and
>> use the command multiple times, once per file.
>>
>
> I considered this approach initially,  however, I adopted the current
> strategy to avoid multiple round trips between the server and clients and
> save on query processing time by issuing a single command rather than
> multiple ones. Further fetching multiple files at once will also aid in
> supporting the tar format by utilising the existing ReceiveTarFile()
> function and will be able to create a tarball for per tablespace per worker.
>
>
>>
>> > - STOP_BACKUP
>> > when all workers finish then, pg_basebackup will send STOP_BACKUP
>> command.
>>
>> This also seems reasonable, but surely the matching command should
>> then be called START_BACKUP, not BASEBACKUP PARALLEL.
>>
>> > I have done a basic proof of concenpt (POC), which is also attached. I
>> would appreciate some input on this. So far, I am simply dividing the list
>> equally and assigning them to worker processes. I intend to fine tune this
>> by taking into consideration file sizes. Further to add tar format support,
>> I am considering that each worker process, processes all files belonging to
>> a tablespace in its list (i.e. creates and copies tar file), before it
>> processes the next tablespace. As a result, this will create tar files that
>> are disjointed with respect tablespace data. For example:
>>
>> Instead of doing this, I suggest that you should just maintain a list
>> of all the files that need to be fetched and have each worker pull a
>> file from the head of the list and fetch it when it finishes receiving
>> the previous file.  That way, if some connections go faster or slower
>> than others, the distribution of work ends up fairly even.  If you
>> instead pre-distribute the work, you're guessing what's going to
>> happen in the future instead of just waiting to see what actually does
>> happen. Guessing isn't intrinsically bad, but guessing when you could
>> be sure of doing the right thing *is* bad.
>>
>> If you want to be really fancy, you could start by sorting the files
>> in descending order of size, so that big files are fetched before
>> small ones.  Since the largest possible file is 1GB and any database
>> where this feature is important is probably hundreds or thousands of
>> GB, this may not be very important. I suggest not worrying about it
>> for v1.
>>
>
> Ideally, I would like to support the tar format as well, which would be
> much easier to implement when fetching multiple files at once since that
> would enable using the existent functionality to be used without much
> change.
>
> Your idea of sorting the files in descending order of size seems very
> appealing. I think we can do this and have the file divided among the
> workers one by one i.e. the first file in the list goes to worker 1, the
> second to process 2, and so on and so forth.
>
>
>>
>> > Say, tablespace t1 has 20 files and we have 5 worker processes and
>> tablespace t2 has 10. Ignoring all other factors for the sake of this
>> example, each worker process will get a group of 4 files of t1 and 2 files
>> of t2. Each process will create 2 tar files, one for t1 containing 4 files
>> and another for t2 containing 2 files.
>>
>> This is one of several possible approaches. If we're doing a
>> plain-format backup in parallel, we can just write each file where it
>> needs to go and call it good. But, with a tar-format backup, what
>> should we do? I can see three options:
>>
>> 1. Error! Tar format parallel backups are not supported.
>>
>> 2. Write multiple tar files. The user might reasonably expect that
>> they're going to end up with the same files at the end of the backup
>> regardless of whether they do it in parallel. A user with this
>> expectation will be disappointed.
>>
>> 3. Write one tar file. In this design, the workers have to take turns
>> writing to the tar file, so you need some synchronization around that.
>> Perhaps you'd have N threads that read and buffer a file, and N+1
>> buffers.  Then you have one additional thread that reads the complete
>> files from the buffers and writes them to the tar file. There's
>> obviously some possibility that the writer won't be able to keep up
>> and writing the backup will therefore be slower than it would be with
>> approach (2).
>>
>> There's probably also a possibility that approach (2) would thrash the
>> disk head back and forth between multiple files that are all being
>> written at the same time, and approach (3) will therefore win by not
>> thrashing the disk head. But, since spinning media are becoming less
>> and less popular and are likely to have multiple disk heads under the
>> hood when they are used, this is probably not too likely.
>>
>> I think your choice to go with approach (2) is probably reasonable,
>> but I'm not sure whether everyone will agree.
>>
>
> Yes for the tar format support, approach (2) is what I had in
> mind. Currently I'm working on the implementation and will share the patch
> in a couple of days.
>
>
> --
> Asif Rehman
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca
>


-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

commit 0d7433e44123b486b48f2071b24f1eaef46f4849
Author: Jeevan Chalke <jeevan.cha...@enterprisedb.com>
Date:   Thu Oct 3 12:58:55 2019 +0530

    Fix errors and warnings.

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ef55bd0..32ed160 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -346,6 +346,7 @@ perform_base_backup(basebackup_options *opt)
 		{
 			/* save backup label into temp file for now. So stop backup can send it to pg_basebackup later on. */
 			FILE       *fp = AllocateFile(TMP_BACKUP_LABEL_FILE, "w");
+
 			if (!fp)
 				ereport(ERROR,
 						(errcode_for_file_access(),
@@ -1627,8 +1628,8 @@ StopBackup(basebackup_options *opt)
 	XLogRecPtr    endptr;
 	char *labelfile;
 	struct stat statbuf;
-	int            r;
 	StringInfoData buf;
+	FILE	   *lfp;
 
 	/* Disable throttling. */
 	throttling_counter = -1;
@@ -1648,7 +1649,7 @@ StopBackup(basebackup_options *opt)
 	sendFile(TMP_BACKUP_LABEL_FILE, BACKUP_LABEL_FILE, &statbuf, false, InvalidOid);
 
 	/* read backup_label file into buffer, we need it for do_pg_stop_backup */
-	FILE *lfp = AllocateFile(TMP_BACKUP_LABEL_FILE, "r");
+	lfp = AllocateFile(TMP_BACKUP_LABEL_FILE, "r");
 	if (!lfp)
 	{
 		ereport(ERROR,
@@ -1658,7 +1659,7 @@ StopBackup(basebackup_options *opt)
 	}
 
 	labelfile = palloc(statbuf.st_size + 1);
-	r = fread(labelfile, statbuf.st_size, 1, lfp);
+	fread(labelfile, statbuf.st_size, 1, lfp);
 	labelfile[statbuf.st_size] = '\0';
 
 
@@ -1692,6 +1693,7 @@ SendBackupFileList(List *tablespaces)
 {
 	StringInfoData buf;
 	ListCell   *lc;
+	pathinfo   *pi;
 
 	List *files = NIL;
 	foreach(lc, tablespaces)
@@ -1704,7 +1706,6 @@ SendBackupFileList(List *tablespaces)
 	}
 
 	// add backup label file
-	pathinfo *pi;
 	MAKE_PATHINFO(false, TMP_BACKUP_LABEL_FILE);
 	files = lcons(pi, files);
 
@@ -1736,15 +1737,15 @@ SendBackupFileList(List *tablespaces)
 	{
 		pathinfo *pi = (pathinfo *) lfirst(lc);
 		char   *path = pi->path;
+		Size		len;
 
 		/* Send one datarow message */
 		pq_beginmessage(&buf, 'D');
 		pq_sendint16(&buf, 2);    /* number of columns */
 
-		int32 isdir = pi->isdir ? 1 : 0;
-		send_int8_string(&buf, isdir);
+		send_int8_string(&buf, (pi->isdir ? 1 : 0));
 
-		Size len = strlen(path);
+		len = strlen(path);
 		pq_sendint32(&buf, len);
 		pq_sendbytes(&buf, path, len);
 
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 1637735..a316cc6 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -1864,6 +1864,7 @@ BaseBackup(void)
 	{
 		SimpleStringList files = {NULL, NULL};
 		SimpleStringList **worker_files;
+		int			num_files;
 
 		/*
 		 * Get the header
@@ -1881,7 +1882,7 @@ BaseBackup(void)
 			exit(1);
 		}
 
-		int num_files = 0;
+		num_files = 0;
 		for (i = 0; i < PQntuples(res); i++)
 		{
 			bool isdir = atoi(PQgetvalue(res, i, 0));
@@ -2537,6 +2538,7 @@ getFiles(SimpleStringList *files)
 	SimpleStringListCell *cell;
 	PGresult   *res = NULL;
 	int i = 0;
+	PGconn *worker_conn;
 
 	PQExpBuffer buf;
 	buf = createPQExpBuffer();
@@ -2561,7 +2563,7 @@ getFiles(SimpleStringList *files)
 	}
 	appendPQExpBufferStr(buf, " )");
 
-	PGconn *worker_conn = GetConnection();
+	worker_conn = GetConnection();
 	if (!worker_conn)
 		return 1;
 
@@ -2623,9 +2625,10 @@ divideFilesList(SimpleStringList *files, int numFiles)
 static void
 create_workers_and_fetch(SimpleStringList **worker_files)
 {
+	int			i;
+
 	worker_process = (int*) palloc(sizeof(int) * numWorkers);
-	int status;
-	int pid, i;
+
 	for (i = 0; i < numWorkers; i++)
 	{
 		worker_process[i] = fork();

Re: WIP/PoC for parallel backup

Reply via email to