Re: design for parallel backup

Robert Haas Thu, 30 Apr 2020 11:51:14 -0700

On Mon, Apr 20, 2020 at 4:19 PM Andres Freund <and...@anarazel.de> wrote:
> One question I have not really seen answered well:
>
> Why do we want parallelism here. Or to be more precise: What do we hope
> to accelerate by making what part of creating a base backup
> parallel. There's several potential bottlenecks, and I think it's
> important to know the design priorities to evaluate a potential design.


I spent some time today trying to understand just one part of this,
which is how long it will take to write the base backup out to disk
and whether having multiple independent processes helps. I settled on
writing and fsyncing 64GB of data, written in 8kB chunks, divided into
1, 2, 4, 8, or 16 equal size files, with each file written by a
separate process, and an fsync() at the end before process exit. So in
this test, there is no question of whether the master can read the
data fast enough, nor is there any issue of network bandwidth. It's
purely a test of whether it's faster to have one process write a big
file or whether it's faster to have multiple processes each write a
smaller file.

I tested this on EDB's cthulhu. It's an older server, but it happens
to have 4 mount points available for testing, one with XFS + magnetic
disks, one with ext4 + magnetic disks, one with XFS + SSD, and one
with ext4 + SSD. I did the experiment described above on each mount
point separately, and then I also tried 4, 8, or 16 equal size files
spread evenly across the 4 mount points. To summarize the results very
briefly:

1. ext4 degraded really badly with >4 concurrent writers. XFS did not.
2. SSDs were faster than magnetic disks, but you had to use XFS and
>=4 concurrent writers to get the benefit.
3. Spreading writes across the mount points works well, but the
slowest mount point sets the pace.

Here are more detailed results, with times in seconds:

filesystem media 1@64GB 2@32GB 4@16GB 8@8GB 16@4GB
xfs mag 97 53 60 67 71
ext4 mag 94 68 66 335 549
xfs ssd 97 55 33 27 25
ext4 ssd 116 70 66 227 450
spread spread n/a n/a 48 42 44

The spread test with 16 files @ 4GB llooks like this:

[/mnt/data-ssd/robert.haas/test14] open: 0, write: 7, fsync: 0, close:
0, total: 7
[/mnt/data-ssd/robert.haas/test10] open: 0, write: 7, fsync: 2, close:
0, total: 9
[/mnt/data-ssd/robert.haas/test2] open: 0, write: 7, fsync: 2, close:
0, total: 9
[/mnt/data-ssd/robert.haas/test6] open: 0, write: 7, fsync: 2, close:
0, total: 9
[/mnt/data-ssd2/robert.haas/test3] open: 0, write: 16, fsync: 0,
close: 0, total: 16
[/mnt/data-ssd2/robert.haas/test11] open: 0, write: 16, fsync: 0,
close: 0, total: 16
[/mnt/data-ssd2/robert.haas/test15] open: 0, write: 17, fsync: 0,
close: 0, total: 17
[/mnt/data-ssd2/robert.haas/test7] open: 0, write: 18, fsync: 0,
close: 0, total: 18
[/mnt/data-mag/robert.haas/test16] open: 0, write: 7, fsync: 18,
close: 0, total: 25
[/mnt/data-mag/robert.haas/test4] open: 0, write: 7, fsync: 19, close:
0, total: 26
[/mnt/data-mag/robert.haas/test12] open: 0, write: 7, fsync: 19,
close: 0, total: 26
[/mnt/data-mag/robert.haas/test8] open: 0, write: 7, fsync: 22, close:
0, total: 29
[/mnt/data-mag2/robert.haas/test9] open: 0, write: 20, fsync: 23,
close: 0, total: 43
[/mnt/data-mag2/robert.haas/test13] open: 0, write: 18, fsync: 25,
close: 0, total: 43
[/mnt/data-mag2/robert.haas/test5] open: 0, write: 19, fsync: 24,
close: 0, total: 43
[/mnt/data-mag2/robert.haas/test1] open: 0, write: 18, fsync: 25,
close: 0, total: 43

The fastest write performance of any test was the 16-way XFS-SSD test,
which wrote at about 2.56 gigabytes per second. The fastest
single-file test was on ext4-magnetic, though ext4-ssd and
xfs-magnetic were similar, around 0.66 gigabytes per second. Your
system must be a LOT faster, because you were seeing pg_basebackup
running at, IIUC, ~3 gigabytes per second, and that would have been a
second process both writing and doing other things. For comparison,
some recent local pg_basebackup testing on this machine by some of my
colleagues ran at about 0.82 gigabytes per second.

I suspect it would be possible to get significantly higher numbers on
this hardware by (1) changing all the filesystems over to XFS and (2)
dividing the data dynamically based on write speed rather than writing
the same amount of it everywhere. I bet we could reach 6-8 gigabytes
per second if we did all that.

Now, I don't know how much this matters. To get limited by this stuff,
you'd need an incredibly fast network - 10 or maybe 40 or 100 Gigabit
Ethernet or something like that - or to be doing a local backup. But I
thought that it was interesting and that I should share it, so here
you go! I do wonder if the apparently concurrency problems with ext4
might matter on systems with high connection counts just in normal
operation, backups aside.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <string.h>
#include <sys/wait.h>

#define BLCKSZ 8192

extern void runtest(unsigned long long blocks_total, char *filename);

int
main(int argc, char **argv)
{
	int	i;
	int	status;

	if (argc < 3)
	{
		fprintf(stderr, "not enough arguments\n");
		exit(1);
	}

	unsigned long long blocks_total = strtoull(argv[1], NULL, 0);

	for (i = 2; i < argc; ++i)
	{
		pid_t	pid = fork();

		if (pid == 0)
		{
			runtest(blocks_total, argv[i]);
			exit(0);
		}
		else if (pid < 0)
		{
			perror("fork");
			exit(1);
		}
	}

	while (wait(&status) >= 0)
		;
	sleep(1);

	return 0;
}

void 
runtest(unsigned long long blocks_total, char *filename)
{
	char	junk[BLCKSZ];

	memset(junk, 'J', BLCKSZ);

	time_t	t0 = time(NULL);
	int fd = open(filename, O_CREAT | O_TRUNC | O_WRONLY, 0600);
	if (fd < 0)
	{
		perror("open");
		exit(1);
	}

	time_t	t1 = time(NULL);
	unsigned long long blocks_written = 0;

	while (blocks_written < blocks_total)
	{
		int wc = write(fd, junk, BLCKSZ);

		if (wc != BLCKSZ)
		{
			fprintf(stderr, "wc = %d\n", wc);
			perror("write");
			exit(1);
		}
		++blocks_written;
	}

	time_t	t2 = time(NULL);
	if (fsync(fd) != 0)
	{
		perror("fsync");
		exit(1);
	}

	time_t	t3 = time(NULL);
	if (close(fd) != 0)
	{
		perror("close");
		exit(1);
	}

	time_t	t4 = time(NULL);
	printf("[%s] open: %u, write: %u, fsync: %u, close: %u, total: %u\n",
		   filename, t1 - t0, t2 - t1, t3 - t2, t4 - t3, t4 - t0);
}

#!/usr/bin/perl

my $nblocks = 2**23;
my @dir = qw(
	/mnt/data-mag/robert.haas
	/mnt/data-mag2/robert.haas
	/mnt/data-ssd/robert.haas
	/mnt/data-ssd2/robert.haas
);
my @parallel_degree = qw(1 2 4 8 16);
my @result;

for my $dir (@dir)
{
	for my $degree (@parallel_degree)
	{
		my @cmd = ('./write_and_fsync', $nblocks / $degree);

		clean_dir($dir);
		for (my $i = 1; $i <= $degree; ++$i)
		{
			push @cmd, $dir . '/test' . $i;
		}

		push @result, sprintf("%s %s %s\n", $dir, $degree, try(@cmd));
	}
}

for my $degree (@parallel_degree)
{
	next if $degree % @dir != 0;
	print $degree, "\n";

	my @cmd = ('./write_and_fsync', $nblocks / $degree);

	clean_dir($_) for @dir;
	for (my $i = 1; $i <= $degree; ++$i)
	{
		push @cmd, $dir[$i % @dir] . '/test' . $i;
	}

	push @result, sprintf("ALLDIRS %s %s\n", $degree, try(@cmd));
}

print @result;

sub clean_dir
{
	my ($dir) = @_;

	opendir(DIR, $dir) || die "opendir: $!";
	my @f = grep { /^test\d+$/ } readdir(DIR);
	closedir(DIR);

	for my $f (@f)
	{
		unlink("$dir/$f") || die "unlink: $!";
	}
}

sub try
{
	my (@cmd) = @_;

	print "executing: @cmd\n";
	my $t0 = time();
	system @cmd;
	my $t1 = time();
	
	return $t1 - $t0;
}

Re: design for parallel backup

Reply via email to