On Mon, Apr 20, 2020 at 4:19 PM Andres Freund <and...@anarazel.de> wrote: > One question I have not really seen answered well: > > Why do we want parallelism here. Or to be more precise: What do we hope > to accelerate by making what part of creating a base backup > parallel. There's several potential bottlenecks, and I think it's > important to know the design priorities to evaluate a potential design.
I spent some time today trying to understand just one part of this, which is how long it will take to write the base backup out to disk and whether having multiple independent processes helps. I settled on writing and fsyncing 64GB of data, written in 8kB chunks, divided into 1, 2, 4, 8, or 16 equal size files, with each file written by a separate process, and an fsync() at the end before process exit. So in this test, there is no question of whether the master can read the data fast enough, nor is there any issue of network bandwidth. It's purely a test of whether it's faster to have one process write a big file or whether it's faster to have multiple processes each write a smaller file. I tested this on EDB's cthulhu. It's an older server, but it happens to have 4 mount points available for testing, one with XFS + magnetic disks, one with ext4 + magnetic disks, one with XFS + SSD, and one with ext4 + SSD. I did the experiment described above on each mount point separately, and then I also tried 4, 8, or 16 equal size files spread evenly across the 4 mount points. To summarize the results very briefly: 1. ext4 degraded really badly with >4 concurrent writers. XFS did not. 2. SSDs were faster than magnetic disks, but you had to use XFS and >=4 concurrent writers to get the benefit. 3. Spreading writes across the mount points works well, but the slowest mount point sets the pace. Here are more detailed results, with times in seconds: filesystem media 1@64GB 2@32GB 4@16GB 8@8GB 16@4GB xfs mag 97 53 60 67 71 ext4 mag 94 68 66 335 549 xfs ssd 97 55 33 27 25 ext4 ssd 116 70 66 227 450 spread spread n/a n/a 48 42 44 The spread test with 16 files @ 4GB llooks like this: [/mnt/data-ssd/robert.haas/test14] open: 0, write: 7, fsync: 0, close: 0, total: 7 [/mnt/data-ssd/robert.haas/test10] open: 0, write: 7, fsync: 2, close: 0, total: 9 [/mnt/data-ssd/robert.haas/test2] open: 0, write: 7, fsync: 2, close: 0, total: 9 [/mnt/data-ssd/robert.haas/test6] open: 0, write: 7, fsync: 2, close: 0, total: 9 [/mnt/data-ssd2/robert.haas/test3] open: 0, write: 16, fsync: 0, close: 0, total: 16 [/mnt/data-ssd2/robert.haas/test11] open: 0, write: 16, fsync: 0, close: 0, total: 16 [/mnt/data-ssd2/robert.haas/test15] open: 0, write: 17, fsync: 0, close: 0, total: 17 [/mnt/data-ssd2/robert.haas/test7] open: 0, write: 18, fsync: 0, close: 0, total: 18 [/mnt/data-mag/robert.haas/test16] open: 0, write: 7, fsync: 18, close: 0, total: 25 [/mnt/data-mag/robert.haas/test4] open: 0, write: 7, fsync: 19, close: 0, total: 26 [/mnt/data-mag/robert.haas/test12] open: 0, write: 7, fsync: 19, close: 0, total: 26 [/mnt/data-mag/robert.haas/test8] open: 0, write: 7, fsync: 22, close: 0, total: 29 [/mnt/data-mag2/robert.haas/test9] open: 0, write: 20, fsync: 23, close: 0, total: 43 [/mnt/data-mag2/robert.haas/test13] open: 0, write: 18, fsync: 25, close: 0, total: 43 [/mnt/data-mag2/robert.haas/test5] open: 0, write: 19, fsync: 24, close: 0, total: 43 [/mnt/data-mag2/robert.haas/test1] open: 0, write: 18, fsync: 25, close: 0, total: 43 The fastest write performance of any test was the 16-way XFS-SSD test, which wrote at about 2.56 gigabytes per second. The fastest single-file test was on ext4-magnetic, though ext4-ssd and xfs-magnetic were similar, around 0.66 gigabytes per second. Your system must be a LOT faster, because you were seeing pg_basebackup running at, IIUC, ~3 gigabytes per second, and that would have been a second process both writing and doing other things. For comparison, some recent local pg_basebackup testing on this machine by some of my colleagues ran at about 0.82 gigabytes per second. I suspect it would be possible to get significantly higher numbers on this hardware by (1) changing all the filesystems over to XFS and (2) dividing the data dynamically based on write speed rather than writing the same amount of it everywhere. I bet we could reach 6-8 gigabytes per second if we did all that. Now, I don't know how much this matters. To get limited by this stuff, you'd need an incredibly fast network - 10 or maybe 40 or 100 Gigabit Ethernet or something like that - or to be doing a local backup. But I thought that it was interesting and that I should share it, so here you go! I do wonder if the apparently concurrency problems with ext4 might matter on systems with high connection counts just in normal operation, backups aside. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
#include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <string.h> #include <sys/wait.h> #define BLCKSZ 8192 extern void runtest(unsigned long long blocks_total, char *filename); int main(int argc, char **argv) { int i; int status; if (argc < 3) { fprintf(stderr, "not enough arguments\n"); exit(1); } unsigned long long blocks_total = strtoull(argv[1], NULL, 0); for (i = 2; i < argc; ++i) { pid_t pid = fork(); if (pid == 0) { runtest(blocks_total, argv[i]); exit(0); } else if (pid < 0) { perror("fork"); exit(1); } } while (wait(&status) >= 0) ; sleep(1); return 0; } void runtest(unsigned long long blocks_total, char *filename) { char junk[BLCKSZ]; memset(junk, 'J', BLCKSZ); time_t t0 = time(NULL); int fd = open(filename, O_CREAT | O_TRUNC | O_WRONLY, 0600); if (fd < 0) { perror("open"); exit(1); } time_t t1 = time(NULL); unsigned long long blocks_written = 0; while (blocks_written < blocks_total) { int wc = write(fd, junk, BLCKSZ); if (wc != BLCKSZ) { fprintf(stderr, "wc = %d\n", wc); perror("write"); exit(1); } ++blocks_written; } time_t t2 = time(NULL); if (fsync(fd) != 0) { perror("fsync"); exit(1); } time_t t3 = time(NULL); if (close(fd) != 0) { perror("close"); exit(1); } time_t t4 = time(NULL); printf("[%s] open: %u, write: %u, fsync: %u, close: %u, total: %u\n", filename, t1 - t0, t2 - t1, t3 - t2, t4 - t3, t4 - t0); }
#!/usr/bin/perl my $nblocks = 2**23; my @dir = qw( /mnt/data-mag/robert.haas /mnt/data-mag2/robert.haas /mnt/data-ssd/robert.haas /mnt/data-ssd2/robert.haas ); my @parallel_degree = qw(1 2 4 8 16); my @result; for my $dir (@dir) { for my $degree (@parallel_degree) { my @cmd = ('./write_and_fsync', $nblocks / $degree); clean_dir($dir); for (my $i = 1; $i <= $degree; ++$i) { push @cmd, $dir . '/test' . $i; } push @result, sprintf("%s %s %s\n", $dir, $degree, try(@cmd)); } } for my $degree (@parallel_degree) { next if $degree % @dir != 0; print $degree, "\n"; my @cmd = ('./write_and_fsync', $nblocks / $degree); clean_dir($_) for @dir; for (my $i = 1; $i <= $degree; ++$i) { push @cmd, $dir[$i % @dir] . '/test' . $i; } push @result, sprintf("ALLDIRS %s %s\n", $degree, try(@cmd)); } print @result; sub clean_dir { my ($dir) = @_; opendir(DIR, $dir) || die "opendir: $!"; my @f = grep { /^test\d+$/ } readdir(DIR); closedir(DIR); for my $f (@f) { unlink("$dir/$f") || die "unlink: $!"; } } sub try { my (@cmd) = @_; print "executing: @cmd\n"; my $t0 = time(); system @cmd; my $t1 = time(); return $t1 - $t0; }