Hi, On 2020-04-21 14:01:28 -0400, Robert Haas wrote: > On Tue, Apr 21, 2020 at 11:36 AM Andres Freund <and...@anarazel.de> wrote: > > It's all CRC overhead. I don't see a difference with > > --manifest-checksums=none anymore. We really should look for a better > > "fast" checksum. > > Hmm, OK. I'm wondering exactly what you tested here. Was this over > your 20GiB/s connection between laptop and workstation, or was this > local TCP?
It was local TCP. The speeds I can reach are faster than the 10GiB/s (unidirectional) I can do between the laptop & workstation, so testing it over "actual" network isn't informative - I basically can reach line speed between them with any method. > Also, was the database being read from persistent storage, or was it > RAM-cached? It was in kernel buffer cache. But I can reach 100% utilization of storage too (which is slightly slower than what I can do over unix socket). pg_basebackup --manifest-checksums=none -h /tmp/ -D- -Ft -cfast -Xnone |pv -B16M -r -a > /dev/null 2.59GiB/s find /srv/dev/pgdev-dev/base/ -type f -exec dd if={} bs=32k status=none \; |pv -B16M -r -a > /dev/null 2.53GiB/s find /srv/dev/pgdev-dev/base/ -type f -exec cat {} + |pv -B16M -r -a > /dev/null 2.42GiB/s I tested this with a -s 5000 DB, FWIW. > How do you expect to take advantage of I/O parallelism without > multiple processes/connections? Which kind of I/O parallelism are you thinking of? Independent tablespaces? Or devices that can handle multiple in-flight IOs? WRT the latter, at least linux will keep many IOs in-flight for sequential buffered reads. > - UNIX socket was slower than a local TCP socket, and about the same > speed as a TCP socket with SSL. Hm. Interesting. Wonder if that a question of the unix socket buffer size? > - CRC-32C is about 10% slower than no manifest and/or no checksums in > the manifest. SHA256 is 1.5-2x slower, but less when compression is > also used (see below). > - Plain format is a little slower than tar format; tar with gzip is > typically >~5x slower, but less when the checksum algorithm is SHA256 > (again, see below). I see about 250MB/s with -Z1 (from the source side). If I hack pg_basebackup.c to specify a deflate level of 0 to gzsetparams, which zlib docs says should disable compression, I get up to 700MB/s. Which still is a factor of ~3.7 to uncompressed. This seems largely due to zlib's crc32 computation not being hardware accelerated: - 99.75% 0.05% pg_basebackup pg_basebackup [.] BaseBackup - 99.95% BaseBackup - 81.60% writeTarData - gzwrite - gz_write - gz_comp.constprop.0 - 85.11% deflate - 97.66% deflate_stored + 87.45% crc32_z + 9.53% __memmove_avx_unaligned_erms + 3.02% _tr_stored_block 2.27% __memmove_avx_unaligned_erms + 14.86% __libc_write + 18.40% pqGetCopyData3 > It seems to me that the interesting cases may involve having lots of > available CPUs and lots of disk spindles, but a comparatively slow > pipe between the machines. Hm, I'm not sure I am following. If network is the bottleneck, we'd immediately fill the buffers, and that'd be that? ISTM all of this is only really relevant if either pg_basebackup or walsender is the bottleneck? > I mean, if it takes 36 hours to read the > data from disk, you can't realistically expect to complete a full > backup in less than 36 hours. Incremental backup might help, but > otherwise you're just dead. On the other hand, if you can read the > data from the disk in 2 hours but it takes 36 hours to complete a > backup, it seems like you have more justification for thinking that > the backup software could perhaps do better. In such cases efficient > server-side compression may help a lot, but even then, I wonder > whether you can you read the data at maximum speed with only a single > process? I tend to doubt it, but I guess you only have to be fast > enough to saturate the network. Hmm. Well, I can do >8GByte/s of buffered reads in a single process (obviously cached, because I don't have storage quite that fast - uncached I can read at nearly 3GByte/s, the disk's speed). So sure, there's a limit to what a single process can do, but I think we're fairly far away from it. I think it's fairly obvious that we need faster compression - and that while we clearly can win a lot by just using a faster algorithm/implementation than standard zlib, we'll likely also need parallelism in some form. I'm doubtful that using multiple connections and multiple backends is the best way to achieve that, but it'd be a way. Greetings, Andres Freund