Performance degradation on concurrent COPY into a single relation in PG16.

Masahiko Sawada Sun, 02 Jul 2023 19:56:15 -0700

Hi all,

While testing PG16, I observed that in PG16 there is a big performance
degradation in concurrent COPY into a single relation with 2 - 16
clients in my environment. I've attached a test script that measures
the execution time of COPYing 5GB data in total to the single relation
while changing the number of concurrent insertions, in PG16 and PG15.
Here are the results on my environment (EC2 instance, RHEL 8.6, 128
vCPUs, 512GB RAM):


* PG15 (4b15868b69)
PG15: nclients = 1, execution time = 14.181
PG15: nclients = 2, execution time = 9.319
PG15: nclients = 4, execution time = 5.872
PG15: nclients = 8, execution time = 3.773
PG15: nclients = 16, execution time = 3.202
PG15: nclients = 32, execution time = 3.023
PG15: nclients = 64, execution time = 3.829
PG15: nclients = 128, execution time = 4.111
PG15: nclients = 256, execution time = 4.158

* PG16 (c24e9ef330)
PG16: nclients = 1, execution time = 17.112
PG16: nclients = 2, execution time = 14.084
PG16: nclients = 4, execution time = 27.997
PG16: nclients = 8, execution time = 10.554
PG16: nclients = 16, execution time = 7.074
PG16: nclients = 32, execution time = 4.607
PG16: nclients = 64, execution time = 2.093
PG16: nclients = 128, execution time = 2.141
PG16: nclients = 256, execution time = 2.202

PG16 has better scalability (more than 64 clients) but it took much
more time than PG15, especially at 1 - 16 clients.

The relevant commit is 00d1e02be2 "hio: Use ExtendBufferedRelBy() to
extend tables more efficiently". With commit 1cbbee0338 (the previous
commit of 00d1e02be2), I got a better numbers, it didn't have a better
scalability, though:

PG16: nclients = 1, execution time = 17.444
PG16: nclients = 2, execution time = 10.690
PG16: nclients = 4, execution time = 7.010
PG16: nclients = 8, execution time = 4.282
PG16: nclients = 16, execution time = 3.373
PG16: nclients = 32, execution time = 3.205
PG16: nclients = 64, execution time = 3.705
PG16: nclients = 128, execution time = 4.196
PG16: nclients = 256, execution time = 4.201

While investigating the cause, I found an interesting fact that in
mdzeroextend if I use only either FileFallocate() or FileZero, we can
get better numbers. For example, If I always use FileZero with the
following change:

@@ -574,7 +574,7 @@ mdzeroextend(SMgrRelation reln, ForkNumber forknum,
         * that decision should be made though? For now just use a cutoff of
         * 8, anything between 4 and 8 worked OK in some local testing.
         */
-       if (numblocks > 8)
+       if (false)
        {
            int         ret;

I got:

PG16: nclients = 1, execution time = 16.898
PG16: nclients = 2, execution time = 8.740
PG16: nclients = 4, execution time = 4.656
PG16: nclients = 8, execution time = 2.733
PG16: nclients = 16, execution time = 2.021
PG16: nclients = 32, execution time = 1.693
PG16: nclients = 64, execution time = 1.742
PG16: nclients = 128, execution time = 2.180
PG16: nclients = 256, execution time = 2.296

After further investigation, the performance degradation comes from
calling posix_fallocate() (called via FileFallocate()) and pwritev()
(called via FileZero) alternatively depending on how many blocks we
extend by. And it happens only on the xfs filesystem. Does anyone
observe a similar performance issue with the attached benchmark
script?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

PG15_PGDATA="/home/masahiko/pgsql/15.s/data"
PG16_PGDATA="/home/masahiko/pgsql/16.s/data"
PG15_BIN="/home/masahiko/pgsql/15.s/bin"
PG16_BIN="/home/masahiko/pgsql/16.s/bin"

ROWS=$((100 * 1000 * 1000))
CLIENTS="1 2 4 8 16 32 64 128 256"

${PG15_BIN}/pg_ctl stop -D ${PG15_PGDATA} -mi
${PG16_BIN}/pg_ctl stop -D ${PG16_PGDATA} -mi
rm -rf $PG15_PGDATA $PG16_PGDATA

echo "initializing clusters ..."
${PG15_BIN}/initdb -D $PG15_PGDATA -E UTF8 --no-locale
${PG16_BIN}/initdb -D $PG16_PGDATA -E UTF8 --no-locale

cat <<EOF >> ${PG15_PGDATA}/postgresql.conf
port = 5515
max_wal_size = 50GB
shared_buffers = 20GB
max_connections = 500
EOF
cat <<EOF >> ${PG16_PGDATA}/postgresql.conf
port = 5516
max_wal_size = 50GB
shared_buffers = 20GB
max_connections = 500
EOF

if [ "$1" != "skip_file_init" ]; then
    echo "prepare load files..."
    ${PG16_BIN}/pg_ctl start -D ${PG16_PGDATA}
    for c in $CLIENTS
    do
	rm -f /tmp/tmp_${c}.data
	${PG16_BIN}/psql -d postgres -p 5516 -X -c "copy (select generate_series(1, $ROWS / $c)) to '/tmp/tmp_${c}.data'"
    done
    ${PG16_BIN}/pg_ctl stop -D ${PG16_PGDATA}
fi

echo "start benchmark ..."
#for version in PG15 PG16
for version in PG16
do
    PSQL=""
    if [ "$version" == "PG15" ]; then
	PSQL="${PG15_BIN}/psql -p 5515 -d postgres"
	${PG15_BIN}/pg_ctl start -D ${PG15_PGDATA}
    else
	PSQL="${PG16_BIN}/psql -p 5516 -d postgres"
	${PG16_BIN}/pg_ctl start -D ${PG16_PGDATA}
    fi

    ${PSQL} -c "create unlogged table test (c int) with (autovacuum_enabled = off)"

    for c in $CLIENTS
    do
	${PSQL} -c "truncate test" > /dev/null 2>&1

	chileren=()
	start=`date +%s.%3N`
	for i in `seq 1 $c`
	do
	    ${PSQL} -c "copy test from '/tmp/tmp_${c}.data'" > /dev/null 2>&1 &
	    children+=($!)
	done
	wait ${children[@]}
	end=`date +%s.%3N`

	exec_time=$(echo "scale=3; $end - $start" | bc)
	echo "$version: nclients = $c, execution time = $exec_time"
    done

    if [ "$version" == "PG15" ]; then
	${PG15_BIN}/pg_ctl stop -D ${PG15_PGDATA} -mi
    else
	${PG16_BIN}/pg_ctl stop -D ${PG16_PGDATA} -mi
    fi
done

Performance degradation on concurrent COPY into a single relation in PG16.

Reply via email to