Le jeudi 28 octobre 2021, 14:31:36 CEST Michael Paquier a écrit :
> On Wed, Oct 27, 2021 at 10:11:00AM +0200, Ronan Dunklau wrote:
> > Sorry I sent an intermediary version of the patch, here is the correct
> > one.
>
> While looking at this patch, I have figured out a simple way to make
> the tests faster without impacting the coverage. The size of the WAL
> segments archived is a bit of a bottleneck as they need to be zeroed
> by pg_receivewal at creation. This finishes by being a waste of time,
> even if we don't flush the data. So I'd like to switch the test so as
> we use a segment size of 1MB, first.
>
> A second thing is that we copy too many segments than really needed
> when using the slot's restart_lsn as starting point as RESERVE_WAL
> would use the current redo location, so it seems to me that a
> checkpoint is in order before the slot creation. A third thing is
> that generating some extra data after the end LSN we want to use makes
> the test much faster at the end.
>
> With those three methods combined, the test goes down from 11s to 9s
> here. Attached is a patch I'd like to apply to make the test
> cheaper.
Interesting ideas, thanks. For the record, the time drops from ~4.5s to 3s on
average on my machine.
I think if you reduce the size of the generate_series batches, this should
probably be reduced everywhere. With what we do though, inserting a single
line should work just as well, I wonder why we insist on inserting a hundred
lines ? I updated your patch with that small modification, it also makes the
code less verbose.
>
> I also had a look at your patch. Here are some comments.
>
> +# Cleanup the previous stream directories to reuse them
> +unlink glob "'${stream_dir}/*'";
> +unlink glob "'${slot_dir}/*'";
> I think that we'd better choose a different location for the
> archives. Keeping the segments of the previous tests is useful for
> debugging if a previous step of the test failed.
Ok.
>
> +$standby->psql('',
> + "CREATE_REPLICATION_SLOT $folder_slot PHYSICAL (RESERVE_WAL)",
> + replication => 1);
> Here as well we could use a restart point to reduce the number of
> segments archived.
The restart point should be very close, as we don't generate any activity on
the primary between the backup and the slot's creation. I'm not sure adding
the complexity of triggering a checkpoint on the primary and waiting for the
standby to catch up on it would be that useful.
>
> +# Now, try to resume after the promotion, from the folder.
> +$standby->command_ok(
> + [ 'pg_receivewal', '-D', $stream_dir, '--verbose', '--endpos', $nextlsn,
> + '--slot', $folder_slot, '--no-sync'],
> + "Stream some wal after promoting, resuming from the folder's position");
> What is called the "resume-from-folder" test in the patch is IMO
> too costly and brings little extra value, requiring two commands of
> pg_receivewal (one to populate the folder and one to test the TLI jump
> from the previously-populated point) to test basically the same thing
> as when the starting point is taken from the slot. Except that
> restarting from the slot is twice cheaper. The point of the
> resume-from-folder case is to make sure that we restart from the point
> of the archive folder rather than the slot's restart_lsn, but your
> test fails this part, in fact, because the first command populating
> the archive folder also uses "--slot $folder_slot", updating the
> slot's restart_lsn before the second pg_receivewal uses this slot
> again.
>
> I think, as a whole, that testing for the case where an archive folder
> is populated (without a slot!), followed by a second command where we
> use a slot that has a restart_lsn older than the archive's location,
> to not be that interesting. If we include such a test, there is no
> need to include that within the TLI jump part, in my opinion. So I
> think that we had better drop this part of the patch, and keep only
> the case where we resume from a slot for the TLI jump.
You're right about the test not being that interesting in that case. I thought
it would be worthwhile to test both cases, but resume_from_folder doesn't
actually exercise any code that was not already called before (timeline
computation from segment name).
> The commands of pg_receivewal included in the test had better use -n
> so as there is no infinite loop on failure.
Ok.
--
Ronan Dunklau
>From ad09605b6f8b2d1cc91e67a255bf469363cad2b3 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau <ronan.dunk...@aiven.io>
Date: Thu, 28 Oct 2021 15:12:04 +0200
Subject: [PATCH v16 2/2] Speed up pg_receivewal tests.
Author: Michael Paquier
---
src/bin/pg_basebackup/t/020_pg_receivewal.pl | 21 ++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/src/bin/pg_basebackup/t/020_pg_receivewal.pl b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
index e0422e2242..1691054f5c 100644
--- a/src/bin/pg_basebackup/t/020_pg_receivewal.pl
+++ b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
@@ -15,7 +15,7 @@ program_options_handling_ok('pg_receivewal');
umask(0077);
my $primary = PostgreSQL::Test::Cluster->new('primary');
-$primary->init(allows_streaming => 1);
+$primary->init(allows_streaming => 1, extra => ['--wal-segsize=1']);
$primary->start;
my $stream_dir = $primary->basedir . '/archive_wal';
@@ -55,8 +55,7 @@ $primary->psql('postgres', 'SELECT pg_switch_wal();');
my $nextlsn =
$primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
chomp($nextlsn);
-$primary->psql('postgres',
- 'INSERT INTO test_table VALUES (generate_series(1,100));');
+$primary->psql('postgres', 'INSERT INTO test_table VALUES (1);');
# Stream up to the given position. This is necessary to have a fixed
# started point for the next commands done in this test, with or without
@@ -85,8 +84,7 @@ SKIP:
$nextlsn =
$primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
chomp($nextlsn);
- $primary->psql('postgres',
- 'INSERT INTO test_table VALUES (generate_series(100,200));');
+ $primary->psql('postgres', 'INSERT INTO test_table VALUES (1);');
# Note the trailing whitespace after the value of --compress, that is
# a valid value.
@@ -136,8 +134,7 @@ $primary->psql('postgres', 'SELECT pg_switch_wal();');
$nextlsn =
$primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
chomp($nextlsn);
-$primary->psql('postgres',
- 'INSERT INTO test_table VALUES (generate_series(200,300));');
+$primary->psql('postgres', 'INSERT INTO test_table VALUES (1);');
$primary->command_ok(
[
'pg_receivewal', '-D', $stream_dir, '--verbose',
@@ -167,7 +164,9 @@ mkdir($slot_dir);
$slot_name = 'archive_slot';
# Setup the slot, reserving WAL at creation (corresponding to the
-# last redo LSN here, actually).
+# last redo LSN here, actually, so use a checkpoint to reduce the
+# number of segments archived).
+$primary->psql('postgres', 'checkpoint;');
$primary->psql('postgres',
"SELECT pg_create_physical_replication_slot('$slot_name', true);");
@@ -181,13 +180,15 @@ my $walfile_streamed = $primary->safe_psql(
# Switch to a new segment, to make sure that the segment retained by the
# slot is still streamed. This may not be necessary, but play it safe.
-$primary->psql('postgres',
- 'INSERT INTO test_table VALUES (generate_series(1,100));');
+$primary->psql('postgres', 'INSERT INTO test_table VALUES (1);');
$primary->psql('postgres', 'SELECT pg_switch_wal();');
$nextlsn =
$primary->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
chomp($nextlsn);
+# Add a bit more data to accelerate the end of the next commands.
+$primary->psql('postgres', 'INSERT INTO test_table VALUES (1);');
+
# Check case where the slot does not exist.
$primary->command_fails_like(
[
--
2.33.1
>From 957177881692b772a1415316bcf0babcc7958e09 Mon Sep 17 00:00:00 2001
From: Ronan Dunklau <ronan.dunk...@aiven.io>
Date: Tue, 26 Oct 2021 10:54:12 +0200
Subject: [PATCH v16 1/2] Add a test for pg_receivewal following timeline
switch.
pg_receivewal is able to follow a timeline switch, but this was not
tested anywher. This test case verify that it works as expected both
when resuming from a replication slot and from the archive directory,
which have different methods of retrieving the timeline it left off.
---
src/bin/pg_basebackup/t/020_pg_receivewal.pl | 58 +++++++++++++++++++-
1 file changed, 57 insertions(+), 1 deletion(-)
diff --git a/src/bin/pg_basebackup/t/020_pg_receivewal.pl b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
index 092c9b6f25..e0422e2242 100644
--- a/src/bin/pg_basebackup/t/020_pg_receivewal.pl
+++ b/src/bin/pg_basebackup/t/020_pg_receivewal.pl
@@ -5,7 +5,7 @@ use strict;
use warnings;
use PostgreSQL::Test::Utils;
use PostgreSQL::Test::Cluster;
-use Test::More tests => 31;
+use Test::More tests => 35;
program_help_ok('pg_receivewal');
program_version_ok('pg_receivewal');
@@ -206,3 +206,59 @@ $primary->command_ok(
"WAL streamed from the slot's restart_lsn");
ok(-e "$slot_dir/$walfile_streamed",
"WAL from the slot's restart_lsn has been archived");
+
+# Test a timeline switch.
+# This test is split in two, using the same standby: one test check the
+# resume-from-folder case, the other the resume-from-slot one.
+
+# Setup a standby for our tests
+my $backup_name = "basebackup";
+$primary->backup($backup_name);
+my $standby = PostgreSQL::Test::Cluster->new("standby");
+$standby->init_from_backup($primary, $backup_name, has_streaming => 1);
+$standby->start;
+
+# Create a replication slot on this new standby
+my $archive_slot = "archive_slot";
+$standby->psql('',
+ "CREATE_REPLICATION_SLOT $archive_slot PHYSICAL (RESERVE_WAL)",
+ replication => 1);
+# Get a walfilename from before the promotion to make sure it is archived
+# after promotion
+my $walfile_before_promotion = $primary->safe_psql('postgres',
+ "SELECT pg_walfile_name(pg_current_wal_insert_lsn())");
+# Switch wal to make sure it is not a partial file but a complete segment.
+$primary->psql('postgres', 'INSERT INTO test_table VALUES (1);');
+$primary->psql('postgres', 'SELECT pg_switch_wal();');
+$primary->wait_for_catchup($standby, 'replay', $primary->lsn('write'));
+
+# Everything is setup, promote the standby to trigger a timeline switch
+$standby->psql(
+ 'postgres',
+ "SELECT pg_promote(wait_seconds => 300)");
+
+# Force a wal switch to make sure at least one full WAL is archived on the new
+# timeline, and fetch this walfilename.
+my $walfile_after_promotion = $standby->safe_psql('postgres',
+ "SELECT pg_walfile_name(pg_current_wal_insert_lsn())");
+$standby->psql('postgres', 'INSERT INTO test_table VALUES (1);');
+$standby->psql('postgres', 'SELECT pg_switch_wal();');
+$nextlsn =
+ $standby->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+chomp($nextlsn);
+$standby->psql('postgres', 'INSERT INTO test_table VALUES (1);');
+
+# Now try to resume from the slot after the promotion.
+my $timeline_dir = $primary->basedir . '/timeline_wal';
+mkdir($timeline_dir);
+
+$standby->command_ok(
+ [ 'pg_receivewal', '-D', $timeline_dir, '--verbose', '--endpos', $nextlsn,
+ '--slot', $archive_slot, '--no-sync', '-n'],
+ "Stream some wal after promoting, resuming from the slot's position");
+ok(-e "$timeline_dir/$walfile_before_promotion",
+ "WAL from the old timeline has been archived resuming from the slot");
+ok(-e "$timeline_dir/$walfile_after_promotion",
+ "WAL from the new timeline has been archived resuming from the slot");
+ok(-e "$timeline_dir/00000002.history",
+ "Timeline history file has been archived resuming from the slot");
--
2.33.1