date:20220425

Re: A problem about partitionwise join

2022-04-25 Thread Richard Guo

On Mon, Nov 22, 2021 at 3:04 PM Richard Guo  wrote:

>
> The suggested changes have already been included in v5 patch. Sorry for
> the confusion.
>
> Verified that the patch still applies and works on latest master. So I'm
> moving it to the next CF (which is Commitfest 2022-01). Please correct
> me if this is not the right thing to do.
>

Rebased the patch with latest master. Appreciate any comments.

Thanks
Richard


v6-0001-Fix-up-partitionwise-join.patch
Description: Binary data

variable filename for psql \copy

2022-04-25 Thread Jiří Fejfar

Hi all,

I have found maybe buggy behaviour (of psql parser?) when using psql \copy
with psql variable used for filename.

SQL copy is working fine:

contrib_regression=# \set afile '/writable_dir/out.csv'
contrib_regression=# select :'afile' as filename;
   filename
---
 /writable_dir/out.csv
(1 row)

contrib_regression=# copy (select 1) to :'afile';
COPY 1

but psql \copy is returning error:

contrib_regression=# \copy (select 1) to :'afile';
ERROR:  syntax error at or near "'afile'"
LINE 1: COPY  ( select 1 ) TO STDOUT 'afile';
 ^
when used without quotes it works, but it will create file in actual
directory and name ':afile'

contrib_regression=# \copy (select 1) to :afile;
COPY 1

vagrant@nfiesta_dev_pg:~/npg$ cat :afile
1

workaround (suggested by Pavel Stěhule) is here:

contrib_regression=# \set afile '/writable_dir/out2.csv'
contrib_regression=# \set cmd '\\copy (SELECT 1) to ':afile
contrib_regression=# :cmd
COPY 1

My PG versin:

contrib_regression=# select version();
version
---
 PostgreSQL 12.10 (Debian 12.10-1.pgdg110+1) on x86_64-pc-linux-gnu,
compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
(1 row)

Best regards, Jiří Fejfar.

RE: Perform streaming logical transactions by background workers and parallel apply

2022-04-25 Thread houzj.f...@fujitsu.com

On Friday, April 22, 2022 12:12 PM Peter Smith  wrote:
> 
> Hello Hou-san. Here are my review comments for v4-0001. Sorry, there
> are so many of them (it is a big patch); some are trivial, and others
> you might easily dismiss due to my misunderstanding of the code. But
> hopefully, there are at least some comments that can be helpful in
> improving the patch quality.

Thanks for the comments !
I think most of the comments make sense and here are explanations for
some of them.

> 24. src/backend/replication/logical/launcher.c - ApplyLauncherMain
> 
> @@ -869,7 +917,7 @@ ApplyLauncherMain(Datum main_arg)
>   wait_time = wal_retrieve_retry_interval;
> 
>   logicalrep_worker_launch(sub->dbid, sub->oid, sub->name,
> - sub->owner, InvalidOid);
> + sub->owner, InvalidOid, DSM_HANDLE_INVALID);
>   }
> Now that the logicalrep_worker_launch is retuning a bool, should this
> call be checking the return value and taking appropriate action if it
> failed?

Not sure we can change the logic of existing caller. I think only the new
caller in the patch is necessary to check this.


> 26. src/backend/replication/logical/origin.c - acquire code
> 
> + /*
> + * We allow the apply worker to get the slot which is acquired by its
> + * leader process.
> + */
> + else if (curstate->acquired_by != 0 && acquire)
>   {
>   ereport(ERROR,
> 
> I somehow felt that this param would be better called 'skip_acquire',
> so all the callers would have to use the opposite boolean and then
> this code would say like below (which seemed easier to me). YMMV.
> 
> else if (curstate->acquired_by != 0 && !skip_acquire)
>   {
>   ereport(ERROR,

Not sure about this.


> 59. src/backend/replication/logical/worker.c - ApplyWorkerMain
> 
> @@ -3733,7 +4292,7 @@ ApplyWorkerMain(Datum main_arg)
> 
>   options.proto.logical.publication_names = MySubscription->publications;
>   options.proto.logical.binary = MySubscription->binary;
> - options.proto.logical.streaming = MySubscription->stream;
> + options.proto.logical.streaming = (MySubscription->stream != SUBSTREAM_OFF);
>   options.proto.logical.twophase = false;
>
> I was not sure why this is converting from an enum to a boolean? Is it right?

I think it's ok, the "logical.streaming" is used in publisher which don't need
to know the exact type of the streaming(it only need to know whether the
streaming is enabled for now)


> 63. src/backend/replication/logical/worker.c - ApplyBgwShutdown
> 
> +static void
> +ApplyBgwShutdown(int code, Datum arg)
> +{
> + SpinLockAcquire(&MyParallelState->mutex);
> + MyParallelState->failed = true;
> + SpinLockRelease(&MyParallelState->mutex);
> +
> + dsm_detach((dsm_segment *) DatumGetPointer(arg));
> +}
> 
> Should this do detach first and set the flag last?

Not sure about this. I think it's fine to detach this at the end.

> 76. src/backend/replication/logical/worker.c - check_workers_status
> 
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("Background worker %u exited unexpectedly",
> + wstate->pstate->n)));
> 
> Should that message also give more identifying info about the
> *current* worker doing the ERROR - e.g.the one which found this the
> other bgworker was failed? Or is that just the PIC in the log message
> good enough?

Currently, only the main apply worker should report this error, so not sure do
we need to report the current worker.

> 77. src/backend/replication/logical/worker.c - check_workers_status
> 
> + if (!AllTablesyncsReady() && nfreeworkers != list_length(ApplyWorkersList))
> + {
> 
> I did not really understand this code, but isn't there a possibility
> that it will cause many restarts if the tablesyncs are taking a long
> time to complete?

I think it's ok, after restarting, we won't start bgworker until all the table
is READY.

Best regards,
Hou zj

Re: variable filename for psql \copy

2022-04-25 Thread Daniel Verite

Jiří Fejfar wrote:

> I have found maybe buggy behaviour (of psql parser?) when using psql \copy
> with psql variable used for filename.

While it's annoying that it doesn't work as you tried it, this behavior is 
documented, so in that sense it's not a bug.
The doc also suggests a workaround in a tip section:

From psql manpage:

  The syntax of this command is similar to that of the SQL COPY
   command. All options other than the data source/destination are as
   specified for COPY. Because of this, special parsing rules apply to
   the \copy meta-command. Unlike most other meta-commands, the entire
   remainder of the line is always taken to be the arguments of \copy,
   and neither variable interpolation nor backquote expansion are
   performed in the arguments.

   Tip
   Another way to obtain the same result as \copy ... to is to use
   the SQL COPY ... TO STDOUT command and terminate it with \g
   filename or \g |program. Unlike \copy, this method allows the
   command to span multiple lines; also, variable interpolation
   and backquote expansion can be used.

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-25 Thread Bharath Rupireddy

On Sat, Apr 23, 2022 at 4:21 AM David Christensen
 wrote:
>
> Hi -hackers,
>
> Enclosed is a patch to allow extraction/saving of FPI from the WAL
> stream via pg_waldump.
>
> Description from the commit:
>
> Extracts full-page images from the WAL stream into a target directory,
> which must be empty or not
> exist.  These images are subject to the same filtering rules as normal
> display in pg_waldump, which
> means that you can isolate the full page writes to a target relation,
> among other things.
>
> Files are saved with the filename:  with
> formatting to make things
> somewhat sortable; for instance:
>
> -01C0.1663.1.6117.0
> -01000150.1664.0.6115.0
> -010001E0.1664.0.6114.0
> -01000270.1663.1.6116.0
> -01000300.1663.1.6113.0
> -01000390.1663.1.6112.0
> -01000420.1663.1.8903.0
> -010004B0.1663.1.8902.0
> -01000540.1663.1.6111.0
> -010005D0.1663.1.6110.0
>
> It's noteworthy that the raw images do not have the current LSN stored
> with them in the WAL
> stream (as would be true for on-heap versions of the blocks), nor
> would the checksum be valid in
> them (though WAL itself has checksums, so there is some protection
> there).  This patch chooses to
> place the LSN and calculate the proper checksum (if non-zero in the
> source image) in the outputted
> block.  (This could perhaps be a targetted flag if we decide we don't
> always want this.)
>
> These images could be loaded/inspected via `pg_read_binary_file()` and
> used in the `pageinspect`
> suite of tools to perform detailed analysis on the pages in question,
> based on historical
> information, and may come in handy for forensics work.

Thanks for working on this. I'm just thinking if we can use these FPIs
to repair the corrupted pages? I would like to understand more
detailed usages of the FPIs other than inspecting with pageinspect.

Given that others have realistic use-cases (of course I would like to
know more about those), +1 for the idea. However, I would suggest
adding a function to extract raw FPI data to the pg_walinspect
extension that got recently committed in PG 15, the out of which can
directly be fed to pageinspect functions or

Few comments:
1) I think it's good to mention the stored file name format.
+ printf(_("  -W, --raw-fpi=path save found full page images to
given path\n"));
2)
+ for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
+ {
+ /* we will now extract the fullpage image from the XLogRecord and save
+ * it to a calculated filename */
+
+ if (XLogRecHasBlockImage(record, block_id))

I think we need XLogRecHasBlockRef to be true to check
XLogRecHasBlockImage otherwise, we will see some build farms failing,
recently I've seen this failure for pg_walinspect..

for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
{
if (!XLogRecHasBlockRef(record, block_id))
continue;

if (XLogRecHasBlockImage(record, block_id))
*fpi_len += XLogRecGetBlock(record, block_id)->bimg_len;
}
3) Please correct the commenting format:
+ /* we will now extract the fullpage image from the XLogRecord and save
+ * it to a calculated filename */
4) Usually we start errors with lower case letters "could not ."
+ pg_fatal("Couldn't open file for output: %s", filename);
+ pg_fatal("Couldn't write out complete FPI to file: %s", filename);
And the variable name too:
+ FILE *OPF;
5) Not sure how the FPIs of TOASTed tables get stored, but it would be
good to check.
6) Good to specify the known usages of FPIs in the documentation.
7) Isn't it good to emit an error if RestoreBlockImage returns false?
+ if (RestoreBlockImage(record, block_id, page))
+ {
8) I think I don't mind if a non-empty directory is specified - IMO
better usability is this - if the directory is non-empty, just go add
the FPI files if FPI file exists just replace it, if the directory
isn't existing, create and write the FPI files.
+ /* we accept an empty existing directory */
+ if (stat(config.save_fpw_path, &st) == 0 && S_ISDIR(st.st_mode))
+ {
 9) Instead of following:
+ if (XLogRecordHasFPW(xlogreader_state))
+ XLogRecordSaveFPWs(xlogreader_state, config.save_fpw_path);
I will just do this in XLogRecordSaveFPWs:
for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
{
if (!XLogRecHasBlockRef(record, block_id))
continue;

if (XLogRecHasBlockImage(record, block_id))
{

}
}
10) Along with pg_pwrite(), can we also fsync the files (of course
users can choose it optionally) so that the writes will be durable for
the OS crashes?

Regards,
Bharath Rupireddy.

Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory

2022-04-25 Thread Bharath Rupireddy

On Mon, Apr 25, 2022 at 6:38 AM Michael Paquier  wrote:
>
> On Fri, Apr 22, 2022 at 07:17:37PM +0530, Bharath Rupireddy wrote:
> > Right. We find enough disk space and go to write and suddenly the
> > write operations fail for some reason or the VM crashes because of a
> > reason other than disk space. I think the foolproof solution is to
> > figure out the available disk space before prepadding or compressing
> > and also use the
> > write-first-to-temp-file-and-then-rename-it-to-original-file as
> > proposed in the earlier patches in this thread.
>
> Yes, what would count here is only the amount of free space in a
> partition.  The total amount of space available becomes handy once you
> begin introducing things like percentage-based quota policies for the
> disk when archiving.  The free amount of space could be used to define
> a policy based on the maximum number of bytes you need to leave
> around, as well, but this is not perfect science as this depends of
> what FSes decide to do underneath.  There are a couple of designs
> possible here.  When I had to deal with my upthread case I have chosen
> one as I had no need to worry only about Linux, it does not mean that
> this is the best choice that would fit with the long-term community
> picture.  This comes down to how much pg_receivewal should handle
> automatically, and how it should handle it.

Thanks. I'm not sure why we are just thinking of crashes due to
out-of-disk space. Figuring out free disk space before writing a huge
file (say a WAL file) is a problem in itself to the core postgres as
well, not just pg_receivewal.

I think we are off-track a bit here. Let me illustrate what's the
whole problem is and the idea:

If the node/VM on which pg_receivewal runs, goes down/crashes or fails
during write operation while padding the target WAL file (the .partial
file) with zeros, the unfilled target WAL file ((let me call this file
a partially padded .partial file) will be left over and subsequent
reads/writes to that it will fail with "write-ahead log file \"%s\"
has %zd bytes, should be 0 or %d" error which requires manual
intervention to remove it. In a service, this manual intervention is
what we would like to avoid. Let's not much bother right now for
compressed file writes (for now at least) as they don't have a
prepadding phase.

The proposed solution is to make the prepadding atomic - prepad the
.partial file as .partial.tmp name and after the prepadding
rename (durably if sync option is chosen for pg_receivewal) to
.partial. Before prepadding  .partial.tmp, delete the
.partial.tmp if it exists.

The above problem isn't unique to pg_receivewal alone, pg_basebackup
too uses CreateWalDirectoryMethod and dir_open_for_write via
ReceiveXlogStream.

IMHO, pg_receivewal checking for available disk space before writing
any file should better be discussed separately?

Regards,
Bharath Rupireddy.

Re: tweak to a few index tests to hits ambuildempty() routine.

2022-04-25 Thread Alvaro Herrera

On 2021-Nov-29, Amul Sul wrote:

> Attached patch is doing small changes to brin, gin & gist index tests
> to use an unlogged table without changing the original intention of
> those tests and that is able to hit ambuildempty() routing which is
> otherwise not reachable by the current tests.

I added one change to include spgist too, which was uncovered, and
pushed this.

Thanks for the patch!

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this."   (Fotis)
   (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)

Re: [PATCH] Compression dictionaries for JSONB

2022-04-25 Thread Aleksander Alekseev

Hi Zhihong,

Many thanks for your feedback!

> For src/backend/catalog/pg_dict.c, please add license header.

Fixed.

> +   elog(ERROR, "skipbytes > decoded_size - outoffset");
>
> Include the values for skipbytes, decoded_size and outoffset.

In fact, this code should never be executed, and if somehow it will
be, this information will not help us much to debug the issue. I made
corresponding changes to the error message and added the comments.

Here it the 2nd version of the patch:

- Includes changes named above
- Fixes a warning reported by cfbot
- Fixes some FIXME's
- The path includes some simple tests now
- A proper commit message was added

Please note that this is still a draft. Feedback is welcome.

-- 
Best regards,
Aleksander Alekseev


v2-0001-Compression-dictionaries-for-JSONB.patch
Description: Binary data

Re: tweak to a few index tests to hits ambuildempty() routine.

2022-04-25 Thread Alvaro Herrera

On 2022-Apr-25, Alvaro Herrera wrote:

> I added one change to include spgist too, which was uncovered, and
> pushed this.

Looking into the recoveryCheck failure in buildfarm.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: tweak to a few index tests to hits ambuildempty() routine.

2022-04-25 Thread Alvaro Herrera

On 2022-Apr-25, Alvaro Herrera wrote:

> On 2022-Apr-25, Alvaro Herrera wrote:
> 
> > I added one change to include spgist too, which was uncovered, and
> > pushed this.
> 
> Looking into the recoveryCheck failure in buildfarm.

Hmm, so 027_stream_regress.pl is not prepared to deal with any unlogged
tables that may be left in the regression database (which is what my
spgist addition did).  I first tried doing a TRUNCATE of the unlogged
table, but that doesn't work either, and it turns out that the
regression database does not have any UNLOGGED relations.  Maybe that's
something we need to cater for, eventually, but for now dropping the
table suffices.  I have pushed that.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Tom: There seems to be something broken here.
Teodor: I'm in sackcloth and ashes...  Fixed.
http://archives.postgresql.org/message-id/482d1632.8010...@sigaev.ru

Re: tweak to a few index tests to hits ambuildempty() routine.

2022-04-25 Thread Amul Sul

On Mon, Apr 25, 2022 at 7:23 PM Alvaro Herrera  wrote:
>
> On 2022-Apr-25, Alvaro Herrera wrote:
>
> > On 2022-Apr-25, Alvaro Herrera wrote:
> >
> > > I added one change to include spgist too, which was uncovered, and
> > > pushed this.
> >

Thanks for the commit with the improvement.

Regards,
Amul

Re: tweak to a few index tests to hits ambuildempty() routine.

2022-04-25 Thread Tom Lane

Alvaro Herrera  writes:
> Hmm, so 027_stream_regress.pl is not prepared to deal with any unlogged
> tables that may be left in the regression database (which is what my
> spgist addition did).  I first tried doing a TRUNCATE of the unlogged
> table, but that doesn't work either, and it turns out that the
> regression database does not have any UNLOGGED relations.  Maybe that's
> something we need to cater for, eventually, but for now dropping the
> table suffices.  I have pushed that.

It does seem like the onus should be on 027_stream_regress.pl to
deal with that, rather than restricting what the core tests can
leave behind.

Maybe we could have it look for unlogged tables and drop them
before making the dumps?  Although I don't understand why
TRUNCATE wouldn't do the job equally well.

regards, tom lane

Re: json_object returning jsonb reuslt different from returning json, returning text

2022-04-25 Thread Andrew Dunstan


On 2022-04-25 Mo 01:19, alias wrote:
>
> seems it's a bug around value 0.
>
> SELECT JSON_OBJECTAGG(k: v ABSENT ON NULL WITH UNIQUE KEYS RETURNING
> jsonb)
> FROM (VALUES (1, 1), (10, NULL),(4, null), (5, null),(6, null),(2, 2))
> foo(k, v);
> return:
> {"1": 1, "2": 2}
>
> SELECT JSON_OBJECTAGG(k: v ABSENT ON NULL WITH UNIQUE KEYS RETURNING
> jsonb)
> FROM (VALUES (1, 1), (0, NULL),(4, null), (5, null),(6, null),(2, 2))
> foo(k, v);
>
> return
>  {"0": null, "1": 1, "2": 2}


Thanks for the report.

I don't think there's anything special about '0' except that it sorts
first. There appears to be a bug in the uniquefying code where the first
item(s) have nulls. The attached appears to fix it. Please test and see
if you can break it.


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c
index aa151a53d6..21d874c098 100644
--- a/src/backend/utils/adt/jsonb_util.c
+++ b/src/backend/utils/adt/jsonb_util.c
@@ -1959,8 +1959,18 @@ uniqueifyJsonbObject(JsonbValue *object, bool unique_keys, bool skip_nulls)
 
 	if (hasNonUniq || skip_nulls)
 	{
-		JsonbPair  *ptr = object->val.object.pairs + 1,
-   *res = object->val.object.pairs;
+		JsonbPair  *ptr, *res;
+
+		while (skip_nulls && object->val.object.nPairs > 0 &&
+			   object->val.object.pairs->value.type == jbvNull)
+		{
+			/* If skip_nulls is true, remove leading items with null */
+			object->val.object.pairs++;
+			object->val.object.nPairs--;
+		}
+
+		ptr = object->val.object.pairs + 1;
+		res = object->val.object.pairs;
 
 		while (ptr - object->val.object.pairs < object->val.object.nPairs)
 		{

An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-04-25 Thread Bharath Rupireddy

Hi,

With synchronous replication typically all the transactions (txns)
first locally get committed, then streamed to the sync standbys and
the backend that generated the transaction will wait for ack from sync
standbys. While waiting for ack, it may happen that the query or the
txn gets canceled (QueryCancelPending is true) or the waiting backend
is asked to exit (ProcDiePending is true). In either of these cases,
the wait for ack gets canceled and leaves the txn in an inconsistent
state (as in the client thinks that the txn would have replicated to
sync standbys) - "The transaction has already committed locally, but
might not have been replicated to the standby.". Upon restart after
the crash or in the next txn after the old locally committed txn was
canceled, the client will be able to see the txns that weren't
actually streamed to sync standbys. Also, if the client fails over to
one of the sync standbys after the crash (either by choice or because
of automatic failover management after crash), the locally committed
txns on the crashed primary would be lost which isn't good in a true
HA solution.

Here's a proposal (mentioned previously by Satya [1]) to avoid the
above problems:
1) Wait a configurable amount of time before canceling the sync
replication by the backends i.e. delay processing of
QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
synchronous_replication_naptime_before_cancel, when set, it will let
the backends wait for the ack before canceling the synchronous
replication so that the transaction can be available in sync standbys
as well. If the ack isn't received even within this time frame, the
backend cancels the wait and goes ahead as it does today. In
production HA environments, the GUC can be set to a reasonable value
to avoid missing transactions during failovers.
2) Wait for sync standbys to catch up upon restart after the crash or
in the next txn after the old locally committed txn was canceled. One
way to achieve this is to let the backend, that's making the first
connection, wait for sync standbys to catch up in ClientAuthentication
right after successful authentication. However, I'm not sure this is
the best way to do it at this point.

Thoughts?

Here's a WIP patch implementing the (1), I'm yet to code for (2). I
haven't added tests, I'm yet to figure out how to add one as there's
no way we can delay the WAL sender so that we can reliably hit this
code. I will think more about this.

[1] 
https://www.postgresql.org/message-id/CAHg%2BQDdTdPsqtu0QLG8rMg3Xo%3D6Xo23TwHPYsUgGNEK13wTY5g%40mail.gmail.com

Regards,
Bharath Rupireddy.
From d5fe07bbd80b72dfbf06e9b039b9e4a93a7f7a06 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sun, 24 Apr 2022 03:42:59 +
Subject: [PATCH v1] Wait specified amount of time before cancelling sync
 replication

In PostgreSQL high availability setup with synchronous replication,
typically all the transactions first locally get committed, then
streamed to the synchronous standbys and the backends that generated
the transaction will wait for acknowledgement from synchronous
standbys. While waiting for acknowledgement, it may happen that the
query or the transaction gets canceled or the backend that's waiting
for acknowledgement is asked to exit. In either of these cases, the
wait for acknowledgement gets canceled and leaves transaction in an
inconsistent state as it got committed locally but not on the
standbys. When set the GUC synchronous_replication_naptime_before_cancel
introduced in this patch, it will let the backends wait for the
acknowledgement before canceling the wait for acknowledgement so
that the transaction can be available in synchronous standbys as
well.
---
 doc/src/sgml/config.sgml  | 30 +++
 src/backend/replication/syncrep.c | 50 +++
 src/backend/utils/misc/guc.c  | 12 +
 src/backend/utils/misc/postgresql.conf.sample |  2 +
 src/include/replication/syncrep.h |  3 ++
 5 files changed, 97 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 03986946a8..1681ea173f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4507,6 +4507,36 @@ ANY num_sync ( 
+  synchronous_replication_naptime_before_cancel (integer)
+  
+   synchronous_replication_naptime_before_cancel configuration parameter
+  
+  
+  
+   
+Specifies the amount of time in milliseconds to wait for synchronous
+replication before cancelling. Default value is 0, a value of -1 or 0
+disables this feature. In PostgreSQL high
+availability setup with synchronous replication, typically all the
+transactions first locally get committed, then streamed to the
+synchronous standbys and the backends that generated the transaction
+will wait for acknowledgement from synchronous standbys. While waiting
+for acknowledgement, it may happen t

Re: Estimating HugePages Requirements?

2022-04-25 Thread Magnus Hagander

On Mon, Apr 25, 2022 at 2:15 AM Michael Paquier  wrote:

> On Fri, Apr 22, 2022 at 09:49:34AM +0200, Magnus Hagander wrote:
> > I agree that thats a very narrow use case. And I'm not sure the use case
> of
> > a running server is even that important here - it's really the offline
> one
> > that's important. Or rather, the really compelling one is when there is a
> > server running but I want to check the value offline because it will
> > change. SHOW doesn't help there because it shows the value based on the
> > currently running configuration, not the new one after a restart.
>
> You mean the case of a server where one would directly change
> postgresql.conf on a running server, and use postgres -C to see how
> much the kernel settings need to be changed before the restart?
>

Yes.

AIUI that was the original use-case for this feature. It certainly was for
me :)



> Hmm. So what's the solution on windows? I guess maybe it's not as
> important
> > there because there is no limit on huge pages, but generally getting the
> > expected shared memory usage might be useful? Just significantly less
> > important.
>
> Contrary to Linux, we don't need to care about the number of large
> pages that are necessary because there is no equivalent of
> vm.nr_hugepages on Windows (see [1]), do we?  If that were the case,
> we'd have a use case for huge_page_size, additionally.
>

Right, for this one in particular -- that's what I meant with my comment
about there not being a limit. But this feature works for other settings as
well, not just the huge pages one.  Exactly what the use-cases are can
vary, but surely they would have the same problems wrt redirects?


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-25 Thread David Christensen

On Mon, Apr 25, 2022 at 1:11 AM Michael Paquier  wrote:
>
> On Sat, Apr 23, 2022 at 01:43:36PM -0500, David Christensen wrote:
> > Hi Matthias, great point.  Enclosed is a revised version of the patch
> > that adds the fork identifier to the end if it's a non-main fork.
>
> Like Alvaro, I have seen cases where this would have been really
> handy.  So +1 from me, as well, to have more tooling like what you are
> proposing.  Fine for me to use one file for each block with a name
> like what you are suggesting for each one of them.
>
> +   /* we accept an empty existing directory */
> +   if (stat(config.save_fpw_path, &st) == 0 && S_ISDIR(st.st_mode))
> +   {
> I don't think that there is any need to rely on a new logic if there
> is already some code in place able to do the same work.  See
> verify_dir_is_empty_or_create() in pg_basebackup.c, as one example,
> that relies on pg_check_dir().  I think that you'd better rely at
> least on what pgcheckdir.c offers.

Yeah, though I am tending towards what another user had suggested and
just accepting an existing directory rather than requiring it to be
empty, so thinking I might just skip this one. (Will review the file
you've pointed out to see if there is a relevant function though.)

> +   {"raw-fpi", required_argument, NULL, 'W'},
> I think that we'd better rename this option.  "fpi", that is not used
> much in the user-facing docs, is additionally not adapted when we have
> an other option called -w/--fullpage.  I can think of
> --save-fullpage.

Makes sense.

> +   PageSetLSN(page, record->ReadRecPtr);
> +   /* if checksum field is non-zero then we have checksums 
> enabled,
> +* so recalculate the checksum with new LSN (yes, this is 
> a hack)
> +*/
> Yeah, that looks like a hack, but putting in place a page on a cluster
> that has checksums enabled would be more annoying with
> zero_damaged_pages enabled if we don't do that, so that's fine by me
> as-is.  Perhaps you should mention that FPWs don't have their
> pd_checksum updated when written.

Can make a mention; you thinking just a comment in the code is
sufficient, or want there to be user-visible docs as well?

> +   /* we will now extract the fullpage image from the XLogRecord and save
> +* it to a calculated filename */
> The format of this comment is incorrect.
>
> +The LSN of the record with this block, formatted
> +as %08x-%08X instead of the
> +conventional %X/%X due to filesystem naming
> +limits
> The last part of the sentence about %X/%X could just be removed.  That
> could be confusing, at worse.

Makes sense.

> +   PageSetLSN(page, record->ReadRecPtr);
> Why is pd_lsn set?

This was to make the page as extracted equivalent to the effect of
applying the WAL record block on replay (the LSN and checksum both);
since recovery calls this function to mark the LSN where the page came
from this is why I did this as well.  (I do see perhaps a case for
--raw output that doesn't munge the page whatsoever, just made
comparisons easier.)

> git diff --check complains a bit.

Can look into this.

> This stuff should include some tests.  With --end, the tests can
> be cheap.

Yeah, makes sense, will include some in the next version.

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-25 Thread David Christensen

On Mon, Apr 25, 2022 at 2:00 AM Drouvot, Bertrand  wrote:
>
> Hi,
>
> On 4/25/22 8:11 AM, Michael Paquier wrote:
> > On Sat, Apr 23, 2022 at 01:43:36PM -0500, David Christensen wrote:
> >> Hi Matthias, great point.  Enclosed is a revised version of the patch
> >> that adds the fork identifier to the end if it's a non-main fork.
> > Like Alvaro, I have seen cases where this would have been really
> > handy.  So +1 from me, as well, to have more tooling like what you are
> > proposing.
>
> +1 on the idea.
>
> FWIW, there is an extension doing this [1] but having the feature
> included in pg_waldump would be great.

Cool, glad to see there is some interest; definitely some overlap in
forensics inside and outside the database both, as there are different
use cases for both.

David

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-25 Thread David Christensen

On Mon, Apr 25, 2022 at 6:03 AM Bharath Rupireddy
 wrote:
> Thanks for working on this. I'm just thinking if we can use these FPIs
> to repair the corrupted pages? I would like to understand more
> detailed usages of the FPIs other than inspecting with pageinspect.

My main use case was for being able to look at potential corruption,
either in the WAL stream, on heap, or in tools associated with the WAL
stream.  I suppose you could use the page images to replace corrupted
on-disk pages (and in fact I think I've heard of a tool or two that
try to do that), though don't know that I consider this the primary
purpose (and having toast tables and the list, as well as clog would
make it potentially hard to just drop-in a page version without
issues).  Might help in extreme situations though.

> Given that others have realistic use-cases (of course I would like to
> know more about those), +1 for the idea. However, I would suggest
> adding a function to extract raw FPI data to the pg_walinspect
> extension that got recently committed in PG 15, the out of which can
> directly be fed to pageinspect functions or

Yeah, makes sense to have some overlap here; will review what is there
and see if there is some shared code base we can utilize.  (ISTR some
work towards getting these two tools using more of the same code, and
this seems like another such instance.)

> Few comments:
> 1) I think it's good to mention the stored file name format.
> + printf(_("  -W, --raw-fpi=path save found full page images to
> given path\n"));

+1, though I've also thought there could be uses to have multiple
possible output formats here (most immediately, there may be cases
where we want *each* FPI for a block vs the *latest*, so files name
with/without the LSN component seem the easiest way forward here).
That would introduce some additional complexity though, so might need
to see if others think that makes any sense.

> 2)
> + for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
> + {
> + /* we will now extract the fullpage image from the XLogRecord and save
> + * it to a calculated filename */
> +
> + if (XLogRecHasBlockImage(record, block_id))
>
> I think we need XLogRecHasBlockRef to be true to check
> XLogRecHasBlockImage otherwise, we will see some build farms failing,
> recently I've seen this failure for pg_walinspect..
>
> for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
> {
> if (!XLogRecHasBlockRef(record, block_id))
> continue;
>
> if (XLogRecHasBlockImage(record, block_id))
> *fpi_len += XLogRecGetBlock(record, block_id)->bimg_len;
> }

Good point; my previous patch that got committed here (127aea2a65)
probably also needed this treatment.

> 3) Please correct the commenting format:
> + /* we will now extract the fullpage image from the XLogRecord and save
> + * it to a calculated filename */

Ack.

> 4) Usually we start errors with lower case letters "could not ."
> + pg_fatal("Couldn't open file for output: %s", filename);
> + pg_fatal("Couldn't write out complete FPI to file: %s", filename);
> And the variable name too:
> + FILE *OPF;

Ack.

> 5) Not sure how the FPIs of TOASTed tables get stored, but it would be
> good to check.

What would be different here? Are there issues you can think of, or
just more from the pageinspect side of things?

> 6) Good to specify the known usages of FPIs in the documentation.

Ack. Prob good to get additional info/use cases from others, as mine
is fairly short. :-)

> 7) Isn't it good to emit an error if RestoreBlockImage returns false?
> + if (RestoreBlockImage(record, block_id, page))
> + {

Ack.

> 8) I think I don't mind if a non-empty directory is specified - IMO
> better usability is this - if the directory is non-empty, just go add
> the FPI files if FPI file exists just replace it, if the directory
> isn't existing, create and write the FPI files.
> + /* we accept an empty existing directory */
> + if (stat(config.save_fpw_path, &st) == 0 && S_ISDIR(st.st_mode))
> + {

Agreed; was mainly trying to prevent accidental expansion inside
`pg_wal` when an earlier version of the patch implied `.` as the
current dir with an optional path, but I've since made the path
non-optional and agree that this is unnecessarily restrictive.

>  9) Instead of following:
> + if (XLogRecordHasFPW(xlogreader_state))
> + XLogRecordSaveFPWs(xlogreader_state, config.save_fpw_path);
> I will just do this in XLogRecordSaveFPWs:
> for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
> {
> if (!XLogRecHasBlockRef(record, block_id))
> continue;
>
> if (XLogRecHasBlockImage(record, block_id))
> {
>
> }
> }

Yeah, a little redundant.

> 10) Along with pg_pwrite(), can we also fsync the files (of course
> users can choose it optionally) so that the writes will be durable for
> the OS crashes?

Can add; you thinking a separate flag to disable thi

Move Section 9.27.7 (Data Object Management Functions) to System Information Chapter

2022-04-25 Thread David G. Johnston

Hi,

Both the location and name of the linked to section make no sense to me:

https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT

Neither of the tables listed there manage (cause to change) anything.  They
are pure informational functions - size and path of objects respectively.
It belongs in the previous chapter "System Information Functions and
Operators" with a different name.

David J.

bogus: logical replication rows/cols combinations

2022-04-25 Thread Alvaro Herrera

I just noticed that publishing tables on multiple publications with
different row filters and column lists has somewhat surprising behavior.
To wit: if a column is published in any row-filtered publication, then
the values for that column are sent to the subscriber even for rows that
don't match the row filter, as long as the row matches the row filter
for any other publication, even if that other publication doesn't
include the column.

Here's an example.

Publisher:

create table uno (a int primary key, b int, c int);
create publication uno for table uno (a, b) where (a > 0);
create publication dos for table uno (a, c) where (a < 0);

Here, we specify: publish columns a,b for rows with positive a, and
publish columns a,c for rows with negative a.

What happened next will surprise you!  Well, maybe not.  On subscriber:

create table uno (a int primary key, b int, c int);
create subscription sub_uno connection 'port=55432 dbname=alvherre' publication 
uno,dos;

Publisher:
insert into uno values (1, 2, 3), (-1, 3, 4);

Publication 'uno' only has columns a and b, so row with a=1 should not
have value c=3.  And publication 'dos' only has columns a and c, so row
with a=-1 should not have value b=3.  But, on subscriber:

table uno;
 a  │ b │ c 
┼───┼───
  1 │ 2 │ 3
 -1 │ 3 │ 4

q.e.d.

I think results from a too simplistic view on how to mix multiple
publications with row filters and column lists.  IIRC we are saying "if
column X appears in *any* publication, then the value is published",
period, and don't stop to evaluate the row filter corresponding to each
of those publications. 

The desired result on subscriber is:

table uno;
 a  │ b │ c 
┼───┼───
  1 │ 2 │
 -1 │   │ 4


Thoughts?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: variable filename for psql \copy

2022-04-25 Thread David G. Johnston

On Mon, Apr 25, 2022 at 1:24 AM Jiří Fejfar  wrote:

> contrib_regression=# copy (select 1) to :'afile';
>

Hopefully you realize that COPY is going to place that file on the server,
not send it to the psql client to be placed on the local machine.

The best way to do copy in psql is:
\set afile '...'
\o :'afile'
copy ... to stdout; --or the variant where you one-shot the \o ( \g with
arguments )

Not only do you get variable expansion but you can write the COPY command
on multiple lines just like any other SQL command.

Additionally, we have a list, and even an online form, for submitting bug
reports.  That would have been the more appropriate place to direct this
email.

David J.

Re: [Proposal] vacuumdb --schema only

2022-04-25 Thread Nathan Bossart

On Mon, Apr 25, 2022 at 08:50:09AM +0200, Gilles Darold wrote:
> Can I change the commitfest status to ready for committers?

I've marked it as ready-for-committer.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-04-25 Thread Nathan Bossart

On Mon, Apr 25, 2022 at 07:51:03PM +0530, Bharath Rupireddy wrote:
> With synchronous replication typically all the transactions (txns)
> first locally get committed, then streamed to the sync standbys and
> the backend that generated the transaction will wait for ack from sync
> standbys. While waiting for ack, it may happen that the query or the
> txn gets canceled (QueryCancelPending is true) or the waiting backend
> is asked to exit (ProcDiePending is true). In either of these cases,
> the wait for ack gets canceled and leaves the txn in an inconsistent
> state (as in the client thinks that the txn would have replicated to
> sync standbys) - "The transaction has already committed locally, but
> might not have been replicated to the standby.". Upon restart after
> the crash or in the next txn after the old locally committed txn was
> canceled, the client will be able to see the txns that weren't
> actually streamed to sync standbys. Also, if the client fails over to
> one of the sync standbys after the crash (either by choice or because
> of automatic failover management after crash), the locally committed
> txns on the crashed primary would be lost which isn't good in a true
> HA solution.

This topic has come up a few times recently [0] [1] [2].

> Thoughts?

І think this will require a fair amount of discussion.  I'm personally in
favor of just adding a GUC that can be enabled to block canceling
synchronous replication waits, but I know folks have concerns with that
approach.  When I looked at this stuff previously [2], it seemed possible
to handle the other data loss scenarios (e.g., forcing failover whenever
the primary shut down, turning off restart_after_crash).  However, I'm not
wedded to this approach.

[0] https://postgr.es/m/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C%40yandex-team.ru
[1] https://postgr.es/m/cac4b9df-92c6-77aa-687b-18b86cb13728%40stratox.cz
[2] https://postgr.es/m/FDE157D7-3F35-450D-B927-7EC2F82DB1D6%40amazon.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: proposal: possibility to read dumped table's name from file

2022-04-25 Thread Pavel Stehule

Hi

fresh rebase

Regards

Pavel
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index c946755737..3711959fa2 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -779,6 +779,80 @@ PostgreSQL documentation
   
  
 
+ 
+  --filter=filename
+  
+   
+Specify a filename from which to read patterns for objects to include
+or exclude from the dump. The patterns are interpreted according to the
+same rules as the corresponding options:
+-t/--table for tables,
+-n/--schema for schemas,
+--include-foreign-data for data on foreign servers and
+--exclude-table-data for table data.
+To read from STDIN use - as the
+filename.  The --filter option can be specified in
+conjunction with the above listed options for including or excluding
+objects, and can also be specified more than once for multiple filter
+files.
+   
+
+   
+The file lists one object pattern per row, with the following format:
+
+{ include | exclude } { table | schema | foreign_data | data } PATTERN
+
+   
+
+   
+The first keyword specifies whether the objects matched by the pattern
+are to be included or excluded. The second keyword specifies the type
+of object to be filtered using the pattern:
+
+ 
+  
+   table: tables, works like
+   -t/--table
+  
+ 
+ 
+  
+   schema: schemas, works like
+   -n/--schema
+  
+ 
+ 
+  
+   foreign_data: data on foreign servers, works like
+   --include-foreign-data. This keyword can only be
+   used with the include keyword.
+  
+ 
+ 
+  
+   data: table data, works like
+   --exclude-table-data. This keyword can only be
+   used with the exclude keyword.
+  
+ 
+
+   
+
+   
+Lines starting with # are considered comments and
+are ignored. Comments can be placed after filter as well. Blank lines
+are also ignored. See  for how to
+perform quoting in patterns.
+   
+
+   
+Example files are listed below in the 
+section.
+   
+
+  
+ 
+
  
   --if-exists
   
@@ -1119,6 +1193,7 @@ PostgreSQL documentation
 schema (-n/--schema) and
 table (-t/--table) qualifier
 match at least one extension/schema/table in the database to be dumped.
+This also applies to filters used with --filter.
 Note that if none of the extension/schema/table qualifiers find
 matches, pg_dump will generate an error
 even without --strict-names.
@@ -1528,6 +1603,19 @@ CREATE DATABASE foo WITH TEMPLATE template0;
 
 
 $ pg_dump -t "\"MixedCaseName\"" mydb > mytab.sql
+
+
+  
+   To dump all tables with names starting with mytable, except for table
+   mytable2, specify a filter file
+   filter.txt like:
+
+include table mytable*
+exclude table mytable2
+
+
+
+$ pg_dump --filter=filter.txt mydb > db.sql
 
 
  
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 786d592e2b..bb00e4fd4b 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -55,10 +55,12 @@
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
+#include "common/string.h"
 #include "dumputils.h"
 #include "fe_utils/option_utils.h"
 #include "fe_utils/string_utils.h"
 #include "getopt_long.h"
+#include "lib/stringinfo.h"
 #include "libpq/libpq-fs.h"
 #include "parallel.h"
 #include "pg_backup_db.h"
@@ -96,6 +98,29 @@ typedef enum OidOptions
 	zeroAsNone = 4
 } OidOptions;
 
+/*
+ * State data for reading filter items from stream
+ */
+typedef struct
+{
+	FILE	   *fp;
+	const char *filename;
+	int			lineno;
+	StringInfoData linebuff;
+}			FilterStateData;
+
+/*
+ * List of objects that can be specified in filter file
+ */
+typedef enum
+{
+	FILTER_OBJECT_TYPE_NONE,
+	FILTER_OBJECT_TYPE_TABLE,
+	FILTER_OBJECT_TYPE_SCHEMA,
+	FILTER_OBJECT_TYPE_FOREIGN_DATA,
+	FILTER_OBJECT_TYPE_DATA
+}			FilterObjectType;
+
 /* global decls */
 static bool dosync = true;		/* Issue fsync() to make dump durable on disk. */
 
@@ -317,6 +342,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions,
 static char *get_synchronized_snapshot(Archive *fout);
 static void setupDumpWorker(Archive *AHX);
 static TableInfo *getRootTableInfo(const TableInfo *tbinfo);
+static void getFiltersFromFile(const char *filename, DumpOptions *dopt);
 
 
 int
@@ -389,6 +415,7 @@ main(int argc, char **argv)
 		{"enable-row-security", no_argument, &dopt.enable_row_security, 1},
 		{"exclude-table-data", required_argument, NULL, 4},
 		{"extra-float-digits", required_argument, NULL, 8},
+		{"filter", required_argument, NULL, 12},
 		{"if-exists"

Re: WIP: WAL prefetch (another approach)

2022-04-25 Thread Tom Lane

I believe that the WAL prefetch patch probably accounts for the
intermittent errors that buildfarm member topminnow has shown
since it went in, eg [1]:

diff -U3 
/home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out 
/home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out
--- /home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out   
2022-04-10 03:05:15.972622440 +0200
+++ 
/home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out  
2022-04-25 05:09:49.861642059 +0200
@@ -34,11 +34,7 @@
 (1 row)
 
 SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_records_info_till_end_of_wal(:'wal_lsn1');
- ok 
-
- t
-(1 row)
-
+ERROR:  could not read WAL at 0/1903E40
 SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats(:'wal_lsn1', :'wal_lsn2');
  ok 
 
@@ -46,11 +42,7 @@
 (1 row)
 
 SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats_till_end_of_wal(:'wal_lsn1');
- ok 
-
- t
-(1 row)
-
+ERROR:  could not read WAL at 0/1903E40
 -- ===
 -- Test for filtering out WAL records of a particular table
 -- ===


I've reproduced this manually on that machine, and confirmed that the
proximate cause is that XLogNextRecord() is returning NULL because
state->decode_queue_head == NULL, without bothering to provide an errormsg
(which doesn't seem very well thought out in itself).  I obtained the
contents of the xlogreader struct at failure:

(gdb) p *xlogreader
$1 = {routine = {page_read = 0x594270 , 
segment_open = 0x593b44 , 
segment_close = 0x593d38 }, system_identifier = 0, 
  private_data = 0x0, ReadRecPtr = 26230672, EndRecPtr = 26230752, 
  abortedRecPtr = 26230752, missingContrecPtr = 26230784, 
  overwrittenRecPtr = 0, DecodeRecPtr = 26230672, NextRecPtr = 26230752, 
  PrevRecPtr = 0, record = 0x0, decode_buffer = 0xf25428 "\240", 
  decode_buffer_size = 65536, free_decode_buffer = true, 
  decode_buffer_head = 0xf25428 "\240", decode_buffer_tail = 0xf25428 "\240", 
  decode_queue_head = 0x0, decode_queue_tail = 0x0, 
  readBuf = 0xf173f0 "\020\321\005", readLen = 0, segcxt = {
ws_dir = '\000' , ws_segsize = 16777216}, seg = {
ws_file = 25, ws_segno = 0, ws_tli = 1}, segoff = 0, 
  latestPagePtr = 26222592, latestPageTLI = 1, currRecPtr = 26230752, 
  currTLI = 1, currTLIValidUntil = 0, nextTLI = 0, 
  readRecordBuf = 0xf1b3f8 "<", readRecordBufSize = 40960, 
  errormsg_buf = 0xef3270 "", errormsg_deferred = false, nonblocking = false}

I don't have an intuition about where to look beyond that, any
suggestions?

What I do know so far is that while the failure reproduces fairly
reliably under "make check" (more than half the time, which squares
with topminnow's history), it doesn't reproduce at all under "make
installcheck" (after removing NO_INSTALLCHECK), which seems odd.
Maybe it's dependent on how much WAL history the installation has
accumulated?

It could be that this is a bug in pg_walinspect or a fault in its
test case; hard to tell since that got committed at about the same
time as the prefetch changes.

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2022-04-25%2001%3A48%3A47

Re: WIP: WAL prefetch (another approach)

2022-04-25 Thread Tom Lane

Oh, one more bit of data: here's an excerpt from pg_waldump output after
the failed test:

rmgr: Btree   len (rec/tot): 72/72, tx:727, lsn: 
0/01903BC8, prev 0/01903B70, desc: INSERT_LEAF off 111, blkref #0: rel 
1663/16384/2673 blk 9
rmgr: Btree   len (rec/tot): 72/72, tx:727, lsn: 
0/01903C10, prev 0/01903BC8, desc: INSERT_LEAF off 141, blkref #0: rel 
1663/16384/2674 blk 7
rmgr: Standby len (rec/tot): 42/42, tx:727, lsn: 
0/01903C58, prev 0/01903C10, desc: LOCK xid 727 db 16384 rel 16391 
rmgr: Transaction len (rec/tot):437/   437, tx:727, lsn: 
0/01903C88, prev 0/01903C58, desc: COMMIT 2022-04-25 20:16:03.374197 CEST; 
inval msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 
catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 
catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 
catcache 6 catcache 7 catcache 6 snapshot 2608 relcache 16391
rmgr: Heaplen (rec/tot): 59/59, tx:728, lsn: 
0/01903E40, prev 0/01903C88, desc: INSERT+INIT off 1 flags 0x00, blkref #0: rel 
1663/16384/16391 blk 0
rmgr: Heaplen (rec/tot): 59/59, tx:728, lsn: 
0/01903E80, prev 0/01903E40, desc: INSERT off 2 flags 0x00, blkref #0: rel 
1663/16384/16391 blk 0
rmgr: Transaction len (rec/tot): 34/34, tx:728, lsn: 
0/01903EC0, prev 0/01903E80, desc: COMMIT 2022-04-25 20:16:03.379323 CEST
rmgr: Heaplen (rec/tot): 59/59, tx:729, lsn: 
0/01903EE8, prev 0/01903EC0, desc: INSERT off 3 flags 0x00, blkref #0: rel 
1663/16384/16391 blk 0
rmgr: Heaplen (rec/tot): 59/59, tx:729, lsn: 
0/01903F28, prev 0/01903EE8, desc: INSERT off 4 flags 0x00, blkref #0: rel 
1663/16384/16391 blk 0
rmgr: Transaction len (rec/tot): 34/34, tx:729, lsn: 
0/01903F68, prev 0/01903F28, desc: COMMIT 2022-04-25 20:16:03.381720 CEST

The error is complaining about not being able to read 0/01903E40,
which AFAICT is from the first "INSERT INTO sample_tbl" command,
which most certainly ought to be down to disk at this point.

Also, I modified the test script to see what WAL LSNs it thought
it was dealing with, and got

+\echo 'wal_lsn1 = ' :wal_lsn1
+wal_lsn1 =  0/1903E40
+\echo 'wal_lsn2 = ' :wal_lsn2
+wal_lsn2 =  0/1903EE8

confirming that idea of where 0/01903E40 is in the WAL history.
So this is sure looking like a bug somewhere in xlogreader.c,
not in pg_walinspect.

regards, tom lane

Re: pg_walcleaner - new tool to detect, archive and delete the unneeded wal files (was Re: pg_archivecleanup - add the ability to detect, archive and delete the unneeded wal files on the primary)

2022-04-25 Thread Stephen Frost

Greetings,

* Bharath Rupireddy (bharath.rupireddyforpostg...@gmail.com) wrote:
> On Mon, Apr 18, 2022 at 8:48 PM Stephen Frost  wrote:
> > * Bharath Rupireddy (bharath.rupireddyforpostg...@gmail.com) wrote:
> > > On Mon, Apr 18, 2022 at 7:41 PM Stephen Frost  wrote:
> > > > * Bharath Rupireddy (bharath.rupireddyforpostg...@gmail.com) wrote:
> > > > > Thanks for the comments. Here's a new tool called pg_walcleaner which
> > > > > basically deletes (optionally archiving before deletion) the unneeded
> > > > > WAL files.
> > > > >
> > > > > Please provide your thoughts and review the patches.
> > > >
> > > > Alright, I spent some more time thinking about this and contemplating
> > > > what the next steps are... and I feel like the next step is basically
> > > > "add a HINT when the server can't start due to being out of disk space
> > > > that one should consider running pg_walcleaner" and at that point... why
> > > > aren't we just, uh, doing that?  This is all still quite hand-wavy, but
> > > > it sure would be nice to be able to avoid downtime due to a broken
> > > > archiving setup.  pgbackrest has a way of doing this and while we, of
> > > > course, discourage the use of that option, as it means throwing away
> > > > WAL, it's an option that users have.  PG could have a similar option.
> > > > Basically, to archive_command/library what max_slot_wal_keep_size is for
> > > > slots.
> > >
> > > Thanks. I get your point. The way I see it is that the postgres should
> > > be self-aware of the about-to-get-full disk (probably when the data
> > > directory size is 90%(configurable, of course) of total disk size) and
> > > then freeze the new write operations (may be via new ALTER SYSTEM SET
> > > READ-ONLY or setting default_transaction_read_only GUC) and then go
> > > clean the unneeded WAL files by just invoking pg_walcleaner tool
> > > perhaps. I think, so far, this kind of work has been done outside of
> > > postgres. Even then, we might get into out-of-disk situations
> > > depending on how frequently we check the data directory size to
> > > compute the 90% configurable limit. Detecting the disk size is the KEY
> > > here. Hence we need an offline invokable tool like pg_walcleaner.
> >
> > Ugh, last I checked, figuring out if a given filesystem is near being
> > full is a pain to do in a cross-platform way.  Why not just do exactly
> > what we already are doing for replication slots, but for
> > archive_command?
> 
> Do you mean to say that if the archvie_command fails, say, for "some
> time" or "some number of attempts", just let the server not bother
> about it and checkpoint delete the WAL files instead of going out of
> disk? If this is the thought, then it's more dangerous as we might end
> up losing the WAL forever. For invalidating replication slots, it's
> okay because the required WAL can exist somewhere (either on the
> primary or on the archive location).

I was thinking more specifically along the lines of "if there's > X GB
of WAL that hasn't been archived, give up on archiving anything new"
(which is how the pgbackrest option works).

As archiving with this command is optional, it does present the same
risk too.  Perhaps if we flipped it around to require the
archive_command be provided then it'd be a bit better, though we would
also need a way for users to ask for us to just delete the WAL without
archiving it.  There again though ... the server already has a way of
both archiving and removing archived WAL and also has now grown the
archive_library option, something that this tool would be pretty hard to
replicate, I feel like, as it wouldn't be loading the library into a PG
backend anymore.  As we don't have any real archive libraries yet, it's
hard to say if that's going to be an actual issue or not.  Something to
consider though.

> > > > That isn't to say that we shouldn't also have a tool like this, but it
> > > > generally feels like we're taking a reactive approach here rather than a
> > > > proactive one to addressing the root issue.
> > >
> > > Agree. The offline tool like pg_walcleaner can help greatly even with
> > > some sort of above internal/external disk space monitoring tools.
> >
> > See, this seems like a really bad idea to me.  I'd be very concerned
> > about people mis-using this tool in some way and automating its usage
> > strikes me as absolutely exactly that..  Are we sure that we can
> > guarantee that we don't remove things we shouldn't when this ends up
> > getting run against a running cluster from someone's automated tooling?
> > Or when someone sees that it refuses to run for $reason and tries to..
> > "fix" that?  Seems quite risky to me..  I'd probably want to put similar
> > caveats around using this tool as I do around pg_resetwal when doing
> > training- that is, don't ever, ever, ever use this, heh.
> 
> The initial version of the patch doesn't check if the server crashed
> or not before running it. I was thinking of looking at the
> postmaster.pid or pg

Re: pgsql: Allow db.schema.table patterns, but complain about random garbag

2022-04-25 Thread Andrew Dunstan



On 2022-04-24 Su 15:37, Andrew Dunstan wrote:
> On 2022-04-24 Su 14:19, Noah Misch wrote:
>
>> Even if MinGW has
>> some magic to make that work, I suspect we'll want a non-header home.  
>> Perhaps
>> src/common/exec.c?  It's best to keep this symbol out of libpq and other
>> DLLs, though I bet exports.txt would avoid functional problems.
>
> exec.c looks like it should work fine.
>
>


OK, in the absence of further comment I'm going to do it that way.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Use "WAL segment" instead of "log segment" consistently in user-facing messages

2022-04-25 Thread Nathan Bossart

It's been a few weeks, so I'm marking the commitfest entry as
waiting-on-author.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Add --{no-,}bypassrls flags to createuser

2022-04-25 Thread Nathan Bossart

On Thu, Apr 21, 2022 at 01:21:57PM -0700, David G. Johnston wrote:
> I'm ok with -m/--member as well (like with --role only one role can be
> specified per switch instance so member, not membership, the later meaning,
> at least for me, the collective).
> 
> That -m doesn't match --role-to is no worse than -g not matching --role, a
> short option seems worthwhile, and the -m (membership) mnemonic should be
> simple to pick-up.
> 
> I don't see the addition of "-name" to the option name being beneficial.
> 
> Yes, the standard doesn't use the "TO" prefix for "ROLE" - but taking that
> liberty for consistency here is very appealing and there isn't another SQL
> clause that it would be confused with.

+1 for "member".  It might not be perfect, but IMO it's the clearest
option.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-04-25 Thread Nathan Bossart

On Thu, Mar 24, 2022 at 03:22:11PM +0530, Bharath Rupireddy wrote:
>> > > > > Both seem still very long. I still am doubtful this level of detail 
>> > > > > is
>> > > > > appropriate. Seems more like a thing for a tracepoint or such. How 
>> > > > > about just
>> > > > > printing the time for the logical decoding operations in aggregate, 
>> > > > > without
>> > > > > breaking down into files, adding LSNs etc?
>> > > >
>> > > > The distinction that the patch makes right now is for snapshot and
>> > > > rewrite mapping files and it makes sense to have them separately.
>> > >
>> > > -1. The line also needs to be readable...
>> >
>> > IMHO, that's subjective. Even now, the existing
>> > "checkpoint/restartpoint complete" message has a good amount of info
>> > which makes it unreadable for some.
>> >
>> > The number of logical decoding files(snapshot and mapping) the
>> > checkpoint processed is a good metric to have in server logs along
>> > with the time it took for removing/syncing them. Thoughts?

I took another look at the example output, and I think I agree that logging
the total time for logical decoding operations is probably the best path
forward.  This information would be enough to clue an administrator into
the possible causes of lengthy checkpoints, but it also wouldn't disrupt
the readability of the log statement too much.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Cryptohash OpenSSL error queue in FIPS enabled builds

2022-04-25 Thread Daniel Gustafsson

> On 25 Apr 2022, at 02:50, Michael Paquier  wrote:
> On Sat, Apr 23, 2022 at 11:40:19PM +0200, Daniel Gustafsson wrote:
>> On 22 Apr 2022, at 19:01, Tom Lane  wrote:

>>> It seems like that solution means you're leaving an extra error in the 
>>> queue to
>>> break unrelated code.  Wouldn't it be better to clear after grabbing the 
>>> error?
>>> (Or maybe do both.)
>> 
>> That's a very good point, doing it in both ends of the operation is better
>> here.
> 
> Error queues are cleaned with ERR_clear_error() before specific SSL
> calls in the frontend and the backend, never after the fact.  If we
> assume that multiple errors can be stacked in the OpenSSL error queue,
> shouldn't we worry about cleaning up the error queue in code paths
> like pgtls_read/write(), be_tls_read/write() and be_tls_open_server()?
> So it seems to me that SSLerrmessage() should be treated the same way
> for the backend and the frontend.  Any opinions?

Well, clearing the queue before calling into OpenSSL is the programming pattern
which is quite universally followed so I'm not sure we need to litter the
codepaths with calls to clearing the queue as we leave.

In this particular codepath I think we can afford clearing it on the way out,
with a comment explaining why.  It's easily reproducible and adding a call and
a comment is a good documentation for ourselves of this OpenSSL behavior.  That
being said, clearing on the way in is the important bit.

> pgcrypto's openssl.c has the same problem under FIPS as it includes
> EVP calls.  Saying that, putting a cleanup in pg_cryptohash_create()
> before the fact, and one in SSLerrmessage() after consuming the error
> code should be fine to keep a clean queue.

pgcrypto doesn't really consume or even inspect the OpenSSL errors, but pass on
a PXE error based on the context of the operation.  We could clear the queue as
we leave, but as you say we already clear it before calling in other places so
it's not clear that it's useful.  We've had EVP in pgcrypto for some time
without seeing issues from error queues, one error left isn't that different
from two when not consumed.

The attached 0002 does however correctly (IMO) report the error as an init
error instead of the non-descript generic error, which isn't really all that
helpful.  I think that also removes the last consumer of the generic error, but
I will take another look with fresh eyes to confirm that.

0003 removes what I think is a weirdly placed questionmark from the message
that make it seem strangely ambiguous.  This needs to update the test answer
files as well, but I first wanted to float the idea before doing that.

--
Daniel Gustafsson   https://vmware.com/

v2-0003-pgcrypto-remove-questionmark-from-error-message.patch
Description: Binary data

v2-0002-pgcrypto-report-init-errors-as-PXE_CIPHER_INIT.patch
Description: Binary data

v2-0001-Clear-the-OpenSSL-error-queue-before-cryptohash-o.patch
Description: Binary data

Re: bogus: logical replication rows/cols combinations

2022-04-25 Thread Tomas Vondra

On 4/25/22 17:48, Alvaro Herrera wrote:
> I just noticed that publishing tables on multiple publications with
> different row filters and column lists has somewhat surprising behavior.
> To wit: if a column is published in any row-filtered publication, then
> the values for that column are sent to the subscriber even for rows that
> don't match the row filter, as long as the row matches the row filter
> for any other publication, even if that other publication doesn't
> include the column.
> 
> Here's an example.
> 
> Publisher:
> 
> create table uno (a int primary key, b int, c int);
> create publication uno for table uno (a, b) where (a > 0);
> create publication dos for table uno (a, c) where (a < 0);
> 
> Here, we specify: publish columns a,b for rows with positive a, and
> publish columns a,c for rows with negative a.
> 
> What happened next will surprise you!  Well, maybe not.  On subscriber:
> 
> create table uno (a int primary key, b int, c int);
> create subscription sub_uno connection 'port=55432 dbname=alvherre' 
> publication uno,dos;
> 
> Publisher:
> insert into uno values (1, 2, 3), (-1, 3, 4);
> 
> Publication 'uno' only has columns a and b, so row with a=1 should not
> have value c=3.  And publication 'dos' only has columns a and c, so row
> with a=-1 should not have value b=3.  But, on subscriber:
> 
> table uno;
>  a  │ b │ c 
> ┼───┼───
>   1 │ 2 │ 3
>  -1 │ 3 │ 4
> 
> q.e.d.
> 
> I think results from a too simplistic view on how to mix multiple
> publications with row filters and column lists.  IIRC we are saying "if
> column X appears in *any* publication, then the value is published",
> period, and don't stop to evaluate the row filter corresponding to each
> of those publications. 
> 

Right.

> The desired result on subscriber is:
> 
> table uno;
>  a  │ b │ c 
> ┼───┼───
>   1 │ 2 │
>  -1 │   │ 4
> 
> 
> Thoughts?
> 

I'm not quite sure which of the two behaviors is more "desirable". In a
way, it's somewhat similar to publish_as_relid, which is also calculated
not considering which of the row filters match?

But maybe you're right and it should behave the way you propose ... the
example I have in mind is a use case replicating table with two types of
rows - sensitive and non-sensitive. For sensitive, we replicate only
some of the columns, for non-sensitive we replicate everything. Which
could be implemented as two publications

create publication sensitive_rows
   for table t (a, b) where (is_sensitive);

create publication non_sensitive_rows
   for table t where (not is_sensitive);

But the way it's implemented now, we'll always replicate all columns,
because the second publication has no column list.

Changing this to behave the way you expect would be quite difficult,
because at the moment we build a single OR expression from all the row
filters. We'd have to keep the individual expressions, so that we can
build a column list for each of them (in order to ignore those that
don't match).

We'd have to remove various other optimizations - for example we can't
just discard row filters if we found "no_filter" publication. Or more
precisely, we'd have to consider column lists too.

In other words, we'd have to merge pgoutput_column_list_init into
pgoutput_row_filter_init, and then modify pgoutput_row_filter to
evaluate the row filters one by one, and build the column list.

I can take a stab at it, but it seems strange to not apply the same
logic to evaluation of publish_as_relid. I wonder what Amit thinks about
this, as he wrote the row filter stuff.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Cryptohash OpenSSL error queue in FIPS enabled builds

2022-04-25 Thread Tom Lane

Daniel Gustafsson  writes:
> In this particular codepath I think we can afford clearing it on the way out,
> with a comment explaining why.

Yeah.  It seems out of the ordinary for an OpenSSL call to stack
two error conditions, so treating a known case of that specially
seems reasonable.  Patches seem sane from here.

regards, tom lane

Re: add checkpoint stats of snapshot and mapping files of pg_logical dir

2022-04-25 Thread Michael Paquier

On Mon, Apr 25, 2022 at 01:34:38PM -0700, Nathan Bossart wrote:
> I took another look at the example output, and I think I agree that logging
> the total time for logical decoding operations is probably the best path
> forward.  This information would be enough to clue an administrator into
> the possible causes of lengthy checkpoints, but it also wouldn't disrupt
> the readability of the log statement too much. 

+   /* translator: the placeholders after first %s show 
restartpoint/checkpoint options */
+   (errmsg("%s starting:%s%s%s%s%s%s%s%s",
+   restartpoint ?
_("restartpoint") : _("checkpoint"),

0001 breaks translation, as "checkpoint/restartpoint" and "starting"
would treated as separate terms to translate.  That would not matter
for English, but it does in French where we'd say "début du
checkpoint".  You could fix that by adding "starting" to each
refactored term or build a string.  0002 does the latter, so my take
is that you should begin using a StringInfo in 0001.
--
Michael


signature.asc
Description: PGP signature

Re: Estimating HugePages Requirements?

2022-04-25 Thread Michael Paquier

On Mon, Apr 25, 2022 at 04:55:25PM +0200, Magnus Hagander wrote:
> AIUI that was the original use-case for this feature. It certainly was for
> me :)

Perhaps we'd be fine with relaxing the requirements here knowing that
the control file should never be larger than PG_CONTROL_MAX_SAFE_SIZE
(aka the read should be atomic so it could be made lockless).  At the
end of the day, to be absolutely correct in the shmem size estimation,
I think that we are going to need what's proposed here or the sizing
may not be right depending on how extensions adjust GUCs after they
load their _PG_init():
https://www.postgresql.org/message-id/20220419154658.GA2487941@nathanxps13

That's a bit independent, but not completely unrelated either
depending on how exact you want your number of estimated huge pages to
be.  Just wanted to mention it.

>> Contrary to Linux, we don't need to care about the number of large
>> pages that are necessary because there is no equivalent of
>> vm.nr_hugepages on Windows (see [1]), do we?  If that were the case,
>> we'd have a use case for huge_page_size, additionally.
> 
> Right, for this one in particular -- that's what I meant with my comment
> about there not being a limit. But this feature works for other settings as
> well, not just the huge pages one.  Exactly what the use-cases are can
> vary, but surely they would have the same problems wrt redirects?

Yes, the redirection issue would apply to all the run-time GUCs.
--
Michael

signature.asc
Description: PGP signature

Re: Skipping schema changes in publication

2022-04-25 Thread Peter Smith

On Sat, Apr 23, 2022 at 2:09 AM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 22, 2022 at 12:39 PM vignesh C  wrote:
> >
> > Hi,
> >
> > This feature adds an option to skip changes of all tables in specified
> > schema while creating publication.
> > This feature is helpful for use cases where the user wants to
> > subscribe to all the changes except for the changes present in a few
> > schemas.
> > Ex:
> > CREATE PUBLICATION pub1 FOR ALL TABLES SKIP ALL TABLES IN SCHEMA s1,s2;
> > OR
> > ALTER PUBLICATION pub1 ADD SKIP ALL TABLES IN SCHEMA s1,s2;
> >
> > A new column pnskip is added to table "pg_publication_namespace", to
> > maintain the schemas that the user wants to skip publishing through
> > the publication. Modified the output plugin (pgoutput) to skip
> > publishing the changes if the relation is part of skip schema
> > publication.
> > As a continuation to this, I will work on implementing skipping tables
> > from all tables in schema and skipping tables from all tables
> > publication.
> >
> > Attached patch has the implementation for this.
> > This feature is for the pg16 version.
> > Thoughts?
>
> The feature seems to be useful especially when there are lots of
> schemas in a database. However, I don't quite like the syntax. Do we
> have 'SKIP' identifier in any of the SQL statements in SQL standard?
> Can we think of adding skip_schema_list as an option, something like
> below?
>
> CREATE PUBLICATION foo FOR ALL TABLES (skip_schema_list = 's1, s2');
> ALTER PUBLICATION foo SET (skip_schema_list = 's1, s2'); - to set
> ALTER PUBLICATION foo SET (skip_schema_list = ''); - to reset
>

I had been wondering for some time if there was any way to introduce a
more flexible pattern matching into PUBLICATION but without bloating
the syntax. Maybe your idea to use an option for the "skip" gives a
way to do it...

For example, if we could use regex (for .
patterns) for the option value then

~~

e.g.1. Exclude certain tables:

// do NOT publish any tables of schemas s1,s2
CREATE PUBLICATION foo FOR ALL TABLES (exclude_match = '(s1\..*)|(s2\..*)');

// do NOT publish my secret tables (those called "mysecretXXX")
CREATE PUBLICATION foo FOR ALL TABLES (exclude_match = '(.*\.mysecret.*)');

~~

e.g.2. Only allow certain tables.

// ONLY publish my tables (those called "mytableXXX")
CREATE PUBLICATION foo FOR ALL TABLES (subset_match = '(.*\.mytable.*)');

// So following is equivalent to FOR ALL TABLES IN SCHEMA s1
CREATE PUBLICATION foo FOR ALL TABLES (subset_match = '(s1\..*)');

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Cryptohash OpenSSL error queue in FIPS enabled builds

2022-04-25 Thread Michael Paquier

On Tue, Apr 26, 2022 at 12:07:32AM +0200, Daniel Gustafsson wrote:
> In this particular codepath I think we can afford clearing it on the way out,
> with a comment explaining why.  It's easily reproducible and adding a call and
> a comment is a good documentation for ourselves of this OpenSSL behavior.  
> That
> being said, clearing on the way in is the important bit.

+ * consumed an error, but cipher initialization can in FIPS enabled
It seems to me that this comment needs a hyphen, as of
"FIPS-enabled".

I am a bit annoyed to assume that having only a localized
ERR_clear_error() in the error code path of the init() call is the
only problem that would occur, only because that's the first one we'd
see in a hash computation.  So my choice would be to call
ERR_get_error() within SSLerrmessage() and clear the queue after
fetching the error code via ERR_get_error() for both
cryptohash_openssl.c and hmac_openssl.c, but I won't fight hard
against both of you on this point, either.

Perhaps this should be reported to the upstream folks?  We'd still
need this code for already released versions, but getting two errors
looks like a mistake.

> pgcrypto doesn't really consume or even inspect the OpenSSL errors, but pass 
> on
> a PXE error based on the context of the operation.  We could clear the queue 
> as
> we leave, but as you say we already clear it before calling in other places so
> it's not clear that it's useful.  We've had EVP in pgcrypto for some time
> without seeing issues from error queues, one error left isn't that different
> from two when not consumed.

Okay.  I did not recall the full error stack used in pgcrypto.  It is
annoying to not get from OpenSSL the details of the error, though.
With FIPS enabled, one computing a hash with pgcrypto would just know
about the initialization error, but would miss why the computation
failed.  It looks like we could use a new error code to tell
px_strerror() to look at the OpenSSL error queue instead of one of the
hardcoded strings.  Just saying.

> The attached 0002 does however correctly (IMO) report the error as an init
> error instead of the non-descript generic error, which isn't really all that
> helpful.  I think that also removes the last consumer of the generic error, 
> but
> I will take another look with fresh eyes to confirm that.
>
> 0003 removes what I think is a weirdly placed questionmark from the message
> that make it seem strangely ambiguous.  This needs to update the test answer
> files as well, but I first wanted to float the idea before doing that.

Good catches.
--
Michael

signature.asc
Description: PGP signature

Re: Building Postgres with lz4 on Visual Studio

2022-04-25 Thread Michael Paquier

On Wed, Apr 13, 2022 at 05:21:41PM +0300, Melih Mutlu wrote:
> I tried to build Postgres from source using Visual Studio 19. It worked all
> good.
> Then I wanted to build it with some dependencies, started with the ones
> listed here [1]. But I'm having some issues with lz4.
> 
> First I downloaded the latest release of lz4 from this link [2].
> Modified the src\tools\msvc\config.pl file as follows:

Yeah, that's actually quite an issue because there is no official
release for liblz4.lib, so one has to compile the code by himself to
be able to get his/her hands on liblz4.lib.  zstd is similarly
consistent with its release contents.
--
Michael


signature.asc
Description: PGP signature

Re: [Proposal] vacuumdb --schema only

2022-04-25 Thread Michael Paquier

On Mon, Apr 25, 2022 at 09:18:53AM -0700, Nathan Bossart wrote:
> I've marked it as ready-for-committer.

The refactoring logic to build the queries is clear to follow.  I have
a few comments about the shape of the patch, though.

case 'a':
-   alldb = true;
+   check_objfilter(OBJFILTER_ALL_DBS);
break;
The cross-option checks are usually done after all the options
switches are check.  Why does this need to be different?  It does not
strike me as a huge problem to do one filter check at the end.

+void
+check_objfilter(VacObjFilter curr_option)
+{
+   objfilter |= curr_option;
+
+   if ((objfilter & OBJFILTER_ALL_DBS) &&
+   (objfilter & OBJFILTER_DATABASE))
+   pg_fatal("cannot vacuum all databases and a specific one at the same 
time");
The addition of more OBJFILTER_* (unlikely going to happen, but who
knows) would make it hard to know which option should not interact
with each other.  Wouldn't it be better to use a kind of compatibility
table for that?  As one OBJFILTER_* matches with one option, you could
simplify the number of strings in need of translation by switching to
an error message like "cannot use options %s and %s together", or
something like that?

+$node->command_fails(
+   [ 'vacuumdb', '-a', '-d', 'postgres' ],
+   'cannot use options -a and -d at the same time');
This set of tests had better use command_fails_like() to make sure
that the correct error patterns from check_objfilter() show up?
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-25 Thread Michael Paquier

On Mon, Apr 25, 2022 at 10:24:52AM -0500, David Christensen wrote:
> On Mon, Apr 25, 2022 at 6:03 AM Bharath Rupireddy
>  wrote:
>> Thanks for working on this. I'm just thinking if we can use these FPIs
>> to repair the corrupted pages? I would like to understand more
>> detailed usages of the FPIs other than inspecting with pageinspect.
> 
> My main use case was for being able to look at potential corruption,
> either in the WAL stream, on heap, or in tools associated with the WAL
> stream.  I suppose you could use the page images to replace corrupted
> on-disk pages (and in fact I think I've heard of a tool or two that
> try to do that), though don't know that I consider this the primary
> purpose (and having toast tables and the list, as well as clog would
> make it potentially hard to just drop-in a page version without
> issues).  Might help in extreme situations though.

You could do a bunch of things with those images, even make things
worse if you are not careful enough.

>> 10) Along with pg_pwrite(), can we also fsync the files (of course
>> users can choose it optionally) so that the writes will be durable for
>> the OS crashes?
> 
> Can add; you thinking a separate flag to disable this with default true?

We expect data generated by tools like pg_dump, pg_receivewal
(depending on the use --synchronous) or pg_basebackup to be consistent
when we exit from the call.  FWIW, flushing this data does not seem
like a strong requirement for something aimed at being used page-level
chirurgy or lookups, because the WAL segments should still be around
even if the host holding the archives is unplugged.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-04-25 Thread Michael Paquier

On Mon, Apr 25, 2022 at 10:11:10AM -0500, David Christensen wrote:
> On Mon, Apr 25, 2022 at 1:11 AM Michael Paquier  wrote:
>> I don't think that there is any need to rely on a new logic if there
>> is already some code in place able to do the same work.  See
>> verify_dir_is_empty_or_create() in pg_basebackup.c, as one example,
>> that relies on pg_check_dir().  I think that you'd better rely at
>> least on what pgcheckdir.c offers.
> 
> Yeah, though I am tending towards what another user had suggested and
> just accepting an existing directory rather than requiring it to be
> empty, so thinking I might just skip this one. (Will review the file
> you've pointed out to see if there is a relevant function though.)

OK.  FWIW, pg_check_dir() is used in initdb and pg_basebackup because
these care about the behavior to use when specifying a target path.
You could reuse it, but use a different policy depending on its
returned result for the needs of what you see fit in this case.

>> +   PageSetLSN(page, record->ReadRecPtr);
>> +   /* if checksum field is non-zero then we have checksums 
>> enabled,
>> +* so recalculate the checksum with new LSN (yes, this 
>> is a hack)
>> +*/
>> Yeah, that looks like a hack, but putting in place a page on a cluster
>> that has checksums enabled would be more annoying with
>> zero_damaged_pages enabled if we don't do that, so that's fine by me
>> as-is.  Perhaps you should mention that FPWs don't have their
>> pd_checksum updated when written.
> 
> Can make a mention; you thinking just a comment in the code is
> sufficient, or want there to be user-visible docs as well?

I was thinking about a comment, at least.

> This was to make the page as extracted equivalent to the effect of
> applying the WAL record block on replay (the LSN and checksum both);
> since recovery calls this function to mark the LSN where the page came
> from this is why I did this as well.  (I do see perhaps a case for
> --raw output that doesn't munge the page whatsoever, just made
> comparisons easier.)

Hm.  Okay.  The argument goes both ways, I guess, depending on what we
want to do with those raw pages.  Still you should not need pd_lsn if
the point is to be able to stick the pages back in place to attempt to
get back as much data as possible when loading it back to the shared
buffers?
--
Michael


signature.asc
Description: PGP signature

Re: [Proposal] vacuumdb --schema only

2022-04-25 Thread Nathan Bossart

On Tue, Apr 26, 2022 at 11:36:02AM +0900, Michael Paquier wrote:
> The refactoring logic to build the queries is clear to follow.  I have
> a few comments about the shape of the patch, though.

Thanks for taking a look!

> case 'a':
> -   alldb = true;
> +   check_objfilter(OBJFILTER_ALL_DBS);
> break;
> The cross-option checks are usually done after all the options
> switches are check.  Why does this need to be different?  It does not
> strike me as a huge problem to do one filter check at the end.

Makes sense.  I fixed this in v13.

> +void
> +check_objfilter(VacObjFilter curr_option)
> +{
> +   objfilter |= curr_option;
> +
> +   if ((objfilter & OBJFILTER_ALL_DBS) &&
> +   (objfilter & OBJFILTER_DATABASE))
> +   pg_fatal("cannot vacuum all databases and a specific one at the same 
> time");
> The addition of more OBJFILTER_* (unlikely going to happen, but who
> knows) would make it hard to know which option should not interact
> with each other.  Wouldn't it be better to use a kind of compatibility
> table for that?  As one OBJFILTER_* matches with one option, you could
> simplify the number of strings in need of translation by switching to
> an error message like "cannot use options %s and %s together", or
> something like that?

I think this might actually make things more complicated.  In addition to
the compatibility table, we'd need to define the strings to use in the
error message somewhere.  I can give this a try if you feel strongly about
it.

> +$node->command_fails(
> +   [ 'vacuumdb', '-a', '-d', 'postgres' ],
> +   'cannot use options -a and -d at the same time');
> This set of tests had better use command_fails_like() to make sure
> that the correct error patterns from check_objfilter() show up?

Yes.  I did this in v13.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 7559e6a289058a36eb3b10414b929e52b0bc5cbf Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 22 Apr 2022 22:34:38 -0700
Subject: [PATCH v13 1/1] Add --schema and --exclude-schema options to
 vacuumdb.

These two new options can be used to either process all tables in
specific schemas or to skip processing all tables in specific
schemas.  This change also refactors the handling of invalid
combinations of command-line options to a new helper function.

Author: Gilles Darold
Reviewed-by: Justin Pryzby, Nathan Bossart
Discussion: https://postgr.es/m/929fbf3c-24b8-d454-811f-1d5898ab3e91%40migops.com
---
 doc/src/sgml/ref/vacuumdb.sgml|  66 
 src/bin/scripts/t/100_vacuumdb.pl |  42 
 src/bin/scripts/vacuumdb.c| 172 +++---
 3 files changed, 239 insertions(+), 41 deletions(-)

diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 956c0f01cb..841aced3bd 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -39,6 +39,40 @@ PostgreSQL documentation
dbname
   
 
+  
+   vacuumdb
+   connection-option
+   option
+
+   
+
+ 
+   
+
+ 
+  -n
+  --schema
+ 
+ schema
+
+   
+
+   
+
+ 
+  -N
+  --exclude-schema
+ 
+ schema
+
+   
+ 
+
+   
+
+   dbname
+  
+
   
vacuumdb
connection-option
@@ -244,6 +278,30 @@ PostgreSQL documentation
   
  
 
+ 
+  -n schema
+  --schema=schema
+  
+   
+Clean or analyze all tables in
+schema only.  Multiple
+schemas can be vacuumed by writing multiple -n switches.
+   
+  
+ 
+
+ 
+  -N schema
+  --exclude-schema=schema
+  
+   
+Do not clean or analyze any tables in
+schema.  Multiple schemas
+can be excluded by writing multiple -N switches.
+   
+  
+ 
+
  
   --no-index-cleanup
   
@@ -619,6 +677,14 @@ PostgreSQL documentation
 $ vacuumdb --analyze --verbose --table='foo(bar)' xyzzy
 
 
+   
+To clean all tables in the foo and bar schemas
+in a database named xyzzy:
+
+$ vacuumdb --schema='foo' --schema='bar' xyzzy
+
+
+
  
 
  
diff --git a/src/bin/scripts/t/100_vacuumdb.pl b/src/bin/scripts/t/100_vacuumdb.pl
index 96a818a3c1..e5343774fe 100644
--- a/src/bin/scripts/t/100_vacuumdb.pl
+++ b/src/bin/scripts/t/100_vacuumdb.pl
@@ -103,6 +103,8 @@ $node->safe_psql(
   CREATE TABLE funcidx (x int);
   INSERT INTO funcidx VALUES (0),(1),(2),(3);
   CREATE INDEX i0 ON funcidx ((f1(x)));
+  CREATE SCHEMA "Foo";
+  CREATE TABLE "Foo".bar(id int);
 |);
 $node->command_ok([qw|vacuumdb -Z --table="need""q(uot"(")x") postgres|],
 	'column list');
@@ -146,5 +148,45 @@ $node->issues_sql_like(
 	[ 'vacuumdb', '--min-xid-age', '2147483001', 'postgres' ],
 	qr/GREATEST.*relfrozenxid.*2147483001/,
 	'vacuumdb --table --min-xid-age');
+$node->issues_sql_like(
+	[ 'vacuumdb', '--schema', '"Foo"', 'postgres' ],
+	qr/VACUUM "Foo".bar/,
+

Re: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.

2022-04-25 Thread Julien Rouhaud

Hi,

Please keep the list in copy, especially if that's about Windows specific as
I'm definitely not very knowledgeable about it.

On Fri, Apr 01, 2022 at 09:18:03AM +, Wilm Hoyer wrote:
> 
> If you don't wanna go the manifest way, maybe the RtlGetVersion function is 
> the one you need:
> https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-rtlgetversion?redirectedfrom=MSDN

Thanks for the info!  I tried to use the function but trying to include either
wdm.h or Ntddk.h errors out.  Unfortunately I don't know how to look for a file
in Windows so I don't even know if those files are present.

I searched a bit and apparently some people are using this function directly
opening some dll, which seems wrong.

> Another Idea on windows machines would be to use the commandline to execute
> ver in a separate Process and store the result in a file.

That also seems hackish, I don't think that we want to rely on something like
that.

> >> While winver.exe on the same vm says windows 11, version 21H2, build 
> >> 22000.493.
> 
> > So, what GetVersionEx returns is actually "it depends", and this is 
> > documented:
> 
> >> With the release of Windows 8.1, the behavior of the GetVersionEx API 
> >> has changed in the value it will return for the operating system 
> >> version. The value returned by the GetVersionEx function now depends 
> >> on how the application is manifested.
> >>
> >> Applications not manifested for Windows 8.1 or Windows 10 will return 
> >> the Windows 8 OS version value (6.2). Once an application is 
> >> manifested for a given operating system version, GetVersionEx will 
> >> always return the version that the application is manifested for in 
> >> future releases. To manifest your applications for Windows 8.1 or 
> >> Windows 10, refer to Targeting your application for Windows.
> 
> The documentation is a bit unclear - with the correct functions you should 
> get the:
> Minimum( actualOS-Version, Maximum(Manifested OS Versions))
> The Idea behind, as I understand it, is to better support virtualization and
> backward compatibility - you manifest only Windows 8.1 -> than you always get
> a System that behaves like Windows 8.1 in every aspect.  (Every Aspect not
> true in some corner cases due to security patches)

Well, it clearly does *NOT* behave as a Windows 8.1, even if for some reason
large pages relies on security patches.

Their API is entirely useless, so I'm still on the opinion that we should
unconditionally use the FILE_MAP_LARGE_PAGES flag if it's defined and call it a
day.

Re: pgsql: Add contrib/pg_walinspect.

2022-04-25 Thread Michael Paquier

Hi Jeff,

On Fri, Apr 08, 2022 at 07:27:44AM +, Jeff Davis wrote:
> Add contrib/pg_walinspect.
> 
> Provides similar functionality to pg_waldump, but from a SQL interface
> rather than a separate utility.

The tests of pg_walinspect look unstable:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2022-04-25%2001%3A48%3A47

 SELECT COUNT(*) >= 0 AS ok FROM
 pg_get_wal_records_info_till_end_of_wal(:'wal_lsn1');
- ok
-
- t
-(1 row)
-
+ERROR:  could not read WAL at 0/1903E40

This points out at ReadNextXLogRecord(), though I would not blame this
test suite as you create a physical replication slot beforehand.
Could this be an issue related to the addition of the circular WAL
decoding buffer, aka 3f1ce97?

I am adding an open item about that.

Thanks,
--
Michael

signature.asc
Description: PGP signature

Re: CLUSTER on partitioned index

2022-04-25 Thread Michael Paquier

On Sat, Apr 16, 2022 at 08:58:50PM +0900, Michael Paquier wrote:
> Well, I am a bit annoyed that we don't actually check that a CLUSTER
> command does not block when doing a CLUSTER on a partitioned table
> while a lock is held on one of its partitions.  So, attached is a
> proposal of patch to improve the test coverage in this area.  While on
> it, I have added a test with a normal table.  You can see the
> difference once you remove the ACL check added recently in
> get_tables_to_cluster_partitioned().  What do you think?

This was the last reason why this was listed as an open item, so,
hearing nothing, I have applied this patch to add those extra tests,
and switched the item as fixed.
--
Michael


signature.asc
Description: PGP signature

Re: Making JIT more granular

2022-04-25 Thread David Rowley

(This is an old thread. See [1] if you're missing the original email.)

On Tue, 4 Aug 2020 at 14:01, David Rowley  wrote:
> At the moment JIT compilation, if enabled, is applied to all
> expressions in the entire plan.  This can sometimes be a problem as
> some expressions may be evaluated lots and warrant being JITted, but
> others may only be evaluated just a few times, or even not at all.
>
> This problem tends to become large when table partitioning is involved
> as the number of expressions in the plan grows with each partition
> present in the plan.  Some partitions may have many rows and it can be
> useful to JIT expression, but others may have few rows or even no
> rows, in which case JIT is a waste of effort.

This patch recently came up again in [2], where Magnus proposed we add
a new GUC [3] to warn users when JIT compilation takes longer than the
specified fraction of execution time. Over there I mentioned that I
think it might be better to have a go at making the JIT costing better
so that it's more aligned to the amount of JITing work there is to do
rather than the total cost of the plan without any consideration about
how much there is to JIT compile.

In [4], Andres reminded me that I need to account for the number of
times a given plan is (re)scanned rather than just the total_cost of
the Plan node. There followed some discussion about how that might be
done.

I've loosely implemented this in the attached patch. In order to get
the information about the expected number of "loops" a given Plan node
will be subject to, I've modified create_plan() so that it passes this
value down recursively while creating the plan.  Nodes such as Nested
Loop multiply the "est_calls" by the number of outer rows. For nodes
such as Material, I've made the estimated calls = 1.0.  Memoize must
take into account the expected cache hit ratio, which I've had to
record as part of MemoizePath so that create_plan knows about that.
Altogether, this is a fair bit of churn for createplan.c, and it's
still only part of the way there.  When planning subplans, we do
create_plan() right away and since we plan subplans before the outer
plans, we've no idea how many times the subplan will be rescanned. So
to make this work fully I think we'd need to modify the planner so
that we delay the create_plan() for subplans until sometime after
we've planned the outer query.

The reason that I'm posting about this now is mostly because I did say
I'd come back to this patch for v16 and I'm also feeling bad that I
-1'd Magnus' patch, which likely resulted in making zero forward
progress in improving JIT and it's costing situation for v15.

The reason I've not completed this patch to fix the deficiencies
regarding subplans is that that's quite a bit of work and I don't
really want to do that right now.  We might decide that JIT costing
should work in a completely different way that does not require
estimating how many times a plan node will be rescanned.  I think
there's enough patch here to allow us to test this and then decide if
it's any good or not.

There's also maybe some controversy in the patch. I ended up modifying
EXPLAIN so that it shows loops=N as part of the estimated costs.  I
understand there's likely to be fallout from doing that as there are
various tools around that this would likely break.  I added that for a
couple of reasons; 1) I think it would be tricky to figure out why JIT
was or was not enabled without showing that in EXPLAIN, and; 2) I
needed to display it somewhere for my testing so I could figure out if
I'd done something wrong when calculating the value during
create_plan().

This basically looks like:

postgres=# explain select * from h, h h1, h h2;
QUERY PLAN
--
 Nested Loop  (cost=0.00..12512550.00 rows=10 width=12)
   ->  Nested Loop  (cost=0.00..12532.50 rows=100 width=8)
 ->  Seq Scan on h  (cost=0.00..15.00 rows=1000 width=4)
 ->  Materialize  (cost=0.00..20.00 rows=1000 width=4 loops=1000)
   ->  Seq Scan on h h1  (cost=0.00..15.00 rows=1000 width=4)
   ->  Materialize  (cost=0.00..20.00 rows=1000 width=4 loops=100)
 ->  Seq Scan on h h2  (cost=0.00..15.00 rows=1000 width=4)
(7 rows)

Just the same as EXPLAIN ANALYZE, I've coded loops= to only show when
there's more than 1 loop. You can also see that the node below
Materialize is not expected to be scanned multiple times. Technically
it could when a parameter changes, but right now it seems more trouble
than it's worth to go to the trouble of estimating that during
create_plan(). There's also some variation from the expected loops and
the actual regarding parallel workers. In the estimate, this is just
the number of times an average worker is expected to invoke the plan,
whereas the actual "loops" is the sum of each worker's invocations.

The other slight controversy that I can see in the patch is

Re: pgsql: Add contrib/pg_walinspect.

2022-04-25 Thread Tom Lane

Michael Paquier  writes:
> Could this be an issue related to the addition of the circular WAL
> decoding buffer, aka 3f1ce97?

I already whined about this at [1].

I've been wondering if the issue could be traced to topminnow's unusual
hardware properties, specifically that it has MAXALIGN 8 even though
it's only a 32-bit machine per sizeof(void *).  I think the only
other active buildfarm animal like that is my gaur ... but I've
failed to reproduce it on gaur.  Best guess at the moment is that
it's a timing issue that topminnow manages to reproduce often.

Anyway, as I said in the other thread, I can reproduce it on
topminnow's host.  Let me know if you have ideas about how to
home in on the cause.

regards, tom lane

[1] https://www.postgresql.org/message-id/111657.1650910309%40sss.pgh.pa.us

Re: CLUSTER on partitioned index

2022-04-25 Thread Alvaro Herrera

On 2022-Apr-26, Michael Paquier wrote:

> On Sat, Apr 16, 2022 at 08:58:50PM +0900, Michael Paquier wrote:
> > Well, I am a bit annoyed that we don't actually check that a CLUSTER
> > command does not block when doing a CLUSTER on a partitioned table
> > while a lock is held on one of its partitions.  So, attached is a
> > proposal of patch to improve the test coverage in this area.
> 
> This was the last reason why this was listed as an open item, so,
> hearing nothing, I have applied this patch to add those extra tests,
> and switched the item as fixed.

Thank you!

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: pgsql: Add contrib/pg_walinspect.

2022-04-25 Thread Michael Paquier

On Tue, Apr 26, 2022 at 01:25:14AM -0400, Tom Lane wrote:
> I've been wondering if the issue could be traced to topminnow's unusual
> hardware properties, specifically that it has MAXALIGN 8 even though
> it's only a 32-bit machine per sizeof(void *).  I think the only
> other active buildfarm animal like that is my gaur ... but I've
> failed to reproduce it on gaur.  Best guess at the moment is that
> it's a timing issue that topminnow manages to reproduce often.

I have managed to miss your message.  Let's continue the discussion
there, then.

> Anyway, as I said in the other thread, I can reproduce it on
> topminnow's host.  Let me know if you have ideas about how to
> home in on the cause.

Nice.  I have not been able to do so, but based on the lack of
reports from the buildfarm, that's not surprising.
--
Michael


signature.asc
Description: PGP signature

RE: Skipping schema changes in publication

2022-04-25 Thread osumi.takami...@fujitsu.com

On Thursday, April 21, 2022 12:15 PM vignesh C  wrote:
> Updated patch by changing the syntax to use EXCEPT instead of SKIP.
Hi


This is my review comments on the v2 patch.

(1) gram.y

I think we can make a unified function that merges
preprocess_alltables_pubobj_list with check_except_in_pubobj_list.

With regard to preprocess_alltables_pubobj_list,
we don't use the 2nd argument "location" in this function.

(2) create_publication.sgml

+  
+   Create a publication that publishes all changes in all the tables except for
+   the changes of users and
+   departments table;

This sentence should end ":" not ";".

(3) publication.out & publication.sql

+-- fail - can't set except table to schema  publication
+ALTER PUBLICATION testpub_forschema SET EXCEPT TABLE testpub_tbl1;

There is one unnecessary space in the comment.
Kindly change from "schema  publication" to "schema publication".

(4) pg_dump.c & describe.c

In your first email of this thread, you explained this feature
is for PG16. Don't we need additional branch for PG16 ?

@@ -6322,6 +6328,21 @@ describePublications(const char *pattern)
}
}

+   if (pset.sversion >= 15)
+   {


@@ -4162,7 +4164,7 @@ getPublicationTables(Archive *fout, TableInfo tblinfo[], 
int numTables)
/* Collect all publication membership info. */
if (fout->remoteVersion >= 15)
appendPQExpBufferStr(query,
-"SELECT tableoid, oid, 
prpubid, prrelid, "
+"SELECT tableoid, oid, 
prpubid, prrelid, prexcept,"


(5) psql-ref.sgml

+If + is appended to the command name, the tables,
+except tables and schemas associated with each publication are shown as
+well.

I'm not sure if "except tables" is a good description.
I suggest "excluded tables". This applies to the entire patch,
in case if this is reasonable suggestion.


Best Regards,
Takamichi Osumi

Re: WIP: WAL prefetch (another approach)

2022-04-25 Thread Thomas Munro

On Tue, Apr 26, 2022 at 6:11 AM Tom Lane  wrote:
> I believe that the WAL prefetch patch probably accounts for the
> intermittent errors that buildfarm member topminnow has shown
> since it went in, eg [1]:
>
> diff -U3 
> /home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out 
> /home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out

Hmm, maybe but I suspect not.  I think I might see what's happening here.

> +ERROR:  could not read WAL at 0/1903E40

> I've reproduced this manually on that machine, and confirmed that the
> proximate cause is that XLogNextRecord() is returning NULL because
> state->decode_queue_head == NULL, without bothering to provide an errormsg
> (which doesn't seem very well thought out in itself).  I obtained the

Thanks for doing that.  After several hours of trying I also managed
to reproduce it on that gcc23 system (not at all sure why it doesn't
show up elsewhere; MIPS 32 bit layout may be a factor), and added some
trace to get some more clues.  Still looking into it, but here is the
current hypothesis I'm testing:

1.  The reason there's a messageless ERROR in this case is because
there is new read_page callback logic introduced for pg_walinspect,
called via read_local_xlog_page_no_wait(), which is like the old
read_local_xlog_page() except that it returns -1 if you try to read
past the current "flushed" LSN, and we have no queued message.  An
error is then reported by XLogReadRecord(), and appears to the user.

2.  The reason pg_walinspect tries to read WAL data past the flushed
LSN is because its GetWALRecordsInfo() function keeps calling
XLogReadRecord() until EndRecPtr >= end_lsn, where end_lsn is taken
from a snapshot of the flushed LSN, but I don't see where it takes
into account that the flushed LSN might momentarily fall in the middle
of a record.  In that case, xlogreader.c will try to read the next
page, which fails because it's past the flushed LSN (see point 1).

I will poke some more tomorrow to try to confirm this and try to come
up with a fix.

Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-04-25 Thread Laurenz Albe

On Mon, 2022-04-25 at 19:51 +0530, Bharath Rupireddy wrote:
> With synchronous replication typically all the transactions (txns)
> first locally get committed, then streamed to the sync standbys and
> the backend that generated the transaction will wait for ack from sync
> standbys. While waiting for ack, it may happen that the query or the
> txn gets canceled (QueryCancelPending is true) or the waiting backend
> is asked to exit (ProcDiePending is true). In either of these cases,
> the wait for ack gets canceled and leaves the txn in an inconsistent
> state [...]
> 
> Here's a proposal (mentioned previously by Satya [1]) to avoid the
> above problems:
> 1) Wait a configurable amount of time before canceling the sync
> replication by the backends i.e. delay processing of
> QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
> synchronous_replication_naptime_before_cancel, when set, it will let
> the backends wait for the ack before canceling the synchronous
> replication so that the transaction can be available in sync standbys
> as well.
> 2) Wait for sync standbys to catch up upon restart after the crash or
> in the next txn after the old locally committed txn was canceled.

While this may mitigate the problem, I don't think it will deal with
all the cases which could cause a transaction to end up committed locally,
but not on the synchronous standby.  I think that only using the full
power of two-phase commit can make this bulletproof.

Is it worth adding additional complexity that is not a complete solution?

Yours,
Laurenz Albe

53 matches

Mail list logo