date:20220427

Re: avoid multiple hard links to same WAL file after a crash

2022-04-27 Thread Michael Paquier

On Tue, Apr 26, 2022 at 01:09:35PM -0700, Nathan Bossart wrote:
> Here is an attempt at creating something that can be back-patched.  0001
> simply replaces calls to durable_rename_excl() with durable_rename() and is
> intended to be back-patched.  0002 removes the definition of
> durable_rename_excl() and is _not_ intended for back-patching.  I imagine
> 0002 will need to be held back for v16devel.

I would not mind applying 0002 on HEAD now to avoid more uses of this
API, and I can get behind 0001 after thinking more about it.

> I think back-patching 0001 will encounter a couple of small obstacles.  For
> example, the call in basic_archive won't exist on most of the
> back-branches, and durable_rename_excl() was named durable_link_or_rename()
> before v13.  I don't mind producing a patch for each back-branch if needed.

I am not sure that have any need to backpatch this change based on the
unlikeliness of the problem, TBH.  One thing that is itching me a bit,
like Robert upthread, is that we don't check anymore that the newfile
does not exist in the code paths because we never expect one.  It is
possible to use stat() for that.  But access() within a simple
assertion would be simpler?  Say something like:
Assert(access(path, F_OK) != 0 && errno == ENOENT);

The case for basic_archive is limited as the comment of the patch
states, but that would be helpful for the two calls in timeline.c and
the one in xlog.c in the long-term.  And this has no need to be part
of fd.c, this can be added before the durable_rename() calls.  What do
you think?
--
Michael

signature.asc
Description: PGP signature

Re: Fix NULL pointer reference in _outPathTarget()

2022-04-27 Thread Peter Eisentraut


On 22.04.22 16:18, Tom Lane wrote:

Peter Eisentraut  writes:

On 20.04.22 18:53, Tom Lane wrote:

Yeah, that's another way to do it.  I think though that the unresolved
question is whether or not we want the field name to appear in the output
when the field is null.  I believe that I intentionally made it not appear
originally, so that that case could readily be distinguished.  You could
argue that that would complicate life greatly for a _readPathTarget()
function, which is true, but I don't foresee that we'll need one.



We could adapt the convention to print NULL values as "<>", like


Works for me.


done

Fix primary crash continually with invalid checkpoint after promote

2022-04-27 Thread Zhao Rui

Newly promoted primary may leave an invalid checkpoint.

In function CreateRestartPoint, control file is updated and old wals are 
removed. But in some situations, control file is not updated, old wals are 
still removed. Thus produces an invalid checkpoint with nonexistent wal. 
Crucial log: "invalid primary checkpoint record", "could not locate a valid 
checkpoint record".




The following timeline reproduces above situation:

tl1: standby begins to create restart point (time or wal triggered).

tl2: standby promotes and control file state is updated to DB_IN_PRODUCTION. 
Control file will not update (xlog.c:9690). But old wals are still removed 
(xlog.c:9719).

tl3: standby becomes primary. primary may crash before the next complete 
checkpoint (OOM in my situation). primary will crash continually with invalid 
checkpoint.




The attached patch reproduces this problem using standard postgresql perl test, 
you can run with 

./configure --enable-tap-tests; make -j; make -C src/test/recovery/ 
check PROVE_TESTS=t/027_invalid_checkpoint_after_promote.pl

The attached patch also fixes this problem by ensuring that remove old wals 
only after control file is updated.

0001-Fix-primary-crash-continually-with-invalid-checkpoint-after-promote.patch
Description: Binary data

Wrong rows count in EXPLAIN

2022-04-27 Thread Пантюшин Александр Иванович

When I create a new table, and then I evaluate the execution of the SELECT 
query, I see a strange rows count in EXPLAIN
CREATE TABLE test1(f INTEGER PRIMARY KEY NOT NULL);
ANALYZE test1;
EXPLAIN SELECT * FROM test1;
   QUERY PLAN
-
 Seq Scan on test1  (cost=0.00..35.50 rows=2550 width=4)
(1 row)

Table is empty but rows=2550. Seem like it was calculated from some default 
values.
Is this normal behavior or a bug? Can it lead to a poor choice of the plan of a 
query in general?

Re: Fix primary crash continually with invalid checkpoint after promote

2022-04-27 Thread Nathan Bossart

On Tue, Apr 26, 2022 at 03:16:13PM +0800, Zhao Rui wrote:
> In function CreateRestartPoint, control file is updated and old wals are 
> removed. But in some situations, control file is not updated, old wals are 
> still removed. Thus produces an invalid checkpoint with nonexistent wal. 
> Crucial log: "invalid primary checkpoint record", "could not locate a valid 
> checkpoint record".

I think this is the same issue tracked here: [0].

[0] 
https://postgr.es/m/20220316.102444.2193181487576617583.horikyota.ntt%40gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.

2022-04-27 Thread Michael Paquier

On Tue, Apr 26, 2022 at 12:54:35PM +0800, Julien Rouhaud wrote:
> I searched a bit and apparently some people are using this function directly
> opening some dll, which seems wrong.

I was wondering about this whole business, and the manifest approach
is a *horrible* design for an API where the goal is to know if your
run-time environment is greater than a given threshold.

>> Another Idea on windows machines would be to use the commandline to execute
>> ver in a separate Process and store the result in a file.
> 
> That also seems hackish, I don't think that we want to rely on something like
> that.

Hmm.  That depends on the dependency set, I guess.  We do that on
Linux at some extent to for large pages in sysv_shmem.c.  Perhaps this
could work for Win10 if this avoids the extra loopholes with the
manifests.

> Their API is entirely useless,

This I agree.

> so I'm still on the opinion that we should
> unconditionally use the FILE_MAP_LARGE_PAGES flag if it's defined and call it 
> a
> day.

Are we sure that this is not going to cause failures in environments
where the flag is not supported?
--
Michael

signature.asc
Description: PGP signature

Re: pgsql: Add contrib/pg_walinspect.

2022-04-27 Thread Bharath Rupireddy

On Wed, Apr 27, 2022 at 8:57 AM Bharath Rupireddy
 wrote:
>
> On Wed, Apr 27, 2022 at 8:45 AM Tom Lane  wrote:
> >
> > I wrote:
> > > Thomas Munro  writes:
> > >> BTW If you had your local change from debug.patch (upthread), that'd
> > >> defeat the patch.  I mean this:
> >
> > >> +if(!*errormsg)
> > >> +  *errormsg = "decode_queue_head is null";
> >
> > > Oh!  Okay, I'll retry without that.
> >
> > I've now done several runs with your patch and not seen the test failure.
> > However, I think we ought to rethink this API a bit rather than just
> > apply the patch as-is.  Even if it were documented, relying on
> > errormsg = NULL to mean something doesn't seem like a great plan.
>
> Sorry for being late in the game, occupied with other stuff.
>
> How about using private_data of XLogReaderState for
> read_local_xlog_page_no_wait, something like this?
>
> typedef struct ReadLocalXLogPageNoWaitPrivate
> {
> bool end_of_wal;
> } ReadLocalXLogPageNoWaitPrivate;
>
> In read_local_xlog_page_no_wait:
>
> /* If asked, let's not wait for future WAL. */
> if (!wait_for_wal)
>{
> private_data->end_of_wal = true;
> break;
>}
>
> /*
>  * Opaque data for callbacks to use.  Not used by XLogReader.
>  */
> void   *private_data;

I found an easy way to reproduce this consistently (I think on any server):

I basically generated huge WAL record (I used a fun extension that I
wrote - https://github.com/BRupireddy/pg_synthesize_wal, but one can
use pg_logical_emit_message as well)
session 1:
select * from pg_synthesize_wal_record(1*1024*1024); --> generate 1 MB
of WAL record first and make a note of the output lsn (lsn1)

session 2:
select * from pg_get_wal_records_info_till_end_of_wal(lsn1);
\watch 1

session 1:
select * from pg_synthesize_wal_record(1000*1024*1024);  --> generate
~1 GB of WAL record and we see ERROR:  could not read WAL at X in
session 2.

Delay the checkpoint (set checkpoint_timeout to 1hr) just not recycle
the wal files while we run pg_walinspect functions, no other changes
required from the default initdb settings on the server.

And, Thomas's patch fixes the issue.

Regards,
Bharath Rupireddy.

Re: Wrong rows count in EXPLAIN

2022-04-27 Thread Andrey Borodin

Hi!

> 26 апр. 2022 г., в 13:45, Пантюшин Александр Иванович 
>  написал(а):
> 
> When I create a new table, and then I evaluate the execution of the SELECT 
> query, I see a strange rows count in EXPLAIN
> CREATE TABLE test1(f INTEGER PRIMARY KEY NOT NULL);
> ANALYZE test1;
> EXPLAIN SELECT * FROM test1;
>QUERY PLAN
> -
>  Seq Scan on test1  (cost=0.00..35.50 rows=2550 width=4)
> (1 row)
> 
> Table is empty but rows=2550. Seem like it was calculated from some default 
> values.
> Is this normal behavior or a bug? Can it lead to a poor choice of the plan of 
> a query in general?

Which Postgres version do you use?

I observe:
postgres=# CREATE TABLE test1(f INTEGER PRIMARY KEY NOT NULL);
CREATE TABLE
postgres=# ANALYZE test1;
ANALYZE
postgres=# EXPLAIN SELECT * FROM test1;
 QUERY PLAN  
-
 Index Only Scan using test1_pkey on test1  (cost=0.12..8.14 rows=1 width=4)
(1 row)

postgres=# select version();
   version  
  
--
 PostgreSQL 15devel on x86_64-apple-darwin19.6.0, compiled by Apple clang 
version 11.0.3 (clang-1103.0.32.62), 64-bit
(1 row)

Without "ANALYZE test1;" table_block_relation_estimate_size() assumes relation 
size is 10 blocks.

Best regards, Andrey Borodin.

Re: bogus: logical replication rows/cols combinations

2022-04-27 Thread Alvaro Herrera

On 2022-Apr-26, Tomas Vondra wrote:

> I'm not quite sure which of the two behaviors is more "desirable". In a
> way, it's somewhat similar to publish_as_relid, which is also calculated
> not considering which of the row filters match?

I grepped doc/src/sgml for `publish_as_relid` and found no hits, so
I suppose it's not a user-visible feature as such.

> But maybe you're right and it should behave the way you propose ... the
> example I have in mind is a use case replicating table with two types of
> rows - sensitive and non-sensitive. For sensitive, we replicate only
> some of the columns, for non-sensitive we replicate everything.

Exactly.  If we blindly publish row/column values that aren't in *any*
publications, this may lead to leaking protected values.

> Changing this to behave the way you expect would be quite difficult,
> because at the moment we build a single OR expression from all the row
> filters. We'd have to keep the individual expressions, so that we can
> build a column list for each of them (in order to ignore those that
> don't match).

I think we should do that, yeah.

> I can take a stab at it, but it seems strange to not apply the same
> logic to evaluation of publish_as_relid. I wonder what Amit thinks about
> this, as he wrote the row filter stuff.

By grepping publicationcmds.c, it seems that publish_as_relid refers to
the ancestor partitioned table that is used for column list and
rowfilter determination, when a partition is being published as part of
it.  I don't think these things are exactly parallel.

... In fact I think they are quite orthogonal: probably you should be
able to publish a partitioned table in two publications, with different
rowfilters and different column lists (which can come from the
topmost partitioned table), and each partition should still work in the
way I describe above.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere."(Lamar Owen)

Re: bogus: logical replication rows/cols combinations

2022-04-27 Thread Alvaro Herrera

On 2022-Apr-27, Amit Kapila wrote:

> On Tue, Apr 26, 2022 at 4:00 AM Tomas Vondra
>  wrote:

> > I can take a stab at it, but it seems strange to not apply the same
> > logic to evaluation of publish_as_relid.
> 
> Yeah, the current behavior seems to be consistent with what we already
> do.

Sorry, this argument makes no sense to me.  The combination of both
features is not consistent, and both features are new.
'publish_as_relid' is an implementation detail.  If the implementation
fails to follow the feature design, then the implementation must be
fixed ... not the design!

IMO, we should first determine how we want row filters and column lists
to work when used in conjunction -- for relations (sets of rows) in a
general sense.  After we have done that, then we can use that design to
drive how we want partitioned tables to be handled for it.  Keep in mind
that when users see a partitioned table, what they first see is a table.
They want all their tables to work in pretty much the same way --
partitioned or not partitioned.  The fact that a table is partitioned
should affect as little as possible the way it interacts with other
features.

Now, another possibility is to say "naah, this is too hard", or even
"naah, there's no time to write all that for this release".  That might
be okay, but in that case let's add an implementation restriction to
ensure that we don't paint ourselves in a corner regarding what is
reasonable behavior.  For example, an easy restriction might be: if a
table is in multiple publications with mismatching row filters/column
lists, then a subscriber is not allowed to subscribe to both
publications.  (Maybe this restriction isn't exactly what we need so
that it actually implements what we need, not sure).  Then, if/when in
the future we implement this correctly, we can lift the restriction.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La conclusión que podemos sacar de esos estudios es que
no podemos sacar ninguna conclusión de ellos" (Tanenbaum)

Re: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.

2022-04-27 Thread Magnus Hagander

On Tue, Apr 26, 2022, 05:55 Julien Rouhaud  wrote:

> Hi,
>
> Please keep the list in copy, especially if that's about Windows specific
> as
> I'm definitely not very knowledgeable about it.
>
> On Fri, Apr 01, 2022 at 09:18:03AM +, Wilm Hoyer wrote:
> >
> > If you don't wanna go the manifest way, maybe the RtlGetVersion function
> is the one you need:
> >
> https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-rtlgetversion?redirectedfrom=MSDN
>
> Thanks for the info!  I tried to use the function but trying to include
> either
> wdm.h or Ntddk.h errors out.  Unfortunately I don't know how to look for a
> file
> in Windows so I don't even know if those files are present.
>

That's a kernel api for use in drivers. Not in applications. You need the
device driver developer kit to get to the headers.

It's not supposed to be used from a user land application.

But note the documentation comment that says: “*RtlGetVersion* is the
kernel-mode equivalent of the user-mode *GetVersionEx* function in the
Windows SDK. ".

Tldr, user mode applications are supposed to use GetVersionEx().

/Magnus

RE: bogus: logical replication rows/cols combinations

2022-04-27 Thread houzj.f...@fujitsu.com

On Wednesday, April 27, 2022 12:56 PM From: Amit Kapila 
 wrote:
> On Tue, Apr 26, 2022 at 4:00 AM Tomas Vondra
>  wrote:
> >
> > On 4/25/22 17:48, Alvaro Herrera wrote:
> >
> > > The desired result on subscriber is:
> > >
> > > table uno;
> > >  a  │ b │ c
> > > ┼───┼───
> > >   1 │ 2 │
> > >  -1 │   │ 4
> > >
> > >
> > > Thoughts?
> > >
> >
> > I'm not quite sure which of the two behaviors is more "desirable". In a
> > way, it's somewhat similar to publish_as_relid, which is also calculated
> > not considering which of the row filters match?
> >
> 
> Right, or in other words, we check all publications to decide it and
> similar is the case for publication actions which are also computed
> independently for all publications.
> 
> > But maybe you're right and it should behave the way you propose ... the
> > example I have in mind is a use case replicating table with two types of
> > rows - sensitive and non-sensitive. For sensitive, we replicate only
> > some of the columns, for non-sensitive we replicate everything. Which
> > could be implemented as two publications
> >
> > create publication sensitive_rows
> >for table t (a, b) where (is_sensitive);
> >
> > create publication non_sensitive_rows
> >for table t where (not is_sensitive);
> >
> > But the way it's implemented now, we'll always replicate all columns,
> > because the second publication has no column list.
> >
> > Changing this to behave the way you expect would be quite difficult,
> > because at the moment we build a single OR expression from all the row
> > filters. We'd have to keep the individual expressions, so that we can
> > build a column list for each of them (in order to ignore those that
> > don't match).
> >
> > We'd have to remove various other optimizations - for example we can't
> > just discard row filters if we found "no_filter" publication.
> >
> 
> I don't think that is the right way. We need some way to combine
> expressions and I feel the current behavior is sane. I mean to say
> that even if there is one publication that has no filter (column/row),
> we should publish all rows with all columns. Now, as mentioned above
> combining row filters or column lists for all publications appears to
> be consistent with what we already do and seems correct behavior to
> me.
> 
> To me, it appears that the method used to decide whether a particular
> table is published or not is also similar to what we do for row
> filters or column lists. Even if there is one publication that
> publishes all tables, we consider the current table to be published
> irrespective of whether other publications have published that table
> or not.
> 
> > Or more
> > precisely, we'd have to consider column lists too.
> >
> > In other words, we'd have to merge pgoutput_column_list_init into
> > pgoutput_row_filter_init, and then modify pgoutput_row_filter to
> > evaluate the row filters one by one, and build the column list.
> >
> 
> Hmm, I think even if we want to do something here, we also need to
> think about how to achieve similar behavior for initial tablesync
> which will be more tricky.

I think it could be difficult to make the initial tablesync behave the same.
Currently, we make a "COPY" command to do the table sync, I am not sure
how to change the "COPY" query to achieve the expected behavior here.

BTW, For the use case mentioned here:
"""
replicating table with two types of
rows - sensitive and non-sensitive. For sensitive, we replicate only
some of the columns, for non-sensitive we replicate everything.
""" 

One approach to do this is to create two subscriptions and two
publications which seems a workaround.
-
create publication uno for table uno (a, b) where (a > 0);
create publication dos for table uno (a, c) where (a < 0);

create subscription sub_uno connection 'port=55432 dbname=alvherre' publication 
uno;
create subscription sub_dos connection 'port=55432 dbname=alvherre' publication 
dos;
-

Best regards,
Hou zj

Re: pgsql: Add contrib/pg_walinspect.

2022-04-27 Thread Bharath Rupireddy

On Wed, Apr 27, 2022 at 1:47 PM Bharath Rupireddy
 wrote:
> > >
> > > I've now done several runs with your patch and not seen the test failure.
> > > However, I think we ought to rethink this API a bit rather than just
> > > apply the patch as-is.  Even if it were documented, relying on
> > > errormsg = NULL to mean something doesn't seem like a great plan.
> >
> > Sorry for being late in the game, occupied with other stuff.
> >
> > How about using private_data of XLogReaderState for
> > read_local_xlog_page_no_wait, something like this?
> >
> > typedef struct ReadLocalXLogPageNoWaitPrivate
> > {
> > bool end_of_wal;
> > } ReadLocalXLogPageNoWaitPrivate;
> >
> > In read_local_xlog_page_no_wait:
> >
> > /* If asked, let's not wait for future WAL. */
> > if (!wait_for_wal)
> >{
> > private_data->end_of_wal = true;
> > break;
> >}
> >
> > /*
> >  * Opaque data for callbacks to use.  Not used by XLogReader.
> >  */
> > void   *private_data;
>
> I found an easy way to reproduce this consistently (I think on any server):
>
> I basically generated huge WAL record (I used a fun extension that I
> wrote - https://github.com/BRupireddy/pg_synthesize_wal, but one can
> use pg_logical_emit_message as well)
> session 1:
> select * from pg_synthesize_wal_record(1*1024*1024); --> generate 1 MB
> of WAL record first and make a note of the output lsn (lsn1)
>
> session 2:
> select * from pg_get_wal_records_info_till_end_of_wal(lsn1);
> \watch 1
>
> session 1:
> select * from pg_synthesize_wal_record(1000*1024*1024);  --> generate
> ~1 GB of WAL record and we see ERROR:  could not read WAL at X in
> session 2.
>
> Delay the checkpoint (set checkpoint_timeout to 1hr) just not recycle
> the wal files while we run pg_walinspect functions, no other changes
> required from the default initdb settings on the server.
>
> And, Thomas's patch fixes the issue.

Here's v2 patch (up on Thomas's v1 at [1]) using private_data to set
the end of the WAL flag. Please have a look at it.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGLtswFk9ZO3WMOqnDkGs6dK5kCdQK9gxJm0N8gip5cpiA%40mail.gmail.com

Regards,
Bharath Rupireddy.


v2-0001-Fix-pg_walinspect-race-against-flush-LSN.patch
Description: Binary data

Re: bogus: logical replication rows/cols combinations

2022-04-27 Thread Amit Kapila

On Wed, Apr 27, 2022 at 3:13 PM Alvaro Herrera  wrote:
>
> On 2022-Apr-26, Tomas Vondra wrote:
>
> > I'm not quite sure which of the two behaviors is more "desirable". In a
> > way, it's somewhat similar to publish_as_relid, which is also calculated
> > not considering which of the row filters match?
>
> I grepped doc/src/sgml for `publish_as_relid` and found no hits, so
> I suppose it's not a user-visible feature as such.
>

`publish_as_relid` is computed based on 'publish_via_partition_root'
setting of publication which is a user-visible feature.

> > But maybe you're right and it should behave the way you propose ... the
> > example I have in mind is a use case replicating table with two types of
> > rows - sensitive and non-sensitive. For sensitive, we replicate only
> > some of the columns, for non-sensitive we replicate everything.
>
> Exactly.  If we blindly publish row/column values that aren't in *any*
> publications, this may lead to leaking protected values.
>
> > Changing this to behave the way you expect would be quite difficult,
> > because at the moment we build a single OR expression from all the row
> > filters. We'd have to keep the individual expressions, so that we can
> > build a column list for each of them (in order to ignore those that
> > don't match).
>
> I think we should do that, yeah.
>

This can hit the performance as we need to evaluate each expression
for each row.

> > I can take a stab at it, but it seems strange to not apply the same
> > logic to evaluation of publish_as_relid. I wonder what Amit thinks about
> > this, as he wrote the row filter stuff.
>
> By grepping publicationcmds.c, it seems that publish_as_relid refers to
> the ancestor partitioned table that is used for column list and
> rowfilter determination, when a partition is being published as part of
> it.
>

Yeah, this is true when the corresponding publication has set
'publish_via_partition_root' as true.

>  I don't think these things are exactly parallel.
>

Currently, when the subscription has multiple publications, we combine
the objects, and actions of those publications. It happens for
'publish_via_partition_root', publication actions, tables, column
lists, or row filters. I think the whole design works on this idea
even the initial table sync. I think it might need a major change
(which I am not sure about at this stage) if we want to make the
initial sync also behave similar to what you are proposing.

I feel it would be much easier to create two different subscriptions
as mentioned by Hou-San [1] for the case you are talking about if the
user really needs something like that.

> ... In fact I think they are quite orthogonal: probably you should be
> able to publish a partitioned table in two publications, with different
> rowfilters and different column lists (which can come from the
> topmost partitioned table), and each partition should still work in the
> way I describe above.
>

We consider the column lists or row filters for either the partition
(on which the current operation is performed) or partitioned table
based on 'publish_via_partition_root' parameter of publication.

[1] - 
https://www.postgresql.org/message-id/OS0PR01MB5716B82315A067F1D78F247E94FA9%40OS0PR01MB5716.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

Re: Wrong rows count in EXPLAIN

2022-04-27 Thread Andrey Borodin




> 27 апр. 2022 г., в 15:17, Пантюшин Александр Иванович 
>  написал(а):
> 
> Hi,
> >Which Postgres version do you use?
> I checked this on PG 11
> ...

> and on PG 13

Yes, I think before 3d351d91 it was impossible to distinguish between actually 
empty and never analyzed table.
But now it is working just as you would expect. There's an interesting relevant 
discussion linked to the commit message.

Best regards, Andrey Borodin.

[0] 
https://github.com/postgres/postgres/commit/3d351d916b20534f973eda760cde17d96545d4c4

Re: Wrong rows count in EXPLAIN

2022-04-27 Thread David Rowley

On Wed, 27 Apr 2022 at 21:08, Andrey Borodin  wrote:
> Which Postgres version do you use?

3d351d91 changed things so we could tell the difference between a
relation which was analyzed and is empty vs a relation that's never
been analyzed. That's why you're not seeing the same behaviour as the
OP.

Tom's commit message [1] also touches on the "safety measure". Here
he's referring to the 2550 estimate, or more accurately, 10 pages
filled with tuples of that width.  This is intended so that newly
created tables that quickly subsequently are loaded with data then
queried before auto-analyze gets a chance to run are not assumed to be
empty.  The problem, if we assumed these non-analyzed tables were
empty, would be that the planner would likely choose plans containing
nodes like Seq Scans and non-parameterized Nested Loops rather than
maybe Index Scans and Merge or Hash joins. The 10-page thing is aimed
to try and avoid the planner from making that mistake.  Generally, the
planner underestimating the number of rows causes worse problems than
when it overestimates the row counts. So 10 seems much better than 0.

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3d351d916b20534f973eda760cde17d96545d4c4

Re: bogus: logical replication rows/cols combinations

2022-04-27 Thread Alvaro Herrera

On 2022-Apr-27, Amit Kapila wrote:

> On Wed, Apr 27, 2022 at 3:13 PM Alvaro Herrera  
> wrote:

> > > Changing this to behave the way you expect would be quite difficult,
> > > because at the moment we build a single OR expression from all the row
> > > filters. We'd have to keep the individual expressions, so that we can
> > > build a column list for each of them (in order to ignore those that
> > > don't match).
> >
> > I think we should do that, yeah.
> 
> This can hit the performance as we need to evaluate each expression
> for each row.

So we do things because they are easy and fast, rather than because they
work correctly?

> > ... In fact I think they are quite orthogonal: probably you should be
> > able to publish a partitioned table in two publications, with different
> > rowfilters and different column lists (which can come from the
> > topmost partitioned table), and each partition should still work in the
> > way I describe above.
> 
> We consider the column lists or row filters for either the partition
> (on which the current operation is performed) or partitioned table
> based on 'publish_via_partition_root' parameter of publication.

OK, but this isn't relevant to what I wrote.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: bogus: logical replication rows/cols combinations

2022-04-27 Thread Amit Kapila

On Wed, Apr 27, 2022 at 4:27 PM Alvaro Herrera  wrote:
>
> On 2022-Apr-27, Amit Kapila wrote:
>
> > On Wed, Apr 27, 2022 at 3:13 PM Alvaro Herrera  
> > wrote:
>
> > > > Changing this to behave the way you expect would be quite difficult,
> > > > because at the moment we build a single OR expression from all the row
> > > > filters. We'd have to keep the individual expressions, so that we can
> > > > build a column list for each of them (in order to ignore those that
> > > > don't match).
> > >
> > > I think we should do that, yeah.
> >
> > This can hit the performance as we need to evaluate each expression
> > for each row.
>
> So we do things because they are easy and fast, rather than because they
> work correctly?
>

The point is I am not sure if what you are saying is better behavior
than current but if others feel it is better then we can try to do
something for it. In the above sentence, I just wanted to say that it
will impact performance but if that is required then sure we should do
it that way.

-- 
With Regards,
Amit Kapila.

[PATCH v1] remove redundant check of item pointer

2022-04-27 Thread Junwang Zhao

In function ItemPointerEquals, the ItemPointerGetBlockNumber
already checked the ItemPointer if valid, there is no need
to check it again in ItemPointerGetOffset, so use
ItemPointerGetOffsetNumberNoCheck instead.

Signed-off-by: Junwang Zhao 
---
 src/backend/storage/page/itemptr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/page/itemptr.c 
b/src/backend/storage/page/itemptr.c
index 9011337aa8..61ad727b1d 100644
--- a/src/backend/storage/page/itemptr.c
+++ b/src/backend/storage/page/itemptr.c
@@ -37,8 +37,8 @@ ItemPointerEquals(ItemPointer pointer1, ItemPointer pointer2)
 
if (ItemPointerGetBlockNumber(pointer1) ==
ItemPointerGetBlockNumber(pointer2) &&
-   ItemPointerGetOffsetNumber(pointer1) ==
-   ItemPointerGetOffsetNumber(pointer2))
+   ItemPointerGetOffsetNumberNoCheck(pointer1) ==
+   ItemPointerGetOffsetNumberNoCheck(pointer2))
return true;
else
return false;
-- 
2.33.0

Re: Skipping schema changes in publication

2022-04-27 Thread vignesh C

On Tue, Apr 26, 2022 at 11:32 AM osumi.takami...@fujitsu.com
 wrote:
>
> On Thursday, April 21, 2022 12:15 PM vignesh C  wrote:
> > Updated patch by changing the syntax to use EXCEPT instead of SKIP.
> Hi
>
>
> This is my review comments on the v2 patch.
>
> (1) gram.y
>
> I think we can make a unified function that merges
> preprocess_alltables_pubobj_list with check_except_in_pubobj_list.
>
> With regard to preprocess_alltables_pubobj_list,
> we don't use the 2nd argument "location" in this function.

Removed location and made a unified function.

> (2) create_publication.sgml
>
> +  
> +   Create a publication that publishes all changes in all the tables except 
> for
> +   the changes of users and
> +   departments table;
>
> This sentence should end ":" not ";".

Modified

> (3) publication.out & publication.sql
>
> +-- fail - can't set except table to schema  publication
> +ALTER PUBLICATION testpub_forschema SET EXCEPT TABLE testpub_tbl1;
>
> There is one unnecessary space in the comment.
> Kindly change from "schema  publication" to "schema publication".

Modified

> (4) pg_dump.c & describe.c
>
> In your first email of this thread, you explained this feature
> is for PG16. Don't we need additional branch for PG16 ?
>
> @@ -6322,6 +6328,21 @@ describePublications(const char *pattern)
> }
> }
>
> +   if (pset.sversion >= 15)
> +   {
>
>
> @@ -4162,7 +4164,7 @@ getPublicationTables(Archive *fout, TableInfo 
> tblinfo[], int numTables)
> /* Collect all publication membership info. */
> if (fout->remoteVersion >= 15)
> appendPQExpBufferStr(query,
> -"SELECT tableoid, 
> oid, prpubid, prrelid, "
> +"SELECT tableoid, 
> oid, prpubid, prrelid, prexcept,"
>

Modified by adding a comment saying "FIXME: 15 should be changed
to 16 later for PG16."

> (5) psql-ref.sgml
>
> +If + is appended to the command name, the tables,
> +except tables and schemas associated with each publication are shown 
> as
> +well.
>
> I'm not sure if "except tables" is a good description.
> I suggest "excluded tables". This applies to the entire patch,
> in case if this is reasonable suggestion.

Modified it in most of the places where it was applicable. I felt the
usage was ok in a few places.

Thanks for the comments, the attached v3 patch has the changes for the same.

Regards.
Vignesh
From f8a8dff478638f822377f2515f69df9c6cce501e Mon Sep 17 00:00:00 2001
From: Vigneshwaran C 
Date: Wed, 20 Apr 2022 11:19:50 +0530
Subject: [PATCH v3] Skip publishing the tables specified in EXCEPT TABLE.

A new option "EXCEPT TABLE" in Create/Alter Publication allows
one or more tables to be excluded, publisher will exclude sending the data
of the excluded tables to the subscriber.

The new syntax allows specifying schemas. For example:
CREATE PUBLICATION pub1 FOR ALL TABLES EXCEPT TABLE t1,t2;
OR
ALTER PUBLICATION pub1 ADD EXCEPT TABLE t1,t2;

A new column prexcept is added to table "pg_publication_rel", to maintain
the relations that the user wants to exclude publishing through the publication.
Modified the output plugin (pgoutput) to exclude publishing the changes of the
excluded tables.

Updates pg_dump to identify and dump the excluded tables of the publications.
Updates the \d family of commands to display excluded tables of the
publications and \dRp+ variant will now display associated except tables if any.

Bump catalog version.
---
 doc/src/sgml/catalogs.sgml|   9 ++
 doc/src/sgml/logical-replication.sgml |   8 +-
 doc/src/sgml/ref/alter_publication.sgml   |  14 ++-
 doc/src/sgml/ref/create_publication.sgml  |  29 -
 doc/src/sgml/ref/psql-ref.sgml|   5 +-
 src/backend/catalog/pg_publication.c  |  36 --
 src/backend/commands/publicationcmds.c| 106 +++---
 src/backend/commands/tablecmds.c  |   4 +-
 src/backend/parser/gram.y |  78 +++--
 src/backend/replication/pgoutput/pgoutput.c   |  25 ++---
 src/backend/utils/cache/relcache.c|  17 ++-
 src/bin/pg_dump/pg_dump.c |  45 ++--
 src/bin/pg_dump/pg_dump.h |   1 +
 src/bin/pg_dump/pg_dump_sort.c|   7 ++
 src/bin/pg_dump/t/002_pg_dump.pl  |  23 
 src/bin/psql/describe.c   |  52 +++--
 src/bin/psql/tab-complete.c   |  15 ++-
 src/include/catalog/pg_publication.h  |   7 +-
 src/include/catalog/pg_publication_rel.h  |   1 +
 src/include/commands/publicationcmds.h|   4 +-
 src/include/nodes/parsenodes.h|   2 +
 src/test/regress/expected/publication.out |  81 -
 src/test/regress/sql/publication.sql  |  40 ++-
 .../t/033_rep_changes_except_tabl

Re: Wrong rows count in EXPLAIN

2022-04-27 Thread Tom Lane

=?koi8-r?B?8MHO1MDbyc4g4czFy9PBzsTSIOnXwc7P18ne?=  
writes:
> When I create a new table, and then I evaluate the execution of the SELECT 
> query, I see a strange rows count in EXPLAIN
> CREATE TABLE test1(f INTEGER PRIMARY KEY NOT NULL);
> ANALYZE test1;
> EXPLAIN SELECT * FROM test1;
>QUERY PLAN
> -
>  Seq Scan on test1  (cost=0.00..35.50 rows=2550 width=4)
> (1 row)

> Table is empty but rows=2550.

This is intentional, arising from the planner's unwillingness to
assume that a table is empty.  It assumes that such a table actually
contains (from memory) 10 pages, and then backs into a rowcount
estimate from that depending on the data-type-dependent width of
the table rows.

Without this provision, we'd produce very bad plans for cases
where a newly-populated table hasn't been analyzed yet.

regards, tom lane

Re: Wrong rows count in EXPLAIN

2022-04-27 Thread Bruce Momjian

On Wed, Apr 27, 2022 at 09:44:21AM -0400, Tom Lane wrote:
> =?koi8-r?B?8MHO1MDbyc4g4czFy9PBzsTSIOnXwc7P18ne?=  
> writes:
> > When I create a new table, and then I evaluate the execution of the SELECT 
> > query, I see a strange rows count in EXPLAIN
> > CREATE TABLE test1(f INTEGER PRIMARY KEY NOT NULL);
> > ANALYZE test1;
> > EXPLAIN SELECT * FROM test1;
> >QUERY PLAN
> > -
> >  Seq Scan on test1  (cost=0.00..35.50 rows=2550 width=4)
> > (1 row)
> 
> > Table is empty but rows=2550.
> 
> This is intentional, arising from the planner's unwillingness to
> assume that a table is empty.  It assumes that such a table actually
> contains (from memory) 10 pages, and then backs into a rowcount
> estimate from that depending on the data-type-dependent width of
> the table rows.
> 
> Without this provision, we'd produce very bad plans for cases
> where a newly-populated table hasn't been analyzed yet.

We could have a noice mode that warns when a table without statistics is
used.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

cirrus: run macos with COPY_PARSE_PLAN_TREES etc

2022-04-27 Thread Justin Pryzby

fork: <20220325000933.vgazz7pjk2ytj...@alap3.anarazel.de>

On Thu, Mar 24, 2022 at 05:09:33PM -0700, Andres Freund wrote:
> On 2022-03-24 18:51:30 -0400, Andrew Dunstan wrote:
> > I wonder if we should add these compile flags to the cfbot's setup?
> 
> Yes, I think we should. There's a bit of discussion of that in and below
> https://postgr.es/m/20220213051937.GO31460%40telsasoft.com - that veered a bit
> of course, so I haven't done anything about it yet.  Perhaps one build
> COPY_PARSE_PLAN_TREES and RAW_EXPRESSION_COVERAGE_TEST another
> WRITE_READ_PARSE_PLAN_TREES?  We should add the slower to the macos build,
> that's plenty fast and I'm intending to slow the linux test by using ubsan,
> which works better on linux.

Why would you put them on different tasks ?
to avoid slowing down one task too much ?
That doesn't seem to be an issue, at least for those three defines.

What about adding RELCACHE_FORCE_RELEASE, too ?
Even with that, macos is only ~1min slower.

https://cirrus-ci.com/task/5456727205216256

commit 53480b8db63b5cd2476142e28ed3f9fe8480f9f3
Author: Justin Pryzby 
Date:   Thu Apr 14 06:27:07 2022 -0500

cirrus/macos: enable various runtime checks

cirrus CI can take a while to be schedule on macos, but the instance always 
has
many cores, so this is a good platform to enable options which will slow it
down.

See:

https://www.postgresql.org/message-id/20211217193159.pwrelhiyx7kev...@alap3.anarazel.de

https://www.postgresql.org/message-id/20211213211223.vkgg3wwiss2tragj%40alap3.anarazel.de

https://www.postgresql.org/message-id/CAH2-WzmevBhKNEtqX3N-Tkb0gVBHH62C0KfeTxXzqYES_PiFiA%40mail.gmail.com

https://www.postgresql.org/message-id/20220325000933.vgazz7pjk2ytj...@alap3.anarazel.de

ci-os-only: macos

diff --git a/.cirrus.yml b/.cirrus.yml
index e0264929c74..4a655fc 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -337,6 +337,7 @@ task:
   CLANG="ccache ${brewpath}/llvm/bin/ccache" \
   CFLAGS="-Og -ggdb" \
   CXXFLAGS="-Og -ggdb" \
+  CPPFLAGS="-DRELCACHE_FORCE_RELEASE -DCOPY_PARSE_PLAN_TREES 
-DWRITE_READ_PARSE_PLAN_TREES -DRAW_EXPRESSION_COVERAGE_TEST" \
   \
   LLVM_CONFIG=${brewpath}/llvm/bin/llvm-config \
   PYTHON=python3

AW: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.

2022-04-27 Thread Wilm Hoyer

On Tue, Apr 26, 2022 at 12:54:35PM +0800, Julien Rouhaud wrote:
>> I searched a bit and apparently some people are using this function 
>> directly opening some dll, which seems wrong.

> I was wondering about this whole business, and the manifest approach is a 
> *horrible* design for an API where the goal is to know if your run-time 
> environment is greater than a given threshold.

Agreed for the use case at hand, where you want to use one core API Function or 
another depending on the OS Version.

One Blog from Microsoft, I remember, told that one reason for the change were 
the increase of false installation error messages "Install Error - Your system 
does not meet the minimum supported operating system and service pack level."
where the software in question was written for Windows XP and the user tried to 
install it on, say, Windows 8.
That is just a Developer-Pilot error, where the Developers forgot to anticipate 
future OS Versions and instead of checking for Version at least, where checking 
for Version equality of all at design time known Windows Version.
Since you can develop only against OS APIs known at design time, and Microsoft 
claims to be pretty good at maintaining backward compatible facades for their 
APIs, there is some reason in that decision.
(To only see the Versions and APIs you told the OS with the manifest, you knew 
about at compile time).
The core Problem at hand is, that ms broke the promise of backward 
compatibility, since the function in question is working differently, depending 
on windows version, although with the above reasoning we should get the exact 
same behavior on windows 10 as on windows 8.1 (as PostgreSql, per default, only 
claims to know about Windows 8.1 features).

That said, I can understand the design decision. Personally, I still don't like 
it a bit, since developers should be allowed to make some stupid mistakes.

>>> Another Idea on windows machines would be to use the commandline to 
>>> execute ver in a separate Process and store the result in a file.
>> 
>> That also seems hackish, I don't think that we want to rely on 
>> something like that.

>Hmm.  That depends on the dependency set, I guess.  We do that on Linux at 
>some extent to for large pages in sysv_shmem.c.  Perhaps this could work for 
>Win10 if this avoids the extra loopholes with the >manifests.

I used the following hack to get the "real" Major and Minor Version of Windows 
- it's in C# (.Net) and needs to be adjusted (you can compile as x64 and use a 
long-long as return value ) to return the Service Number too and translated it 
into C.
I share it anyways, as it might help - please be kind, as it really is a little 
hack.

Situation: 
Main Application can't or is not willing to add a manifest file into its 
resources.

Solution:
Start a small executable (which has a manifest file compiled into its 
resources), let it read the OS Version and code the Version into its return 
Code.

CInt32 is basically an integer redefinition, where one can access the lower and 
higher Int16 separately.

The Main Programm eventually calls this (locate the executable, adjust the 
Process startup to be minimal, call the executable as separate process and 
interpret the return value as Version):
private static Version ReadModernOsVersionInternal()
{
String codeBase = Assembly.GetExecutingAssembly().CodeBase;
Uri uri = new Uri(codeBase);

String localPath = uri.LocalPath;
String pathDirectory = Path.GetDirectoryName(localPath);

if (pathDirectory != null)
{
String fullCombinePath = Path.Combine(pathDirectory, 
"Cf.Utilities.ReadModernOSVersion");

ProcessStartInfo processInfo = new ProcessStartInfo
{
FileName = fullCombinePath,
CreateNoWindow = true,
UseShellExecute = false
};

Process process = new Process
{
StartInfo = processInfo
};

process.Start();

if (process.WaitForExit(TimeSpan.FromSeconds(1).Milliseconds))
{
CInt32 versionInteger = process.ExitCode;
return new Version(versionInteger.HighValue, 
versionInteger.LowValue);
}
}

return new Version();
}

The small Version Check executable:

static Int32 Main(String[] args)
{
return OsVersionErmittler.ErmittleOsVersion();
}

and

static class OsVersionErmittler
{
/// 
/// Ermittelt die OsVersion und übergibt diese als High und LowWord.
/// 
/// 
public static CInt32 ErmittleOsVersion()
{
OperatingSystem version = Environment.OSVersion;
if (version.Platform == PlatformID.Win32NT && version.Version >= new 
Version(6, 3))
{
String versionString = version.VersionString;
return new CInt32((Int16) version.Version.Major, (Int16) 
version.Version.Minor);
}
return 0;
}
}

The shortened manifest of the small executable:

remove redundant check of item pointer

2022-04-27 Thread Junwang Zhao

In function ItemPointerEquals, the ItemPointerGetBlockNumber
already checked the ItemPointer if valid, there is no need
to check it again in ItemPointerGetOffset, so use
ItemPointerGetOffsetNumberNoCheck instead.

-- 
Regards
Junwang Zhao


v1-0001-remove-redundant-check-of-item-pointer.patch
Description: Binary data

Re: remove redundant check of item pointer

2022-04-27 Thread Tom Lane

Junwang Zhao  writes:
> In function ItemPointerEquals, the ItemPointerGetBlockNumber
> already checked the ItemPointer if valid, there is no need
> to check it again in ItemPointerGetOffset, so use
> ItemPointerGetOffsetNumberNoCheck instead.

I do not think this change is worth making.  The point of
ItemPointerGetOffsetNumberNoCheck is not to save some cycles,
it's to be able to fetch the offset field in cases where it might
validly be zero.  The assertion will be compiled out anyway in
production builds --- and even in assert-enabled builds, I'd kind
of expect the compiler to optimize away the duplicated tests.

regards, tom lane

Re: Wrong rows count in EXPLAIN

2022-04-27 Thread Пантюшин Александр Иванович

Hi,
>Which Postgres version do you use?
I checked this on PG 11
postgres=# select version();
 version
-
 PostgreSQL 11.5 on x86_64-w64-mingw32, compiled by x86_64-w64-mingw32-gcc.exe 
(x86_64-win32-seh-rev0, Built by MinGW-W64 project) 8.1.0, 64-bit
(1 row)

and on PG 13
postgres=# select version();
   version
-
 PostgreSQL 13.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
8.4.0-1ubuntu1~18.04) 8.4.0, 64-bit
(1 row)

Both versions shows same strange rows counting in EXPLAINE


От: Andrey Borodin 
Отправлено: 27 апреля 2022 г. 13:08:22
Кому: Пантюшин Александр Иванович
Копия: pgsql-hack...@postgresql.org; Тарасов Георгий Витальевич
Тема: Re: Wrong rows count in EXPLAIN

Hi!

> 26 апр. 2022 г., в 13:45, Пантюшин Александр Иванович 
>  написал(а):
>
> When I create a new table, and then I evaluate the execution of the SELECT 
> query, I see a strange rows count in EXPLAIN
> CREATE TABLE test1(f INTEGER PRIMARY KEY NOT NULL);
> ANALYZE test1;
> EXPLAIN SELECT * FROM test1;
>QUERY PLAN
> -
>  Seq Scan on test1  (cost=0.00..35.50 rows=2550 width=4)
> (1 row)
>
> Table is empty but rows=2550. Seem like it was calculated from some default 
> values.
> Is this normal behavior or a bug? Can it lead to a poor choice of the plan of 
> a query in general?

Which Postgres version do you use?

I observe:
postgres=# CREATE TABLE test1(f INTEGER PRIMARY KEY NOT NULL);
CREATE TABLE
postgres=# ANALYZE test1;
ANALYZE
postgres=# EXPLAIN SELECT * FROM test1;
 QUERY PLAN
-
 Index Only Scan using test1_pkey on test1  (cost=0.12..8.14 rows=1 width=4)
(1 row)

postgres=# select version();
   version
--
 PostgreSQL 15devel on x86_64-apple-darwin19.6.0, compiled by Apple clang 
version 11.0.3 (clang-1103.0.32.62), 64-bit
(1 row)

Without "ANALYZE test1;" table_block_relation_estimate_size() assumes relation 
size is 10 blocks.

Best regards, Andrey Borodin.

Unstable tests for recovery conflict handling

2022-04-27 Thread Tom Lane

[ starting a new thread cuz the shared-stats one is way too long ]

Andres Freund  writes:
> Add minimal tests for recovery conflict handling.

It's been kind of hidden by other buildfarm noise, but
031_recovery_conflict.pl is not as stable as it should be [1][2][3][4].

Three of those failures look like

[11:08:46.806](105.129s) ok 1 - buffer pin conflict: cursor with conflicting 
pin established
Waiting for replication conn standby's replay_lsn to pass 0/33EF190 on primary
[12:01:49.614](3182.807s) # poll_query_until timed out executing this query:
# SELECT '0/33EF190' <= replay_lsn AND state = 'streaming'
#  FROM pg_catalog.pg_stat_replication
#  WHERE application_name IN ('standby', 'walreceiver')
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
timed out waiting for catchup at t/031_recovery_conflict.pl line 123.

In each of these examples we can see in the standby's log that it
detected the expected buffer pin conflict:

2022-04-27 11:08:46.353 UTC [1961604][client backend][2/2:0] LOG:  statement: 
BEGIN;
2022-04-27 11:08:46.416 UTC [1961604][client backend][2/2:0] LOG:  statement: 
DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT b FROM 
test_recovery_conflict_table1;
2022-04-27 11:08:46.730 UTC [1961604][client backend][2/2:0] LOG:  statement: 
FETCH FORWARD FROM test_recovery_conflict_cursor;
2022-04-27 11:08:47.825 UTC [1961298][startup][1/0:0] LOG:  recovery still 
waiting after 13.367 ms: recovery conflict on buffer pin
2022-04-27 11:08:47.825 UTC [1961298][startup][1/0:0] CONTEXT:  WAL redo at 
0/33E6E80 for Heap2/PRUNE: latestRemovedXid 0 nredirected 0 ndead 1; blkref #0: 
rel 1663/16385/16386, blk 0

but then nothing happens until the script times out and kills the test.
I think this is showing us a real bug, ie we sometimes fail to cancel
the conflicting query.

The other example [3] looks different:

[01:02:43.582](2.357s) ok 1 - buffer pin conflict: cursor with conflicting pin 
established
Waiting for replication conn standby's replay_lsn to pass 0/342C000 on primary
done
[01:02:43.747](0.165s) ok 2 - buffer pin conflict: logfile contains terminated 
connection due to recovery conflict
[01:02:43.804](0.057s) not ok 3 - buffer pin conflict: stats show conflict on 
standby
[01:02:43.805](0.000s) 
[01:02:43.805](0.000s) #   Failed test 'buffer pin conflict: stats show 
conflict on standby'
#   at t/031_recovery_conflict.pl line 295.
[01:02:43.805](0.000s) #  got: '0'
# expected: '1'

Not sure what to make of that --- could there be a race condition in the
reporting of the conflict?

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-04-27%2007%3A16%3A51
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2022-04-21%2021%3A20%3A15
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2022-04-13%2022%3A45%3A30
[4] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-04-11%2005%3A40%3A41

Re: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.

2022-04-27 Thread Julien Rouhaud

On Wed, Apr 27, 2022 at 05:13:12PM +0900, Michael Paquier wrote:
> On Tue, Apr 26, 2022 at 12:54:35PM +0800, Julien Rouhaud wrote:
> > so I'm still on the opinion that we should
> > unconditionally use the FILE_MAP_LARGE_PAGES flag if it's defined and call 
> > it a
> > day.
> 
> Are we sure that this is not going to cause failures in environments
> where the flag is not supported?

I don't know for sure as I have no way to test, but it would be very lame for
an OS to provide a #define explicitly intended for one use case if that use
case can't handle that flag yet.

Re: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.

2022-04-27 Thread Julien Rouhaud

On Wed, Apr 27, 2022 at 03:04:23PM +, Wilm Hoyer wrote:
>
> I used the following hack to get the "real" Major and Minor Version of
> Windows - it's in C# (.Net) and needs to be adjusted (you can compile as x64
> and use a long-long as return value ) to return the Service Number too and
> translated it into C.
> I share it anyways, as it might help - please be kind, as it really is a
> little hack.
>
> Situation:
> Main Application can't or is not willing to add a manifest file into its
> resources.
>
> Solution:
> Start a small executable (which has a manifest file compiled into its
> resources), let it read the OS Version and code the Version into its return
> Code.

Thanks for sharing.

Having to compile another tool just for that seems like a very high price to
pay, especially since we don't have any C# code in the tree.  I'm not even sure
that compiling this wouldn't need additional requirements and/or if it would
work on our oldest supported Windows versions.

Re: Unstable tests for recovery conflict handling

2022-04-27 Thread Mark Dilger




> On Apr 27, 2022, at 9:45 AM, Tom Lane  wrote:
> 
> [ starting a new thread cuz the shared-stats one is way too long ]
> 
> Andres Freund  writes:
>> Add minimal tests for recovery conflict handling.
> 
> It's been kind of hidden by other buildfarm noise, but
> 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4].
> 
> Three of those failures look like

Interesting,

I have been getting failures on REL_14_STABLE:

t/012_subtransactions.pl . 11/12

#   Failed test 'Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on 
promoted standby'
#   at t/012_subtransactions.pl line 211.
#  got: '3'
# expected: '0'
t/012_subtransactions.pl . 12/12 # Looks like you failed 1 test of 
12.
t/012_subtransactions.pl . Dubious, test returned 1 (wstat 256, 
0x100)
Failed 1/12 subtests 

And the logs, tmp_check/log/regress_log_012_subtransactions, showing:

### Enabling streaming replication for node "primary"
### Starting node "primary"
# Running: pg_ctl -D 
/Users/mark.dilger/recovery_test/postgresql/src/test/recovery/tmp_check/t_012_subtransactions_primary_data/pgdata
 -l 
/Users/mark.dilger/recovery_test/postgresql/src/test/recovery/tmp_check/log/012_subtransactions_primary.log
 -o --cluster-name=primary start
waiting for server to start done
server started
# Postmaster PID for node "primary" is 46270
psql::1: ERROR:  prepared transaction with identifier "xact_012_1" does 
not exist
not ok 11 - Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on 
promoted standby

#   Failed test 'Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction on 
promoted standby'
#   at t/012_subtransactions.pl line 211.
#  got: '3'
# expected: '0'


This is quite consistent for me, but only when I configure with 
--enable-coverage and --enable-dtrace.  (I haven't yet tried one of those 
without the other.)

I wasn't going to report this yet, having not yet completely narrowed this 
down, but I wonder if anybody else is seeing this?

I'll try again on master

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgsql: Add contrib/pg_walinspect.

2022-04-27 Thread Jeff Davis

On Wed, 2022-04-27 at 13:47 +0530, Bharath Rupireddy wrote:
> I found an easy way to reproduce this consistently (I think on any
> server):
> 
> I basically generated huge WAL record (I used a fun extension that I
> wrote - https://github.com/BRupireddy/pg_synthesize_wal, but one can
> use pg_logical_emit_message as well)

Thank you Bharath for creating the extension and the simple test case.

Thomas's patch solves the issue for me as well.

Tom, the debug patch you posted[0] seems to be setting the error
message if it's not already set. Thomas's patch uses the lack of a
message as a signal that we've reached the end of WAL. That explains
why you are still seeing the problem.

Obviously, that's a sign that Thomas's patch is not the cleanest
solution. But other approaches would be more invasive. I guess the
question is whether that's a good enough solution for now, and
hopefully we could improve the API later; or whether we need to come up
with something better.

When reviewing, I considered the inability to read old WAL and the
inability to read flushed-in-the-middle-of-a-record WAL as similar
kinds of errors that the user would need to deal with. But they are
different: the former can be avoided by creating a slot; the latter
can't be easily avoided, only retried.

Depending on the intended use cases, forcing the user to retry might be
reasonable, in which case we could consider this a test problem rather
than a real problem, and we might be able to do something simpler to
just stabilize the test.

Regards,
Jeff Davis

[0] https://postgr.es/m/295868.1651024...@sss.pgh.pa.us

Re: Unstable tests for recovery conflict handling

2022-04-27 Thread Mark Dilger

> On Apr 27, 2022, at 10:11 AM, Mark Dilger  
> wrote:
> 
> I'll try again on master

Still with coverage and dtrace enabled, I get the same thing, except that 
master formats the logs a bit differently:

# Postmaster PID for node "primary" is 19797
psql::1: ERROR:  prepared transaction with identifier "xact_012_1" does 
not exist
[10:26:16.314](1.215s) not ok 11 - Rollback of PGPROC_MAX_CACHED_SUBXIDS+ 
prepared transaction on promoted standby
[10:26:16.314](0.000s)
[10:26:16.314](0.000s) #   Failed test 'Rollback of PGPROC_MAX_CACHED_SUBXIDS+ 
prepared transaction on promoted standby'
[10:26:16.314](0.000s) #   at t/012_subtransactions.pl line 208.
[10:26:16.314](0.000s) #  got: '3'
# expected: '0'

With coverage but not dtrace enabled, I still get the error, though the log 
leading up to the error now has a bunch of coverage noise lines like:

profiling: 
/Users/mark.dilger/recovery_test/postgresql/src/backend/utils/sort/tuplesort.gcda:
 cannot merge previous GCDA file: corrupt arc tag

The error itself looks the same except the timing numbers differ a little.

With neither enabled, all tests pass.

I'm inclined to think that either the recovery code or the test have a race 
condition, and that enabling coverage causes the race to come out differently.  
I'll keep poking

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Unstable tests for recovery conflict handling

2022-04-27 Thread Tom Lane

I wrote:
> It's been kind of hidden by other buildfarm noise, but
> 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4].
> ...
> I think this is showing us a real bug, ie we sometimes fail to cancel
> the conflicting query.

After digging around in the code, I think this is almost certainly
some manifestation of the previously-complained-of problem [1] that
RecoveryConflictInterrupt is not safe to call in a signal handler,
leading the conflicting backend to sometimes decide that it's not
the problem.  That squares with the observation that skink is more
prone to show this than other animals: you'd have to get the SIGUSR1
while the target backend isn't idle, so a very slow machine ought to
show it more.  We don't seem to have that issue on the open items
list, but I'll go add it.

Not sure if the 'buffer pin conflict: stats show conflict on standby'
failure could trace to a similar cause.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com

Re: Possible corruption by CreateRestartPoint at promotion

2022-04-27 Thread Nathan Bossart

On Wed, Apr 27, 2022 at 02:16:01PM +0900, Michael Paquier wrote:
> On Tue, Apr 26, 2022 at 08:26:09PM -0700, Nathan Bossart wrote:
>> On Wed, Apr 27, 2022 at 10:43:53AM +0900, Kyotaro Horiguchi wrote:
>>> +   ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
>>> +   ControlFile->minRecoveryPointTLI = 0;
>>> +
>>> +   /* also update local copy */
>>> +   LocalMinRecoveryPoint = InvalidXLogRecPtr;
>>> +   LocalMinRecoveryPointTLI = 0;
>> 
>> Should this be handled by the code that changes the control file state to
>> DB_IN_PRODUCTION instead?  It looks like this is ordinarily done in the
>> next checkpoint.  It's not clear to me why it is done this way.
> 
> Anyway, that would be the work of the end-of-recovery checkpoint
> requested at the end of StartupXLOG() once a promotion happens or of
> the checkpoint requested by PerformRecoveryXLogAction() in the second
> case, no?  So, I don't quite see why we need to update
> minRecoveryPoint and minRecoveryPointTLI in the control file here, as
> much as this does not have to be part of the end-of-recovery code
> that switches the control file to DB_IN_PRODUCTION.

+1. We probably don't need to reset minRecoveryPoint here.

> -   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
> -   ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
> -   {
> 7ff23c6 has removed the last call to CreateCheckpoint() outside the
> checkpointer, meaning that there is one less concurrent race to worry
> about, but I have to admit that this change, to update the control
> file's checkPoint and checkPointCopy even if we don't check after
> ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the
> code less robust in ~14.  So I am questioning whether a backpatch
> is actually worth the risk here.

IMO we should still check this before updating ControlFile to be safe.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: avoid multiple hard links to same WAL file after a crash

2022-04-27 Thread Nathan Bossart

On Wed, Apr 27, 2022 at 04:09:20PM +0900, Michael Paquier wrote:
> I am not sure that have any need to backpatch this change based on the
> unlikeliness of the problem, TBH.  One thing that is itching me a bit,
> like Robert upthread, is that we don't check anymore that the newfile
> does not exist in the code paths because we never expect one.  It is
> possible to use stat() for that.  But access() within a simple
> assertion would be simpler?  Say something like:
> Assert(access(path, F_OK) != 0 && errno == ENOENT);
> 
> The case for basic_archive is limited as the comment of the patch
> states, but that would be helpful for the two calls in timeline.c and
> the one in xlog.c in the long-term.  And this has no need to be part
> of fd.c, this can be added before the durable_rename() calls.  What do
> you think?

Here is a new patch set with these assertions added.  I think at least the
xlog.c change ought to be back-patched.  The problem may be unlikely, but
AFAICT the possible consequences include WAL corruption.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 8ffc337621f8a287350a7a55256b58b0585f7a1f Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 26 Apr 2022 11:56:50 -0700
Subject: [PATCH v4 1/2] Replace calls to durable_rename_excl() with
 durable_rename().

durable_rename_excl() attempts to avoid overwriting any existing
files by using link() and unlink(), but it falls back to rename()
on some platforms (e.g., Windows), which offers no such overwrite
protection.  Most callers use durable_rename_excl() just in case
there is an existing file, but in practice there shouldn't be one.
basic_archive used it to avoid overwriting an archive concurrently
created by another server, but as mentioned above, it will still
overwrite files on some platforms.

Furthermore, failures during durable_rename_excl() can result in
multiple hard links to the same file.  My testing demonstrated that
it was possible to end up with two links to the same file in pg_wal
after a crash just before unlink() during WAL recycling.
Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL
file was re-recycled upon restarting.  This seems likely to lead to
WAL corruption.

This change replaces all calls to durable_rename_excl() with
durable_rename().  This removes the protection against
accidentally overwriting an existing file, but some platforms are
already living without it, and ordinarily there shouldn't be one.
The function itself is left around in case any extensions are using
it.  It will be removed in v16 via a follow-up commit.

Back-patch to all supported versions.  Before v13,
durable_rename_excl() was named durable_link_or_rename().

Author: Nathan Bossart
Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/20220418182336.GA2298576%40nathanxps13
---
 contrib/basic_archive/basic_archive.c |  5 +++--
 src/backend/access/transam/timeline.c | 16 
 src/backend/access/transam/xlog.c |  9 +++--
 3 files changed, 10 insertions(+), 20 deletions(-)

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index e7efbfb9c3..ed33854c57 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -281,9 +281,10 @@ basic_archive_file_internal(const char *file, const char *path)

 	/*
 	 * Sync the temporary file to disk and move it to its final destination.
-	 * This will fail if destination already exists.
+	 * Note that this will overwrite any existing file, but this is only
+	 * possible if someone else created the file since the stat() above.
 	 */
-	(void) durable_rename_excl(temp, destination, ERROR);
+	(void) durable_rename(temp, destination, ERROR);

 	ereport(DEBUG1,
 			(errmsg("archived \"%s\" via basic_archive", file)));
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index be21968293..f3a8e53aa4 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -441,12 +441,8 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, newTLI);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	Assert(access(path, F_OK) != 0 && errno == ENOENT);
+	durable_rename(tmppath, path, ERROR);

 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -519,12 +515,8 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, tli);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existi

Re: [RFC] building postgres with meson -v8

2022-04-27 Thread Peter Eisentraut


Here is a patch that adds in NLS.

There are some opportunities to improve this.  For example, we could 
move the list of languages from the meson.build files into separate 
LINGUAS files, which could be shared with the makefile-based build 
system.  I need to research this a bit more.


Also, this only covers the build and install phases of the NLS process. 
The xgettext and msgmerge aspects I haven't touched at all.  There is 
more to research there as well.


The annoying thing is that the i18n module doesn't appear to have a way 
to communicate with feature options or dependencies, so there isn't a 
way to tell it to only do its things when some option is enabled, or 
conversely to check whether the module found the things it needs and to 
enable or disable an option based on that.  So right now for example if 
you explicitly disable the 'nls' option, the binaries are built without 
NLS but the .mo files are still built and installed.


In any case, this works for the main use cases and gets us a step 
forward, so it's worth considering.From ccfbff1beed568ef3ebe2e1af0701802d96b9017 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 27 Apr 2022 17:57:04 +0200
Subject: [PATCH 7/7] meson: NLS support

---
 meson.build| 23 +++---
 meson_options.txt  |  3 +++
 src/backend/meson.build|  2 ++
 src/backend/po/meson.build |  3 +++
 src/bin/initdb/meson.build |  2 ++
 src/bin/initdb/po/meson.build  |  3 +++
 src/bin/pg_amcheck/meson.build |  2 ++
 src/bin/pg_amcheck/po/meson.build  |  3 +++
 src/bin/pg_archivecleanup/meson.build  |  2 ++
 src/bin/pg_archivecleanup/po/meson.build   |  3 +++
 src/bin/pg_basebackup/meson.build  |  2 ++
 src/bin/pg_basebackup/po/meson.build   |  3 +++
 src/bin/pg_checksums/meson.build   |  2 ++
 src/bin/pg_checksums/po/meson.build|  3 +++
 src/bin/pg_config/meson.build  |  2 ++
 src/bin/pg_config/po/meson.build   |  3 +++
 src/bin/pg_controldata/meson.build |  2 ++
 src/bin/pg_controldata/po/meson.build  |  3 +++
 src/bin/pg_ctl/meson.build |  2 ++
 src/bin/pg_ctl/po/meson.build  |  3 +++
 src/bin/pg_dump/meson.build|  2 ++
 src/bin/pg_dump/po/meson.build |  3 +++
 src/bin/pg_resetwal/meson.build|  2 ++
 src/bin/pg_resetwal/po/meson.build |  3 +++
 src/bin/pg_rewind/meson.build  |  2 ++
 src/bin/pg_rewind/po/meson.build   |  3 +++
 src/bin/pg_test_fsync/meson.build  |  2 ++
 src/bin/pg_test_fsync/po/meson.build   |  3 +++
 src/bin/pg_test_timing/meson.build |  2 ++
 src/bin/pg_test_timing/po/meson.build  |  3 +++
 src/bin/pg_upgrade/meson.build |  2 ++
 src/bin/pg_upgrade/po/meson.build  |  3 +++
 src/bin/pg_verifybackup/meson.build|  2 ++
 src/bin/pg_verifybackup/po/meson.build |  3 +++
 src/bin/pg_waldump/meson.build |  2 ++
 src/bin/pg_waldump/po/meson.build  |  3 +++
 src/bin/psql/meson.build   |  2 ++
 src/bin/psql/po/meson.build|  3 +++
 src/bin/scripts/meson.build|  2 ++
 src/bin/scripts/po/meson.build |  3 +++
 src/interfaces/ecpg/ecpglib/meson.build|  2 ++
 src/interfaces/ecpg/ecpglib/po/meson.build |  3 +++
 src/interfaces/ecpg/preproc/meson.build|  2 ++
 src/interfaces/ecpg/preproc/po/meson.build |  3 +++
 src/interfaces/libpq/meson.build   |  5 -
 src/interfaces/libpq/po/meson.build|  3 +++
 src/pl/plperl/meson.build  |  2 ++
 src/pl/plperl/po/meson.build   |  3 +++
 src/pl/plpgsql/src/meson.build |  2 ++
 src/pl/plpgsql/src/po/meson.build  |  3 +++
 src/pl/plpython/meson.build|  2 ++
 src/pl/plpython/po/meson.build |  3 +++
 src/pl/tcl/meson.build |  2 ++
 src/pl/tcl/po/meson.build  |  3 +++
 54 files changed, 155 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/po/meson.build
 create mode 100644 src/bin/initdb/po/meson.build
 create mode 100644 src/bin/pg_amcheck/po/meson.build
 create mode 100644 src/bin/pg_archivecleanup/po/meson.build
 create mode 100644 src/bin/pg_basebackup/po/meson.build
 create mode 100644 src/bin/pg_checksums/po/meson.build
 create mode 100644 src/bin/pg_config/po/meson.build
 create mode 100644 src/bin/pg_controldata/po/meson.build
 create mode 100644 src/bin/pg_ctl/po/meson.build
 create mode 100644 src/bin/pg_dump/po/meson.build
 create mode 100644 src/bin/pg_resetwal/po/meson.build
 create mode 100644 src/bin/pg_rewind/po/meson.build
 create mode 100644 src/bin/pg_test_fsync/po/meson.build
 create mode 100644 src/bin/pg_test_timing/po/meson.build
 create mode 100644 src/bin/pg_upgrade/po/meson.build
 create mode 100644 src/bin/pg_verifyb

Re: bogus: logical replication rows/cols combinations

2022-04-27 Thread Tomas Vondra

Hi,

so I've been looking at tweaking the code so that the behavior matches
Alvaro's expectations. It passes check-world but I'm not claiming it's
nowhere near commitable - the purpose is mostly to give better idea of
how invasive the change is etc.

As described earlier, this abandons the idea of building a single OR
expression from all the row filters (per action), and replaces that with
a list of per-publication info (struct PublicationInfo), combining info
about both row filters and column lists.

This means we can't initialize the row filters and column lists
separately, but at the same time. So pgoutput_row_filter_init was
modified to initialize both, and pgoutput_column_list_init was removed.

With this info, we can calculate column lists only for publications with
matching row filters, which is what the modified pgoutput_row_filter
does (the calculated column list is returned through a parameter).


This however does not remove the 'columns' from RelationSyncEntry
entirely. We still need that "superset" column list when sending schema.

Imagine two publications, one replicating (a,b) and the other (a,c),
maybe depending on row filter. send_relation_and_attrs() needs to send
info about all three attributes (a,b,c), i.e. about any attribute that
might end up being replicated.

We might try to be smarter and send the exact schema needed by the next
operation, i.e. when inserting (a,b) we'd make sure the last schema we
sent was (a,b) and invalidate/resend it otherwise. But that might easily
result in "trashing" where we send the schema and the next operation
invalidates it right away because it needs a different schema.

But there's another reason to do it like this - it seems desirable to
actually reset columns don't match the calculated column list. Using
Alvaro's example, it seems reasonable to expect these two transactions
to produce the same result on the subscriber:

1) insert (a,b) + update to (a,c)

  insert into uno values (1, 2, 3);
  update uno set a = -1 where a = 1;

2) insert (a,c)

  insert into uno values (-1, 2, 3);

But to do this, the update actually needs to send (-1,NULL,3).

So in this case we'll have (a,b,c) column list in RelationSyncEntry, and
only attributes on this list will be sent as part of schema. And DML
actions we'll calculate either (a,b) or (a,c) depending on the row
filter, and missing attributes will be replicated as NULL.


I haven't done any tests how this affect performance, but I have a
couple thoughts regarding that:

a) I kinda doubt the optimizations would really matter in practice,
because how likely is it that one relation is in many publications (in
the same subscription)?

b) Did anyone actually do some benchmarks that I could repeat, to see
how much worse this is?

c) AFAICS we could optimize this in at least some common cases. For
example we could combine the entries with matching row filters, and/or
column filters.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index ff8513e2d2..41c5e3413f 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -33,7 +33,9 @@ static void logicalrep_write_attrs(StringInfo out, Relation rel,
    Bitmapset *columns);
 static void logicalrep_write_tuple(StringInfo out, Relation rel,
    TupleTableSlot *slot,
-   bool binary, Bitmapset *columns);
+   bool binary,
+   Bitmapset *schema_columns,
+   Bitmapset *columns);
 static void logicalrep_read_attrs(StringInfo in, LogicalRepRelation *rel);
 static void logicalrep_read_tuple(StringInfo in, LogicalRepTupleData *tuple);
 
@@ -412,7 +414,8 @@ logicalrep_read_origin(StringInfo in, XLogRecPtr *origin_lsn)
  */
 void
 logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
-		TupleTableSlot *newslot, bool binary, Bitmapset *columns)
+		TupleTableSlot *newslot, bool binary,
+		Bitmapset *schema_columns, Bitmapset *columns)
 {
 	pq_sendbyte(out, LOGICAL_REP_MSG_INSERT);
 
@@ -424,7 +427,8 @@ logicalrep_write_insert(StringInfo out, TransactionId xid, Relation rel,
 	pq_sendint32(out, RelationGetRelid(rel));
 
 	pq_sendbyte(out, 'N');		/* new tuple follows */
-	logicalrep_write_tuple(out, rel, newslot, binary, columns);
+	logicalrep_write_tuple(out, rel, newslot, binary,
+		   schema_columns, columns);
 }
 
 /*
@@ -457,7 +461,8 @@ logicalrep_read_insert(StringInfo in, LogicalRepTupleData *newtup)
 void
 logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
 		TupleTableSlot *oldslot, TupleTableSlot *newslot,
-		bool binary, Bitmapset *columns)
+		bool binary, Bitmapset *schema_columns,
+		Bitmapset *columns)
 {
 	pq_sendbyte(out, LOGICAL_REP_MSG_UPDATE);
 
@@ -478,11 +483,12 @@ logicalrep_write_update(StringInfo out, TransactionId xid, Relation rel,
 			pq_sendbyte(out, 'O')

Multi-Master Logical Replication

2022-04-27 Thread Peter Smith

MULTI-MASTER LOGICAL REPLICATION

1.0 BACKGROUND

Let’s assume that a user wishes to set up a multi-master environment
so that a set of PostgreSQL instances (nodes) use logical replication
to share tables with every other node in the set.

We define this as a multi-master logical replication (MMLR) node-set.

1.1 ADVANTAGES OF MMLR

- Increases write scalability (e.g., all nodes can write arbitrary data).
- Allows load balancing
- Allows rolling updates of nodes (e.g., logical replication works
between different major versions of PostgreSQL).
- Improves the availability of the system (e.g., no single point of failure)
- Improves performance (e.g., lower latencies for geographically local nodes)

2.0 MMLR AND POSTGRESQL

It is already possible to configure a kind of MMLR set in PostgreSQL
15 using PUB/SUB, but it is very restrictive because it can only work
when no two nodes operate on the same table. This is because when two
nodes try to share the same table then there becomes a circular
recursive problem where Node1 replicates data to Node2 which is then
replicated back to Node1 and so on.

To prevent the circular recursive problem Vignesh is developing a
patch [1] that introduces new SUBSCRIPTION options "local_only" (for
publishing only data originating at the publisher node) and
"copy_data=force". Using this patch, we have created a script [2]
demonstrating how to set up all the above multi-node examples. An
overview of the necessary steps is given in the next section.

2.1 STEPS – Adding a new node N to an existing node-set

step 1. Prerequisites – Apply Vignesh’s patch [1]. All nodes in the
set must be visible to each other by a known CONNECTION. All shared
tables must already be defined on all nodes.

step 2. On node N do CREATE PUBLICATION pub_N FOR ALL TABLES

step 3. All other nodes then CREATE SUBSCRIPTION to PUBLICATION pub_N
with "local_only=on, copy_data=on" (this will replicate initial data
from the node N tables to every other node).

step 4. On node N, temporarily ALTER PUBLICATION pub_N to prevent
replication of 'truncate', then TRUNCATE all tables of node N, then
re-allow replication of 'truncate'.

step 5. On node N do CREATE SUBSCRIPTION to the publications of all
other nodes in the set
5a. Specify "local_only=on, copy_data=force" for exactly one of the
subscriptions (this will make the node N tables now have the same
data as the other nodes)
5b. Specify "local_only=on, copy_data=off" for all other subscriptions.

step 6. Result - Now changes to any table on any node should be
replicated to every other node in the set.

Note: Steps 4 and 5 need to be done within the same transaction to
avoid loss of data in case of some command failure. (Because we can't
perform create subscription in a transaction, we need to create the
subscription in a disabled mode first and then enable it in the
transaction).

2.2 DIFFICULTIES

Notice that it becomes increasingly complex to configure MMLR manually
as the number of nodes in the set increases. There are also some
difficulties such as
- dealing with initial table data
- coordinating the timing to avoid concurrent updates
- getting the SUBSCRIPTION options for copy_data exactly right.

3.0 PROPOSAL

To make the MMLR setup simpler, we propose to create a new API that
will hide all the step details and remove the burden on the user to
get it right without mistakes.

3.1 MOTIVATION
- MMLR (sharing the same tables) is not currently possible
- Vignesh's patch [1] makes MMLR possible, but the manual setup is
still quite difficult
- An MMLR implementation can solve the timing problems (e.g., using
Database Locking)

3.2 API

Preferably the API would be implemented as new SQL functions in
PostgreSQL core, however, implementation using a contrib module or
some new SQL syntax may also be possible.

SQL functions will be like below:
- pg_mmlr_set_create = create a new set, and give it a name
- pg_mmlr_node_attach = attach the current node to a specified set
- pg_mmlr_node_detach = detach a specified node from a specified set
- pg_mmlr_set_delete = delete a specified set

For example, internally the pg_mmlr_node_attach API function would
execute the equivalent of all the CREATE PUBLICATION, CREATE
SUBSCRIPTION, and TRUNCATE steps described above.

Notice this proposal has some external API similarities with the BDR
extension [3] (which also provides multi-master logical replication),
although we plan to implement it entirely using PostgreSQL’s PUB/SUB.

4.0 ACKNOWLEDGEMENTS

The following people have contributed to this proposal – Hayato
Kuroda, Vignesh C, Peter Smith, Amit Kapila.

5.0 REFERENCES

[1]
https://www.postgresql.org/message-id/flat/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9%3D9orqubhjcQ%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/CAHut%2BPvY2P%3DUL-X6maMA5QxFKdcdciRRCKDH3j%3D_hO8u2OyRYg%40mail.gmail.com
[3] https://www.enterprisedb.com/docs/bdr/latest/

[END]

~~~

One of my colleagues will post more detailed information later.

---

Re: pgsql: Add contrib/pg_walinspect.

2022-04-27 Thread Thomas Munro

On Wed, Apr 27, 2022 at 10:22 PM Bharath Rupireddy
 wrote:
> Here's v2 patch (up on Thomas's v1 at [1]) using private_data to set
> the end of the WAL flag. Please have a look at it.

I don't have a strong view on whether it's better to use a NULL error
for this private communication between pg_walinspect.c and the
read_page callback it installs, or install a custom private_data to
signal this condition, or to give up on all this stuff completely and
just wait (see below for thoughts).  I'd feel better about both
no-wait options if the read_page callback in question were actually in
the contrib module, and not in the core code.  On the other hand, I'm
not too hung up about it because I'm really hoping to see the
get-rid-of-all-these-callbacks-and-make-client-code-do-the-waiting
scheme proposed by Horiguchi-san and Heikki developed further, to rip
much of this stuff out in a future release.

If you go with the private_data approach, a couple of small comments:

 if (record == NULL)
 {
+ReadLocalXLogPageNoWaitPrivate *private_data;
+
+/* return NULL, if end of WAL is reached */
+private_data = (ReadLocalXLogPageNoWaitPrivate *)
+xlogreader->private_data;
+
+if (private_data->end_of_wal)
+return NULL;
+
 if (errormsg)
 ereport(ERROR,
 (errcode_for_file_access(),
  errmsg("could not read WAL at %X/%X: %s",
 LSN_FORMAT_ARGS(first_record), errormsg)));
-else
-ereport(ERROR,
-(errcode_for_file_access(),
- errmsg("could not read WAL at %X/%X",
-LSN_FORMAT_ARGS(first_record;
 }

I think you should leave that second ereport() call in, in this
variant of the patch, no?  I don't know if anything else raises errors
with no message, but if we're still going to treat them as silent
end-of-data conditions then you might as well go with the v1 patch.

Another option might be to abandon this whole no-wait concept and
revert 2258e76f's changes to xlogutils.c.  pg_walinspect already does
preliminary checks that LSNs are in range, so you can't enter a value
that will wait indefinitely, and it's interruptible (it's a 1ms
sleep/check loop, not my favourite programming pattern but that's
pre-existing code).  If you're unlucky enough to hit the case where
the LSN is judged to be valid but is in the middle of a record that
hasn't been totally flushed yet, it'll just be a bit slower to return
as we wait for the inevitable later flush(es) to happen.  The rest of
your record will *surely* be flushed pretty soon (or the flushing
backend panics the whole system and time ends).  I don't imagine this
is performance critical work, so maybe that'd be acceptable?

Re: Possible corruption by CreateRestartPoint at promotion

2022-04-27 Thread Michael Paquier

On Wed, Apr 27, 2022 at 11:09:45AM -0700, Nathan Bossart wrote:
> On Wed, Apr 27, 2022 at 02:16:01PM +0900, Michael Paquier wrote:
>> -   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
>> -   ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
>> -   {
>> 7ff23c6 has removed the last call to CreateCheckpoint() outside the
>> checkpointer, meaning that there is one less concurrent race to worry
>> about, but I have to admit that this change, to update the control
>> file's checkPoint and checkPointCopy even if we don't check after
>> ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the
>> code less robust in ~14.  So I am questioning whether a backpatch
>> is actually worth the risk here.
> 
> IMO we should still check this before updating ControlFile to be safe.

Sure.  Fine by me to play it safe.
--
Michael


signature.asc
Description: PGP signature

trivial comment fix

2022-04-27 Thread Euler Taveira

Hi,

While reading worker.c, I noticed that the referred SQL command was wrong.
ALTER SUBSCRIPTION ... REFRESH PUBLICATION instead of ALTER TABLE ... REFRESH
PUBLICATION. Trivial fix attached.


--
Euler Taveira
EDB   https://www.enterprisedb.com/
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4171371296..7da7823c35 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -93,7 +93,7 @@
  * ReorderBufferFinishPrepared.
  *
  * If the subscription has no tables then a two_phase tri-state PENDING is
- * left unchanged. This lets the user still do an ALTER TABLE REFRESH
+ * left unchanged. This lets the user still do an ALTER SUBSCRIPTION REFRESH
  * PUBLICATION which might otherwise be disallowed (see below).
  *
  * If ever a user needs to be aware of the tri-state value, they can fetch it

RE: Data is copied twice when specifying both child and parent table in publication

2022-04-27 Thread shiy.f...@fujitsu.com

On Sun, Apr 24, 2022 2:16 PM Wang, Wei/王 威  wrote:
> 
> Attach the new patches.[suggestions by Amit-San]
> The patch for HEAD:
> 1. Add a new function to get tables info by a publications array.
> The patch for REL14:
> 1. Use an alias to make the statement understandable. BTW, I adjusted the
> alignment.
> 2. Improve the test cast about the column list and row filter to cover this 
> bug.
> 

Thanks for your patches.

Here's a comment on the patch for REL14.

+   appendStringInfo(&cmd, "SELECT DISTINCT ns.nspname, c.relname\n"
+" FROM 
pg_catalog.pg_publication_tables t\n"
+"  JOIN pg_catalog.pg_namespace 
ns\n"
+" ON ns.nspname = 
t.schemaname\n"
+"  JOIN pg_catalog.pg_class c\n"
+" ON c.relname = t.tablename 
AND c.relnamespace = ns.oid\n"
+" WHERE t.pubname IN (%s)\n"
+" AND (c.relispartition IS FALSE\n"
+"  OR NOT EXISTS\n"
+"( SELECT 1 FROM 
pg_partition_ancestors(c.oid) as relid\n"
+"  WHERE relid IN\n"
+"(SELECT DISTINCT 
(schemaname || '.' || tablename)::regclass::oid\n"
+" FROM 
pg_catalog.pg_publication_tables t\n"
+" WHERE t.pubname IN 
(%s))\n"
+"  AND relid != c.oid))\n",
+pub_names.data, pub_names.data);

I think we can use an alias like 'pa' for pg_partition_ancestors, and modify 
the SQL as follows. 

+   appendStringInfo(&cmd, "SELECT DISTINCT ns.nspname, c.relname\n"
+" FROM 
pg_catalog.pg_publication_tables t\n"
+"  JOIN pg_catalog.pg_namespace 
ns\n"
+" ON ns.nspname = 
t.schemaname\n"
+"  JOIN pg_catalog.pg_class c\n"
+" ON c.relname = t.tablename 
AND c.relnamespace = ns.oid\n"
+" WHERE t.pubname IN (%s)\n"
+" AND (c.relispartition IS FALSE\n"
+"  OR NOT EXISTS\n"
+"( SELECT 1 FROM 
pg_partition_ancestors(c.oid) pa\n"
+"  WHERE pa.relid IN\n"
+"(SELECT DISTINCT 
(t.schemaname || '.' || t.tablename)::regclass::oid\n"
+" FROM 
pg_catalog.pg_publication_tables t\n"
+" WHERE t.pubname IN 
(%s))\n"
+"  AND pa.relid != c.oid))\n",
+pub_names.data, pub_names.data);

Regards,
Shi yu

Re: Unstable tests for recovery conflict handling

2022-04-27 Thread Andres Freund

Hi,

On 2022-04-27 10:11:53 -0700, Mark Dilger wrote:
> 
> 
> > On Apr 27, 2022, at 9:45 AM, Tom Lane  wrote:
> > 
> > [ starting a new thread cuz the shared-stats one is way too long ]
> > 
> > Andres Freund  writes:
> >> Add minimal tests for recovery conflict handling.
> > 
> > It's been kind of hidden by other buildfarm noise, but
> > 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4].
> > 
> > Three of those failures look like
> 
> Interesting,
> 
> I have been getting failures on REL_14_STABLE:
> 
> t/012_subtransactions.pl . 11/12  
>   
> #   Failed test 'Rollback of PGPROC_MAX_CACHED_SUBXIDS+ prepared transaction 
> on promoted standby'
> #   at t/012_subtransactions.pl line 211.
> #  got: '3'
> # expected: '0'
> t/012_subtransactions.pl . 12/12 # Looks like you failed 1 test 
> of 12.
> t/012_subtransactions.pl . Dubious, test returned 1 (wstat 256, 
> 0x100)
> Failed 1/12 subtests 
> 
> And the logs, tmp_check/log/regress_log_012_subtransactions, showing:

I'm a bit confused - what's the relation of that failure to this thread / the
tests / this commit?

Greetings,

Andres Freund

Re: Unstable tests for recovery conflict handling

2022-04-27 Thread Andres Freund

Hi,

On 2022-04-27 14:08:45 -0400, Tom Lane wrote:
> I wrote:
> > It's been kind of hidden by other buildfarm noise, but
> > 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4].
> > ...
> > I think this is showing us a real bug, ie we sometimes fail to cancel
> > the conflicting query.
> 
> After digging around in the code, I think this is almost certainly
> some manifestation of the previously-complained-of problem [1] that
> RecoveryConflictInterrupt is not safe to call in a signal handler,
> leading the conflicting backend to sometimes decide that it's not
> the problem.

I think at least some may actually be because of
https://postgr.es/m/20220409045515.35ypjzddp25v72ou%40alap3.anarazel.de
rather than RecoveryConflictInterrupt itself.

I'll go an finish up the comment bits that I still need to clean up in my
bugix and commit that...

Greetings,

Andres Freund

Re: remove redundant check of item pointer

2022-04-27 Thread Junwang Zhao

got it, thanks for the explanation.

On Wed, Apr 27, 2022 at 11:34 PM Tom Lane  wrote:
>
> Junwang Zhao  writes:
> > In function ItemPointerEquals, the ItemPointerGetBlockNumber
> > already checked the ItemPointer if valid, there is no need
> > to check it again in ItemPointerGetOffset, so use
> > ItemPointerGetOffsetNumberNoCheck instead.
>
> I do not think this change is worth making.  The point of
> ItemPointerGetOffsetNumberNoCheck is not to save some cycles,
> it's to be able to fetch the offset field in cases where it might
> validly be zero.  The assertion will be compiled out anyway in
> production builds --- and even in assert-enabled builds, I'd kind
> of expect the compiler to optimize away the duplicated tests.
>
> regards, tom lane



-- 
Regards
Junwang Zhao

Re: Handle infinite recursion in logical replication setup

2022-04-27 Thread Peter Smith

Here are my review comments for v10-0002 (TAP tests part only)

FIle: src/test/subscription/t/032_localonly.pl

==

1.

+# Detach node C and clean the table contents.
+sub detach_node_clean_table_data
+{

1a. Maybe say "Detach node C from the node-group of (A, B, C) and
clean the table contents from all nodes"

1b. Actually wondered do you need to TRUNCATE from both A and B (maybe
only truncate 1 is OK since those nodes are still using MMC). OTOH
maybe your explicit way makes the test simpler.

~~~

2.

+# Subroutine for verify the data is replicated successfully.
+sub verify_data
+{

2a. Typo: "for verify"  -> "to verify"

2b. The messages in this function maybe are not very appropriate. They
say 'Inserted successfully without leading to infinite recursion in
circular replication setup', but really the function is only testing
all the data is the same as 'expected'. So it could be the result of
any operation - not just Insert.

~~~

3. SELECT ORDER BY?

$result = $node_B->safe_psql('postgres', "SELECT * FROM tab_full;");
is($result, qq(11
12
13),
   'Node_C data replicated to Node_B'
);

I am not sure are these OK like this or if *every* SELECT use ORDER BY
to make sure the data is in the same qq expected order? There are
multiple cases like this.

(BTW, I think this comment needs to be applied for the v10-0001 patch,
maybe not v10-0002).

~~~

4.

+###
+# Specifying local_only 'on' which indicates that the publisher should only
+# replicate the changes that are generated locally from node_B, but in
+# this case since the node_B is also subscribing data from node_A, node_B can
+# have data originated from node_A, so throw an error in this case to prevent
+# node_A data being replicated to the node_C.
+###

There is something wrong with the description because there is no
"node_C" in this test. You are just creating a 2nd subscription on
node A.

~~

5.

+($result, $stdout, $stderr) = $node_A->psql(
+   'postgres', "
+CREATE SUBSCRIPTION tap_sub_A3
+CONNECTION '$node_B_connstr application_name=$subname_AB'
+PUBLICATION tap_pub_B
+WITH (local_only = on, copy_data = on)");
+like(

It seemed strange to call this 2nd subscription "tap_sub_A3". Wouldn't
it be better to call it "tap_sub_A2"?

~~~

6.

+###
+# Join 3rd node (node_C) to the existing 2 nodes(node_A & node_B) bidirectional
+# replication setup when the existing nodes (node_A & node_B) has no data and
+# the new node (node_C) some pre-existing data.
+###
+$node_C->safe_psql('postgres', "INSERT INTO tab_full VALUES (31);");
+
+$result = $node_A->safe_psql('postgres', "SELECT * FROM tab_full order by 1;");
+is( $result, qq(), 'Check existing data');
+
+$result = $node_B->safe_psql('postgres', "SELECT * FROM tab_full order by 1;");
+is( $result, qq(), 'Check existing data');
+
+$result = $node_C->safe_psql('postgres', "SELECT * FROM tab_full order by 1;");
+is($result, qq(31), 'Check existing data');
+
+create_subscription($node_A, $node_C, $subname_AC, $node_C_connstr,
+   'tap_pub_C', 'copy_data = force, local_only = on');
+create_subscription($node_B, $node_C, $subname_BC, $node_C_connstr,
+   'tap_pub_C', 'copy_data = force, local_only = on');
+

Because the Node_C does not yet have any subscriptions aren't these
cases where you didn't really need to use "force"?

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: trivial comment fix

2022-04-27 Thread John Naylor

On Thu, Apr 28, 2022 at 7:27 AM Euler Taveira  wrote:
>
> Hi,
>
> While reading worker.c, I noticed that the referred SQL command was wrong.
> ALTER SUBSCRIPTION ... REFRESH PUBLICATION instead of ALTER TABLE ... REFRESH
> PUBLICATION. Trivial fix attached.

Pushed, thanks!


--
John Naylor
EDB: http://www.enterprisedb.com

Re: Possible corruption by CreateRestartPoint at promotion

2022-04-27 Thread Kyotaro Horiguchi

At Wed, 27 Apr 2022 14:16:01 +0900, Michael Paquier  wrote 
in 
> On Tue, Apr 26, 2022 at 08:26:09PM -0700, Nathan Bossart wrote:
> > On Wed, Apr 27, 2022 at 10:43:53AM +0900, Kyotaro Horiguchi wrote:
> >> At Tue, 26 Apr 2022 11:33:49 -0700, Nathan Bossart 
> >>  wrote in 
> >>> I suspect we'll start seeing this problem more often once end-of-recovery
> >>> checkpoints are removed [0].  Would you mind creating a commitfest entry
> >>> for this thread?  I didn't see one.
> >> 
> >> I'm not sure the patch makes any change here, because restart points
> >> don't run while crash recovery, since no checkpoint records seen
> >> during a crash recovery.  Anyway the patch doesn't apply anymore so
> >> rebased, but only the one for master for the lack of time for now.
> > 
> > Thanks for the new patch!  Yeah, it wouldn't affect crash recovery, but
> > IIUC Robert's patch also applies to archive recovery.
> > 
> >> +  /* Update pg_control */
> >> +  LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> >> +
> >>/*
> >> * Remember the prior checkpoint's redo ptr for
> >> * UpdateCheckPointDistanceEstimate()
> >> */
> >>PriorRedoPtr = ControlFile->checkPointCopy.redo;
> > 
> > nitpick: Why move the LWLockAcquire() all the way up here?
> 
> Yeah, that should not be necessary.  InitWalRecovery() is the only
> place outside the checkpointer that would touch this field, but that
> happens far too early in the startup sequence to matter with the
> checkpointer.

Yes it is not necessary. I just wanted to apparently ensure not to
access ControlFile outside ControlFileLoc.  So I won't be opposed to
reverting it since, as you say, it is *actuall* safe..

> >> +  Assert (PriorRedoPtr < RedoRecPtr);
> > 
> > I think this could use a short explanation.
> 
> That's just to make sure that the current redo LSN is always older
> than the one prior that.  It does not seem really necessary to me to
> add that.

Just after we call UpdateCheckPointDistanceEstimate(RedoRecPtr -
PriorRedoPtr). Don't we really need any safeguard against giving a
wrap-arounded (in other words, treamendously large) value to the
function?  Actually it doesn't seem to happen now, but I don't
confident that that never ever happens.

That being said, I'm a minority here, so removed it.

> >> +  /*
> >> +   * Aarchive recovery has ended. Crash recovery ever after should
> >> +   * always recover to the end of WAL
> >> +   */
> 
> s/Aarchive/Archive/.

Oops! Fixed.

> >> +  ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
> >> +  ControlFile->minRecoveryPointTLI = 0;
> >> +
> >> +  /* also update local copy */
> >> +  LocalMinRecoveryPoint = InvalidXLogRecPtr;
> >> +  LocalMinRecoveryPointTLI = 0;
> > 
> > Should this be handled by the code that changes the control file state to
> > DB_IN_PRODUCTION instead?  It looks like this is ordinarily done in the
> > next checkpoint.  It's not clear to me why it is done this way.
> 
> Anyway, that would be the work of the end-of-recovery checkpoint
> requested at the end of StartupXLOG() once a promotion happens or of
> the checkpoint requested by PerformRecoveryXLogAction() in the second
> case, no?  So, I don't quite see why we need to update

Eventually the work is done by StartupXLOG().  So we don't need to do
that at all even in CreateCheckPoint().  If we expect that the
end-of-recovery checkpoint clears it, that won't happen if if the last
restart point takes so long time that the end-of-recovery checkpoint
request is ignored. If DB_IN_ARCHIVE_RECOVERY ended while the restart
point is running, it is highly possible that the end-of-recovery
checkpoint trigger is ignored. In that case the values are cleard at
the next checkpoint.

In short, if we want to reset them at so-called end-of-recovery
checkpoint, we should do that also in CreateRecoveryPoint.

So, it is not removed in this version.

> minRecoveryPoint and minRecoveryPointTLI in the control file here, as
> much as this does not have to be part of the end-of-recovery code
> that switches the control file to DB_IN_PRODUCTION.
> 
> -   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
> -   ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
> -   {
> 7ff23c6 has removed the last call to CreateCheckpoint() outside the
> checkpointer, meaning that there is one less concurrent race to worry
> about, but I have to admit that this change, to update the control

Sure.  In very early stage the reasoning to rmove the code was it.
And the rason for proposing to back-patch the same to older versions
is basing on the further investigation, and I'm not fully confident
that for the earlier versions.

> file's checkPoint and checkPointCopy even if we don't check after
> ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the
> code less robust in ~14.  So I am questioning whether a backpatch
> is actually worth the risk here.

Agreed.

regards.

-- 
Kyotaro Horiguchi
N

Re: Possible corruption by CreateRestartPoint at promotion

2022-04-27 Thread Kyotaro Horiguchi

At Thu, 28 Apr 2022 09:12:13 +0900, Michael Paquier  wrote 
in 
> On Wed, Apr 27, 2022 at 11:09:45AM -0700, Nathan Bossart wrote:
> > On Wed, Apr 27, 2022 at 02:16:01PM +0900, Michael Paquier wrote:
> >> -   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
> >> -   ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
> >> -   {
> >> 7ff23c6 has removed the last call to CreateCheckpoint() outside the
> >> checkpointer, meaning that there is one less concurrent race to worry
> >> about, but I have to admit that this change, to update the control
> >> file's checkPoint and checkPointCopy even if we don't check after
> >> ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the
> >> code less robust in ~14.  So I am questioning whether a backpatch
> >> is actually worth the risk here.
> > 
> > IMO we should still check this before updating ControlFile to be safe.
> 
> Sure.  Fine by me to play it safe.

Why do we consider concurrent check/restart points here while we don't
consider the same for ControlFile->checkPointCopy?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Possible corruption by CreateRestartPoint at promotion

2022-04-27 Thread Kyotaro Horiguchi

At Wed, 27 Apr 2022 01:31:55 -0400, Tom Lane  wrote in 
> Michael Paquier  writes:
> > On Wed, Apr 27, 2022 at 12:36:10PM +0800, Rui Zhao wrote:
> >> Do you have interest in adding a test like one in my patch?
> 
> > I have studied the test case you are proposing, and I am afraid that
> > it is too expensive as designed.
> 
> That was my feeling too.  It's certainly a useful test for verifying
> that we fixed the problem, but that doesn't mean that it's worth the
> cycles to add it as a permanent fixture in check-world, even if we
> could get rid of the timing assumptions.

My first feeling is the same.  And I don't find a way to cause this
cheap and reliably without inserting a dedicate debugging-aid code.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pgsql: Add contrib/pg_walinspect.

2022-04-27 Thread Jeff Davis

On Thu, 2022-04-28 at 12:11 +1200, Thomas Munro wrote:
> 
> Another option might be to abandon this whole no-wait concept and
> revert 2258e76f's changes to xlogutils.c.  pg_walinspect already does
> preliminary checks that LSNs are in range, so you can't enter a value
> that will wait indefinitely, and it's interruptible (it's a 1ms
> sleep/check loop, not my favourite programming pattern but that's
> pre-existing code).  If you're unlucky enough to hit the case where
> the LSN is judged to be valid but is in the middle of a record that
> hasn't been totally flushed yet, it'll just be a bit slower to return
> as we wait for the inevitable later flush(es) to happen.  The rest of
> your record will *surely* be flushed pretty soon (or the flushing
> backend panics the whole system and time ends).  I don't imagine this
> is performance critical work, so maybe that'd be acceptable?

I'm inclined toward this option. I was a bit iffy on those xlogutils.c
changes to begin with, and they are causing a problem now, so I'd like
to avoid layering on more workarounds.

The time when we need to wait is very narrow, only in this case where
it's earlier than the flush pointer, and the flush pointer is in the
middle of a record that's not fully flushed. And as you say, we won't
be waiting very long in that case, because once we start to write a WAL
record it better finish soon.

Bharath, thoughts? When you originally introduced the nowait behavior,
I believe that was to solve the case where someone specifies an LSN
range well in the future, but we can still catch that and throw an
error if we see that it's beyond the flush pointer. Do you see a
problem with just doing that and getting rid of the nowait changes?

Regards,
Jeff Davis

Re: bogus: logical replication rows/cols combinations

2022-04-27 Thread Amit Kapila

On Thu, Apr 28, 2022 at 3:26 AM Tomas Vondra
 wrote:
>
> so I've been looking at tweaking the code so that the behavior matches
> Alvaro's expectations. It passes check-world but I'm not claiming it's
> nowhere near commitable - the purpose is mostly to give better idea of
> how invasive the change is etc.
>

I was just skimming through the patch and didn't find anything related
to initial sync handling. I feel the behavior should be same for
initial sync and replication.

-- 
With Regards,
Amit Kapila.

Re: avoid multiple hard links to same WAL file after a crash

2022-04-27 Thread Michael Paquier

On Wed, Apr 27, 2022 at 11:42:04AM -0700, Nathan Bossart wrote:
> Here is a new patch set with these assertions added. I think at least the
> xlog.c change ought to be back-patched. The problem may be unlikely, but
> AFAICT the possible consequences include WAL corruption.

Okay, so I have applied this stuff this morning to see what the
buildfarm had to say, and we have finished with a set of failures in
various buildfarm members:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kestrel&dt=2022-04-28%2002%3A13%3A27
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2022-04-28%2002%3A14%3A08
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2022-04-28%2002%3A59%3A26

All of them did not like the part where we assume that a TLI history
file written by a WAL receiver should not exist beforehand, but as
025_stuck_on_old_timeline.pl is showing, a standby may attempt to
retrieve a TLI history file after getting it from the archives.

I was analyzing the whole thing, and it looks like a race condition.
Per the the buildfarm logs, we have less than 5ms between the moment
the startup process retrieves the history file of TLI 2 from the
archives and the moment the WAL receiver decides to check if this TLI
file exists. If it does not exist, it would then retrieve it from the
primary via streaming. So I guess that the sequence of events is
that:
- In WalRcvFetchTimeLineHistoryFiles(), the WAL receiver checks the
existence of the history file for TLI 2, does not find it.
- The startup process retrieves the file from the archives.
- The WAL receiver goes through the internal loop of
WalRcvFetchTimeLineHistoryFiles(), retrieves the history file from the
primary's stream.

Switching from durable_rename_excl() to durable_rename() would mean
that we'd overwrite the TLI file received from the primary stream over
what's been retrieved from the archives. That does not strike me as
an issue in itself and that should be safe, so the comment is
misleading, and we can live without the assertion in
writeTimeLineHistoryFile() called by the WAL receiver. Now, I think
that we'd better keep some belts in writeTimeLineHistory() called by
the startup process at the end-of-recovery as I should never ever have
a TLI file generated when selecting a new timeline. Perhaps this
should be a elog(ERROR) at least, with a check on the file existence
before calling durable_rename()?

Anyway, my time is constrained next week due to the upcoming Japanese
Golden Week and the buildfarm has to be stable, so I have reverted the
change for now.
--
Michael

signature.asc
Description: PGP signature

Re: Add --{no-,}bypassrls flags to createuser

2022-04-27 Thread Shinya Kato


Thank you for the reviews!

On 2022-04-26 05:19, Nathan Bossart wrote:

-	printf(_("  -g, --role=ROLE   new role will be a member of 
this role\n"));
+	printf(_("  -g, --role=ROLEnew role will be a member of this 
role\n"));

This looks lik an unexpected change.


I fixed it.



I'm ok with -m/--member as well (like with --role only one role can be
specified per switch instance so member, not membership, the later 
meaning,

at least for me, the collective).

That -m doesn't match --role-to is no worse than -g not matching 
--role, a
short option seems worthwhile, and the -m (membership) mnemonic should 
be

simple to pick-up.

I don't see the addition of "-name" to the option name being 
beneficial.


Yes, the standard doesn't use the "TO" prefix for "ROLE" - but taking 
that
liberty for consistency here is very appealing and there isn't another 
SQL

clause that it would be confused with.


+1 for "member".  It might not be perfect, but IMO it's the clearest
option.


Thanks! I changed the option "--membership" to "--member".


For now, I also think "-m / --member" is the best choice, although it is 
ambiguous:(

I'd like to hear others' opinions.

regards


--
Shinya Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATIONdiff --git a/doc/src/sgml/ref/createuser.sgml b/doc/src/sgml/ref/createuser.sgml
index 17579e50af..189ca5bb67 100644
--- a/doc/src/sgml/ref/createuser.sgml
+++ b/doc/src/sgml/ref/createuser.sgml
@@ -76,6 +76,20 @@ PostgreSQL documentation
   
  
 
+ 
+  -a role
+  --admin=role
+  
+   
+ The new role will be added immediately as a member with admin option
+ of this role.
+ Multiple roles to which new role will be added as a member with admin
+ option can be specified by writing multiple
+ -a switches.
+ 
+  
+ 
+
  
   -c number
   --connection-limit=number
@@ -204,6 +218,19 @@ PostgreSQL documentation
   
  
 
+ 
+  -m role
+  --member=role
+  
+   
+ The new role will be added immediately as a member of this role.
+ Multiple roles to which new role will be added as a member
+ can be specified by writing multiple
+ -m switches.
+ 
+  
+ 
+
  
   -P
   --pwprompt
@@ -258,6 +285,17 @@ PostgreSQL documentation
   
  
 
+ 
+  -v timestamp
+  --valid-until=timestamp
+  
+   
+Set a timestamp after which the role's password is no longer valid.
+The default is to set no expiration.
+   
+  
+ 
+
  
-V
--version
@@ -290,6 +328,28 @@ PostgreSQL documentation
   
  
 
+ 
+  --bypassrls
+  
+   
+The new user will have the BYPASSRLS privilege,
+which is described more fully in the documentation for .
+   
+  
+ 
+
+ 
+  --no-bypassrls
+  
+   
+The new user will not have the BYPASSRLS
+privilege, which is described more fully in the documentation for .
+   
+  
+ 
+
  
-?
--help
diff --git a/src/bin/scripts/createuser.c b/src/bin/scripts/createuser.c
index bfba0d09d1..73cd1b479e 100644
--- a/src/bin/scripts/createuser.c
+++ b/src/bin/scripts/createuser.c
@@ -51,6 +51,11 @@ main(int argc, char *argv[])
 		{"connection-limit", required_argument, NULL, 'c'},
 		{"pwprompt", no_argument, NULL, 'P'},
 		{"encrypted", no_argument, NULL, 'E'},
+		{"bypassrls", no_argument, NULL, 4},
+		{"no-bypassrls", no_argument, NULL, 5},
+		{"valid-until", required_argument, NULL, 'v'},
+		{"member", required_argument, NULL, 'm'},
+		{"admin", required_argument, NULL, 'a'},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -69,6 +74,9 @@ main(int argc, char *argv[])
 	int			conn_limit = -2;	/* less than minimum valid value */
 	bool		pwprompt = false;
 	char	   *newpassword = NULL;
+	SimpleStringList members = {NULL, NULL};
+	SimpleStringList admins = {NULL, NULL};
+	char	   *timestamp = NULL;
 
 	/* Tri-valued variables.  */
 	enum trivalue createdb = TRI_DEFAULT,
@@ -76,7 +84,8 @@ main(int argc, char *argv[])
 createrole = TRI_DEFAULT,
 inherit = TRI_DEFAULT,
 login = TRI_DEFAULT,
-replication = TRI_DEFAULT;
+replication = TRI_DEFAULT,
+bypassrls = TRI_DEFAULT;
 
 	PQExpBufferData sql;
 
@@ -89,7 +98,7 @@ main(int argc, char *argv[])
 
 	handle_help_version_opts(argc, argv, "createuser", help);
 
-	while ((c = getopt_long(argc, argv, "h:p:U:g:wWedDsSrRiIlLc:PE",
+	while ((c = getopt_long(argc, argv, "h:p:U:g:wWedDsSrRiIlLc:PEv:m:a:",
 			long_options, &optindex)) != -1)
 	{
 		switch (c)
@@ -165,6 +174,21 @@ main(int argc, char *argv[])
 			case 3:
 interactive = true;
 break;
+			case 4:
+bypassrls = TRI_YES;
+break;
+			case 5:
+bypassrls = TRI_NO;
+break;
+			case 'v':
+timestamp = pg_strdup(optarg);
+break;
+			case 'm':
+simple_string

Re: Defer selection of asynchronous subplans until the executor initialization stage

2022-04-27 Thread Etsuro Fujita

On Mon, Apr 25, 2022 at 1:29 PM Etsuro Fujita  wrote:
> I modified mark_async_capable_plan() a bit further; 1) adjusted code
> in the ProjectionPath case, just for consistency with other cases, and
> 2) tweaked/improved comments a bit.  Attached is a new version of the
> patch (“prevent-async-2.patch”).
>
> As mentioned before, v14 has the same issue, so I created a fix for
> v14, which I’m attaching as well (“prevent-async-2-v14.patch”).  In
> the fix I modified is_async_capable_path() the same way as
> mark_async_capable_plan() in HEAD, renaming it to
> is_async_capable_plan(), and updated some comments.
>
> Barring objections, I’ll push/back-patch these.

Done.

Best regards,
Etsuro Fujita

Re: Implementing Incremental View Maintenance

2022-04-27 Thread Yugo NAGATA

Hello Greg,

On Sat, 23 Apr 2022 08:18:01 +0200
Greg Stark  wrote:

> I'm trying to figure out how to get this feature more attention. Everyone
> agrees it would be a huge help but it's a scary patch to review.
> 
> I wonder if it would be helpful to have a kind of "readers guide"
> explanation of the patches to help a reviewer understand what the point of
> each patch is and how the whole system works? I think Andres and Robert
> have both taken that approach before with big patches and it really helped
> imho.

Thank you very much for your suggestion!

Following your advice, I am going to write a readers guide referring to the past
posts of Andres and Rebert. 


Regards,
Yugo Nagata

-- 
Yugo NAGATA

Re: Multi-Master Logical Replication

2022-04-27 Thread Laurenz Albe

On Thu, 2022-04-28 at 09:49 +1000, Peter Smith wrote:
> To prevent the circular recursive problem Vignesh is developing a
> patch [1] that introduces new SUBSCRIPTION options "local_only" (for
> publishing only data originating at the publisher node) and
> "copy_data=force". Using this patch, we have created a script [2]
> demonstrating how to set up all the above multi-node examples. An
> overview of the necessary steps is given in the next section.

I am missing a discussion how replication conflicts are handled to
prevent replication from breaking or the databases from drifting apart.

Yours,
Laurenz Albe

Re: Possible corruption by CreateRestartPoint at promotion

2022-04-27 Thread Michael Paquier

On Thu, Apr 28, 2022 at 11:43:57AM +0900, Kyotaro Horiguchi wrote:
> At Thu, 28 Apr 2022 09:12:13 +0900, Michael Paquier  
> wrote in 
>> On Wed, Apr 27, 2022 at 11:09:45AM -0700, Nathan Bossart wrote:
>>> On Wed, Apr 27, 2022 at 02:16:01PM +0900, Michael Paquier wrote:
 -   if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
 -   ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
 -   {
 7ff23c6 has removed the last call to CreateCheckpoint() outside the
 checkpointer, meaning that there is one less concurrent race to worry
 about, but I have to admit that this change, to update the control
 file's checkPoint and checkPointCopy even if we don't check after
 ControlFile->checkPointCopy.redo < lastCheckPoint.redo would make the
 code less robust in ~14.  So I am questioning whether a backpatch
 is actually worth the risk here.
>>> 
>>> IMO we should still check this before updating ControlFile to be safe.
>> 
>> Sure.  Fine by me to play it safe.
> 
> Why do we consider concurrent check/restart points here while we don't
> consider the same for ControlFile->checkPointCopy?

I am not sure what you mean here.  FWIW, I am translating the
suggestion of Nathan to split the existing check in
CreateRestartPoint() that we are discussing here into two if blocks,
instead of just one:
- Move the update of checkPoint and checkPointCopy into its own if
block, controlled only by the check on
(ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
- Keep the code updating minRecoveryPoint and minRecoveryPointTLI
mostly as-is, but just do the update when the control file state is
DB_IN_ARCHIVE_RECOVERY.  Of course we need to keep the check on
(minRecoveryPoint < lastCheckPointEndPtr).

v5 is mostly doing that.
--
Michael


signature.asc
Description: PGP signature

59 matches

Mail list logo