Re: make MaxBackends available in _PG_init

2022-01-25 Thread Michael Paquier
On Mon, Jan 10, 2022 at 07:01:32PM +, Bossart, Nathan wrote:
> On 1/7/22, 12:27 PM, "Robert Haas"  wrote:
>> On Fri, Jan 7, 2022 at 1:09 PM Bossart, Nathan  wrote:
>>> While that approach would provide a way to safely retrieve the value,
>>> I think it would do little to prevent the issue in practice.  If the
>>> size of the patch is a concern, we could also convert MaxBackends into
>>> a macro for calling GetMaxBackends().  This could also be a nice way
>>> to avoid breaking extensions that are using it correctly while
>>> triggering ERRORs for extensions that are using it incorrectly.  I've
>>> attached a new version of the patch that does it this way.
>>
>> That's too magical for my taste.
> 
> Fair point.  v4 [0] is the less magical version.

So, where are we on this patch?  It looks like there is an agreement
that MaxBackends is used widely enough that it justifies the use of a
separate function to set and get a better value computed.  There may
be other parameters that could use a brush up, but most known cases
would be addressed here.  v4 looks rather straight-forward, at quick
glance.

(I'd also prefer the less magical version.)
--
Michael


signature.asc
Description: PGP signature


Re: missing indexes in indexlist with partitioned tables

2022-01-25 Thread Amit Langote
Hi,

On Mon, Jan 24, 2022 at 9:30 PM Arne Roland  wrote:
> The comment at my end goes on:
>
> /*
> * Don't add partitioned indexes to the indexlist, since they are
> * not usable by the executor. If they are unique add them to the
> * partindexlist instead, to use for further pruning. If they
> * aren't that either, simply skip them.
> */

"partindexlist" really made me think about a list of "partial indexes"
for some reason.  I think maybe "partedindexlist" is what you are
looking for; "parted" is commonly used as short for "partitioned" when
naming variables.

The comment only mentions "further pruning" as to what partitioned
indexes are to be remembered in RelOptInfo, but it's not clear what
that means.  It may help to be more specific.

Finally, I don't understand why we need a separate field to store
indexes found in partitioned base relations.  AFAICS, nothing but the
sites you are interested in (relation_has_unique_index_for() and
rel_supports_distinctness()) would ever bother to look at a
partitioned base relation's indexlist.  Do you think putting them into
in indexlist might break something?

> Regarding the semantics: This is sort of what the statement checks for (skip 
> for inhparent, if not unique or not partitioned index), i.e. it checks for 
> the case, where the index shouldn't be added to either list.
>
> Side note: I personally think the name inhparent is mildly confusing, since 
> it's not really about inheritance. I don't have a significantly better idea 
> though.

Partitioned tables are "inheritance parent", so share the same code as
what traditional inheritance parents have always used for planning.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: Fix BUG #17335: Duplicate result rows in Gather node

2022-01-25 Thread David Rowley
On Tue, 25 Jan 2022 at 20:03, David Rowley  wrote:
>
> On Tue, 25 Jan 2022 at 17:35, Yura Sokolov  wrote:
> > And another attempt to fix tests volatility.
>
> FWIW, I had not really seen the point in adding a test for this.   I
> did however see a point in it with your original patch. It seemed
> useful there to verify that Gather and GatherMerge did what we
> expected with 1 worker.

I ended up pushing just the last patch I sent.

The reason I didn't think it was worth adding a new test was that no
tests were added in the original commit.  Existing tests did cover it,
but here we're just restoring the original behaviour for one simple
case.  The test in your patch just seemed a bit more hassle than it
was worth. I struggle to imagine how we'll break this again.

David




libpq async duplicate error results

2022-01-25 Thread Peter Eisentraut


This issue was discovered by Fabien in the SHOW_ALL_RESULTS thread.  I'm 
posting it here separately, because I think it ought to be addressed in 
libpq rather than with a workaround in psql, as proposed over there.


When using PQsendQuery() + PQgetResult() and the server crashes during 
the execution of the command, PQgetResult() then returns two result sets 
with partially duplicated error messages, like this from the attached 
test program:


command = SELECT 'before';
result 1 status = PGRES_TUPLES_OK
error message = ""

command = SELECT pg_terminate_backend(pg_backend_pid());
result 1 status = PGRES_FATAL_ERROR
error message = "FATAL:  terminating connection due to administrator command
"
result 2 status = PGRES_FATAL_ERROR
error message = "FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
"

command = SELECT 'after';
PQsendQuery() error: no connection to the server


This is hidden in normal use because PQexec() throws away all but the 
last result set, but the extra one is still generated internally.


Apparently, this has changed between PG13 and PG14.  In PG13 and 
earlier, the output is


command = SELECT 'before';
result 1 status = PGRES_TUPLES_OK
error message = ""

command = SELECT pg_terminate_backend(pg_backend_pid());
result 1 status = PGRES_FATAL_ERROR
error message = "FATAL:  terminating connection due to administrator command
"
result 2 status = PGRES_FATAL_ERROR
error message = "server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
"

command = SELECT 'after';
PQsendQuery() error: no connection to the server

In PG13, PQexec() concatenates all the error messages from multiple 
results, so a user of PQexec() sees the same output before and after. 
But for users of the lower-level APIs, things have become a bit more 
confusing.


Also, why are there multiple results being generated in the first place?


[0]: 
https://www.postgresql.org/message-id/alpine.DEB.2.22.394.2112230703530.2668598@pseudo#include 
#include 

static void
run_query(PGconn *conn, const char *command)
{
int rc;
PGresult *res;
int count = 0;

fprintf(stderr, "command = %s\n", command);
rc = PQsendQuery(conn, command);
if (rc != 1)
{
char *error = PQerrorMessage(conn);

fprintf(stderr, "PQsendQuery() error: %s", error);
return;
}

while ((res = PQgetResult(conn)))
{
ExecStatusType status;

status = PQresultStatus(res);
fprintf(stderr, "result %d status = %s\n", ++count, 
PQresStatus(status));
fprintf(stderr, "error message = \"%s\"\n", 
PQresultErrorMessage(res));
PQclear(res);
}

fprintf(stderr, "\n");
return;
}

int
main()
{
PGconn *conn;

conn = PQconnectdb("");
if (PQstatus(conn) != CONNECTION_OK)
{
fprintf(stderr, "Connection to database failed: %s\n",
PQerrorMessage(conn));
return 1;
}

run_query(conn, "SELECT 'before';");
run_query(conn, "SELECT pg_terminate_backend(pg_backend_pid());");
run_query(conn, "SELECT 'after';");

PQfinish(conn);
return 0;
}


Re: Make mesage at end-of-recovery less scary.

2022-01-25 Thread Kyotaro Horiguchi
At Mon, 24 Jan 2022 14:23:33 +0400, Pavel Borisov  
wrote in 
> >
> > d2ddfa681db27a138acb63c8defa8cc6fa588922 removed global variables
> > ReadRecPtr and EndRecPtr. This is rebased version that reads the LSNs
> > directly from xlogreader instead of the removed global variables.
> >
> 
> Hi, hackers!
> 
> I've checked the latest version of a patch. It applies cleanly, check-world
> passes and CI is also in the green state.
> Proposed messages seem good to me, but probably it would be better to have
> a test on conditions where "reached end of WAL..." emitted.
> Then, I believe it can be set as 'ready for committter'.

Thanks for checking that, and the comment!

I thought that we usually don't test log messages, but finally I found
that I needed that.  It is because I found another mode of end-of-wal
and a bug that emits a spurious message on passing...

This v8 is changed in...

- Added tests to 011_crash_recovery.pl

- Fixed a bug that server emits "end-of-wal" messages even if it have
  emitted an error message for the same LSN.

- Changed XLogReaderValidatePageHeader() so that it recognizes an
  empty page as end-of-WAL.

- Made pg_waldump conscious of end-of-wal.

While doing the last item, I noticed that pg_waldump shows the wrong
LSN as the error position.  Concretely it emits the LSN of the last
sound WAL record as the error position.  I will post a bug-fix patch
for the issue after confirmation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 0f1024bdfba9d1926465351fa1b7125698a21e8d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v8] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c |  91 +-
 src/backend/access/transam/xlogreader.c   |  64 +
 src/backend/replication/walreceiver.c |   3 +-
 src/bin/pg_waldump/pg_waldump.c   |  13 ++-
 src/include/access/xlogreader.h   |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 110 +-
 6 files changed, 254 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 58922f7ede..c08b9554b3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 	for (;;)
 	{
 		char	   *errormsg;
+		XLogRecPtr	ErrRecPtr = InvalidXLogRecPtr;
 
 		record = XLogReadRecord(xlogreader, &errormsg);
 		if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			{
 abortedRecPtr = xlogreader->abortedRecPtr;
 missingContrecPtr = xlogreader->missingContrecPtr;
+ErrRecPtr = abortedRecPtr;
+			}
+			else
+			{
+/*
+ * NULL ReadRecPtr means we could not read a record at the
+ * beginning. In that case EndRecPtr is storing the LSN of the
+ * record we tried to read.
+ */
+ErrRecPtr =
+	xlogreader->ReadRecPtr ?
+	xlogreader->ReadRecPtr : xlogreader->EndRecPtr;
 			}
 
 			if (readFile >= 0)
@@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			}
 
 			/*
-			 * We only end up here without a message when XLogPageRead()
-			 * failed - in that case we already logged something. In
-			 * StandbyMode that only happens if we have been triggered, so we
-			 * shouldn't loop anymore in that case.
+			 * If we get here for other than end-of-wal, emit the error message
+			 * right now. Otherwise the message if any is shown as a part of
+			 * the end-of-WAL message below.
 			 */
-			if (errormsg)
-ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+			if (!xlogreader->EndOfWAL && errormsg)
+ereport(emode_for_corrupt_record(emode, ErrRecPtr),
 		(errmsg_internal("%s", errormsg) /* already translated */ ));
 		}
 
@@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			/* Great, got a record */
 			return record;
 		}
-		else
+
+		/* No valid record available from this source */
+		lastSourceFailed = true;
+
+		if (!fetching_ckpt)
 		{
-			/* No valid record available from this source */
-			lastSourceFailed = true;
-
 			/*
 			 * If archive recovery was requested, but we were still doing
 			 * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 			 * we'd have no idea how far we'd have to replay to reach
 			 * consistency.  So err on the safe side and give up.
 			 */
-			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-!fetc

Re: Schema variables - new implementation for Postgres 15

2022-01-25 Thread Pavel Stehule
út 25. 1. 2022 v 6:18 odesílatel Julien Rouhaud  napsal:

> Hi,
>
> On Mon, Jan 24, 2022 at 12:33:11PM +0100, Pavel Stehule wrote:
> >
> > here is updated patch with locking support
>
> Thanks for updating the patch!
>
> While the locking is globally working as intended, I found a few problems
> with
> it.
>
> First, I don't think that acquiring the lock in
> get_session_variable_type_typmod_collid() and
> prepare_variable_for_reading() is
> the correct approach.  In transformColumnRef() and transformLetStmt() you
> first
> call IdentifyVariable() to check if the given name is a variable without
> locking it and later try to lock the variable if you get a valid Oid.
> This is
> bug prone as any other backend could drop the variable between the two
> calls
> and you would end up with a cache lookup failure.  I think the lock should
> be
> acquired during IdentifyVariable.  It should probably be optional as one
> codepath only needs the information to raise a warning when a variable is
> shadowed, so a concurrent drop isn't a problem there.
>

There is a problem, because before the IdentifyVariable call I don't know
if the variable will be shadowed or not.

If I lock a variable inside IdentifyVariable, then I need to remember if I
did lock there, or if the variable was locked already, and If the variable
is shadowed and if lock is fresh, then I can unlock the variable.


Regards

Pavel


Re: Schema variables - new implementation for Postgres 15

2022-01-25 Thread Julien Rouhaud
Hi,

On Tue, Jan 25, 2022 at 09:35:09AM +0100, Pavel Stehule wrote:
> út 25. 1. 2022 v 6:18 odesílatel Julien Rouhaud  napsal:
> 
> > I think the lock should be
> > acquired during IdentifyVariable.  It should probably be optional as one
> > codepath only needs the information to raise a warning when a variable is
> > shadowed, so a concurrent drop isn't a problem there.
> >
> 
> There is a problem, because before the IdentifyVariable call I don't know
> if the variable will be shadowed or not.
> 
> If I lock a variable inside IdentifyVariable, then I need to remember if I
> did lock there, or if the variable was locked already, and If the variable
> is shadowed and if lock is fresh, then I can unlock the variable.

But in transformColumnRef() you already know if you found a matching column or
not when calling IdentifyVariable(), so you know if an existing variable will
shadow it right?

Couldn't you call something like

lockit = node == NULL;
varid = IdentifyVariable(cref->fields, &attrname, ¬_unique, lockit);

The only other caller is transformLetStmt(), which should always lock the
variable anyway.




Re: Schema variables - new implementation for Postgres 15

2022-01-25 Thread Pavel Stehule
út 25. 1. 2022 v 9:48 odesílatel Julien Rouhaud  napsal:

> Hi,
>
> On Tue, Jan 25, 2022 at 09:35:09AM +0100, Pavel Stehule wrote:
> > út 25. 1. 2022 v 6:18 odesílatel Julien Rouhaud 
> napsal:
> >
> > > I think the lock should be
> > > acquired during IdentifyVariable.  It should probably be optional as
> one
> > > codepath only needs the information to raise a warning when a variable
> is
> > > shadowed, so a concurrent drop isn't a problem there.
> > >
> >
> > There is a problem, because before the IdentifyVariable call I don't know
> > if the variable will be shadowed or not.
> >
> > If I lock a variable inside IdentifyVariable, then I need to remember if
> I
> > did lock there, or if the variable was locked already, and If the
> variable
> > is shadowed and if lock is fresh, then I can unlock the variable.
>
> But in transformColumnRef() you already know if you found a matching
> column or
> not when calling IdentifyVariable(), so you know if an existing variable
> will
> shadow it right?
>

yes, you have true,

Thank you



>
> Couldn't you call something like
>
> lockit = node == NULL;
> varid = IdentifyVariable(cref->fields, &attrname, ¬_unique,
> lockit);
>
> The only other caller is transformLetStmt(), which should always lock the
> variable anyway.
>


RE: row filtering for logical replication

2022-01-25 Thread houzj.f...@fujitsu.com
On Monday, January 24, 2022 5:36 AM Peter Smith 
> 
> FYI - I noticed the cfbot is reporting a failed test case [1] for the latest 
> v69 patch
> set.
> 
> [21:09:32.183] # Failed test 'check replicated inserts on subscriber'
> [21:09:32.183] # at t/025_rep_changes_for_schema.pl line 202.
> [21:09:32.183] # got: '21|1|2139062143'
> [21:09:32.183] # expected: '21|1|21'
> [21:09:32.183] # Looks like you failed 1 test of 13.
> [21:09:32.183] [21:08:49] t/025_rep_changes_for_schema.pl 

The test passed for the latest v70 patch set. I will keep an eye on the cfbot
and if the error happen again in the future, I will continue to investigate
this.

Best regards,
Hou zj


Re: sequences vs. synchronous replication

2022-01-25 Thread Peter Eisentraut



On 15.01.22 23:57, Tomas Vondra wrote:
This approach (and also my previous proposal) seems to assume that the 
value returned from nextval() should not be used until the transaction 
executing that nextval() has been committed successfully. But I'm not 
sure how many applications follow this assumption. Some application 
might use the return value of nextval() instantly before issuing 
commit command. Some might use the return value of nextval() executed 
in rollbacked transaction.




IMO any application that assumes data from uncommitted transactions is 
outright broken and we should not try to fix that because it's quite 
futile (and likely will affect well-behaving applications).


The issue I'm trying to fix in this thread is much narrower - we don't 
actually meet the guarantees for committed transactions (that only did 
nextval without generating any WAL).


The wording in the SQL standard is:

"Changes to the current base value of a sequence generator are not 
controlled by SQL-transactions; therefore, commits and rollbacks of 
SQL-transactions have no effect on the current base value of a sequence 
generator."


This implies the well-known behavior that consuming a sequence value is 
not rolled back.  But it also appears to imply that committing a 
transaction has no impact on the validity of a sequence value produced 
during that transaction.  In other words, this appears to imply that 
making use of a sequence value produced in a rolled-back transaction is 
valid.


A very strict reading of this would seem to imply that every single 
nextval() call needs to be flushed to WAL immediately, which is of 
course impractical.





Re: Fix BUG #17335: Duplicate result rows in Gather node

2022-01-25 Thread Yura Sokolov
В Вт, 25/01/2022 в 21:20 +1300, David Rowley пишет:
> On Tue, 25 Jan 2022 at 20:03, David Rowley  wrote:
> > On Tue, 25 Jan 2022 at 17:35, Yura Sokolov  wrote:
> > > And another attempt to fix tests volatility.
> > 
> > FWIW, I had not really seen the point in adding a test for this.   I
> > did however see a point in it with your original patch. It seemed
> > useful there to verify that Gather and GatherMerge did what we
> > expected with 1 worker.
> 
> I ended up pushing just the last patch I sent.
> 
> The reason I didn't think it was worth adding a new test was that no
> tests were added in the original commit.  Existing tests did cover it,

Existed tests didn't catched the issue. It is pitty fix is merged
without test case it fixes.

> but here we're just restoring the original behaviour for one simple
> case.  The test in your patch just seemed a bit more hassle than it
> was worth. I struggle to imagine how we'll break this again.

Thank you for attention and for fix.

regards,
Yura Sokolov.





Re: Support tab completion for upper character inputs in psql

2022-01-25 Thread Julien Rouhaud
Hi,

On Tue, Jan 25, 2022 at 05:22:32AM +, tanghy.f...@fujitsu.com wrote:
> 
> I tried to add a flag(casesensitive) in the _complete_from_query().
> Now the attached patch passed all the added tap tests.

Thanks for updating the patch.  When you do so, please check and update the
commitfest entry accordingly to make sure that people knows it's ready for
review.  I'm switching the entry to Needs Review.




Re: Parameter for planner estimate of recursive queries

2022-01-25 Thread Peter Eisentraut

On 31.12.21 15:10, Simon Riggs wrote:

The factor 10 is a reasonably safe assumption and helps avoid worst
case behavior in bigger graph queries. However, the factor 10 is way
too large for many types of graph query, such as where the path
through the data is tight, and/or the query is written to prune bushy
graphs, e.g. shortest path queries. The factor 10 should not be
hardcoded in the planner, but should be settable, just as
cursor_tuple_fraction is.

If you think this should be derived without parameters, then we would
want a function that starts at 1 for 1 input row and gets much larger
for larger input. The thinking here is that Graph OLTP is often a
shortest path between two nodes, whereas Graph Analytics and so the
worktable will get much bigger.


On the one hand, this smells like a planner hint.  But on the other 
hand, it doesn't look like we will come up with proper graph-aware 
selectivity estimation system any time soon, so just having all graph 
OLTP queries suck until then because the planner hint is hardcoded 
doesn't seem like a better solution.  So I think this setting can be ok. 
 I think the way you have characterized it makes sense, too: for graph 
OLAP, you want a larger value, for graph OLTP, you want a smaller value.






Reg. evaluation of expression in HashCond

2022-01-25 Thread Vignesh K
Hi,



  I Recently noted that expressions involved in either side of 
HashCondition in HashJoin is not being pushed down to foreign scan. This leads 
to evaluation of the same expression multiple times - (for hashvalue 
computation from hashkeys, for HashCondition expr evaluation, for Projection). 
Not sure if intended behavior is to not push down expressions in HashCond. 
Kindly clarify this case. Have attached sample plan for reference.

 contrib_regression=# explain verbose select x_vec.a*2, y_vec.a*2 as a from 
x_vec, y_vec where x_vec.a*2 = y_vec.a*2 and x_vec.a*2 != 10;

 QUERY PLAN 
    



 Hash Join  (cost=2.09..4.40 rows=4 width=12)

   Output: (x_vec.a * 2), (y_vec.a * 2)

   Hash Cond: ((x_vec.a * 2) = (y_vec.a * 2))

   ->  Foreign Scan on public.x_vec  (cost=0.00..2.18 rows=12 width=4)

 Output: x_vec.a, x_vec.b

 Filter: ((x_vec.a * 2) <> 10)

 CStore Dir: /home/test/postgres/datasets11/cstore_fdw/452395/453195

 CStore Table Size: 28 kB

   ->  Hash  (cost=2.04..2.04 rows=4 width=8)

 Output: y_vec.a

 ->  Foreign Scan on public.y_vec  (cost=0.00..2.04 rows=4 width=8)

   Output: y_vec.a

   CStore Dir: 
/home/test/postgres/datasets11/cstore_fdw/452395/453068

   CStore Table Size: 28 kB

(14 rows)


  Here the same expression is being used in HashCond, Projection. Since its 
not being pushed down to Scan its being evaluated multiple times for HashValue, 
HashCond and Projection. 
Have used a simple expression for an example. If the expression is complex, 
query execution slows down due to this.


The same is also being done even if the expression is used in multiple levels. 
contrib_regression=# explain verbose select * from (select x_vec.a*2 as xa2, 
y_vec.a*2 as ya2 from x_vec, y_vec where x_vec.a*2 = y_vec.a*2) q1 join a on 
q1.xa2 = a.a;

   QUERY PLAN   
    



 Hash Join  (cost=4.37..8.51 rows=2 width=28)

   Output: (x_vec.a * 2), (y_vec.a * 2), a.a, a.b

   Hash Cond: (a.a = (x_vec.a * 2))

   ->  Foreign Scan on public.a  (cost=0.00..4.07 rows=7 width=16)

 Output: a.a, a.b

 CStore Dir: /home/test/postgres/datasets11/cstore_fdw/452395/453149

 CStore Table Size: 28 kB

   ->  Hash  (cost=4.32..4.32 rows=4 width=12)

 Output: x_vec.a, y_vec.a

 ->  Hash Join  (cost=2.09..4.32 rows=4 width=12)

   Output: x_vec.a, y_vec.a

   Hash Cond: ((x_vec.a * 2) = (y_vec.a * 2))

   ->  Foreign Scan on public.x_vec  (cost=0.00..2.12 rows=12 
width=4)

 Output: x_vec.a, x_vec.b

 CStore Dir: 
/home/test/postgres/datasets11/cstore_fdw/452395/453195

 CStore Table Size: 28 kB

   ->  Hash  (cost=2.04..2.04 rows=4 width=8)

 Output: y_vec.a

 ->  Foreign Scan on public.y_vec  (cost=0.00..2.04 rows=4 
width=8)

   Output: y_vec.a

   CStore Dir: 
/home/test/postgres/datasets11/cstore_fdw/452395/453068

   CStore Table Size: 28 kB

(22 rows)





Thanks and regards,

Vignesh K.

Re: [PATCH] Allow multiple recursive self-references

2022-01-25 Thread Peter Eisentraut

On 14.01.22 15:55, Denis Hirn wrote:

There is nothing in there that says that certain branches of the UNION in a 
recursive query mean certain things. In fact, it doesn't even require the query 
to contain a UNION at all.  It just says to iterate on evaluating the query 
until a fixed point is reached.  I think this supports my claim that the 
associativity and commutativity of a UNION in a recursive query still apply.

This is all very complicated, so I don't claim this to be authoritative, but I 
just don't see anything in the spec that supports what you are saying.


I disagree. In SQL:2016, it's discussed in 7.16  syntax rule 
3) j) i), which defines:

[actually 7.17]



3) Among the WQEi, ... WQEk of a given stratum, there shall be at least one 
, say WQEj, such that:
 A) WQEj is a  that immediately contains UNION.
 B) WQEj has one operand that does not contain a  referencing 
any of WQNi,
..., WQNk. This operand is said to be the non-recursive operand of WQEj.
 C) WQEj is said to be an anchor expression, and WQNj an anchor name.


Where  is defined as:

 ::=
 
   |  UNION [ALL | DISTINCT]
   [  ] 

 ::=
 
   | ...

 ::= ...
   |   ... 


This definition pretty much sums up what I have called RUNION.

The SQL standard might not impose a strict order on the UNION branches.
But you have to be able to uniquely identify the anchor expression.


Right, the above text does not impose any ordering of the UNION.  It 
just means that it has to have an operand that it not recursive.  I 
think that is what I was trying to say.


I don't understand what your RUNION examples are meant to show.




Re: UNIQUE null treatment option

2022-01-25 Thread Maxim Orlov
Since cfbot did failed with error, probably, unrelated to the patch itself
(see https://cirrus-ci.com/task/5330150500859904)
and repeated check did not start automatically, I reattach patch v3 to
restart cfbot on this patch.



-- 
Best regards,
Maxim Orlov.


v3-0001-Add-UNIQUE-null-treatment-option.patch
Description: Binary data


Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-01-25 Thread Julien Rouhaud
Hi,

On Mon, Jan 24, 2022 at 08:16:06PM +0300, Anton A. Melnikov wrote:
> Hi, Andrey!
> 
> I've checked the 5th version of the patch and there are some remarks.
> 
> >I've created a new view named pg_stat_statements_aux. But for now both
> >views are using the same function pg_stat_statements which returns all
> >fields. It seems reasonable to me - if sampling solution will need all
> >values it can query the function.
> 
> Agreed, it might be useful in some cases.
> 
> >But it seems "stats_reset" term is not quite correct in this case. The
> >"stats_since" field holds the timestamp of hashtable entry, but not the
> >reset time. The same applies to aux_stats_since - for new statement it
> >holds its entry time, but in case of reset it will hold the reset time.
> 
> Thanks for the clarification. Indeed if we mean the word 'reset' as the
> removal of all the hashtable entries during pg_stat_statements_reset() then
> we shouldn't use it for the first occurrence timestamp in the struct
> pgssEntry.
> So with the stats_since field everything is clear.
> On the other hand aux_stats_since field can be updated for two reasons:
> 1) The same as for stats_since due to first occurrence of entry in the
> hashtable. And it will be equal to stats_since timestamp in that case.
> 2) Due to an external reset from an upper level sampler.
> I think it's not very important how to name this field, but it would be
> better to mention both these reasons in the comment.

Are those 4 new counters really worth it?

The min/max were added to make pg_stat_statements a bit more useful if you
have nothing else using that extension.  But once you setup a tool that
snapshots the metrics regularly, do you really need to know e.g. "what was the
maximum execution time in the last 3 years", which will typically be what
people will retrieve from the non-aux min/max counters, rather than simply
using your additional tool for better and more precise information?




Re: [PATCH] Allow multiple recursive self-references

2022-01-25 Thread Peter Eisentraut
The explanations below are satisfactory to me.  I think the executor 
changes in this patch are ok.  But I would like someone else who has 
more experience in this particular area to check it too; I'm not going 
to take committer responsibility for this by myself without additional 
review.


As I said earlier, I think semantically/mathematically, the changes 
proposed by this patch set are okay.


The rewriting in the parse analysis is still being debated, but it can 
be tackled in separate patches/commits.



On 11.01.22 12:33, Denis Hirn wrote:

I wonder why in ExecWorkTableScan() and ExecReScanWorkTableScan() you call
tuplestore_copy_read_pointer() instead of just tuplestore_select_read_pointer().

The WorkTableScan reads the working_table tuplestore of the parent 
RecursiveUnion
operator. But after each iteration of the RecursiveUnion operator, the following
operations are performed:


122/* done with old working table ... */
123tuplestore_end(node->working_table);  -- (1)
124
125/* intermediate table becomes working table */
126node->working_table = node->intermediate_table;   -- (2)
127
128  /* create new empty intermediate table */
129  node->intermediate_table = tuplestore_begin_heap(false, false,
130   work_mem); -- (3)

https://doxygen.postgresql.org/nodeRecursiveunion_8c_source.html#l00122

In step (1), the current working_table is released. Therefore, all read pointers
that we had additionally allocated are gone, too. The intermediate table becomes
the working table in step (2), and finally a new empty intermediate table is
created (3).

Because of step (1), we have to allocate new unique read pointers for each 
worktable
scan again. Just using tuplestore_select_read_pointer() would be incorrect.


What is the special role of read pointer 0 that you are copying.  Before your
changes, it was just the implicit read pointer, but now that we have several,
it would be good to explain their relationship.

To allocate a new read pointer, tuplestore_alloc_read_pointer() could also be 
used.
However, we would have to know about the eflags parameter – which the worktable
scan has no information about.

The important thing about read pointer 0 is that it always exists, and it is 
initialized correctly.
Therefore, it is save to copy read pointer 0 instead of creating a new one from 
scratch.



Also, the comment you deleted says "Therefore, we don't need a private read 
pointer
for the tuplestore, nor do we need to tell tuplestore_gettupleslot to copy."
You addressed the first part with the read pointer handling, but what about the
second part?  The tuplestore_gettupleslot() call in WorkTableScanNext() still
has copy=false.  Is this an oversight, or did you determine that copying is
still not necessary?

That's an oversight. Copying is still not necessary. Copying would only be 
required,
if additional writes to the tuplestore occur. But that can not happen here.






Re: Non-decimal integer literals

2022-01-25 Thread Peter Eisentraut

On 24.01.22 19:53, Robert Haas wrote:

On Mon, Jan 24, 2022 at 3:41 AM Peter Eisentraut
 wrote:

Rebased patch set


What if someone finds this new behavior too permissive?


Which part exactly?  There are several different changes proposed here.




Re: WIN32 pg_import_system_collations

2022-01-25 Thread Peter Eisentraut

On 24.01.22 22:23, Dmitry Koval wrote:

+/*
+ * Windows will use hyphens between language and territory, where POSIX
+ * uses an underscore. Simply make it POSIX looking.
+ */
+ hyphen = strchr(localebuf, '-');
+ if (hyphen)
+    *hyphen = '_';

After this block modified collation name is used in function

GetNLSVersionEx(COMPARE_STRING, wide_collcollate, &version)

(see win32_read_locale() -> CollationFromLocale() -> CollationCreate()
call). Is it correct to use (wide_collcollate = "en_NZ") instead of
(wide_collcollate = "en-NZ") in GetNLSVersionEx() function?


I don't really know if this is necessary anyway.  Just create the 
collations with the names that the operating system presents.  There is 
no requirement to make the names match POSIX.


If you want to make them match POSIX for some reason, you can also just 
change the object name but leave the collcollate/collctype fields the 
way they came from the OS.





Re: GUC flags

2022-01-25 Thread Peter Eisentraut

On 25.01.22 02:07, Justin Pryzby wrote:

+CREATE TABLE pg_settings_flags AS SELECT name, category,
+   'NO_SHOW_ALL'   =ANY(flags) AS no_show_all,
+   'NO_RESET_ALL'  =ANY(flags) AS no_reset_all,
+   'NOT_IN_SAMPLE' =ANY(flags) AS not_in_sample,
+   'EXPLAIN'   =ANY(flags) AS guc_explain,
+   'COMPUTED'  =ANY(flags) AS guc_computed
+   FROM pg_show_all_settings();


Does this stuff have any value for users?  I'm worried we are exposing a 
bunch of stuff that is really just for internal purposes.  Like, what 
value does showing "not_in_sample" have?  On the other hand, 
"guc_explain" might be genuinely useful, since that is part of a 
user-facing feature.  (I don't like the "guc_*" naming though.)


Your patch doesn't contain a documentation change, so I don't know how 
and to what extend this is supposed to be presented to users.





Re: PublicationActions - use bit flags.

2022-01-25 Thread Peter Eisentraut

On 25.01.22 07:14, Greg Nancarrow wrote:


On Tue, Jan 25, 2022 at 7:31 AM Peter Eisentraut 
> wrote:

 >
 > Why can't GetRelationPublicationActions() have the PublicationActions as
 > a return value, instead of changing it to an output argument?

That would be OK too, for now, for the current (small size, typically 
4-byte) PublicationActions struct.
But if that function was extended in the future to return more 
publication information than just the PublicationActions struct (and I'm 
seeing that in the filtering patches [1]), then using return-by-value 
won't be as efficient as pass-by-reference, and I'd tend to stick with 
pass-by-reference in that case.


[1] 
https://postgr.es/m/OS0PR01MB5716B899A66D2997EF28A1B3945F9%40OS0PR01MB5716.jpnprd01.prod.outlook.com 



By itself, this refactoring doesn't seem worth it.  The code is actually 
longer at the end, and we haven't made it any more extensible or 
anything.  And AFAICT, this is not called in a performance-sensitive way.


The proposed changes in [1] change this function more significantly, so 
adopting the present change wouldn't really help there either except 
create the need for one more rebase.


So I think we should leave this alone here and let [1] make the changes 
it needs.





Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2022-01-25 Thread Julien Rouhaud
Hi,

On Mon, Jan 24, 2022 at 10:33:43AM +0300, Michail Nikolaev wrote:
> 
> Thanks for your attention.
> After some investigation, I think I have found the problem. It is
> caused by XLOG_RUNNING_XACTS at an undetermined moment (some test
> parts rely on it).
> 
> Now test waits for XLOG_RUNNING_XACTS to happen (maximum is 15s) and
> proceed forward.
> 
> I'll move entry back to "Ready for Committer" once it passes tests.

It looks like you didn't fetch the latest upstream commits in a while as this
version is still conflicting with 7a5f6b474 (Make logical decoding a part of
the rmgr) from 6 days ago.

I rebased the pathset in attached v9.  Please double check that I didn't miss
anything in the rebase.
>From 5022571ba9b95cc86715e7f34acc37f99b5e0153 Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Sat, 15 Jan 2022 16:21:51 +0300
Subject: [PATCH v9 1/3] code

---
 src/backend/access/common/bufmask.c  | 25 
 src/backend/access/gist/gistget.c| 43 +++--
 src/backend/access/gist/gistxlog.c   | 15 +
 src/backend/access/hash/hash.c   |  4 +-
 src/backend/access/hash/hash_xlog.c  | 17 +
 src/backend/access/hash/hashsearch.c | 18 --
 src/backend/access/hash/hashutil.c   | 33 +-
 src/backend/access/heap/heapam.c | 42 +---
 src/backend/access/heap/heapam_handler.c |  5 +-
 src/backend/access/index/genam.c | 20 +++---
 src/backend/access/index/indexam.c   | 81 +---
 src/backend/access/nbtree/nbtinsert.c| 22 +--
 src/backend/access/nbtree/nbtree.c   |  4 +-
 src/backend/access/nbtree/nbtsearch.c| 14 +++-
 src/backend/access/nbtree/nbtutils.c | 33 +-
 src/backend/access/nbtree/nbtxlog.c  | 16 +
 src/backend/access/table/tableam.c   |  4 +-
 src/backend/access/transam/rmgr.c|  4 +-
 src/backend/access/transam/xlogutils.c   |  6 ++
 src/backend/storage/ipc/standby.c|  6 ++
 src/bin/pg_rewind/parsexlog.c|  2 +-
 src/bin/pg_waldump/rmgrdesc.c|  2 +-
 src/include/access/bufmask.h |  1 +
 src/include/access/gist.h|  5 ++
 src/include/access/gistxlog.h|  1 +
 src/include/access/hash.h|  2 +
 src/include/access/hash_xlog.h   |  1 +
 src/include/access/heapam.h  |  2 +-
 src/include/access/nbtree.h  |  2 +
 src/include/access/nbtxlog.h |  1 +
 src/include/access/relscan.h | 15 -
 src/include/access/rmgr.h|  2 +-
 src/include/access/rmgrlist.h| 46 +++---
 src/include/access/tableam.h | 14 ++--
 src/include/access/xlog_internal.h   |  4 ++
 35 files changed, 422 insertions(+), 90 deletions(-)

diff --git a/src/backend/access/common/bufmask.c 
b/src/backend/access/common/bufmask.c
index 4e953bfd61..22026482ad 100644
--- a/src/backend/access/common/bufmask.c
+++ b/src/backend/access/common/bufmask.c
@@ -128,3 +128,28 @@ mask_page_content(Page page)
memset(&((PageHeader) page)->pd_upper, MASK_MARKER,
   sizeof(uint16));
 }
+
+/*
+ * mask_lp_dead
+ *
+ * In some index AMs, line pointer flags can be modified without emitting any
+ * WAL record. Sometimes it is required to mask LP_DEAD flags set on primary to
+ * set own values on standby.
+ */
+void
+mask_lp_dead(Page page)
+{
+   OffsetNumber offnum,
+maxoff;
+
+   maxoff = PageGetMaxOffsetNumber(page);
+   for (offnum = FirstOffsetNumber;
+offnum <= maxoff;
+offnum = OffsetNumberNext(offnum))
+   {
+   ItemId  itemId = PageGetItemId(page, offnum);
+
+   if (ItemIdHasStorage(itemId) && ItemIdIsDead(itemId))
+   itemId->lp_flags = LP_NORMAL;
+   }
+}
diff --git a/src/backend/access/gist/gistget.c 
b/src/backend/access/gist/gistget.c
index adbf622c83..1905c04c51 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "access/bufmask.h"
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/relscan.h"
@@ -49,6 +50,7 @@ gistkillitems(IndexScanDesc scan)
Assert(so->curBlkno != InvalidBlockNumber);
Assert(!XLogRecPtrIsInvalid(so->curPageLSN));
Assert(so->killedItems != NULL);
+   Assert(so->numKilled > 0);
 
buffer = ReadBuffer(scan->indexRelation, so->curBlkno);
if (!BufferIsValid(buffer))
@@ -62,8 +64,13 @@ gistkillitems(IndexScanDesc scan)
 * If page LSN differs it means that the page was modified since the 
last
 * read. killedItems could be not valid so LP_DEAD hints applying is not
 * safe.
+*
+* Another case - standby was promoted after start of current 
transaction.
+* It is not required for correctness, but it is better to just skip
+* everything.
  

Re: Design of pg_stat_subscription_workers vs pgstats

2022-01-25 Thread Masahiko Sawada
Hi,

On Tue, Jan 25, 2022 at 3:31 PM Andres Freund  wrote:
>
> Hi,
>
> I was looking the shared memory stats patch again. The rebase of which
> collided fairly heavily with the addition of pg_stat_subscription_workers.
>
> I'm concerned about the design of pg_stat_subscription_workers. The view was
> introduced in
>
>
> commit 8d74fc96db5fd547e077bf9bf4c3b67f821d71cd
> Author: Amit Kapila 
> Date:   2021-11-30 08:54:30 +0530
>
> Add a view to show the stats of subscription workers.
>
> This commit adds a new system view pg_stat_subscription_workers, that
> shows information about any errors which occur during the application of
> logical replication changes as well as during performing initial table
> synchronization. The subscription statistics entries are removed when the
> corresponding subscription is removed.
>
> It also adds an SQL function pg_stat_reset_subscription_worker() to reset
> single subscription errors.
>
> The contents of this view can be used by an upcoming patch that skips the
> particular transaction that conflicts with the existing data on the
> subscriber.
>
> This view can be extended in the future to track other xact related
> statistics like the number of xacts committed/aborted for subscription
> workers.
>
> Author: Masahiko Sawada
> Reviewed-by: Greg Nancarrow, Hou Zhijie, Tang Haiying, Vignesh C, Dilip 
> Kumar, Takamichi Osumi, Amit Kapila
> Discussion: 
> https://postgr.es/m/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xjfuvihn...@mail.gmail.com
>
>
> I tried to skim-read the discussion leading to its introduction, but it's
> extraordinarily long: 474 messages in [1], 131 messages in [2], as well as a
> few other associated threads.
>
>
> From the commit message alone I am concerned that this appears to be intended
> to be used to store important state in pgstats. For which pgstats is
> fundamentally unsuitable (pgstat can loose state during normal operation,
> always looses state during crash restarts, the state can be reset).

The information on pg_stat_subscription_workers view, especially
last_error_xid, can be used to specify XID to "ALTER SUBSCRIPTION ...
SKIP (xid = XXX)" command which is proposed on the same thread, but it
doesn't mean that the new SKIP command relies on this information. The
failure XID is written in the server logs as well and the user
specifies XID manually.

>
> I don't really understand the name "pg_stat_subscription_workers" - what
> workers are stats kept about exactly? The columns don't obviously refer to a
> single worker or such? From the contents it should be name
> pg_stat_subscription_table_stats or such. But no, that'd not quite right,
> because apply errors are stored per-susbcription, while initial sync stuff is
> per-(subscription, table).

This stores stats for subscription workers namely apply and tablesync
worker, so named as pg_stat_subscription_workers.

Also, there is another proposal to add transaction statistics for
logical replication subscribers[1], and it's reasonable to merge these
statistics and this error information rather than having separate
views[2]. There also was an idea to add the transaction statistics to
pg_stat_subscription view, but it doesn't seem a good idea because the
pg_stat_subscription shows dynamic statistics whereas the transaction
statistics are accumulative statistics[3].

>
> The pgstat entries are quite wide (292 bytes), because of the error message
> stored. That's nearly twice the size of PgStat_StatTabEntry. And as far as I
> can tell, once there was an error, we'll just keep the stats entry around
> until the subscription is dropped.

We can drop the particular statistics by
pg_stat_reset_subscription_worker() function.

> And that includes stats for long dropped
> tables, as far as I can see - except that they're hidden from view, due to a
> join to pg_subscription_rel.

We are planning to drop this after successfully apply[4].

> To me this looks like it's using pgstat as an extremely poor IPC mechanism.
>
>
> Why isn't this just storing data in pg_subscription_rel?

These need to be updated on error which means for a failed xact and we
don't want to update the system catalog in that state. There will be
some challenges in a case where updating pg_subscription_rel also
failed too (what to report to the user, etc.). And moreover, we don't
want to consume space for temporary information in the system catalog.

Regards,

[1] 
https://www.postgresql.org/message-id/OSBPR01MB48887CA8F40C8D984A6DC00CED199%40OSBPR01MB4888.jpnprd01.prod.outlook.com
[2] 
https://www.postgresql.org/message-id/CAD21AoDF7LmSALzMfmPshRw_xFcRz3WvB-me8T2gO6Ht%3D3zL2w%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/CAA4eK1JqwpsvjhLxV8CMYQ3NrhimZ8AFhWHh0Qn1FrL%3DLXfY6Q%40mail.gmail.com
[4] 
https://www.postgresql.org/message-id/CAA4eK1%2B9yXkWkJSNtWYV2rG7QNAnoAt%2BeNH0PexoSP9ZQmXKPg%40mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.c

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-01-25 Thread Andrei Zubkov
Hi Julien,
 Tue, 2022-01-25 at 18:08 +0800, Julien Rouhaud wrote:
> 
> Are those 4 new counters really worth it?
> 
> The min/max were added to make pg_stat_statements a bit more useful
> if you
> have nothing else using that extension.  But once you setup a tool
> that
> snapshots the metrics regularly, do you really need to know e.g.
> "what was the
> maximum execution time in the last 3 years", which will typically be
> what
> people will retrieve from the non-aux min/max counters, rather than
> simply
> using your additional tool for better and more precise information?

Of course we can replace old min/max metrics with the new "aux" min/max
metrics. It seems reasonable to me because they will have the same
behavior until we touch reset_aux. I think we can assume that min/max
data is saved somewhere if reset_aux was performed, but how about the
backward compatibility?
There may be some monitoring solutions that doesn't expect min/max
stats reset independently of other statement statistics.
It seems highly unlikely to me, because the min/max stats for "the last
3 years" is obvious unusable but maybe someone uses them as a sign of
something?
Are we need to worry about that?

Also, there is one more dramatic consequence of such decision...
What min/max values should be returned after the auxiliary reset but
before the next statement execution?
The NULL values seems reasonable but there was not any NULLs before and
backward compatibility seems broken. Another approach is to return the
old values of min/max stats and the old aux_stats_since value until the
next statement execution but it seems strange when the reset was
performed and it doesn't affected any stats instantly.

regards,
Andrei Zubkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company






[PATCH] Add TOAST support for several system tables

2022-01-25 Thread Sofia Kopikova

Hi, hackers

ACL lists in tables may potentially be large in size. I suggest adding TOAST 
support for system tables, namely pg_class, pg_attribute and 
pg_largeobject_metadata, for they include ACL columns.

In commit 96cdeae0 these tables were missed because of possible conflicts with 
VACUUM FULL and problem with seeing a non-empty new cluster by pg_upgrade. 
Given patch fixes both expected bugs. Also patch adds a new regress test for 
checking TOAST properly working with ACL attributes. Suggested feature has been 
used in Postgres Pro Enterprise for a long time, and it helps with large ACL's.
 
The VACUUM FULL bug is detailed here:
 
http://postgr.es/m/CAPpHfdu3PJUzHpQrvJ5RC9bEX_bZ6LwX52kBpb0EiD_9e3Np5g%40mail.gmail.com
 
Looking forward to your comments
 
--
Sofia Kopikova
Postgres Professional:  http://www.postgrespro.com
The Russian Postgres CompanyFrom 0a99d94a1ec83d0621bb9805e87d8953a0ba4000 Mon Sep 17 00:00:00 2001
From: Sofia Kopikova 
Date: Tue, 25 Jan 2022 14:44:14 +0300
Subject: [PATCH] Add TOAST support for several system tables

ACL lists may have large size. Using TOAST for system tables pg_class,
pg_attribute and pg_largeobject_metadata with ACL columns.

Bugs with conflicting VACUUM FULL and TOAST and pg_upgrade seeing
a non-empty new cluster are fixed.
---
 src/backend/access/heap/heapam.c  | 67 +--
 src/backend/access/heap/pruneheap.c   |  4 ++
 src/backend/catalog/catalog.c |  2 +
 src/backend/commands/cluster.c| 66 +++---
 src/bin/pg_upgrade/check.c|  3 +-
 src/include/catalog/pg_attribute.h|  2 +
 src/include/catalog/pg_class.h|  2 +
 src/include/catalog/pg_largeobject_metadata.h |  2 +
 src/include/commands/cluster.h|  1 +
 src/test/regress/expected/misc_sanity.out | 27 +++-
 .../regress/expected/pg_catalog_toast1.out| 25 +++
 src/test/regress/parallel_schedule|  3 +
 src/test/regress/sql/misc_sanity.sql  |  7 +-
 src/test/regress/sql/pg_catalog_toast1.sql| 20 ++
 14 files changed, 178 insertions(+), 53 deletions(-)
 create mode 100644 src/test/regress/expected/pg_catalog_toast1.out
 create mode 100644 src/test/regress/sql/pg_catalog_toast1.sql

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6ec57f3d8b2..e49bf3fcb53 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -5958,6 +5958,53 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 	pgstat_count_heap_delete(relation);
 }
 
+/*
+ * Prepare tuple for inplace update.  TOASTed attributes can't be modified
+ * by in-place upgrade.  Simultaneously, new tuple is flatten.  So, we just
+ * replace TOASTed attributes with their values of old tuple.
+ */
+static HeapTuple
+heap_inplace_update_prepare_tuple(Relation relation,
+  HeapTuple oldtup,
+  HeapTuple newtup)
+{
+	TupleDesc	desc = relation->rd_att;
+	HeapTuple	result;
+	Datum	   *oldvals,
+			   *newvals;
+	bool	   *oldnulls,
+			   *newnulls;
+	int			i,
+natts = desc->natts;
+
+	oldvals = (Datum *) palloc(sizeof(Datum) * natts);
+	newvals = (Datum *) palloc(sizeof(Datum) * natts);
+	oldnulls = (bool *) palloc(sizeof(bool) * natts);
+	newnulls = (bool *) palloc(sizeof(bool) * natts);
+
+	heap_deform_tuple(oldtup, desc, oldvals, oldnulls);
+	heap_deform_tuple(newtup, desc, newvals, newnulls);
+
+	for (i = 0; i < natts; i++)
+	{
+		Form_pg_attribute att = &desc->attrs[i];
+
+		if (att->attlen == -1 &&
+			!oldnulls[i] &&
+			VARATT_IS_EXTENDED(oldvals[i]))
+		{
+			Assert(!newnulls[i]);
+			newvals[i] = oldvals[i];
+		}
+	}
+
+	result = heap_form_tuple(desc, newvals, newnulls);
+
+	result->t_self = newtup->t_self;
+
+	return result;
+}
+
 /*
  * heap_inplace_update - update a tuple "in place" (ie, overwrite it)
  *
@@ -5987,6 +6034,8 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	HeapTupleHeader htup;
 	uint32		oldlen;
 	uint32		newlen;
+	HeapTupleData oldtup;
+	HeapTuple	newtup;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -6012,16 +6061,26 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 
 	htup = (HeapTupleHeader) PageGetItem(page, lp);
 
+	oldtup.t_tableOid = RelationGetRelid(relation);
+	oldtup.t_data = htup;
+	oldtup.t_len = ItemIdGetLength(lp);
+	oldtup.t_self = tuple->t_self;
+#ifdef XID_IS_64BIT
+	HeapTupleCopyBaseFromPage(&oldtup, page);
+#endif
+
+	newtup = heap_inplace_update_prepare_tuple(relation, &oldtup, tuple);
+
 	oldlen = ItemIdGetLength(lp) - htup->t_hoff;
-	newlen = tuple->t_len - tuple->t_data->t_hoff;
-	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
+	newlen = newtup->t_len - newtup->t_data->t_hoff;
+	if (oldlen != newlen || htup->t_hoff != newtup->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
 	memc

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-01-25 Thread Julien Rouhaud
Hi Andrei,

On Tue, Jan 25, 2022 at 02:58:17PM +0300, Andrei Zubkov wrote:
> 
> Of course we can replace old min/max metrics with the new "aux" min/max
> metrics. It seems reasonable to me because they will have the same
> behavior until we touch reset_aux. I think we can assume that min/max
> data is saved somewhere if reset_aux was performed, but how about the
> backward compatibility?
> There may be some monitoring solutions that doesn't expect min/max
> stats reset independently of other statement statistics.
> It seems highly unlikely to me, because the min/max stats for "the last
> 3 years" is obvious unusable but maybe someone uses them as a sign of
> something?

To be honest I don't see how any monitoring solution could make any use of
those fields as-is.  For that reason in PoWA we unfortunately have to entirely
ignore them.  There was a previous attempt to provide a way to reset those
counters only (see [1]), but it was returned with feedback due to lack of TLC
from the author.

> Are we need to worry about that?

I don't think it's a problem, as once you have a solution on top of
pg_stat_statements, you get information order of magnitude better from that
solution instead of pg_stat_statements.  And if that's a problem, well either
don't reset those counters, or don't use the external solution if it does it
automatically and you're not ok with it.

> Also, there is one more dramatic consequence of such decision...
> What min/max values should be returned after the auxiliary reset but
> before the next statement execution?
> The NULL values seems reasonable but there was not any NULLs before and
> backward compatibility seems broken. Another approach is to return the
> old values of min/max stats and the old aux_stats_since value until the
> next statement execution but it seems strange when the reset was
> performed and it doesn't affected any stats instantly.

If you're worried about some external table having a NOT NULL constraint for
those fields, how about returning NaN instead?  That's a valid value for a
double precision data type.

[1] https://www.postgresql.org/message-id/1762890.8ARNpCrDLI@peanuts2




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2022-01-25 Thread Julien Rouhaud
Hi,

On Tue, Jan 25, 2022 at 07:21:01PM +0800, Julien Rouhaud wrote:
> > 
> > I'll move entry back to "Ready for Committer" once it passes tests.
> 
> It looks like you didn't fetch the latest upstream commits in a while as this
> version is still conflicting with 7a5f6b474 (Make logical decoding a part of
> the rmgr) from 6 days ago.
> 
> I rebased the pathset in attached v9.  Please double check that I didn't miss
> anything in the rebase.

FTR the cfbot is now happy with this version:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/2947.

I will let you mark the patch as Ready for Committer once you validate that the
rebase was ok.




Re: Use generation context to speed up tuplesorts

2022-01-25 Thread Ronan Dunklau
Le jeudi 20 janvier 2022, 08:36:46 CET Ronan Dunklau a écrit :
> Le jeudi 20 janvier 2022, 02:31:15 CET David Rowley a écrit :
> > On Tue, 18 Jan 2022 at 19:45, Julien Rouhaud  wrote:
> > > I'm also a bit confused about which patch(es) should actually be
> > > reviewed
> > > in that thread.  My understanding is that the original patch might not
> > > be
> > > relevant anymore but I might be wrong.  The CF entry should probably be
> > > updated to clarify that, with an annotation/ title change / author
> > > update
> > > or something.
> > > 
> > > In the meantime I will switch the entry to Waiting on Author.
> > 
> > I think what's going on here is a bit confusing. There's quite a few
> > patches on here that aim to do very different things.
> > 
> > The patch I originally proposed was just to make tuplesort use
> > generation memory contexts. This helps speed up tuplesort because
> > generation contexts allow palloc() to be faster than the standard
> > allocator. Additionally, there's no power-of-2 wastage with generation
> > contexts like there is with the standard allocator. This helps fit
> > more tuples on fewer CPU cache lines.
> > 
> > I believe Andres then mentioned that the fixed block size for
> > generation contexts might become a problem. With the standard
> > allocator the actual size of the underlying malloc() grows up until a
> > maximum size.  With generation contexts this is always the same, so we
> > could end up with a very large number of blocks which means lots of
> > small mallocs which could end up slow.  Tomas then posted a series of
> > patches to address this.
> > 
> > I then posted another patch that has the planner make a choice on the
> > size of the blocks to use for the generation context based on the
> > tuple width and number of tuple estimates. My hope there was to
> > improve performance further by making a better choice for how large to
> > make the blocks in the first place.  This reduces the need for Tomas'
> > patch, but does not eliminate the need for it.
> > 
> > As of now, I still believe we'll need Tomas' patches to allow the
> > block size to grow up to a maximum size.  I think those patches are
> > likely needed before we think about making tuplesort use generation
> > contexts.  The reason I believe this is that we don't always have good
> > estimates for the number of tuples we're going to sort.  nodeSort.c is
> > fairly easy as it just fetches tuples once and then sorts them.  The
> > use of tuplesort for nodeIncrementalSort.c is much more complex as
> > we'll likely be performing a series of smaller sorts which are much
> > harder to get size estimates for. This means there's likely no magic
> > block size that we can use for the generation context that's used to
> > store the tuples in tuplesort.  This is also the case for order by
> > aggregates and probably most other usages of tuplesort.
> 
> You left out the discussion I started about glibc's malloc specific
> behaviour.
> 
> I tried to demonstrate that a big part of the performance gain you were
> seeing were not in fact due to our allocator by itself, but by the way
> different block sizes allocations interact with this malloc implementation
> and how it handles releasing memory back to the system. I also tried to
> show that we can give DBAs more control over how much memory to "waste" as
> the price for faster allocations.
> 
> I agree that the generation context memory manager is useful in the
> tuplesort context, if only for the fact that we fall back to disk less
> quickly due to the reduced wastage of memory, but I would be wary of
> drawing conclusions too quickly based on a specific malloc implementation
> which shows threshold effects, which are compounded by our growing block
> sizes code in general.
> 
> I have on my TODO list to run the same kind of benchmarks using jemalloc, to
> better isolate the performance gains expected from the generation allocator
> itself from the side effect of avoiding glibc's pattern of releasing memory
> back to the kernel too quickly.
> 

I've run the same 1-to-32-columns sort benchmark, using both glibc malloc and 
jemalloc, with the standard aset memory manager or with David's generation v2 
patch.  To use jemalloc, I just use a LD_PRELOAD env variable before starting 
postgres. 

I have not tried to measure jemalloc's memory footprint like I did for glibc's 
malloc in the previous benchmark, as I'm not trying to evaluate jemalloc as a 
glibc's malloc replacement. 

Please find the results attached. As I suspected, most of the gain observed 
with David's patch come from working around glibc's malloc idiosyncracies but 
the good news is that it also helps (to a lesser degree) with jemalloc.

I'm not sure how that would behave with growing block sizes though: as Tomas 
patch and stress-testing benchmarks showed, we can hit some particular 
thresholds (and regressions) in glibc's self-tuning behaviour.

-- 
Ronan Dunklau

bench_jemalloc.ods
Description: applica

Re: Skipping logical replication transactions on subscriber side

2022-01-25 Thread Peter Eisentraut

On 25.01.22 03:54, Amit Kapila wrote:

I don't think this functionality allows a nonprivileged user to do
anything they couldn't otherwise do.  You can create inconsistent data
in the sense that you can choose not to apply certain replicated data.


I thought this will be the only primary way to skip applying certain
transactions. The other could be via pg_replication_origin_advance().
Or are you talking about the case where we skip applying update/delete
where the corresponding rows are not found?

I see the point that if we can allow the owner to skip applying
updates/deletes in certain cases then probably this should also be
okay. Kindly let us know if you have something else in mind as well?


Let's start this again: The question at hand is whether ALTER 
SUBSCRIPTION ... SKIP should be allowed for subscription owners that are 
not superusers.  The argument raised against that was that this would 
allow the owner to create "inconsistent" data.  But it hasn't been 
explained what that actually means or why it is dangerous.





Re: Skipping logical replication transactions on subscriber side

2022-01-25 Thread Peter Eisentraut

On 25.01.22 06:18, Amit Kapila wrote:

I think to avoid this we can send a message to clear this (at least to
clear XID in the view) after skipping the xact but there is no
guarantee that it will be received by the stats collector.
Additionally, the worker can periodically (say after every N (100,
500, etc) successful transaction) send a clear message after
successful apply. This will ensure that eventually the error entry
will be cleared.


Well, I think we need *some* solution for now.  We can't leave a footgun 
where you say, "skip transaction 700", somehow transaction 700 doesn't 
happen, the whole thing gets forgotten, but then 3 months later, the 
next transaction 700 mysteriously gets dropped.





Re: refactoring basebackup.c

2022-01-25 Thread Dagfinn Ilmari Mannsåker
"Shinoda, Noriyoshi (PN Japan FSIP)"  writes:

> Hi, 
> Thank you for committing a great feature. I have tested the committed 
> features. 
> The attached small patch fixes the output of the --help message. In the
> previous commit, only gzip and none were output, but in the attached
> patch, client-gzip and server-gzip are added.

I think it would be better to write that as `[{client,server}-]gzip`,
especially as we add more compression agorithms, where it would
presumably become `[{client,server}-]METHOD` (assuming all methods are
supported on both the client and server side).

I also noticed that in the docs, the `client` and `server` are marked up
as replaceable parameters, when they are actually literals, plus the
hyphen is misplaced.  The `--checkpoint` option also has the `fast` and
`spread` literals marked up as parameters.

All of these are fixed in the attached patch.

- ilmari

>From 8e3d191917984a6d17f2c72212d90c96467463b0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Tue, 25 Jan 2022 13:04:05 +
Subject: [PATCH] pg_basebackup documentation and help fixes

Don't mark up literals as replaceable parameters and indicate alternatives
correctly with {...|...}.
---
 doc/src/sgml/ref/pg_basebackup.sgml   | 6 +++---
 src/bin/pg_basebackup/pg_basebackup.c | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 1d0df346b9..98c89751b3 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -400,7 +400,7 @@
   -Z level
   -Z method[:level]
   --compress=level
-  --compress=[[{client|server-}]method[:level]
+  --compress=[[{client|server}-]method[:level]
   

 Requests compression of the backup. If client or
@@ -441,8 +441,8 @@
 
 
  
-  -c fast|spread
-  --checkpoint=fast|spread
+  -c {fast|spread}
+  --checkpoint={fast|spread}
   

 Sets checkpoint mode to fast (immediate) or spread (the default)
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 72c27c78d0..46f6f53e9b 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -391,7 +391,7 @@ usage(void)
 	printf(_("  -X, --wal-method=none|fetch|stream\n"
 			 " include required WAL files with specified method\n"));
 	printf(_("  -z, --gzip compress tar output\n"));
-	printf(_("  -Z, --compress={gzip,none}[:LEVEL] or [LEVEL]\n"
+	printf(_("  -Z, --compress={[{client,server}-]gzip,none}[:LEVEL] or [LEVEL]\n"
 			 " compress tar output with given compression method or level\n"));
 	printf(_("\nGeneral options:\n"));
 	printf(_("  -c, --checkpoint=fast|spread\n"
-- 
2.30.2



Re: refactoring basebackup.c

2022-01-25 Thread Dagfinn Ilmari Mannsåker
Dagfinn Ilmari Mannsåker  writes:

> "Shinoda, Noriyoshi (PN Japan FSIP)"  writes:
>
>> Hi, 
>> Thank you for committing a great feature. I have tested the committed 
>> features. 
>> The attached small patch fixes the output of the --help message. In the
>> previous commit, only gzip and none were output, but in the attached
>> patch, client-gzip and server-gzip are added.
>
> I think it would be better to write that as `[{client,server}-]gzip`,
> especially as we add more compression agorithms, where it would
> presumably become `[{client,server}-]METHOD` (assuming all methods are
> supported on both the client and server side).
>
> I also noticed that in the docs, the `client` and `server` are marked up
> as replaceable parameters, when they are actually literals, plus the
> hyphen is misplaced.  The `--checkpoint` option also has the `fast` and
> `spread` literals marked up as parameters.
>
> All of these are fixed in the attached patch.

I just noticed there was a superfluous [ in the SGM documentation, and
that the short form was missing the [{client|server}-] part.  Updated
patch attaced.

- ilmari

>From 2164f1a9fc97a5f88f57c7cc9cdafa67398dcc0e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Tue, 25 Jan 2022 13:04:05 +
Subject: [PATCH v2] pg_basebackup documentation and help fixes

Don't mark up literals as replaceable parameters and indicate alternatives
correctly with {...|...}, and add missing [{client,server}-] to the
-Z form.
---
 doc/src/sgml/ref/pg_basebackup.sgml   | 8 
 src/bin/pg_basebackup/pg_basebackup.c | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 1d0df346b9..a5e03d2c66 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -398,9 +398,9 @@
 
  
   -Z level
-  -Z method[:level]
+  -Z [{client|server}-]method[:level]
   --compress=level
-  --compress=[[{client|server-}]method[:level]
+  --compress=[{client|server}-]method[:level]
   

 Requests compression of the backup. If client or
@@ -441,8 +441,8 @@
 
 
  
-  -c fast|spread
-  --checkpoint=fast|spread
+  -c {fast|spread}
+  --checkpoint={fast|spread}
   

 Sets checkpoint mode to fast (immediate) or spread (the default)
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 72c27c78d0..46f6f53e9b 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -391,7 +391,7 @@ usage(void)
 	printf(_("  -X, --wal-method=none|fetch|stream\n"
 			 " include required WAL files with specified method\n"));
 	printf(_("  -z, --gzip compress tar output\n"));
-	printf(_("  -Z, --compress={gzip,none}[:LEVEL] or [LEVEL]\n"
+	printf(_("  -Z, --compress={[{client,server}-]gzip,none}[:LEVEL] or [LEVEL]\n"
 			 " compress tar output with given compression method or level\n"));
 	printf(_("\nGeneral options:\n"));
 	printf(_("  -c, --checkpoint=fast|spread\n"
-- 
2.30.2



Re: Document atthasmissing default optimization avoids verification table scan

2022-01-25 Thread James Coleman
On Sat, Jan 22, 2022 at 10:28 AM David G. Johnston
 wrote:
>
>
>
> On Saturday, January 22, 2022, James Coleman  wrote:
>>
>> On Sat, Jan 22, 2022 at 12:35 AM David G. Johnston
>>  wrote:
>> >
>> > On Fri, Jan 21, 2022 at 5:14 PM James Coleman  wrote:
>> >>
>> >>
>> >> > Really?  That's horrid, because that's directly useful advice.
>> >>
>> >> Remedied, but rewritten a bit to better fit with the new style/goal of
>> >> that tip).
>> >>
>> >> Version 3 is attached.
>> >>
>> >
>> > Coming back to this after a respite I think the tip needs to be moved just 
>> > like everything else.  For much the same reason (though this may only be a 
>> > personal bias), I know what SQL Commands do the various things that DDL 
>> > encompasses (especially the basics like adding a column) and so the DDL 
>> > section is really just a tutorial-like chapter that I will generally 
>> > forget about because I will go straight to the official source which is 
>> > the SQL Command Reference.  My future self would want the tip to show up 
>> > there.  If we put the tip after the existing paragraph that starts: 
>> > "Adding a column with a volatile DEFAULT or changing the type of an 
>> > existing column..." the need to specify an example function in the tip 
>> > goes away - though maybe it should be moved to the notes paragraph 
>> > instead: "with a volatile DEFAULT (e.g., clock_timestamp()) or  changing 
>> > the type of an existing column..."
>>
>> In my mind that actually might be a reason to keep it that way. I
>> expect someone who's somewhat experienced to know there are things
>> (like table rewrites and scans) you need to consider and therefore go
>> to the ALTER TABLE page and read the details. But for someone newer
>> the tutorial page needs to introduce them to the idea that those
>> gotchas exist.
>>
>
> Readers of the DDL page are given a hint of the issues and directed to 
> additional, arguably mandatory, reading.  They can not worry about the 
> nuances during their learning phase but instead can defer that reading until 
> they actually have need to alter a (large) table.  But expecting them to read 
> the command reference page is reasonable and is IMO the more probable place 
> they will look when they start doing stuff in earnest.  For the inexperienced 
> reader breaking this up in this manner based upon depth of detail feels right 
> to me.

Here's a version that looks like that. I'm not convinced it's an
improvement over the previous version: again, I expect more advanced
users to already understand this concept, and I think moving it to the
ALTER TABLE page could very well have the effect of burying i(amidst
the ton of detail on the ALTER TABLE page) concept that would be
useful to learn early on in a tutorial like the DDL page. But if
people really think this is an improvement, then I can acquiesce.

Thanks,
James Coleman
From ffca825ca27cffc70c7eb39385545a76fa0d9e2d Mon Sep 17 00:00:00 2001
From: James Coleman 
Date: Fri, 24 Sep 2021 09:59:27 -0400
Subject: [PATCH v4 1/2] Document atthasmissing default avoids verification
 table scan

When PG11 added the ability for ALTER TABLE ADD COLUMN to set a constant
default value without rewriting the table the doc changes did not note
how the new feature interplayed with ADD COLUMN DEFAULT NOT NULL.
Since adding a NOT NULL constraint requires a verification table scan to
ensure no values are null, users want to know that the combined
operation also avoids the table scan.
---
 doc/src/sgml/ref/alter_table.sgml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index a76e2e7322..1dde16fa39 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -1355,7 +1355,9 @@ WITH ( MODULUS numeric_literal, REM
 evaluated at the time of the statement and the result stored in the
 table's metadata.  That value will be used for the column for all existing
 rows.  If no DEFAULT is specified, NULL is used.  In
-neither case is a rewrite of the table required.
+neither case is a rewrite of the table required.  A NOT NULL
+constraint may be added to the new column in the same statement without
+requiring scanning the table to verify the constraint.

 

-- 
2.17.1

From b1019154c749991a3e23cd8e5f82b31acbbdddc9 Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Fri, 21 Jan 2022 18:50:39 +
Subject: [PATCH v4 2/2] Don't double document ADD COLUMN optimization details

Instead point people to the source of truth. This also avoids using
different language ("constant" versus "non-volatile").
---
 doc/src/sgml/ddl.sgml | 22 --
 doc/src/sgml/ref/alter_table.sgml | 11 ++-
 2 files changed, 14 insertions(+), 19 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 22f6c5c7ab..0ec1b7cd39 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1406,24 +1406,10 @

Re: Foreign join search stops on the first try

2022-01-25 Thread Ashutosh Bapat
This code was written long ago. So I may have some recollection
errors. But AFAIR, the reasons we wanted to avoid repeated
estimation/planning for the same foreign join rel were
1. If use_remote_estimate = true, we fetch EXPLAIN output from the
foreign server for various pathkeys. Fetching EXPLAIN output is
expensive. Irrespective of the join order being considered locally, we
expect the foreign server to give us the same cost since the join is
the same. So we avoid running EXPLAIN again and again.
2. If use_remote_estimate = false, the logic to estimate a foreign
join locally is independent of the join order so should yield same
cost again and again. For some reason that doesn't seem to be the case
here.

On Tue, Jan 25, 2022 at 1:26 PM Alexander Pyhalov
 wrote:

>
> Without patch:
>
> explain analyze verbose SELECT * FROM order_line, stock, district WHERE
> ol_d_id = 1 AND d_id = 1 AND (ol_o_id < d_next_o_id) AND ol_o_id >=
> (d_next_o_id - 20) AND s_i_id = ol_i_id AND s_quantity < 11;
... clipped
> test.stock WHERE ((s_quantity < 11))
>   Planning Time: 1.812 ms
>   Execution Time: 8.534 ms
> (15 rows)
>
> With patch:
>
> explain analyze verbose SELECT * FROM order_line, stock, district WHERE
> ol_d_id = 1 AND d_id = 1 AND (ol_o_id < d_next_o_id) AND ol_o_id >=
> (d_next_o_id - 20) AND s_i_id = ol_i_id AND s_quantity < 11;
>
... clipped
>   Planning Time: 0.928 ms
>   Execution Time: 4.511 ms

It is surprising that the planning time halves with the patch. I
expected it to increase slightly since we will compute estimates
thrice instead of once.

What is use_remote_estimate? Is it ON/OFF?

If we want to proceed along this line, we should take care not to fire
more EXPLAIN queries on the foreign server.

-- 
Best Wishes,
Ashutosh Bapat




Re: fix crash with Python 3.11

2022-01-25 Thread Peter Eisentraut

On 16.01.22 23:53, Tom Lane wrote:

I think a possible fix is:

1. Before entering the PG_TRY block, check for active subtransaction(s)
and immediately throw a Python error if there is one.  (This corresponds
to the existing errors "cannot commit while a subtransaction is active"
and "cannot roll back while a subtransaction is active".  The point is
to reduce the number of system states we have to worry about below.)

2. In the PG_CATCH block, after collecting the error data do
AbortOutOfAnyTransaction();
StartTransactionCommand();
which gets us into a good state with no active subtransactions.

I'm not sure that those two are the best choices of xact.c
entry points, but there's precedent for that in autovacuum.c
among other places.


AFAICT, AbortOutOfAnyTransaction() also aborts subtransactions, so why 
do you suggest the separate handling of subtransactions?






Re: Add connection active, idle time to pg_stat_activity

2022-01-25 Thread Julien Rouhaud
Hi,

On Wed, Jan 12, 2022 at 02:16:35PM +0800, Julien Rouhaud wrote:
> 
> On Mon, Nov 29, 2021 at 11:04 PM Kuntal Ghosh
>  wrote:
> >
> > You also need to update the documentation.
> 
> You also need to update rules.sql: https://cirrus-ci.com/task/6145265819189248

There has been multiple comments in the last two months that weren't addressed
since, and also the patch doesn't pass the regression tests anymore.

Rafia, do you plan to send a new version soon?  Without update in the next few
days this patch will be closed as Returned with Feedback, per the commitfest
rules.




Re: DELETE CASCADE

2022-01-25 Thread Julien Rouhaud
Hi,

On Wed, Jan 12, 2022 at 04:57:27PM +0800, Julien Rouhaud wrote:
> 
> The cfbot reports that this patch doesn't apply anymore:
> http://cfbot.cputube.org/patch_36_3195.log
> 
> > patching file src/backend/utils/adt/ri_triggers.c
> > Hunk #1 succeeded at 93 (offset 3 lines).
> > Hunk #2 FAILED at 181.
> > Hunk #3 succeeded at 556 (offset 5 lines).
> > Hunk #4 succeeded at 581 (offset 5 lines).
> > Hunk #5 succeeded at 755 (offset 5 lines).
> > Hunk #6 succeeded at 776 (offset 5 lines).
> > 1 out of 6 hunks FAILED -- saving rejects to file 
> > src/backend/utils/adt/ri_triggers.c.rej
> 
> Are you currently working on a possibly different approach and/or grammar?  If
> not, could you send a rebased patch?  In the meantime I will switch the cf
> entry to Waiting on Author.

It's been almost 4 months since your last email, and almost 2 weeks since the
notice that this patch doesn't apply anymore.  Without update in the next
couple of days this patch will be closed as Returned with Feedback per the
commitfest rules.




Re: Skipping logical replication transactions on subscriber side

2022-01-25 Thread David G. Johnston
On Tue, Jan 25, 2022 at 5:52 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 25.01.22 06:18, Amit Kapila wrote:
> > I think to avoid this we can send a message to clear this (at least to
> > clear XID in the view) after skipping the xact but there is no
> > guarantee that it will be received by the stats collector.
> > Additionally, the worker can periodically (say after every N (100,
> > 500, etc) successful transaction) send a clear message after
> > successful apply. This will ensure that eventually the error entry
> > will be cleared.
>
> Well, I think we need *some* solution for now.  We can't leave a footgun
> where you say, "skip transaction 700", somehow transaction 700 doesn't
> happen, the whole thing gets forgotten, but then 3 months later, the
> next transaction 700 mysteriously gets dropped.
>

This is indeed part of why I feel that the xid being skipped should be
validated.  As the feature is presented the user is supposed to read the
xid from the system (the new stat view or the error log) and supply it and
then the worker, when it goes to skip, should find that the very first
transaction xid it encounters is the one it is being told to skip.  It
skips that transaction, clears the skipxid, and puts the system back into
normal operating mode.  If that first transaction xid isn't the one being
specified to skip the worker should error with "skipping transaction
failed, xid 123 expected but 456 found".

This whole lack of a guarantee of the availability and accuracy regarding
the data that this process should be reliant upon needs to be engineered
away.

David J.


Re: Skipping logical replication transactions on subscriber side

2022-01-25 Thread Masahiko Sawada
On Tue, Jan 25, 2022 at 11:35 PM David G. Johnston
 wrote:
>
> On Tue, Jan 25, 2022 at 5:52 AM Peter Eisentraut 
>  wrote:
>>
>> On 25.01.22 06:18, Amit Kapila wrote:
>> > I think to avoid this we can send a message to clear this (at least to
>> > clear XID in the view) after skipping the xact but there is no
>> > guarantee that it will be received by the stats collector.
>> > Additionally, the worker can periodically (say after every N (100,
>> > 500, etc) successful transaction) send a clear message after
>> > successful apply. This will ensure that eventually the error entry
>> > will be cleared.
>>
>> Well, I think we need *some* solution for now.  We can't leave a footgun
>> where you say, "skip transaction 700", somehow transaction 700 doesn't
>> happen, the whole thing gets forgotten, but then 3 months later, the
>> next transaction 700 mysteriously gets dropped.
>
>
> This is indeed part of why I feel that the xid being skipped should be 
> validated.  As the feature is presented the user is supposed to read the xid 
> from the system (the new stat view or the error log) and supply it and then 
> the worker, when it goes to skip, should find that the very first transaction 
> xid it encounters is the one it is being told to skip.  It skips that 
> transaction, clears the skipxid, and puts the system back into normal 
> operating mode.  If that first transaction xid isn't the one being specified 
> to skip the worker should error with "skipping transaction failed, xid 123 
> expected but 456 found".

Yeah, I think it's a good idea to clear the subskipxid after the first
transaction regardless of whether the worker skipped it.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: WIN32 pg_import_system_collations

2022-01-25 Thread Juan José Santamaría Flecha
On Tue, Jan 25, 2022 at 11:40 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 24.01.22 22:23, Dmitry Koval wrote:
>

Thanks for looking into this.


> > +/*
> > + * Windows will use hyphens between language and territory, where POSIX
> > + * uses an underscore. Simply make it POSIX looking.
> > + */
> > + hyphen = strchr(localebuf, '-');
> > + if (hyphen)
> > +*hyphen = '_';
> >
> > After this block modified collation name is used in function
> >
> > GetNLSVersionEx(COMPARE_STRING, wide_collcollate, &version)
> >
> > (see win32_read_locale() -> CollationFromLocale() -> CollationCreate()
> > call). Is it correct to use (wide_collcollate = "en_NZ") instead of
> > (wide_collcollate = "en-NZ") in GetNLSVersionEx() function?
>

The problem that David Rowley addressed was coming from Windows collations
in the shape of "English_New Zealand", GetNLSVersionEx() will work with
both "en_NZ" and "en-NZ". You can check collversion in pg_collation in the
patched version.

>
> I don't really know if this is necessary anyway.  Just create the
> collations with the names that the operating system presents.  There is
> no requirement to make the names match POSIX.
>
> If you want to make them match POSIX for some reason, you can also just
> change the object name but leave the collcollate/collctype fields the
> way they came from the OS.
>

I think there is some value in making collation names consistent across
different platforms, e.g. making user scripts more portable. So, I'm doing
that in the attached version, just changing the object name.

Regards,

Juan José Santamaría Flecha


v4-0001-WIN32-pg_import_system_collations.patch
Description: Binary data


Re: Skipping logical replication transactions on subscriber side

2022-01-25 Thread David G. Johnston
On Tue, Jan 25, 2022 at 7:47 AM Masahiko Sawada 
wrote:

> Yeah, I think it's a good idea to clear the subskipxid after the first
> transaction regardless of whether the worker skipped it.
>
>
So basically instead of stopping the worker with an error you suggest
having the worker continue applying changes (after resetting subskipxid,
and - arguably - the ?_error_* fields).  Log the transaction xid mis-match
as a warning in the log file as opposed to an error.

I was supposing to make it an error and have the worker stop again since in
a system where the xid is verified and the code is bug-free I would expect
the situation to be a "can't happen" one and I'd rather error in that
circumstance than warn.  The DBA will have to go and ALTER SUBSCRIPTION
SKIP (xid = NONE) to get the worker working again but I find that
acceptable in this case.

David J.


Re: Skipping logical replication transactions on subscriber side

2022-01-25 Thread Masahiko Sawada
On Tue, Jan 25, 2022 at 11:58 PM David G. Johnston
 wrote:
>
> On Tue, Jan 25, 2022 at 7:47 AM Masahiko Sawada  wrote:
>>
>> Yeah, I think it's a good idea to clear the subskipxid after the first
>> transaction regardless of whether the worker skipped it.
>>
>
> So basically instead of stopping the worker with an error you suggest having 
> the worker continue applying changes (after resetting subskipxid, and - 
> arguably - the ?_error_* fields).  Log the transaction xid mis-match as a 
> warning in the log file as opposed to an error.

Agreed, I think it's better to log a warning than to raise an error.
In the case where the user specified the wrong XID, the worker should
fail again due to the same error.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Skipping logical replication transactions on subscriber side

2022-01-25 Thread David G. Johnston
On Tue, Jan 25, 2022 at 8:09 AM Masahiko Sawada 
wrote:

> On Tue, Jan 25, 2022 at 11:58 PM David G. Johnston
>  wrote:
> >
> > On Tue, Jan 25, 2022 at 7:47 AM Masahiko Sawada 
> wrote:
> >>
> >> Yeah, I think it's a good idea to clear the subskipxid after the first
> >> transaction regardless of whether the worker skipped it.
> >>
> >
> > So basically instead of stopping the worker with an error you suggest
> having the worker continue applying changes (after resetting subskipxid,
> and - arguably - the ?_error_* fields).  Log the transaction xid mis-match
> as a warning in the log file as opposed to an error.
>
> Agreed, I think it's better to log a warning than to raise an error.
> In the case where the user specified the wrong XID, the worker should
> fail again due to the same error.
>
>
If it remains possible for the system to accept a wrongly specified XID I
would agree that this behavior is preferable.  At least when the user
wonders why the skip didn't work and they are seeing the same error again
they will have a log entry warning telling them their XID choice was
incorrect.  I would prefer that the system not accept a wrongly specified
XID and the user be told directly and sooner that their XID choice was
incorrect.

David J.


Re: Skipping logical replication transactions on subscriber side

2022-01-25 Thread Masahiko Sawada
On Wed, Jan 26, 2022 at 12:14 AM David G. Johnston
 wrote:
>
>
> On Tue, Jan 25, 2022 at 8:09 AM Masahiko Sawada  wrote:
>>
>> On Tue, Jan 25, 2022 at 11:58 PM David G. Johnston
>>  wrote:
>> >
>> > On Tue, Jan 25, 2022 at 7:47 AM Masahiko Sawada  
>> > wrote:
>> >>
>> >> Yeah, I think it's a good idea to clear the subskipxid after the first
>> >> transaction regardless of whether the worker skipped it.
>> >>
>> >
>> > So basically instead of stopping the worker with an error you suggest 
>> > having the worker continue applying changes (after resetting subskipxid, 
>> > and - arguably - the ?_error_* fields).  Log the transaction xid mis-match 
>> > as a warning in the log file as opposed to an error.
>>
>> Agreed, I think it's better to log a warning than to raise an error.
>> In the case where the user specified the wrong XID, the worker should
>> fail again due to the same error.
>>
>
> If it remains possible for the system to accept a wrongly specified XID I 
> would agree that this behavior is preferable.  At least when the user wonders 
> why the skip didn't work and they are seeing the same error again they will 
> have a log entry warning telling them their XID choice was incorrect.

Yes.

>  I would prefer that the system not accept a wrongly specified XID and the 
> user be told directly and sooner that their XID choice was incorrect.

Given that we cannot use rely on the pg_stat_subscription_workers view
for this purpose, we would need either a new sub-system that tracks
each logical replication status so the system can set the error XID to
subskipxid, or to wait for shared-memory based stats collector. While
agreeing that ideally, we need such a sub-system I'm concerned that
everyone will agree to add complexity for this feature. That having
been said, if there is a significant need for it, we can implement it
as an improvement.



Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: fix crash with Python 3.11

2022-01-25 Thread Tom Lane
Peter Eisentraut  writes:
> On 16.01.22 23:53, Tom Lane wrote:
>> I think a possible fix is:
>> 
>> 1. Before entering the PG_TRY block, check for active subtransaction(s)
>> and immediately throw a Python error if there is one.  (This corresponds
>> to the existing errors "cannot commit while a subtransaction is active"
>> and "cannot roll back while a subtransaction is active".  The point is
>> to reduce the number of system states we have to worry about below.)

> AFAICT, AbortOutOfAnyTransaction() also aborts subtransactions, so why 
> do you suggest the separate handling of subtransactions?

We don't want these operations to be able to cancel subtransactions,
do we?  The existing errors certainly suggest not.

regards, tom lane




Re: CREATEROLE and role ownership hierarchies

2022-01-25 Thread Mark Dilger



> On Jan 24, 2022, at 10:55 PM, Fujii Masao  wrote:
> 
> +1
> 
> One of "mischiefs" I'm thinking problematic is that users with CREATEROLE can 
> give any predefined role that they don't have, to other users including 
> themselves. For example, users with CREATEROLE can give 
> pg_execute_server_program to themselves and run any OS commands by COPY 
> PROGRAM. This would be an issue when providing something like PostgreSQL 
> cloud service that wants to prevent end users from running OS commands but 
> allow them to create/drop roles. Does the proposed patch fix also this issue?

Yes, the patch restricts CREATEROLE privilege from granting any privilege they 
themselves lack.  There is a regression test in the patch set which 
demonstrates this.  See src/test/regress/expected/create_role.out.  The diffs 
from v6-0004-Restrict-power-granted-via-CREATEROLE.patch are quoted here for 
ease of viewing:

--- ok, having CREATEROLE is enough to create roles in privileged roles
+-- fail, having CREATEROLE is not enough to create roles in privileged roles
 CREATE ROLE regress_read_all_data IN ROLE pg_read_all_data;
+ERROR:  must have admin option on role "pg_read_all_data"
 CREATE ROLE regress_write_all_data IN ROLE pg_write_all_data;
+ERROR:  must have admin option on role "pg_write_all_data"
 CREATE ROLE regress_monitor IN ROLE pg_monitor;
+ERROR:  must have admin option on role "pg_monitor"
 CREATE ROLE regress_read_all_settings IN ROLE pg_read_all_settings;
+ERROR:  must have admin option on role "pg_read_all_settings"
 CREATE ROLE regress_read_all_stats IN ROLE pg_read_all_stats;
+ERROR:  must have admin option on role "pg_read_all_stats"
 CREATE ROLE regress_stat_scan_tables IN ROLE pg_stat_scan_tables;
+ERROR:  must have admin option on role "pg_stat_scan_tables"
 CREATE ROLE regress_read_server_files IN ROLE pg_read_server_files;
+ERROR:  must have admin option on role "pg_read_server_files"
 CREATE ROLE regress_write_server_files IN ROLE pg_write_server_files;
+ERROR:  must have admin option on role "pg_write_server_files"
 CREATE ROLE regress_execute_server_program IN ROLE pg_execute_server_program;
+ERROR:  must have admin option on role "pg_execute_server_program"
 CREATE ROLE regress_signal_backend IN ROLE pg_signal_backend;
+ERROR:  must have admin option on role "pg_signal_backend"

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: refactoring basebackup.c

2022-01-25 Thread tushar

On 1/22/22 12:03 AM, Robert Haas wrote:

I committed the base backup target patch yesterday, and today I
updated the remaining code in light of Michael Paquier's commit
5c649fe153367cdab278738ee4aebbfd158e0546. Here is the resulting patch.

Thanks Robert,  I tested against the latest PG Head and found a few issues -

A)Getting syntax error if -z is used along with -t

[edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/data902 -z -Xfetch
pg_basebackup: error: could not initiate base backup: ERROR:  syntax error

OR

[edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/t2 
--compress=server-gzip:9 -Xfetch -v -z

pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: error: could not initiate base backup: ERROR:  syntax error

B)No information of "client-gzip" or "server-gzip" added under 
"--compress" option/method of ./pg_basebackup --help.


C) -R option is silently ignoring

[edb@centos7tushar bin]$  ./pg_basebackup  -Z 4  -v  -t server:/tmp/pp 
-Xfetch -R

pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/3028 on timeline 1
pg_basebackup: write-ahead log end point: 0/3100
pg_basebackup: base backup completed
[edb@centos7tushar bin]$

go to /tmp/pp folder and extract it - there is no "standby.signal" file 
and if we start cluster against this data directory,it will not be in 
slave mode.


if this is not supported then I think we should throw some errors.

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company





Re: Fix BUG #17335: Duplicate result rows in Gather node

2022-01-25 Thread Tom Lane
Yura Sokolov  writes:
> В Вт, 25/01/2022 в 21:20 +1300, David Rowley пишет:
>> The reason I didn't think it was worth adding a new test was that no
>> tests were added in the original commit.  Existing tests did cover it,

> Existed tests didn't catched the issue. It is pitty fix is merged
> without test case it fixes.

I share David's skepticism about the value of a test case.  The
failure mode that seems likely to me is some other code path making
the same mistake, which a predetermined test would not catch.

Therefore, what I think could be useful is some very-late-stage
assertion check (probably in createplan.c) verifying that the
child of a Gather is parallel-aware.  Or maybe the condition
needs to be more general than that, but anyway the idea is for
the back end of the planner to verify that we didn't build a
silly plan.

regards, tom lane




Re: Foreign join search stops on the first try

2022-01-25 Thread Alexander Pyhalov

Ashutosh Bapat писал 2022-01-25 17:08:

This code was written long ago. So I may have some recollection
errors. But AFAIR, the reasons we wanted to avoid repeated
estimation/planning for the same foreign join rel were
1. If use_remote_estimate = true, we fetch EXPLAIN output from the
foreign server for various pathkeys. Fetching EXPLAIN output is
expensive. Irrespective of the join order being considered locally, we
expect the foreign server to give us the same cost since the join is
the same. So we avoid running EXPLAIN again and again.
2. If use_remote_estimate = false, the logic to estimate a foreign
join locally is independent of the join order so should yield same
cost again and again. For some reason that doesn't seem to be the case
here.


Hi.
use_remote_estimate was set to false in our case, and yes, it fixed this 
issue.
The problem is that if use_remote_estimate = false, the logic to 
estimate a foreign join locally

is not independent from the join order.

In above example, without patch we see plan with cost: 
cost=382.31..966.86 rows=2 width=37


If we avoid exiting on (joinrel->fdw_private), we can see in gdb the 
following cases, when joining all 3 relations:


case 1:
outerrel:relids (stock, order_line), startup_cost = 100, total_cost = 
2415.92001, rel_startup_cost = 0, rel_total_cost = 2315.5, 
retrieved_rows = 21
innerrel: relid (district) startup_cost = 100, total_cost = 
101.145001, rel_startup_cost = 0, rel_total_cost = 1.125, 
retrieved_rows = 1

joinrel: startup_cost = 100, total_cost = 2416.875, retrieved_rows = 2

case 2:
outerrel: relids (district, order_line), startup_cost = 100, total_cost 
= 281.419996, rel_total_cost = 180, retrieved_rows = 71
innerrel: relid (stock), startup_cost = 100, total_cost = 
683.285008, rel_startup_cost = 0, rel_total_cost = 576.625, 
retrieved_rows = 333

joinrel:  startup_cost = 100, total_cost = 974.88, retrieved_rows = 2


So, (stock join order_line) join district has different cost from 
(district join order_line) join stock.




On Tue, Jan 25, 2022 at 1:26 PM Alexander Pyhalov
 wrote:


It is surprising that the planning time halves with the patch. I
expected it to increase slightly since we will compute estimates
thrice instead of once.


I wouldn't look at estimate times here precisely (and would looked at 
costs). Real example where we found it had 100 times more data, but 
effect was the same. Here some differences in planing time could be 
related to restarting instances with or without patches.




What is use_remote_estimate? Is it ON/OFF?



Yes, it was off.


If we want to proceed along this line, we should take care not to fire
more EXPLAIN queries on the foreign server.


You are correct. Fixed patch to avoid extensive join search when 
use_remote_estimate is true.


--
Best regards,
Alexander Pyhalov,
Postgres ProfessionalFrom 732e89ef198c0bad713b1a18446902b0132aa72c Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov 
Date: Mon, 24 Jan 2022 18:28:12 +0300
Subject: [PATCH] Look through all possible foreign join orders

---
 contrib/postgres_fdw/postgres_fdw.c | 50 +
 1 file changed, 44 insertions(+), 6 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index bf3f3d9e26e..703e5df1753 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -5950,7 +5950,8 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 			JoinType jointype,
 			JoinPathExtraData *extra)
 {
-	PgFdwRelationInfo *fpinfo;
+	PgFdwRelationInfo *fpinfo,
+			   *oldfpinfo;
 	ForeignPath *joinpath;
 	double		rows;
 	int			width;
@@ -5960,9 +5961,11 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
  * EvalPlanQual gets triggered. */
 
 	/*
-	 * Skip if this join combination has been considered already.
+	 * Skip if this join combination has been considered already and rejected
+	 * or if this join uses remote estimates.
 	 */
-	if (joinrel->fdw_private)
+	oldfpinfo = (PgFdwRelationInfo *) joinrel->fdw_private;
+	if (oldfpinfo && (!oldfpinfo->pushdown_safe || oldfpinfo->use_remote_estimate))
 		return;
 
 	/*
@@ -6002,6 +6005,12 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 		epq_path = GetExistingLocalJoinPath(joinrel);
 		if (!epq_path)
 		{
+			if (oldfpinfo)
+			{
+joinrel->fdw_private = oldfpinfo;
+pfree(fpinfo);
+			}
+
 			elog(DEBUG3, "could not push down foreign join because a local path suitable for EPQ checks was not found");
 			return;
 		}
@@ -6011,6 +6020,12 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 
 	if (!foreign_join_ok(root, joinrel, jointype, outerrel, innerrel, extra))
 	{
+		if (oldfpinfo)
+		{
+			joinrel->fdw_private = oldfpinfo;
+			pfree(fpinfo);
+		}
+
 		/* Free path required for EPQ if we copied one; we don't need it now */
 		if (epq_path)
 			pfree(epq_path);
@@ -6044,14 +6059,37 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* Estimate costs for bare join

Re: CREATEROLE and role ownership hierarchies

2022-01-25 Thread Mark Dilger



> On Jan 24, 2022, at 2:21 PM, Stephen Frost  wrote:
> 
> To push back on the original “tenant” argument, consider that one of the 
> bigger issues in cloud computing today is exactly the problem that the cloud 
> managers can potentially gain access to the sensitive data of their tenants 
> and that’s not generally viewed as a positive thing.

+1.  This is a real problem.  I have been viewing this problem as separate from 
the one which role ownership is intended to fix.  Do you have a suggestion 
about how to tackle the problems together with less work than tackling them 
separately?

>  This change would make it so that every landlord can go and SELECT from the 
> tables of their tenants without so much as a by-your-leave.

I would expect that is already true.  A user with CREATEROLE can do almost 
everything.  This patch closes some CREATEROLE related security problems, but 
not this one you mention.

>  The tenants likely don’t like that idea

+1

> , and almost as likely the landlords in many cases aren’t thrilled with it 
> either.

+1

>  Should the landlords be able to DROP the tenant due to the tenant not paying 
> their bill?  Of course, and that should then eliminate the tenant’s tables 
> and other objects which take up resources, but that’s not the same thing as 
> saying that a landlord should be able to unlock a tenant’s old phone that 
> they left behind (and yeah, maybe the analogy falls apart a bit there, but 
> the point I’m trying to get at is that it’s not as simple as it’s being made 
> out to be here and we should think about these things and not just implicitly 
> grant all access to the owner because that’s an easy thing to do- and is 
> exactly what viewing owners as “mini superusers” does and leads to many of 
> the same issues we already have with superusers).

This is a pretty interesting argument.  I don't believe it will work to do as 
you say unconditionally, as there is still a need to have CREATEROLE users who 
have privileges on their created roles' objects, even if for no other purpose 
than to be able to REASSIGN OWNED BY those objects before dropping roles.  But 
maybe there is also a need to have CREATEROLE users who lack that privilege?  
Would that be a privilege bit akin to (but not the same as!) the INHERIT 
privilege?  Should I redesign for something like that?

I like that the current patch restricts CREATEROLE users from granting 
privileges they themselves lack.  Would such a new privilege bit work the same 
way?  Imagine that you, "stephen", have CREATEROLE but not this new bit, and 
you create me, "mark" as a tenant with CREATEROLE.  Can you give me the bit?  
Or does the fact that you lack the bit mean you can't give it to me, either?

Other suggestions?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: pg_upgrade should truncate/remove its logs before running

2022-01-25 Thread Justin Pryzby
On Mon, Jan 24, 2022 at 10:59:40AM +0900, Michael Paquier wrote:
> On Thu, Jan 20, 2022 at 07:51:37PM +0900, Michael Paquier wrote:
> > Neat idea.  That would work fine for my case.  So I am fine to stick
> > with this suggestion. 
> 
> I have been looking at this idea, and the result is quite nice, being
> simpler than anything that has been proposed on this thread yet.  We
> get a simpler removal logic, and there is no need to perform any kind
> of sanity checks with the output path provided as long as we generate
> the paths and the dirs after adjust_data_dir().
> 
> Thoughts?

Andrew: you wanted to accommodate any change on the build client, right ?

-- 
Justin




Re: Collecting statistics about contents of JSONB columns

2022-01-25 Thread Mahendra Singh Thalor
On Tue, 25 Jan 2022 at 03:50, Tomas Vondra 
wrote:
>
> On 1/23/22 01:24, Nikita Glukhov wrote:
> > Hi!
> >
> > I am glad that you found my very old patch interesting and started to
> > work on it.  We failed to post it in 2016 mostly because we were not
> > satisfied with JSONB storage.  Also we decided to wait for completion
> > of work on extended statistics as we thought that it could help us.
> > But in early 2017 we switched to SQL/JSON and forgot about this patch.
> >
>
> Understood. Let's see how feasible this idea is and if we can move this
> forward.
>
> >
> > I think custom datatype is necessary for better performance. With a
> > plain JSONB we need to do a lot of work for extraction of path stats:
> >   - iterate through MCV/Histogram JSONB arrays
> >   - cast numeric values to float, string to text etc.
> >   - build PG arrays from extracted datums
> >   - form pg_statistic tuple.
> >
> > With a custom data type we could store pg_statistic tuple unmodified
> > and use it without any overhead.  But then we need modify a bit
> > VariableStatData and several functions to pass additional nullfrac
> > corrections.
> >
>
> I'm not against evaluating/exploring alternative storage formats, but my
> feeling is the real impact on performance will be fairly low. At least I
> haven't seen this as very expensive while profiling the patch so far. Of
> course, I may be wrong, and it may be more significant in some cases.
>
> > Maybe simple record type (path text, data pg_statistic, ext_data jsonb)
> > would be enough.
> >
> Maybe, but then you still need to store a bunch of those, right? So
> either an array (likely toasted) or 1:M table. I'm not sure it's goiing
> to be much cheaper than JSONB.
>
> I'd suggest we focus on what we need to store first, which seems like
> tha primary question, and worry about the exact storage format then.
>
>
> > Also there is an idea to store per-path separately in pg_statistic_ext
> > rows using expression like (jb #> '{foo,bar}') as stxexprs.  This could
> > also help user to select which paths to analyze simply by using some
> > sort of CREATE STATISTICS.  But it is really unclear how to:
> >   * store pg_statistic_ext rows from typanalyze
> >   * form path expressions for array elements (maybe add new jsonpath
> > operator)
> >   * match various -> operator chains to stxexprs
> >   * jsonpath-like languages need simple API for searching by stxexprs
> >
>
> Sure, you can do statistics on expressions, right? Of course, if that
> expression produces JSONB value, that's not very useful at the moment.
> Maybe we could have two typanalyze functions - one for regular analyze,
> one for extended statistics?
>
> That being said, I'm not sure extended stats are a good match for this.
> My feeling was we should collect these stats for all JSONB columns,
> which is why I argued for putting that in pg_statistic.
>
> >
> >
> > Per-path statistics should only be collected for scalars.  This can be
> > enabled by flag JsonAnalyzeContext.scalarsOnly.  But there are is a
> > problem: computed MCVs and histograms will be broken and we will not be
> > able to use them for queries like (jb > const) in general case.  Also
> > we will not be and use internally in scalarineqsel() and var_eq_const()
> > (see jsonSelectivity()).  Instead, we will have to implement own
> > estimator functions for JSONB comparison operators that will correctly
> > use our hacked MCVs and histograms (and maybe not all cases will be
> > supported; for example, comparison to scalars only).
> >
>
> Yeah, but maybe that's an acceptable trade-off? I mean, if we can
> improve estimates for most clauses, and there's a some number of clauses
> that are estimated just like without stats, that's still an improvement,
> right?
>
> > It's possible to replace objects and arrays with empty ones when
> > scalarsOnly is set to keep correct frequencies of non-scalars.
> > But there is an old bug in JSONB comparison: empty arrays are placed
> > before other values in the JSONB sort order, although they should go
> > after all scalars.  So we need also to distinguish empty and non-empty
> > arrays here.
> >
>
> Hmmm ...
> >
> >
> > I tried to fix a major part of places marked as XXX and FIXME, the new
> > version of the patches is attached.  There are a lot of changes, you
> > can see them in a step-by-step form in the corresponding branch
> > jsonb_stats_20220122 in our GitHub repo [1].
> >
>
> Thanks! I'll go through the changes soon.
>
>

Thanks, Nikita and Tomas for these patches.

For the last few days, I was trying to understand these patches, and based
on Tomas's suggestion, I was doing some performance tests.

With the attached .SQL file, I can see that analyze is taking more time
with these patches.

*Setup: *
autovacuum=off
rest all are default settings.

Insert attached file with and without the patch to compare the time taken
by analyze.

*With json patches:*
postgres=# analyze test ;
ANALYZE
Time: *28464.062 ms (00:28.

Re: Replace uses of deprecated Python module distutils.sysconfig

2022-01-25 Thread Tom Lane
I wrote:
> It's a little bit too soon to decide that INCLUDEPY is reliably equal
> to that, but if it still looks that way tomorrow, I'll be satisfied.

As of now, 92 buildfarm animals have reported results from f032f63e7.
Every single one of them reports that all the different methods you
tested give the same answer.  So it looks to me like we should just
go with get_config_var('INCLUDEPY') and be happy.

I guess next steps are to revert f032f63e7 and then retry e0e567a10
with that change.  Who's going to do the honors?

regards, tom lane




Re: Correct error message for end-of-recovery record TLI

2022-01-25 Thread Bossart, Nathan
On 1/24/22, 8:42 PM, "Amul Sul"  wrote:
> On Tue, Jan 25, 2022 at 10:08 AM Michael Paquier  wrote:
>>
>> On Wed, Dec 01, 2021 at 07:09:34PM +, Bossart, Nathan wrote:
>> > The patch no longer applied, so I went ahead and rebased it.
>>
>> This was on the CF stack for some time, so applied.  I have also
>> changed the messages produced for the shutdown and online checkpoint
>> records as they used the same messages so as one can get more
>> context depending on the record types.
>
> A ton of thanks for the improvement & the commit.

+1, thanks!

Nathan



Re: GUC flags

2022-01-25 Thread Justin Pryzby
On Tue, Jan 25, 2022 at 11:47:14AM +0100, Peter Eisentraut wrote:
> On 25.01.22 02:07, Justin Pryzby wrote:
> > +CREATE TABLE pg_settings_flags AS SELECT name, category,
> > +   'NO_SHOW_ALL'   =ANY(flags) AS no_show_all,
> > +   'NO_RESET_ALL'  =ANY(flags) AS no_reset_all,
> > +   'NOT_IN_SAMPLE' =ANY(flags) AS not_in_sample,
> > +   'EXPLAIN'   =ANY(flags) AS guc_explain,
> > +   'COMPUTED'  =ANY(flags) AS guc_computed
> > +   FROM pg_show_all_settings();
> 
> Does this stuff have any value for users?  I'm worried we are exposing a
> bunch of stuff that is really just for internal purposes.  Like, what value
> does showing "not_in_sample" have?  On the other hand, "guc_explain" might
> be genuinely useful, since that is part of a user-facing feature.  (I don't
> like the "guc_*" naming though.)
> 
> Your patch doesn't contain a documentation change, so I don't know how and
> to what extend this is supposed to be presented to users.

I want to avoid putting this in pg_settings.

The two options discussed so far are:
 - to add an function to return the flags;
 - to add the flags to pg_show_all_settings(), but not show it in pg_settings 
view;

I interpretted Michael's suggested as adding it to pg_get_all_settings(), but
*not* including it in the pg_settings view.  Now it seems like I misunderstood,
and Michael wants to add it to the view.

But, even if we only handle the 5 flags we have an immediate use for, it makes
the user-facing view too "wide", just to accommodate this internal use.

If it were in the pg_settings view, I think it ought to have *all* the flags
(not just the flags that help us to retire ./check_guc).  That's much too much.

-- 
Justin




Re: pgsql: Server-side gzip compression.

2022-01-25 Thread Robert Haas
On Tue, Jan 25, 2022 at 4:20 AM David Rowley  wrote:
> On Tue, 25 Jan 2022 at 09:14, Robert Haas  wrote:
> > src/backend/replication/basebackup_gzip.c | 309 
> > ++
>
> This could do with the attached.  MSVC compilers need a bit more
> reassurance that ereport/elog ERRORs don't return.

Err, well, if we need it, we need it. It surprises me, though:
wouldn't this same consideration apply to a very large number of other
places in the code base?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Non-decimal integer literals

2022-01-25 Thread Robert Haas
On Tue, Jan 25, 2022 at 5:34 AM Peter Eisentraut
 wrote:
> On 24.01.22 19:53, Robert Haas wrote:
> > On Mon, Jan 24, 2022 at 3:41 AM Peter Eisentraut
> >  wrote:
> >> Rebased patch set
> >
> > What if someone finds this new behavior too permissive?
>
> Which part exactly?  There are several different changes proposed here.

I was just going based on the description of the feature in your
original post. If someone is hoping that int4in() will accept only
^\d+$ then they will be disappointed by this patch.

Maybe nobody is hoping that, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Non-decimal integer literals

2022-01-25 Thread Alvaro Herrera
On 2022-Jan-24, Peter Eisentraut wrote:

> +decinteger   {decdigit}(_?{decdigit})*
> +hexinteger   0[xX](_?{hexdigit})+
> +octinteger   0[oO](_?{octdigit})+
> +bininteger   0[bB](_?{bindigit})+

I think there should be test cases for literals that these seemingly
strange expressions reject, which are a number with trailing _ (0x123_),
and one with consecutive underscores __ (0x12__34).

I like the idea of these literals.  I would have found them useful on
many occassions.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/




Re: CREATEROLE and role ownership hierarchies

2022-01-25 Thread Robert Haas
On Mon, Jan 24, 2022 at 5:21 PM Stephen Frost  wrote:
> This is an argument to drop the role ownership concept, as I view it.  
> Privileges are driven by membership today and inventing some new independent 
> way to do that is increasing confusion, not improving things.  I disagree 
> that adding role ownership should necessarily change how the regular GRANT 
> privilege system works or throw away basic concepts of that system which have 
> been in place for decades.  Increasing the number of independent ways to 
> answer the question of “what users have what rights on object X” is an active 
> bad thing.  Anything that cares about object access will now also have to 
> address role ownership to answer that question, while if we don’t include 
> this one change then they don’t need to directly have any concern for 
> ownership because regular object privileges still work the same way they did 
> before.

It really feels to me like you just keep moving the goalposts. We
started out with a conversation where Mark said he'd like to be able
to grant permissions on GUCs to non-superusers.[1] You argued
repeatedly that we really needed to do something about CREATEROLE
[2,3,4]. Mark argued that this was an unrelated problem[5] but you
argued that unless it were addressed, users would still be able to
break out of the sandbox[6] which must mean either the OS user, or at
least PostgreSQL users other than the ones they were supposed to be
able to control.

That led *directly* to the patch at hand, which solves the problem by
inventing the notion of role ownership, so that you can distinguish
the roles you can administer from the ones you drop. You are now
proposing that we get rid of that concept, a concept that was added
four months ago[7] as a direct response to your previous feedback.
It's completely unfair to make an argument that results in the
addition of a complex piece of machinery to a body of work that was
initially on an only marginally related topic and then turn around and
argue, quite close to the end of the release cycle, for the removal of
that exact same mechanism.

And your argument about whether the privileges should be able to be
exercised without SET ROLE is also just completely baffling to me
given the previous conversation. It seems 100% clear from the previous
discussion that we were talking about service provider environments
and trying to deliver a good user experience to "lead tenants" in such
environments. Regardless of the technical details of how INHERIT or
anything else work, an actual superuser would not be subject to a
restriction similar to the one you're talking about, so arguing that
it ought to be present here for some technical reason is placing
technicalities ahead of what seemed at the time to be a shared goal.
There's a perfectly good argument to be made that the superuser role
should not work the way it does, but it's too late to relitigate that.
And I can't imagine why any service provider would find any value in a
new role that requires all of the extra push-ups you're trying to
impose on it.

I just can't shake the feeling that you're trying to redesign this
patch out of (a) getting committed and (b) solving any of the problems
it intends to solve, problems with which you largely seemed to agree.
I assume that is not actually your intention, but I can't think of
anything you'd be doing differently here if it were.

[1] 
https://www.postgresql.org/message-id/F9408A5A-B20B-42D2-9E7F-49CD3D1547BC%40enterprisedb.com
[2] 
https://www.postgresql.org/message-id/20210726200542.GX20766%40tamriel.snowman.net
[3] 
https://www.postgresql.org/message-id/20210726205433.GA20766%40tamriel.snowman.net
[4] 
https://www.postgresql.org/message-id/20210823181351.GB17906%40tamriel.snowman.net
[5] 
https://www.postgresql.org/message-id/92AA9A52-A644-42FE-B699-8ECAEE12E635%40enterprisedb.com
[6] 
https://www.postgresql.org/message-id/20210823195130.GF17906%40tamriel.snowman.net
[7] 
https://www.postgresql.org/message-id/67BB2F92-704B-415C-8D47-149327CA8F4B%40enterprisedb.com

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2022-01-25 Thread Justin Pryzby
On Sun, Jan 02, 2022 at 01:07:29PM +0100, Fabien COELHO wrote:
> One liner doc improvement to tell that creation time is only available on 
> windows.
> It is indeed not available on Linux.

The change is about the "isflag" flag, not creation time.

 Returns a record containing the file's size, last access time stamp,
 last modification time stamp, last file status change time stamp (Unix
 platforms only), file creation time stamp (Windows only), and a flag
-indicating if it is a directory.
+indicating if it is a directory (or a symbolic link to a directory).

> # part 03
> ISTM that the "if (1) { if (2) continue; } else if(3) { if (4) continue; }" 
> structure"
> may be simplified to "if (1 && 2) continue; if (3 && 4) continue;", at least 
> if
> IS_DIR and IS_REG are incompatible?

No, what you suggested is not the same;

We talked about this before:
https://www.postgresql.org/message-id/20200315212729.gc26...@telsasoft.com

> Otherwise, at least "else if (3 & 4) continue"?

I could write the *final* "else if" like that, but then it would be different
from the previous case.  Which would be confusing and prone to mistakes.

If I wrote it like this, I think it'd just provoke suggestions from someone
else to change it differently:

/* Skip dirs or special files? */
if (S_ISDIR(attrib.st_mode) && !(flags & LS_DIR_SKIP_DIRS))
continue;
if (!S_ISDIR(attrib.st_mode) && !S_ISREG(attrib.st_mode) && !(flags & 
LS_DIR_SKIP_SPECIAL)
continue;

...
<< Why don't you use "else if" instead of "if (a){} if (!a && b){}" >>

I'm going to leave it up to a committer.

> The ifdef WIN32 (which probably detects windows 64 bits…) overwrites 
> values[3]. ISTM
> it could be reordered so that there is no overwrite, and simpler single 
> assignements.
> 
>   #ifndef WIN32
> v = ...;
>   #else
> v = ... ? ... : ...;
>   #endif

I changed this but without using nested conditionals.

> Add a new "isdir" column to "pg_ls_tmpdir" output.  This is a small behavioral
> change.  I'm ok with it, however I'm unsure why we would not jump directly to
> the "type" char column done later in the patch series.

Because that depends on lstat().

> ISTM all such functions
> should be extended the same way for better homogeneity? That would also impact
> "waldir", "archive_status", "logical_*", "replslot" variants. "make check" ok.

I agree that makes sense, however others have expressed the opposite opinion.
https://www.postgresql.org/message-id/CALj2ACWtrt5EkHrY4WAZ4Cv42SidXAwpeQJU021bxaKpjmbGfA%40mail.gmail.com

The original motive for the patch was that pg_ls_tmpdir doesn't show shared
filesets.  This fixes that essential problem without immediately dragging
everything else along.  I think it's more likely that a committer would merge
them both.  But I don't know, and it's easy to combine patches if desired.

> This patch applies my previous advice:-) ISTM that parts 4 and 5 should be one
> single patch. The test changes show that only waldir has a test. Would it be
> possible to add minimal tests to other variants as well?  "make check" ok.

I have added tests, although some are duplicative.

> This part extends and adds a test for pg_ls_logdir. ISTM that it should
> be merged with the previous patches.  "make check" is ok.

It's seperate to allow writing a separate commit message since it does
something unrelated to the other patches.  What other patch would it would be
merged with ?
| v32-0006-pg_ls_logdir-to-ignore-error-if-initial-top-dir-.patch 

> ISTM that the documentation should be clear about windows vs unix/cygwin 
> specific
> data provided (change/creation).

I preferred to refer to pg_stat_file rather than repeat it for all 7 functions
currently in v15, (and future functions added for new, toplevel dirs).

> # part 11
> 
> This part adds a recurse option. Why not. However, the true value does not
> seem to be tested? "make check" is ok.

WDYM the true value?  It's tested like:

+-- Exercise recursion
+select path, filename, type from pg_ls_dir_metadata('.', true, false, true) 
where
+path in ('base', 'base/pgsql_tmp', 'global', 'global/pg_control', 
'global/pg_filenode.map', 'PG_VERSION', 'pg_multixact', 'pg_multixact/members', 
'pg_multixact/offsets', 'pg_wal', 'pg_wal/archive_status')
+-- (type='d' or 
path~'^(global/.*|PG_VERSION|postmaster\.opts|postmaster\.pid|pg_logical/replorigin_checkpoint)$')
 and filename!~'[0-9]'
+order by path collate "C", filename collate "C";
+  path  |filename | type 
++-+--
+ PG_VERSION | PG_VERSION  | -
+ base   | base| d
+ base/pgsql_tmp | pgsql_tmp   | d
...

-- 
Justin
>From 366edbf62b85a5471f5c3ebeb28a45f2692f9815 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 16 Mar 2020 14:12:55 -0500
Subject: [PATCH v33 01/11] Document historic behavior of links to
 direct

Re: autovacuum prioritization

2022-01-25 Thread Robert Haas
On Mon, Jan 24, 2022 at 11:14 PM Dilip Kumar  wrote:
> I think we need some more parameters to compare bloat vs wraparound.
> I mean in one of your examples in the 2nd paragraph we can say that
> the need-to-start of table A is earlier than table B so it's kind of
> simple.  But when it comes to wraparound vs bloat we need to add some
> weightage to compute how much bloat is considered as bad as
> wraparound.  I think the amount of bloat can not be an absolute number
> but it should be relative w.r.t the total database size or so.  I
> don't think it can be computed w.r.t to the table size because if the
> table is e.g. just 1 GB size and it is 5 times bloated then it is not
> as bad as another 1 TB table which is just 2 times bloated.

Thanks for writing back.

I don't think that I believe the last part of this argument, because
it seems to suppose that the big problem with bloat is that it might
use up disk space, whereas in my experience the big problem with bloat
is that it slows down access to your data. Yet the dead space in some
other table will not have much impact on the speed of access to the
current table. In fact, if most accesses to the table are index scans,
even dead space in the current table may not have much effect, but
sequential scans are bound to notice. It's true that, on a
cluster-wide basis, every dead page is one more page that can
potentially take up space in cache, so in that sense the performance
consequences are global to the whole cluster. However, that effect is
more indirect and takes a long time to become a big problem. The
direct effect of having to read more pages to execute the same query
plan causes problems a lot sooner.

But your broader point that we need to consider how much bloat
represents a problem is a really good one. In the past, one rule that
I've thought about is: if we're vacuuming a table and we're not going
to finish before it needs to be vacuumed again, then we should vacuum
faster (i.e. in effect, increase the cost limit on the fly). That
might still not result in good behavior, but it would at least result
in behavior that is less bad. However, it doesn't really answer the
question of how we decide when to start the very first VACUUM. I don't
really know the answer to that question. The current heuristics result
in estimates of acceptable bloat that are too high in some cases and
too low in others. I've seen tables that got bloated vastly beyond
what autovacuum is configured to tolerate before they caused any real
difficulty, and I know there are other cases where users start to
suffer long before those thresholds are reached.

At the moment, the best idea I have is to use something like the
current algorithm, but treat it as a deadline (keep bloat below this
amount) rather than an initiation criteria (start when you reach this
amount).  But I think that idea is a bit weak; maybe there's something
better out there.

> I think we should be thinking of dynamically adjusting priority as
> well.  Because it is possible that when autovacuum started we
> prioritize the table based on some statistics and estimation but
> vacuuming process can take long time and during that some priority
> might change so during the start of the autovacuum if we push all
> table to some priority queue and simply vacuum in that order then we
> might go wrong somewhere.

Yep. I think we should reassess what to do next after each table.
Possibly making some exception for really small tables - e.g. if we
last recomputed priorities less than 1 minute ago, don't do it again.

> I think we need to make different priority
> queues based on different factors, for example 1 queue for wraparound
> risk and another for bloat risk.

I don't see why we want multiple queues. We have to answer the
question "what should we do next?" which requires us, in some way, to
funnel everything into a single prioritization.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: make MaxBackends available in _PG_init

2022-01-25 Thread Bossart, Nathan
On 1/25/22, 12:01 AM, "Michael Paquier"  wrote:
> So, where are we on this patch?  It looks like there is an agreement
> that MaxBackends is used widely enough that it justifies the use of a
> separate function to set and get a better value computed.  There may
> be other parameters that could use a brush up, but most known cases
> would be addressed here.  v4 looks rather straight-forward, at quick
> glance.

I think the patch is in decent shape.  There may be a few remaining
places where GetMaxBackends() is called repeatedly in the same
function, but IIRC v4 already clears up the obvious ones.  I don't
know if this is worth worrying about too much, but I can create a new
version if you think it is important.

Nathan



Re: Collecting statistics about contents of JSONB columns

2022-01-25 Thread Greg Stark
On Thu, 6 Jan 2022 at 14:56, Tomas Vondra  wrote:
>
>
> Not sure I understand. I wasn't suggesting any user-defined filtering,
> but something done by default, similarly to what we do for regular MCV
> lists, based on frequency. We'd include frequent paths while excluding
> rare ones.
>
> So no need for a user interface.

Not sure but I think he was agreeing with you. That we should figure
out the baseline behaviour and get it as useful as possible first then
later look at adding some way to customize it. I agree -- I don't
think the user interface will be hard technically but I think it will
require some good ideas and there could be lots of bikeshedding. And a
lot of users will never even use it anyways so it's important to get
the defaults as useful as possible.

> Similarly for the non-scalar values - I don't think we can really keep
> regular statistics on such values (for the same reason why it's not
> enough for whole JSONB columns), so why to build/store that anyway.

For a default behaviour I wonder if it wouldn't be better to just
flatten and extract all the scalars. So if there's no matching path
then at least we have some way to estimate how often a scalar appears
anywhere in the json document.

That amounts to assuming the user knows the right path to find a given
scalar and there isn't a lot of overlap between keys. So it would at
least do something useful if you have something like {gender: female,
name: {first: nancy, last: reagan], state: california, country: usa}.
It might get things slightly wrong if you have some people named
"georgia" or have names that can be first or last names.

But it would generally be doing something more or less useful as long
as they look for "usa" in the country field and "male" in the gender
field. If they looked for "male" in $.name.first path it would give
bad estimates but assuming they know their data structure they won't
be doing that.

-- 
greg




Re: refactoring basebackup.c

2022-01-25 Thread Robert Haas
On Tue, Jan 25, 2022 at 8:42 AM Dagfinn Ilmari Mannsåker
 wrote:
> I just noticed there was a superfluous [ in the SGM documentation, and
> that the short form was missing the [{client|server}-] part.  Updated
> patch attaced.

Committed, thanks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Refactoring of compression options in pg_basebackup

2022-01-25 Thread Robert Haas
On Sat, Jan 22, 2022 at 12:47 AM Michael Paquier  wrote:
> Also, having this enum in walmethods.h is perhaps not the best place
> either, even more if you plan to use that in pg_basebackup for the
> server-side compression.  One idea is to rename this enum to
> DataCompressionMethod, moving it into a new header, like common.h as
> of the attached.

Well, we also have CompressionAlgorithm competing for the same job.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add spin_delay() implementation for Arm in s_lock.h

2022-01-25 Thread Blake, Geoff
Hi Tom, Andres,

Any additional feedback for this patch?

Thanks,
Geoff Blake



Re: autovacuum prioritization

2022-01-25 Thread Peter Geoghegan
On Tue, Jan 25, 2022 at 11:30 AM Robert Haas  wrote:
> But your broader point that we need to consider how much bloat
> represents a problem is a really good one. In the past, one rule that
> I've thought about is: if we're vacuuming a table and we're not going
> to finish before it needs to be vacuumed again, then we should vacuum
> faster (i.e. in effect, increase the cost limit on the fly).

That seems reasonable, but I doubt that that's a huge issue in
practice, now that the default cost limits are more sensible.

> That might still not result in good behavior, but it would at least result
> in behavior that is less bad. However, it doesn't really answer the
> question of how we decide when to start the very first VACUUM. I don't
> really know the answer to that question. The current heuristics result
> in estimates of acceptable bloat that are too high in some cases and
> too low in others. I've seen tables that got bloated vastly beyond
> what autovacuum is configured to tolerate before they caused any real
> difficulty, and I know there are other cases where users start to
> suffer long before those thresholds are reached.

ISTM that the easiest thing that could be done to improve this is to
give some consideration to page-level characteristics. For example, a
page that has 5 dead heap-only tuples is vastly different to a similar
page that has 5 LP_DEAD items instead -- and yet our current approach
makes no distinction. Chances are very high that if the only dead
tuples are heap-only tuples, then things are going just fine on that
page -- opportunistic pruning is actually keeping up. Page-level
stability over time seems to be the thing that matters most -- we must
make sure that the same "logical rows" that were inserted around the
same time remain on the same block for as long as possible, without
mixing in other unrelated tuples needlessly. In other words, preserve
natural locality.

This is related to the direction of things, and the certain knowledge
that VACUUM alone can deal with line pointer bloat. The current state
of individual pages hints at the direction of things even without
tracking how things change directly. But tracking the change over time
in ANALYZE seems better still: if successive ANALYZE operations notice
a consistent pattern where pages that had a non-zero number of LP_DEAD
items last time now have a significantly higher number, then it's a
good idea to err in the direction of more aggressive vacuuming.
*Growing* concentrations of LP_DEAD items signal chaos. I think that
placing a particular emphasis on pages with non-zero LP_DEAD items as
a qualitatively distinct category of page might well make sense --
relatively few blocks with a growing number of LP_DEAD items seems
like it should be enough to make autovacuum run aggressively.

As I pointed out not long ago, ANALYZE does a terrible job of
accurately counting dead tuples/LP_DEAD items when they aren't
uniformly distributed in the table -- which is often a hugely
important factor, with a table that is append-mostly with updates and
deletes. That's why I suggested bringing the visibility map into it.
In general I think that the statistics that drive autovacuum are
currently often quite wrong, even on their own simplistic,
quantitative terms.

> I don't see why we want multiple queues. We have to answer the
> question "what should we do next?" which requires us, in some way, to
> funnel everything into a single prioritization.

Even busy production DBs should usually only be vacuuming one large
table at a time. Also might make sense to strategically align the work
with the beginning of a new checkpoint.

-- 
Peter Geoghegan




Re: autovacuum prioritization

2022-01-25 Thread John Naylor
On Tue, Jan 25, 2022 at 2:30 PM Robert Haas  wrote:
>
> On Mon, Jan 24, 2022 at 11:14 PM Dilip Kumar  wrote:

> > I think we need to make different priority
> > queues based on different factors, for example 1 queue for wraparound
> > risk and another for bloat risk.
>
> I don't see why we want multiple queues. We have to answer the
> question "what should we do next?" which requires us, in some way, to
> funnel everything into a single prioritization.

I was thinking along the same lines as Dilip: If the anti-wraparound
risk is really far in the future, there might not be much eligible
freezing work to do. Dead tuples can be removed as soon as visibility
rules allow it. With a separate bloat queue, there might always be
some work to do. Maybe "bloat queue" is too specific, because
insert-only tables can use more vacuuming for the VM even if they have
not reached the configured threshold.

So a worker would check the wraparound queue, and if nothing's there
grab something from the other queue. Maybe low-priority work would
have a low cost limit.

Probably the true best way to do schedule, at least at first, is
what's the least complex. I'm not yet sure what that is...
-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: CREATEROLE and role ownership hierarchies

2022-01-25 Thread Stephen Frost
Greetings,

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Jan 24, 2022, at 2:21 PM, Stephen Frost  wrote:
> > Being able to create and drop users is, in fact, effectively a 
> > superuser-only task today.  We could throw out the entire idea of role 
> > ownership, in fact, as being entirely unnecessary when talking about that 
> > specific task.
> 
> Wow, that's totally contrary to how I see this patch.  The heart and soul of 
> this patch is to fix the fact that CREATEROLE is currently overpowered.  
> Everything else is gravy.

I agree that CREATEROLE is overpowered and that the goal of this should
be to provide a way for roles to be created and dropped that doesn't
give the user who has that power everything that CREATEROLE currently
does.  The point I was making is that the concept of role ownership
isn't intrinsically linked to that and is, therefore, as you say, gravy.
That isn't to say that I'm entirely against the role ownership idea but
I'd want it to be focused on the goal of providing ways of creating and
dropping users and otherwise performing that kind of administration and
that doesn't require the specific change to make owners be members of
all roles they own and automatically have all privileges of those roles
all the time.

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Jan 24, 2022, at 2:21 PM, Stephen Frost  wrote:
> > 
> > Superuser is a problem specifically because it gives people access to do 
> > absolutely anything, both for security and safety concerns. Disallowing a 
> > way to curtail that same risk when it comes to role ownership invites 
> > exactly those same problems.
> 
> Before the patch, users with CREATEROLE can do mischief.  After the patch, 
> users with CREATEROLE can do mischief.  The difference is that the mischief 
> that can be done after the patch is a proper subset of the mischief that can 
> be done before the patch.  (Counter-examples highly welcome.)
> 
> Specifically, I claim that before the patch, non-superuser "bob" with 
> CREATEROLE can interfere with *any* non-superuser.  After the patch, 
> non-superuser "bob" with CREATEROLE can interfere with *some* non-superusers; 
> specifically, with non-superusers he created himself, or which have had 
> ownership transferred to him.
> 
> Restricting the scope of bob's mischief is a huge win, in my view.
> 
> The argument about whether owners should always implicitly inherit privileges 
> from roles they own is a bit orthogonal to my point about mischief-making.  
> Do we at least agree on the mischief-abatement aspect of this patch set?  

I don't know how many bites at this particular apple we're going to get,
but I doubt folks are going to be happy if we change our minds every
release.  Further, I suspect we'll be better off going too far in the
direction of 'mischief reduction' than not far enough.  If we restrict
things too far then we can provide ways to add those things back, but
it's harder to remove things we didn't take away.

This particular case is even an oddity on that spectrum though-
CREATEROLE users, today, don't have access to all the objects created by
roles which they create.  Yes, they can get such access if they go
through some additional hoops, but that could then be caught by someone
auditing the logs, a consideration that I don't think we appreciate
enough today.

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Jan 24, 2022, at 2:21 PM, Stephen Frost  wrote:
> > 
> > To push back on the original “tenant” argument, consider that one of the 
> > bigger issues in cloud computing today is exactly the problem that the 
> > cloud managers can potentially gain access to the sensitive data of their 
> > tenants and that’s not generally viewed as a positive thing.
> 
> +1.  This is a real problem.  I have been viewing this problem as separate 
> from the one which role ownership is intended to fix.  Do you have a 
> suggestion about how to tackle the problems together with less work than 
> tackling them separately?

I don't know about less work or not, but in this particular case I was
asking for a few lines to be removed from the patch.  I can believe that
doing so would create some issues in terms of the use-cases that you
want to solve with this and if we agree on those being sensible cases to
address then we'd need to implement something to address those, though
it's also possibly not the case and maybe removing those few lines
doesn't impact anything beyond then allowing owners to not automatically
inherit the rights of the roles they own if they don't wish to.

Instead of talking about those cases concretely though, it seems like
we've shifted to abstractly talking about ownership and landlords.
Maybe some of that is helpful, but it seems to increasingly be an area
that's causing more division than helping to move forward towards a
mutually agreeable result.

> >  This change would make it so that every landlord can go and SELECT from 
> > the tables of th

Re: pgsql: Server-side gzip compression.

2022-01-25 Thread David Rowley
On Wed, 26 Jan 2022 at 07:12, Robert Haas  wrote:
> wouldn't this same consideration apply to a very large number of other
> places in the code base?

All of the other places are handled. See locations with "keep compiler quiet".

This one is the only one that generates a warning:

basebackup_gzip.c(90): warning C4715: 'bbsink_gzip_new': not all
control paths return a value

David




Re: CREATEROLE and role ownership hierarchies

2022-01-25 Thread Mark Dilger



> On Jan 25, 2022, at 12:44 PM, Stephen Frost  wrote:
> 
> As I mentioned in the patch review, having a particular bit set doesn't
> necessarily mean you should be able to pass it on- the existing object
> GRANT system distinguishes those two and it seems like we should too.
> In other words, I'm saying that we should be able to explicitly say just
> what privileges a CREATEROLE user is able to grant to some other role
> rather than basing it on what that user themselves has.

I like the way you are thinking, but I'm not sure I agree with the facts you 
are asserting.

I agree that "CREATE ROLE.. ROLE .." differs from "CREATE ROLE .. ADMIN ..", 
and "GRANT..WITH GRANT OPTION" differs from "GRANT..", but those only cover 
privileges tracked in an aclitem array.  The privileges CREATEDB, CREATEROLE, 
REPLICATION, and BYPASSRLS don't work that way.  There isn't a with/without 
grant option distinction for them.  So I'm forced to say that a role without 
those privileges must not give them away.

I'd be happier if we could get rid of all privileges of that kind, leaving only 
those that can be granted with/without grant option, tracked in an aclitem, and 
use that to determine if the user creating the role can give them away.  But 
that's a bigger redesign of the system.  Just touching how CREATEROLE works 
entails backwards compatibility problems.  I'd hate to try to change all these 
other things; we'd be breaking a lot more, and features that appear more 
commonly used.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Fix BUG #17335: Duplicate result rows in Gather node

2022-01-25 Thread David Rowley
On Wed, 26 Jan 2022 at 05:32, Tom Lane  wrote:
> Therefore, what I think could be useful is some very-late-stage
> assertion check (probably in createplan.c) verifying that the
> child of a Gather is parallel-aware.  Or maybe the condition
> needs to be more general than that, but anyway the idea is for
> the back end of the planner to verify that we didn't build a
> silly plan.

Yeah, it would be nice to have something like this.  I think to do it,
we might need to invent some sort of path traversal function that can
take a custom callback function.  The problem is that the parallel
aware path does not need to be directly below the gather/gathermerge.

For example (from select_distinct.out)

 Unique
   ->  Sort
 Sort Key: four
 ->  Gather
   Workers Planned: 2
   ->  HashAggregate
 Group Key: four
 ->  Parallel Seq Scan on tenk1

For this case, the custom callback would check that there's at least 1
parallel_aware subpath below the Gather/GatherMerge.

There's probably some other rules that we could Assert are true.  I
think any parallel_aware paths (unless they're scans) must contain
only parallel_aware subpaths. For example, parallel hash join must
have a parallel aware inner and outer.

David




Re: Replace uses of deprecated Python module distutils.sysconfig

2022-01-25 Thread Andres Freund
Hi,

On 2022-01-25 12:45:15 -0500, Tom Lane wrote:
> As of now, 92 buildfarm animals have reported results from f032f63e7.
> Every single one of them reports that all the different methods you
> tested give the same answer.  So it looks to me like we should just
> go with get_config_var('INCLUDEPY') and be happy.
> 
> I guess next steps are to revert f032f63e7 and then retry e0e567a10
> with that change.  Who's going to do the honors?

I assume Peter is done working for the day. I'm stuck in meetings and stuff
for another 2-3 hours.  I can give it a go after that, unless you do so
before.

Greetings,

Andres Freund




Re: Skipping logical replication transactions on subscriber side

2022-01-25 Thread David G. Johnston
On Tue, Jan 25, 2022 at 8:33 AM Masahiko Sawada 
wrote:

> Given that we cannot use rely on the pg_stat_subscription_workers view
> for this purpose, we would need either a new sub-system that tracks
> each logical replication status so the system can set the error XID to
> subskipxid, or to wait for shared-memory based stats collector.
>

I'm reading over the monitoring-stats page to try and get my head around
all of this.  First of all, it defines two kinds of views:

1. PostgreSQL's statistics collector is a subsystem that supports
collection and reporting of information about server activity.
2. PostgreSQL also supports reporting dynamic information ... This facility
is independent of the collector process.

In then has two tables:

28.1 Dynamic Statistics Views (describing #2 above)
28.2 Collected Statistics Views (describing #1 above)

Apparently the "collector process" is UDP-like, not reliable.  The
documentation fails to mention this fact.  I'd argue that this is a
documentation bug.

I do see that the pg_stat_subscription_workers view is correctly placed in
Table 28.2

Reviewing the other views listed in that table only pg_stat_archiver abuses
the statistics collector in a similar fashion.  All of the others are
actually metric oriented.

I don't care for the specification: "will contain one row per subscription
worker on which errors have occurred, for workers applying logical
replication changes and workers handling the initial data copy of the
subscribed tables."

I would much rather have this behave similar to pg_stat_activity (which, of
course, is a Dynamic Statistics View...) in that it shows only and all
workers that are presently working.  The tablesync workers should go away
when they have finished synchronizing.  I should not have to manually
intervene to get rid of unreliable expired data.  The log file feels like a
superior solution to this monitoring view.

Alternatively, if the tablesync workers are done but we've been
accumulating real statistics for them, then by all means keep them included
in the view - but regardless of whether they encountered an error.  But
maybe the view can right join in pg_stat_subscription as show a column for
"(pid is not null) AS is_active".

Maybe we need to add a track_finished_tablesync_workers GUC so the DBA can
decide whether to devote storage and processing resources to that
historical information.

If you had kept the original view name, "pg_stat_subscription_error", this
whole issue goes away.  But you decided to make it more generic and call it
"pg_stat_subscription_workers" - which means you need to get rid of the
error-specific condition in the WHERE clause for the view.  Show all
workers - I can filter on is_active.  Showing only active workers is also
acceptable.  You won't get to change your mind so decide whether this wants
to show only current and running state or whether historical statistics for
now defunct tablesync workers are desired.  Personally, I would just show
active workers and if someone wants to add the feature they can add a
track_tablesync_worker_stats GUC and a matching view.

>From that, every apply worker should be sending a statistics message to the
collector periodically.  If error info is not present and the state is "all
is well", clear out any existing error info from the view.  The attempt to
include an actual statistic field here doesn't seem useful nor redeeming.
I would add a "state" field in its place (well, after subrelid).  And I
would still rename the columns to current_error_* and note that these
should be null unless the status field shows error (there may be some
additional complexity here).  Just get rid of last_error_count.

David J.

P.S.  I saw the discussion regarding pg_dump'ing the subskipid field.  I
didn't notice any discussion around creating and restoring a basebackup.
It seems like during server startup subskipid should just be cleared out.
Then it doesn't matter what one does during backup.


Re: Support for NSS as a libpq TLS backend

2022-01-25 Thread Jacob Champion
On Wed, 2022-01-19 at 10:01 +0100, Daniel Gustafsson wrote:
> > On 18 Jan 2022, at 17:37, Jacob Champion  wrote:
> > 
> > On Wed, 2021-12-15 at 23:10 +0100, Daniel Gustafsson wrote:
> > > I've attached a v50 which fixes the issues found by Joshua upthread, as 
> > > well as
> > > rebases on top of all the recent SSL and pgcrypto changes.
> > 
> > I'm currently tracking down a slot leak. When opening and closing large
> > numbers of NSS databases, at some point we appear to run out of slots
> > and then NSS starts misbehaving, even though we've closed all of our
> > context handles.
> 
> Interesting, are you able to share a reproducer for this so I can assist in
> debugging it?

(This was in my spam folder, sorry for the delay...) Let me see if I
can minimize my current reproduction case and get it ported out of
Python.

--Jacob


JSONB docs patch

2022-01-25 Thread Mikhail Dobrinin
Hello,

I have come across some missing documentation that I think could benefit
the community.

Several functions like `jsonb_exists`, `jsonb_exists_any`,
`jsonb_exists_all` have existed for many PG versions but were not
documented. They are equivalent to `?`, `?|`, and `?&` operators. But some
JDBC drivers have issues with native queries containing these operators
(see
https://stackoverflow.com/questions/38370972/how-do-i-use-postgresql-jsonb-operators-containing-a-question-mark-via-jdb),
so it is useful for users of PG to know the function equivalents of these
operators.

I have attached the patch as an attachment to this email. The documentation
builds correctly without any lint errors after applying the patch locally.
This is my first time contributing, so let me know if there is anything
else I should do (add to commitfest etc).

Cheers!
Mikhail Dobrinin


jsonb-docs-v1.diff
Description: Binary data


Concurrent deadlock scenario with DROP INDEX on partitioned index

2022-01-25 Thread Jimmy Yih
Hello pgsql-hackers,

When dropping a partitioned index, the locking order starts with the
partitioned table, then its partitioned index, then the partition
indexes dependent on that partitioned index, and finally the dependent
partition indexes' parent tables. This order allows a deadlock
scenario to happen if for example an ANALYZE happens on one of the
partition tables which locks the partition table and then blocks when
it attempts to lock its index (the DROP INDEX has the index lock but
cannot get the partition table lock).

Deadlock Reproduction
==
Initial setup:
CREATE TABLE foopart (a int) PARTITION BY RANGE (a);
CREATE TABLE foopart_child PARTITION OF foopart FOR VALUES FROM (1) TO (5);
CREATE INDEX foopart_idx ON foopart USING btree(a);

Attach debugger to session 1 and put breakpoint on function deleteObjectsInList:
1: DROP INDEX foopart_idx;

While session 1 is blocked by debugger breakpoint, run the following:
2: ANALYZE foopart_child;

After a couple seconds, detach the debugger from session 1 and see deadlock.

In order to prevent this deadlock scenario, we should find and lock
all the sub-partition and partition tables beneath the partitioned
index's partitioned table before attempting to lock the sub-partition
and partition indexes.

Attached is a patch (+isolation test) which fixes the issue. We
observed this on latest head of Postgres.

Regards,
Jimmy Yih and Gaurab Dey


0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-on-.patch
Description:  0001-Fix-concurrent-deadlock-scenario-with-DROP-INDEX-on-.patch


Re: JSONB docs patch

2022-01-25 Thread David G. Johnston
On Tue, Jan 25, 2022 at 3:38 PM Mikhail Dobrinin 
wrote:

> Hello,
>
> I have come across some missing documentation that I think could benefit
> the community.
>
> Several functions like `jsonb_exists`, `jsonb_exists_any`,
> `jsonb_exists_all` have existed for many PG versions but were not
> documented. They are equivalent to `?`, `?|`, and `?&` operators. But some
> JDBC drivers have issues with native queries containing these operators
> (see
> https://stackoverflow.com/questions/38370972/how-do-i-use-postgresql-jsonb-operators-containing-a-question-mark-via-jdb),
> so it is useful for users of PG to know the function equivalents of these
> operators.
>
> I have attached the patch as an attachment to this email. The
> documentation builds correctly without any lint errors after applying
> the patch locally. This is my first time contributing, so let me know if
> there is anything else I should do (add to commitfest etc).
>
>
I'm doubtful that encouraging use of these functions for JDBC-users is
better than them learning to write queries using the proper operator.  The
reality is that the usage of indexes depends upon operators being used in
query text, not function names (unless you define a functional index, which
doesn't happen).  Your SO post says as much and does mention that ?? is
indeed the coding that is required.

What I think we should do in light of this reality, though, is indeed
prohibit "??" as (or within) an operator in PostgreSQL.  Since core is not
presently using that operator its prohibition should be reasonably simple -
though maybe extension authors got too creative?

-1 to this patch on the above grounds.  As for the patch itself:
The parentheticals you wrote might be appropriate for a commit message but
do not belong in the documentation.  Mentioning JDBC is simply a no-no; and
we don't document "why" we decided to document something. We also don't go
around pointing out what functions and operators perform the same behavior
(mostly because we generally just don't do that, see above).

I didn't actually review the material parts of the table.  Nothing seems
obviously incorrect there though.

David J.


Re: Support tab completion for upper character inputs in psql

2022-01-25 Thread Tom Lane
I spent some time contemplating my navel about the concerns I raised
upthread about double-quoted identifiers.  I concluded that the reason
things don't work well in that area is that we're trying to get all the
work done by applying quote_ident() on the backend side and then
ignoring quoting considerations in tab-complete itself.  That sort of
works, but not terribly well.  The currently proposed patch is sticking
a toe into the water of dealing with quoting/downcasing in tab-complete,
but we need to go a lot further.  I propose that we ought to drop the
use of quote_ident() in the tab completion queries altogether, instead
having the backend return names as-is, and doing all the dequoting and
requoting work in tab-complete.

Attached is a very-much-WIP patch along these lines.  I make no
pretense that it's complete; no doubt some of the individual
queries are broken or don't return quite the results we want.
But it seems to act the way I think it should for relation names.

One thing I'm particularly unsure what to do with is the queries
for type names, which want to match against the output of
format_type, which'll already have applied quote_ident.  We can
probably hack something up there, but I ran out of time to mess
with that for today.

Anyway, I wanted to post this just to see what people think of
going in this direction.

regards, tom lane

PS: I omitted the proposed regression test changes here.
Many of them are not at all portable --- different versions
of readline/libedit will produce different control character
sequences for backspacing, for example.  I got a lot of
failures when I tried to use those tests with this patch;
I've not run down which ones are test portability problems,
which are due to intentional behavior changes in this patch,
and which are due to breakage I've not fixed yet.

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 502b5c5751..2dadf7d945 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -47,6 +47,7 @@
 #include "catalog/pg_class_d.h"
 #include "common.h"
 #include "libpq-fe.h"
+#include "mb/pg_wchar.h"
 #include "pqexpbuffer.h"
 #include "settings.h"
 #include "stringutils.h"
@@ -148,8 +149,8 @@ typedef struct SchemaQuery
 	const char *namespace;
 
 	/*
-	 * Result --- the appropriately-quoted name to return, in the case of an
-	 * unqualified name.  For example, "pg_catalog.quote_ident(c.relname)".
+	 * Result --- the (unquoted) name to return, in the case of an unqualified
+	 * name.  For example, "c.relname".
 	 */
 	const char *result;
 
@@ -315,7 +316,7 @@ do { \
 		completion_info_charp = _completion_type; \
 		completion_info_charp2 = _completion_schema; \
 	} \
-	matches = rl_completion_matches(text, complete_from_query); \
+	matches = rl_completion_matches(text, complete_from_query_verbatim); \
 } while (0)
 
 #define COMPLETE_WITH_FUNCTION_ARG(function) \
@@ -357,14 +358,14 @@ static const SchemaQuery Query_for_list_of_aggregates[] = {
 		.selcondition = "p.prokind = 'a'",
 		.viscondition = "pg_catalog.pg_function_is_visible(p.oid)",
 		.namespace = "p.pronamespace",
-		.result = "pg_catalog.quote_ident(p.proname)",
+		.result = "p.proname",
 	},
 	{
 		.catname = "pg_catalog.pg_proc p",
 		.selcondition = "p.proisagg",
 		.viscondition = "pg_catalog.pg_function_is_visible(p.oid)",
 		.namespace = "p.pronamespace",
-		.result = "pg_catalog.quote_ident(p.proname)",
+		.result = "p.proname",
 	}
 };
 
@@ -378,7 +379,7 @@ static const SchemaQuery Query_for_list_of_datatypes = {
 	.viscondition = "pg_catalog.pg_type_is_visible(t.oid)",
 	.namespace = "t.typnamespace",
 	.result = "pg_catalog.format_type(t.oid, NULL)",
-	.qualresult = "pg_catalog.quote_ident(t.typname)",
+	.qualresult = "t.typname",
 };
 
 static const SchemaQuery Query_for_list_of_composite_datatypes = {
@@ -390,7 +391,7 @@ static const SchemaQuery Query_for_list_of_composite_datatypes = {
 	.viscondition = "pg_catalog.pg_type_is_visible(t.oid)",
 	.namespace = "t.typnamespace",
 	.result = "pg_catalog.format_type(t.oid, NULL)",
-	.qualresult = "pg_catalog.quote_ident(t.typname)",
+	.qualresult = "t.typname",
 };
 
 static const SchemaQuery Query_for_list_of_domains = {
@@ -398,7 +399,7 @@ static const SchemaQuery Query_for_list_of_domains = {
 	.selcondition = "t.typtype = 'd'",
 	.viscondition = "pg_catalog.pg_type_is_visible(t.oid)",
 	.namespace = "t.typnamespace",
-	.result = "pg_catalog.quote_ident(t.typname)",
+	.result = "t.typname",
 };
 
 /* Note: this intentionally accepts aggregates as well as plain functions */
@@ -409,13 +410,13 @@ static const SchemaQuery Query_for_list_of_functions[] = {
 		.selcondition = "p.prokind != 'p'",
 		.viscondition = "pg_catalog.pg_function_is_visible(p.oid)",
 		.namespace = "p.pronamespace",
-		.result = "pg_catalog.quote_ident(p.proname)",
+		.result = "p.proname",
 	},
 	{
 		.catname = "pg_catalog.pg_proc p",
 		.viscondition = "pg_catalog.pg_function_is_visible(p.oid)

ssl_passphrase_callback failure in REL_14_STABLE

2022-01-25 Thread Thomas Munro
Hi,

I don't know what's going on here yet, but I noticed a $SUBJECT about
half the time on my machine eelpout, starting 2-3 weeks ago.  It fails
like:

Bailout called.  Further testing stopped:  system initdb failed

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL_14_STABLE

Unfortunately when I 'check' in
src/test/modules/ssl_passphrase_callback on that host it doesn't it
fail, there doesn't seem to be anything special about that particular
initdb invocation, no commit is jumping out at me in that time window,
and the BF logs don't show any initdb output...




Re: Replace uses of deprecated Python module distutils.sysconfig

2022-01-25 Thread Tom Lane
Andres Freund  writes:
> On 2022-01-25 12:45:15 -0500, Tom Lane wrote:
>> I guess next steps are to revert f032f63e7 and then retry e0e567a10
>> with that change.  Who's going to do the honors?

> I assume Peter is done working for the day. I'm stuck in meetings and stuff
> for another 2-3 hours.  I can give it a go after that, unless you do so
> before.

OK, will do.

regards, tom lane




Re: ssl_passphrase_callback failure in REL_14_STABLE

2022-01-25 Thread Tom Lane
Thomas Munro  writes:
> I don't know what's going on here yet, but I noticed a $SUBJECT about
> half the time on my machine eelpout, starting 2-3 weeks ago.  It fails
> like:

> Bailout called.  Further testing stopped:  system initdb failed

> https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=eelpout&br=REL_14_STABLE

> Unfortunately when I 'check' in
> src/test/modules/ssl_passphrase_callback on that host it doesn't it
> fail, there doesn't seem to be anything special about that particular
> initdb invocation, no commit is jumping out at me in that time window,
> and the BF logs don't show any initdb output...

Any possibility that it's out-of-disk-space?

regards, tom lane




Re: Non-decimal integer literals

2022-01-25 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jan 25, 2022 at 5:34 AM Peter Eisentraut
>  wrote:
>> Which part exactly?  There are several different changes proposed here.

> I was just going based on the description of the feature in your
> original post. If someone is hoping that int4in() will accept only
> ^\d+$ then they will be disappointed by this patch.

Maybe I misunderstood, but I thought this was about what you could
write as a SQL literal, not about the I/O behavior of the integer
types.  I'd be -0.1 on changing the latter.

regards, tom lane




Re: sequences vs. synchronous replication

2022-01-25 Thread Tomas Vondra

On 1/25/22 10:18, Peter Eisentraut wrote:


On 15.01.22 23:57, Tomas Vondra wrote:
This approach (and also my previous proposal) seems to assume that 
the value returned from nextval() should not be used until the 
transaction executing that nextval() has been committed successfully. 
But I'm not sure how many applications follow this assumption. Some 
application might use the return value of nextval() instantly before 
issuing commit command. Some might use the return value of nextval() 
executed in rollbacked transaction.




IMO any application that assumes data from uncommitted transactions is 
outright broken and we should not try to fix that because it's quite 
futile (and likely will affect well-behaving applications).


The issue I'm trying to fix in this thread is much narrower - we don't 
actually meet the guarantees for committed transactions (that only did 
nextval without generating any WAL).


The wording in the SQL standard is:

"Changes to the current base value of a sequence generator are not 
controlled by SQL-transactions; therefore, commits and rollbacks of 
SQL-transactions have no effect on the current base value of a sequence 
generator."


This implies the well-known behavior that consuming a sequence value is 
not rolled back.  But it also appears to imply that committing a 
transaction has no impact on the validity of a sequence value produced 
during that transaction.  In other words, this appears to imply that 
making use of a sequence value produced in a rolled-back transaction is 
valid.


A very strict reading of this would seem to imply that every single 
nextval() call needs to be flushed to WAL immediately, which is of 
course impractical.


I'm not an expert in reading standards, but I'd not interpret it that 
way. I think it simply says the sequence must not go back, no matter 
what happened to the transaction.


IMO interpreting this as "must not lose any increments from uncommitted 
transactions" is maybe a bit too strict, and as you point out it's also 
impractical because it'd mean calling nextval() repeatedly flushes WAL 
all the time. Not great for batch loads, for example.


I don't think we need to flush WAL for every nextval() call, if we don't 
write WAL for every increment - I think we still can batch WAL for 32 
increments just like we do now (AFAICS that'd not contradict even this 
quite strict interpretation of the standard).


OTOH the flush would have to happen immediately, we can't delay that 
until the end of the transaction. Which is going to affect even cases 
that generate WAL for other reasons (e.g. doing insert), which was 
entirely unaffected by the previous patches.


And the flush would have to happen even for sessions that didn't write 
WAL (which was what started this thread) - we could use page LSN and 
flush only to that (so we'd flush once and then it'd be noop until the 
sequence increments 32-times and writes another WAL record).


Of course, it's not enough to just flush WAL, we have to wait for the 
sync replica too :-(


I don't have any benchmark results quantifying this yet, but I'll do 
some tests in the next day or two. But my expectation is this is going 
to be pretty expensive, and considering how concerned we were about 
affecting current workloads, making the impact worse seems wrong.



My opinion is we should focus on fixing this given the current (weaker) 
interpretation of the standard, i.e. accepting the loss of increments 
observed only by uncommitted transactions. The page LSN patch seems like 
the best way to do that so far.



We may try reworking this to provide the stronger guarantees (i.e. not 
losing even increments from uncommitted transactions) in the future, of 
course. But considering (a) we're not sure that's really what the SQL 
standard requires, (b) no one complained about that in years, and (c) 
it's going to make sequences way more expensive, I doubt that's really 
desirable.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 27cb6307581..0ab95bd3d8e 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -37,6 +37,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "parser/parse_type.h"
+#include "replication/syncrep.h"
 #include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
@@ -812,6 +813,22 @@ nextval_internal(Oid relid, bool check_permissions)
 		PageSetLSN(page, recptr);
 	}
 
+	/*
+	 * Make sure the WAL for the sequence is properly flushed, before returning
+	 * the nextval result. For sessions that did not generate WAL (and so use a
+	 * cached value), this ensures the WAL generated by other sessions reaches
+	 * disk. Also, wait for sync replica.
+	 * 
+	 */
+	if (RelationNeedsWAL(seqrel))
+	{
+		XLogRecPtr	recptr = PageGetLSN(page);
+
+		XLogFlush(recptr);
+
+		SyncRepWaitForLSN(recptr, tr

Re: ssl_passphrase_callback failure in REL_14_STABLE

2022-01-25 Thread Thomas Munro
On Wed, Jan 26, 2022 at 12:59 PM Tom Lane  wrote:
> Any possibility that it's out-of-disk-space?

Erm, yeah that's embarrassing, it could well be that.  I've just given
it some more room.




Re: pg_upgrade should truncate/remove its logs before running

2022-01-25 Thread Michael Paquier
On Tue, Jan 25, 2022 at 10:45:29AM -0600, Justin Pryzby wrote:
> Andrew: you wanted to accommodate any change on the build client, right ?

Yes, this is going to need an adjustment of @logfiles in
TestUpgrade.pm, with the addition of
"$tmp_data_dir/pg_update_output.d/log/*.log" to be consistent with the
data fetched for the tests of older branches.
--
Michael


signature.asc
Description: PGP signature


Re: GUC flags

2022-01-25 Thread Michael Paquier
On Tue, Jan 25, 2022 at 12:07:51PM -0600, Justin Pryzby wrote:
> On Tue, Jan 25, 2022 at 11:47:14AM +0100, Peter Eisentraut wrote:
>> Does this stuff have any value for users?  I'm worried we are exposing a
>> bunch of stuff that is really just for internal purposes.  Like, what value
>> does showing "not_in_sample" have?  On the other hand, "guc_explain" might
>> be genuinely useful, since that is part of a user-facing feature.  (I don't
>> like the "guc_*" naming though.)

EXPLAIN is useful to know which parameter could be part of an explain
query, as that's not an information provided now, even if the category
provides a hint.  COMPUTED is also useful for the purpose of postgres
-C in my opinion.  I am reserved about the rest in terms of user
experience, but the other ones are useful to automate the checks
check_guc was doing, which is still the main goal of this patch if we
remove this script.  And experience has proved lately that people
forget a lot to mark GUCs correctly.

> I interpretted Michael's suggested as adding it to pg_get_all_settings(), but
> *not* including it in the pg_settings view.  Now it seems like I 
> misunderstood,
> and Michael wants to add it to the view.

Yeah, I meant to add that in the view, as it is already wide.  I'd be
fine with a separate SQL function at the end, but putting that in
pg_show_all_settings() without considering pg_settings would not be
consistent.  There is the argument that one could miss an update of
system_views.sql if adding more data to pg_show_all_settings(), even
if that's not really going to happen.

> But, even if we only handle the 5 flags we have an immediate use for, it makes
> the user-facing view too "wide", just to accommodate this internal use.

short_desc and extra_desc count for most of the bloat already, so that
would not change much, but I am fine to discard my point to not make
things worse.
--
Michael


signature.asc
Description: PGP signature


Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2022-01-25 Thread Michael Paquier
On Tue, Jan 25, 2022 at 07:20:05PM +, Bossart, Nathan wrote:
> I looked into removing the "shutdown" variable in favor of using
> "flags" everywhere, but the patch was quite messy and repetitive.  I
> think another way to make things less confusing is to replace
> "shutdown" with an inverse variable called "online."  The attached
> patch does it this way.

Yeah, that sounds like a good compromise.  At least, I find the whole
a bit easier to follow.

Heikki was planning to commit a large refactoring of xlog.c, so we'd
better wait for that to happen before concluding here.
--
Michael


signature.asc
Description: PGP signature


Re: Skipping logical replication transactions on subscriber side

2022-01-25 Thread Amit Kapila
On Tue, Jan 25, 2022 at 6:18 PM Peter Eisentraut
 wrote:
>
> On 25.01.22 03:54, Amit Kapila wrote:
> >> I don't think this functionality allows a nonprivileged user to do
> >> anything they couldn't otherwise do.  You can create inconsistent data
> >> in the sense that you can choose not to apply certain replicated data.
> >>
> > I thought this will be the only primary way to skip applying certain
> > transactions. The other could be via pg_replication_origin_advance().
> > Or are you talking about the case where we skip applying update/delete
> > where the corresponding rows are not found?
> >
> > I see the point that if we can allow the owner to skip applying
> > updates/deletes in certain cases then probably this should also be
> > okay. Kindly let us know if you have something else in mind as well?
>
> Let's start this again: The question at hand is whether ALTER
> SUBSCRIPTION ... SKIP should be allowed for subscription owners that are
> not superusers.  The argument raised against that was that this would
> allow the owner to create "inconsistent" data.  But it hasn't been
> explained what that actually means or why it is dangerous.
>

There are two reasons in my mind: (a) We are going to skip some
unrelated data changes that are not the direct cause of conflict
because of the entire transaction skip. Now, it is possible that
unintentionally it allows skipping some actual changes
insert/update/delete/truncate to some relations which will then allow
even the future changes to cause some conflict or won't get applied. A
few examples are after TRUNCATE is skipped, the INSERTS in following
transactions can cause error "duplicate key .."; similarly say some
INSERT is skipped, then following UPDATE/DELETE won't find the
corresponding row to perform the operation. (b) Users can specify some
random XID, the discussion below is trying to detect this and raise
WARNING/ERROR but still, it could cause some valid transaction (which
won't generate any conflict/error) to skip.

These can lead to some missing data in the subscriber which the user
might not have expected.

-- 
With Regards,
Amit Kapila.




Re: Refactoring of compression options in pg_basebackup

2022-01-25 Thread Michael Paquier
On Tue, Jan 25, 2022 at 03:14:13PM -0500, Robert Haas wrote:
> On Sat, Jan 22, 2022 at 12:47 AM Michael Paquier  wrote:
>> Also, having this enum in walmethods.h is perhaps not the best place
>> either, even more if you plan to use that in pg_basebackup for the
>> server-side compression.  One idea is to rename this enum to
>> DataCompressionMethod, moving it into a new header, like common.h as
>> of the attached.
> 
> Well, we also have CompressionAlgorithm competing for the same job.

Sure, but I don't think that it is a good idea to unify that yet, at
least not until pg_dump is able to handle LZ4 as an option, as the
main benefit that we'd gain here is to be able to change the code to a
switch/case without defaults where we would detect code paths that
require a refresh once adding support for a new option.
--
Michael


signature.asc
Description: PGP signature


Re: refactoring basebackup.c

2022-01-25 Thread Michael Paquier
On Tue, Jan 25, 2022 at 03:54:52AM +, Shinoda, Noriyoshi (PN Japan FSIP) 
wrote:
> Michael, I am proposing to that we remove this message as part of
> this commit: 
> 
> -pg_log_info("no value specified for compression
> level, switching to default");
> 
> I think most people won't want to specify a compression level, so
> emitting a message when they don't seems too verbose. 

(Just noticed this message as I am not in CC.)
Removing this message is fine by me, thanks!
--
Michael


signature.asc
Description: PGP signature


Re: refactoring basebackup.c

2022-01-25 Thread Michael Paquier
On Tue, Jan 25, 2022 at 09:52:12PM +0530, tushar wrote:
> C) -R option is silently ignoring
> 
> go to /tmp/pp folder and extract it - there is no "standby.signal" file and
> if we start cluster against this data directory,it will not be in slave
> mode.

Yeah, I don't think it's good to silently ignore the option, and we
should not generate the file on the server-side.  Rather than erroring
in this case, you'd better add the file to the existing compressed
file of the base data folder on the client-side.

This makes me wonder whether we should begin tracking any open items
for v15..  We don't want to lose track of any issue with features
committed already in the tree.
--
Michael


signature.asc
Description: PGP signature


RE: Support tab completion for upper character inputs in psql

2022-01-25 Thread tanghy.f...@fujitsu.com
On Tuesday, January 25, 2022 6:44 PM, Julien Rouhaud  wrote:
> Thanks for updating the patch.  When you do so, please check and update the
> commitfest entry accordingly to make sure that people knows it's ready for
> review.  I'm switching the entry to Needs Review.
> 

Thanks for your reminder. I'll watch out the status change as you suggested.

Regards,
Tang




Re: pg_upgrade should truncate/remove its logs before running

2022-01-25 Thread Michael Paquier
On Wed, Jan 26, 2022 at 09:44:48AM +0900, Michael Paquier wrote:
> Yes, this is going to need an adjustment of @logfiles in
> TestUpgrade.pm, with the addition of
> "$tmp_data_dir/pg_update_output.d/log/*.log" to be consistent with the
> data fetched for the tests of older branches.

Bleh.  This would point to the old data directory, so this needs to be
"$self->{pgsql}/src/bin/pg_upgrade/tmp_check/data/pg_upgrade_output.d/log/*.log"
to point to the upgraded cluster.
--
Michael


signature.asc
Description: PGP signature


Re: Skipping logical replication transactions on subscriber side

2022-01-25 Thread David G. Johnston
On Mon, Jan 24, 2022 at 12:59 AM David G. Johnston <
david.g.johns...@gmail.com> wrote:

>
> > 5(out). wait for the user to manually restart the replication stream
>>
>> Do you mean that there always is user intervention after error so the
>> replication stream can resume?
>>
>
> That is my working assumption.  It doesn't seem like the system would
> auto-resume without a DBA doing something (I'll attribute a server crash to
> the DBA for convenience).
>
> Apparently I need to read more about how the system works today to
> understand how this varies from and integrates with today's user experience.
>
>
I've done some code reading.  My understanding is that a background worker
for the main apply of a given subscription is created from the launcher
code (not reviewed) which is initialized at server startup (or as needed
sometime thereafter).  This goes into a for(;;) loop in LogicalRepApplyLoop
under a PG_TRY in ApplyWorkerMain.  When a message is applied that provokes
an error the PG_CATCH() in ApplyWorkerMain takes over and then this worker
dies.  While in that PG_CATCH() we have an aborted transaction and so are
limited in what we can change.  We PG_RE_THROW(); back to the background
worker infrastructure and let it perform logging and cleanup; which
includes this destroying this instance of the background worker.  The
background worker that is destroyed is replaced and its replacement is
identical to the original so far as the statistics collector is concerned.

I haven't traced out when the replacement apply worker gets recreated.  It
seems like doing so immediately, and then it going and just encountering
the same error, would be an undesirable choice, and so I've assumed it does
not.  But I also wasn't expecting the apply worker to PG_RE_THROW() either,
but instead continue on running in a different for(;;) loop waiting for
some signal from the system that something has changed that may avoid the
error that put it in timeout.

So my more detailed goal would be to get rid of PG_RE_THROW(); (I assume
doing so would entail transaction rollback) and stay in the worker.  Update
pg_subscription with the error information (having removed PG_RE_THROW we
have new things to consider re: pg_stat_subscription_workers). Go into a
for(;;) loop, maybe polling pg_subscription for an indication that it is OK
to retry applying the last transaction. (can an inter-process signal be
sent from a normal backend process to a background worker process?).  The
SKIP command then matches XID values on pg_subscription; the resumption
sees the subskipxid, updates pg_subscription to remove the error info and
subskipid, skips the next transaction assuming it has the matching XID, and
then continues applying as normal.  Adapt to deal with crash conditions as
needed though clearing before reapplying seems like a safe default.  Again,
upon worker startup maybe they should be cleared too (making pg_dump and
other backup considerations moot - as noted in my P.S. in the previous
email).

I'm not sure we are paranoid enough regarding the locking of
pg_subscription for purposes of reading and writing subskipxid.  I'd
probably rather serialize access to it, and maybe even not allow changing
from one non-zero XID to another non-zero XID.  It shouldn't be needed in
practice (moreso if the XID has to be the one that is present from
current_error_xid) and the user can always reset first.

In worker.c I was and still am confused as to the meaning of 'c' and 'w' in
LogicalRepApplyLoop.  In apply_dispatch in that file enums are used to
compare against the message byte, it would be helpful for the inexperienced
reader if 'c' and 'w' were done as enums instead as well.

David J.


Re: Skipping logical replication transactions on subscriber side

2022-01-25 Thread Amit Kapila
On Tue, Jan 25, 2022 at 8:39 PM Masahiko Sawada  wrote:
>
> On Tue, Jan 25, 2022 at 11:58 PM David G. Johnston
>  wrote:
> >
> > On Tue, Jan 25, 2022 at 7:47 AM Masahiko Sawada  
> > wrote:
> >>
> >> Yeah, I think it's a good idea to clear the subskipxid after the first
> >> transaction regardless of whether the worker skipped it.
> >>
> >
> > So basically instead of stopping the worker with an error you suggest 
> > having the worker continue applying changes (after resetting subskipxid, 
> > and - arguably - the ?_error_* fields).  Log the transaction xid mis-match 
> > as a warning in the log file as opposed to an error.
>
> Agreed, I think it's better to log a warning than to raise an error.
> In the case where the user specified the wrong XID, the worker should
> fail again due to the same error.
>

IIUC, the proposal is to compare the skip_xid with the very
transaction the apply worker received to apply and raise a warning if
it doesn't match with skip_xid and then continue. This seems like a
reasonable idea but can we guarantee that it is always the first
transaction that we want to skip? We seem to guarantee that we won't
get something again once it is written durably/flushed on the
subscriber side. I guess here it can happen that before the errored
transaction, there is some empty xact, or maybe part of the stream
(consider streaming transactions) of some xact, or there could be
other cases as well where the server will send those xacts again.

Now, if the above reasoning is correct then I think your proposal to
clear the skip_xid in the catalog as soon as we have applied the first
transaction successfully seems reasonable to me.

-- 
With Regards,
Amit Kapila.




Re: Skipping logical replication transactions on subscriber side

2022-01-25 Thread Amit Kapila
On Wed, Jan 26, 2022 at 7:31 AM David G. Johnston
 wrote:
>
> On Mon, Jan 24, 2022 at 12:59 AM David G. Johnston 
>  wrote:
>>
>
> So my more detailed goal would be to get rid of PG_RE_THROW();
>

I don't think that will be possible, consider the FATAL/PANIC error
case. Also, there are reasons why we always restart apply worker on
ERROR even without this work. If we want to change that, we might need
to redesign the apply side mechanism which I don't think we should try
to do as part of this patch.

-- 
With Regards,
Amit Kapila.




Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2022-01-25 Thread Bharath Rupireddy
On Wed, Jan 26, 2022 at 6:28 AM Michael Paquier  wrote:
>
> On Tue, Jan 25, 2022 at 07:20:05PM +, Bossart, Nathan wrote:
> > I looked into removing the "shutdown" variable in favor of using
> > "flags" everywhere, but the patch was quite messy and repetitive.  I
> > think another way to make things less confusing is to replace
> > "shutdown" with an inverse variable called "online."  The attached
> > patch does it this way.
>
> Yeah, that sounds like a good compromise.  At least, I find the whole
> a bit easier to follow.

v3 LGTM.

> Heikki was planning to commit a large refactoring of xlog.c, so we'd
> better wait for that to happen before concluding here.

Will that patch set fix the issue reported here?

Regards,
Bharath Rupireddy.




Re: Skipping logical replication transactions on subscriber side

2022-01-25 Thread Masahiko Sawada
On Wed, Jan 26, 2022 at 11:28 AM Amit Kapila  wrote:
>
> On Tue, Jan 25, 2022 at 8:39 PM Masahiko Sawada  wrote:
> >
> > On Tue, Jan 25, 2022 at 11:58 PM David G. Johnston
> >  wrote:
> > >
> > > On Tue, Jan 25, 2022 at 7:47 AM Masahiko Sawada  
> > > wrote:
> > >>
> > >> Yeah, I think it's a good idea to clear the subskipxid after the first
> > >> transaction regardless of whether the worker skipped it.
> > >>
> > >
> > > So basically instead of stopping the worker with an error you suggest 
> > > having the worker continue applying changes (after resetting subskipxid, 
> > > and - arguably - the ?_error_* fields).  Log the transaction xid 
> > > mis-match as a warning in the log file as opposed to an error.
> >
> > Agreed, I think it's better to log a warning than to raise an error.
> > In the case where the user specified the wrong XID, the worker should
> > fail again due to the same error.
> >
>
> IIUC, the proposal is to compare the skip_xid with the very
> transaction the apply worker received to apply and raise a warning if
> it doesn't match with skip_xid and then continue. This seems like a
> reasonable idea but can we guarantee that it is always the first
> transaction that we want to skip? We seem to guarantee that we won't
> get something again once it is written durably/flushed on the
> subscriber side. I guess here it can happen that before the errored
> transaction, there is some empty xact, or maybe part of the stream
> (consider streaming transactions) of some xact, or there could be
> other cases as well where the server will send those xacts again.

Good point.

I guess that in the situation the worker entered an error loop, we can
guarantee that the worker fails while applying the first non-empty
transaction since starting logical replication. And the transaction is
what we’d like to skip. If the transaction that can be applied without
an error is resent after a restart, it’s a problem of logical
replication. As you pointed out, it's possible that there are some
empty transactions before the transaction in question since we don't
advance replication origin LSN if the transaction is empty. Also,
probably the same is true for a streamed transaction that is rolled
back or ROLLBACK-PREPARED transactions. So, we can also skip clearing
subskipxid if the transaction is empty? That is, we make sure to clear
it after applying the first non-empty transaction. We would need to
carefully think about this solution otherwise ALTER SUBSCRIPTION SKIP
ends up not working at all in some cases.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




  1   2   >