date:20220909

On 2022-Sep-09, Amit Langote wrote:

> Yes, the loop in GetForeignKeyActionTriggers() needs this:
> 
> +   /* Only ever look at "action" triggers on the PK side. */
> +   if (RI_FKey_trigger_type(trgform->tgfoid) != RI_TRIGGER_PK)
> +   continue;
> 
> Likewise, GetForeignKeyActionTriggers() needs this:
> 
> +   /* Only ever look at "check" triggers on the FK side. */
> +   if (RI_FKey_trigger_type(trgform->tgfoid) != RI_TRIGGER_FK)
> +   continue;
> 
> We evidently missed this in f4566345cf40b0.

Ouch.  Thank you, pushed.

> On Fri, Sep 9, 2022 at 4:54 AM Tom Lane  wrote:

> > I'd counsel dropping the early-exit optimization; it doesn't
> > save much I expect, and it evidently hides bugs.  Or maybe
> > make it conditional on !USE_ASSERT_CHECKING.
> 
> While neither of these functions are called in hot paths, I am
> inclined to keep the early-exit bit in non-assert builds.

I kept it that way.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)

Re: SI-read predicate locks on materialized views

2022-09-09 Thread Dilip Kumar

On Tue, Jul 26, 2022 at 3:31 PM Richard Guo  wrote:
>
>
> On Tue, Jul 26, 2022 at 3:44 PM Yugo NAGATA  wrote:
>>
>> If such two transactions run concurrently, a write skew anomaly occurs,
>> and the result of order_summary refreshed in T1 will not contain the
>> record inserted in T2.

Yes we do have write skew anomaly.  I think the patch looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Implementing Incremental View Maintenance

2022-09-09 Thread Yugo NAGATA

Hello huyajun,

I'm sorry for delay in my response.

On Tue, 26 Jul 2022 12:00:26 +0800
huyajun  wrote:

> I read your patch and think this processing is greet, but there is a risk of 
> deadlock. 
> Although I have not thought of a suitable processing method for the time 
> being, 
> it is also acceptable for truncate scenarios.The deadlock scene is as follows:
> 
> Mv define is: select * from base_a,base_b;
> S1: truncate base_a; ― only AccessExclusiveLock base_a and not run into after 
> trigger
> S2: insert into base_b; ― The update has been completed and the incremental 
> refresh is started in the after trigger，RowExclusive on base_b and 
> ExclusiveLock on mv
> S1: continue truncate mv, wait for AccessExclusiveLock on mv, wait for S2
> S2: continue refresh mv, wait for AccessShardLock on base_a, wait for S1
> So deadlock occurred

Hmm, this deadlock scenario is possible, indeed.

One idea to resolve it is to acquire RowExclusive locks on all base tables
in the BEFORE trigger. If so, S2 can not progress its process because it
waits for a RowExclusive lock on base_b, and it can not  acquire ExeclusiveLock
on mv before S1 finishes.

> I also found some new issues that I would like to discuss with you

Thank you so much for your massive bug reports!

> 1. Concurrent DML causes imv data error, case like below
> Setup:
> Create table t( a int);
> Insert into t select 1 from generate_series(1,3);
> create incremental materialized view s  as select count(*) from t;
> 
> S1: begin;delete from t where ctid in (select ctid from t limit 1);
> S2: begin;delete from t where ctid in (select ctid from t limit 1 offset 1);
> S1: commit;
> S2: commit;
> 
> After this, The count data of s becomes 2 but correct data is 1.
> I found out that the problem is probably because to our use of ctid update
> Consider user behavior unrelated to imv：
> 
> Create table t( a int);
> Insert into t select 1;
> s1: BEGIN
> s1: update t set a = 2 where ctid in (select ctid from t); -- UPDATE 1
> s2: BEGIN
> s2: update t set a = 3 where ctid in (select ctid from t); -- wait row lock
> s1: COMMIT
> s2: -- UPDATE 0 -- ctid change so can't UPDATE one rows
> So we lost the s2 update
> 
> 2. Sometimes it will crash when the columns of the created materialized view 
> do not match
> Create table t( a int);
> create incremental materialized view s(z)  as select sum(1) as a, sum(1) as b 
> from t;
> 
> The problem should be that colNames in rewriteQueryForIMMV does not consider 
> this situation
> 
> 3. Sometimes no error when the columns of the created materialized view do 
> not match
>  Create table t( a int);
>  create incremental materialized view s(y,z)  as select  count(1) as b from t;
> 
> But the hidden column of IMV is overwritten to z which will cause refresh 
> failed.
> 
> The problem should be that checkRuleResultList we should only skip imv hidden 
> columns check
> 
> 4. A unique index should not be created in the case of a Cartesian product
> 
> create table base_a (i int primary key, j varchar);
> create table base_b (i int primary key, k varchar);
> INSERT INTO base_a VALUES
> (1,10),
> (2,20),
> (3,30),
> (4,40),
> (5,50);
> INSERT INTO base_b VALUES
> (1,101),
> (2,102),
> (3,103),
> (4,104);
> CREATE incremental MATERIALIZED VIEW s as
> select base_a.i,base_a.j from base_a,base_b; ― create error because of unique 
> index

I am working on above issues (#1-#4) now, and I'll respond on each later.

> 5. Besides, I would like to ask you if you have considered implementing an 
> IMV with delayed refresh?
> The advantage of delayed refresh is that it will not have much impact on 
> write performance

Yes, I've been thinking to implement deferred maintenance since the beginning of
this IVM project. However, we've decided to start from immediate maintenance, 
and
will plan to propose deferred maintenance to the core after the current patch is
accepted. (I plan to implement this feature in pg_ivm extension module first,
though.)

> I probably have some ideas about it now, do you think it works?
> 1. After the base table is updated, the delayed IMV's after trigger is used 
> to record the delta
>  information in another table similar to the incremental log of the base table
> 2. When incremental refresh, use the data in the log instead of the data in 
> the trasient table
> of the after trigger
> 3. We need to merge the incremental information in advance to ensure that the 
> base_table
> after transaction filtering UNION ALL old_delta is the state before the base 
> table is updated
> Case like below:
> Create table t( a int);
> ―begin to record log
> Insert into t select 1; ― newlog: 1 oldlog: empty
> Delete from t; ―newlog:1, oldlog：1
> ― begin to incremental refresh
> Select * from t where xmin < xid or (xmin = xid and cmin < cid); ― empty
> So this union all oldlog is not equal to  before the base table is updated
> We need  merge the incremental log in advance to make newlog: empty, oldlog: 
> empty
> 
> If implemented, incre

Re: Bump MIN_WINNT to 0x0600 (Vista) as minimal runtime in 16~

2022-09-09 Thread Michael Paquier

On Fri, Sep 09, 2022 at 10:55:55AM +0900, Michael Paquier wrote:
> Thanks.  This set of simplifications is too good to let go, and I have
> a window to look after the buildfarm today and tomorrow, which should
> be enough to take action if need be.  Hence, I have applied the
> patch.  Now, let's see what the buildfarm tells us ;)

Based on what I can see, the Windows animals seem to have digested
47bd0b3 (cygwin, MinGW and MSVC), so I think that we are good.
--
Michael


signature.asc
Description: PGP signature

Re: New docs chapter on Transaction Management and related changes

2022-09-09 Thread Simon Riggs

On Thu, 8 Sept 2022 at 08:42, Alvaro Herrera  wrote:
>
> On 2022-Sep-06, Simon Riggs wrote:
>
> > On Tue, 6 Sept 2022 at 17:19, Erik Rijkers  wrote:
> > >
> > > Op 06-09-2022 om 17:16 schreef Simon Riggs:
> > > > New chapter on transaction management, plus a few related changes.
> > > >
> > > > Markup and links are not polished yet, so please comment initially on
> > > > the topics, descriptions and wording.
>
> I think the concept of XID epoch should be described also, in or near
> the paragraph that talks about wrapping around and switching between
> int32 and int64.  Right now it's a bit unclear why/how that works.

Will do

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: Doc fix and adjustment for MERGE command

On 2022-Sep-08, Vik Fearing wrote:

> On 9/7/22 22:51, Vitaly Burovoy wrote:
> > Hello hackers!
> > 
> > Reading docs for the MERGE statement I've found a little error: a
> > semicolon in middle of a statement and absence of a semicolon in the end
> > of it.
> > 
> > Key words in subqueries are written in uppercase everywhere in the docs
> > but not in an example for MERGE. I think it should be adjusted too.

> I agree with both of these patches (especially the semicolon part which is
> not subjective).

OK, pushed both together.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El sabio habla porque tiene algo que decir;
el tonto, porque tiene que decir algo" (Platon).

Re: [PATCH]Feature improvement for MERGE tab completion

2022-09-09 Thread Shinya Kato


On 2022-09-09 11:18, bt22kawamotok wrote:


I created a patch for improving MARGE tab completion.
Currently there is a problem with "MERGE INTO dst as d Using src as s
ON d.key = s.key WHEN " is typed, "MATCHED" and "NOT MATCHED" is
not completed.
There is also a problem that typing "MERGE INTO a AS " completes 
"USING".

This patch solves the above problems.


Thanks for the patch!

else if (TailMatches("USING", MatchAny, "ON", MatchAny, "WHEN"))
COMPLETE_WITH("MATCHED", "NOT MATCHED");
	else if (TailMatches("USING", MatchAny, "AS", MatchAny, "ON", MatchAny, 
"WHEN"))

COMPLETE_WITH("MATCHED", "NOT MATCHED");
	else if (TailMatches("USING", MatchAny, MatchAny, "ON", MatchAny, 
"WHEN"))

COMPLETE_WITH("MATCHED", "NOT MATCHED");

I thought it would be better to describe this section as follows, 
summarizing the conditions


else if (TailMatches("USING", MatchAny, "ON", MatchAny, "WHEN") ||
			 TailMatches("USING", MatchAny, "AS", MatchAny, "ON", MatchAny, 
"WHEN") ||

 TailMatches("USING", MatchAny, MatchAny, "ON", MatchAny, 
"WHEN"))
COMPLETE_WITH("MATCHED", "NOT MATCHED");

There are similar redundancies in the tab completion of MERGE statement, 
so why not fix that as well?


--
Regards,

--
Shinya Kato
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: MERGE and parsing with prepared statements

On 2022-Aug-12, Simon Riggs wrote:

> Sorry, but I disagree with this chunk in the latest commit,
> specifically, changing the MATCHED from after to before the NOT
> MATCHED clause.
> 
> The whole point of the second example was to demonstrate that the
> order of the MATCHED/NOT MATCHED clauses made no difference.
> 
> By changing the examples so they are the same, the sentence at line
> 573 now makes no sense.

Not sure how to fix this.  We could add put the WHEN clauses in the
order they were, and add another phrase to that paragraph to explain
more explicitly that the order is not relevant (while at the same time
explaining that if you have multiple WHEN MATCHED clauses, the relative
order of those *is* relevant).

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"At least to kernel hackers, who really are human, despite occasional
rumors to the contrary" (LWN.net)

Re: Doc fix and adjustment for MERGE command

2022-09-09 Thread Vitaly Burovoy


On 2022-09-09 11:54Z, Alvaro Herrera wrote:

On 2022-Sep-08, Vik Fearing wrote:


On 9/7/22 22:51, Vitaly Burovoy wrote:

Hello hackers!

Reading docs for the MERGE statement I've found a little error: a
semicolon in middle of a statement and absence of a semicolon in the end
of it.

Key words in subqueries are written in uppercase everywhere in the docs
but not in an example for MERGE. I think it should be adjusted too.



I agree with both of these patches (especially the semicolon part which is
not subjective).


OK, pushed both together.



Thank you!
=)

--
Best regards,
Vitaly Burovoy

Re: confirmed_flush_lsn shows LSN of the data that has not yet been received by the logical subscriber.

2022-09-09 Thread Ashutosh Bapat

On Thu, Sep 8, 2022 at 8:32 PM Ashutosh Sharma  wrote:
>
> On Thu, Sep 8, 2022 at 6:23 PM Ashutosh Bapat
>  wrote:
> >
> > On Thu, Sep 8, 2022 at 4:14 PM Ashutosh Sharma  
> > wrote:
> > >
> > > Hi All,
> > >
> > > The logically decoded data are sent to the logical subscriber at the time 
> > > of transaction commit, assuming that the data is small. However, before 
> > > the transaction commit is performed, the LSN representing the data that 
> > > is yet to be received by the logical subscriber appears in the 
> > > confirmed_flush_lsn column of pg_replication_slots catalog. Isn't the 
> > > information seen in the confirmed_flush_lsn column while the transaction 
> > > is in progress incorrect ? esp considering the description given in the 
> > > pg doc for this column.
> > >
> > > Actually, while the transaction is running, the publisher keeps on 
> > > sending keepalive messages containing LSN of the last decoded data saved 
> > > in reorder buffer and the subscriber responds with the same LSN as the 
> > > last received LSN which is then updated as confirmed_flush_lsn by the 
> > > publisher. I think the LSN that we are sending with the keepalive message 
> > > should be the one representing the transaction begin message, not the LSN 
> > > of the last decoded data which is yet to be sent. Please let me know if I 
> > > am missing something here.
> >
> > The transactions with commit lsn < confirmed_flush_lsn are confirmed
> > to be received (and applied by the subscriber. Setting LSN
> > corresponding to a WAL record within a transaction in progress as
> > confirmed_flush should be ok. Since the transactions are interleaved
> > in WAL stream, it's quite possible that LSNs of some WAL records of an
> > inflight transaction are lesser than commit LSN of some another
> > transaction. So setting commit LSN of another effectively same as
> > setting it to any of the LSNs of any previous WAL record irrespective
> > of the transaction that it belongs to.
>
> Thank you Ashutosh for the explanation. I still feel that the
> documentation on confirmed_flush_lsn needs some improvement. It
> actually claims that all the data before the confirmed_flush_lsn has
> been received by the logical subscriber, but that's not the case. It
> actually means that all the data belonging to the transactions with
> commit lsn < confirmed_flush_lsn has been received and applied by the
> subscriber. So setting confirmed_flush_lsn to the lsn of wal records
> generated by running transaction might make people think that the wal
> records belonging to previous data of the same running transaction has
> already been received and applied by the subscriber node, but that's
> not true.
>

Can you please point to the documentation.

It's true that it needs to be clarified. But what you are saying may
not be entirely true in case of streamed transaction. In that case we
might send logically decoded changes of an ongoing transaction as
well. They may even get applied but not necessarily committed. It's a
bit complicated. :)

-- 
Best Wishes,
Ashutosh Bapat

Re: Summary function for pg_buffercache

2022-09-09 Thread Melih Mutlu

Hi hackers,

I also added documentation changes into the patch.
You can find it attached.

 I would appreciate any feedback about this pg_buffercache_summary function.

Best,
Melih
From 82e92d217dd240a9b6c1184cf29d4718343558b8 Mon Sep 17 00:00:00 2001
From: Melih Mutlu 
Date: Tue, 9 Aug 2022 16:42:23 +0300
Subject: [PATCH] Added pg_buffercache_summary function

Adds pg_buffercache_summary() function into pg_buffercache extension for
retrieving summary information about overall shared_buffer usage.
---
 contrib/pg_buffercache/Makefile   |   3 +-
 .../expected/pg_buffercache.out   |   9 ++
 .../pg_buffercache--1.3--1.4.sql  |  13 +++
 contrib/pg_buffercache/pg_buffercache.control |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c |  80 -
 contrib/pg_buffercache/sql/pg_buffercache.sql |   5 +
 doc/src/sgml/pgbuffercache.sgml   | 110 +-
 7 files changed, 215 insertions(+), 7 deletions(-)
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.3--1.4.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index d74b3e853c..d6b58d4da9 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -7,7 +7,8 @@ OBJS = \
 
 EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
-	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql
+	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
+	pg_buffercache--1.3--1.4.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache.out b/contrib/pg_buffercache/expected/pg_buffercache.out
index 138556efc9..6994761d0a 100644
--- a/contrib/pg_buffercache/expected/pg_buffercache.out
+++ b/contrib/pg_buffercache/expected/pg_buffercache.out
@@ -8,3 +8,12 @@ from pg_buffercache;
  t
 (1 row)
 
+select used_buffers + unused_buffers > 0,
+dirty_buffers < used_buffers,
+pinned_buffers < used_buffers
+from pg_buffercache_summary();
+ ?column? | ?column? | ?column? 
+--+--+--
+ t| t| t
+(1 row)
+
diff --git a/contrib/pg_buffercache/pg_buffercache--1.3--1.4.sql b/contrib/pg_buffercache/pg_buffercache--1.3--1.4.sql
new file mode 100644
index 00..02800dbf18
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.3--1.4.sql
@@ -0,0 +1,13 @@
+/* contrib/pg_buffercache/pg_buffercache--1.3--1.4.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.4'" to load this file. \quit
+
+CREATE FUNCTION pg_buffercache_summary()
+RETURNS TABLE (used_buffers int4, unused_buffers int4, dirty_buffers int4,
+pinned_buffers int4, avg_usagecount real)
+AS 'MODULE_PATHNAME', 'pg_buffercache_summary'
+LANGUAGE C PARALLEL SAFE;
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_summary() FROM PUBLIC;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 8c060ae9ab..a82ae5f9bb 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.3'
+default_version = '1.4'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index c5754ea9fa..89f8a2e834 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -17,6 +17,7 @@
 
 #define NUM_BUFFERCACHE_PAGES_MIN_ELEM	8
 #define NUM_BUFFERCACHE_PAGES_ELEM	9
+#define NUM_BUFFERCACHE_SUMMARY_ELEM 5
 
 PG_MODULE_MAGIC;
 
@@ -43,7 +44,6 @@ typedef struct
 	int32		pinning_backends;
 } BufferCachePagesRec;
 
-
 /*
  * Function context for data persisting over repeated calls.
  */
@@ -53,12 +53,12 @@ typedef struct
 	BufferCachePagesRec *record;
 } BufferCachePagesContext;
 
-
 /*
  * Function returning data from the shared buffer cache - buffer number,
  * relation node/tablespace/database/blocknum and dirty indicator.
  */
 PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_summary);
 
 Datum
 pg_buffercache_pages(PG_FUNCTION_ARGS)
@@ -237,3 +237,79 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(funcctx);
 }
+
+Datum
+pg_buffercache_summary(PG_FUNCTION_ARGS)
+{
+	Datum		result;
+	TupleDesc	tupledesc;
+	HeapTuple	tuple;
+	Datum		values[NUM_BUFFERCACHE_SUMMARY_ELEM];
+	bool		nulls[NUM_BUFFERCACHE_SUMMARY_ELEM];
+
+	int32 used_buffers = 0;
+	int32 unused_buffers = 0;
+	int32 dirty_buffers = 0;
+	int32 pinned_buffers = 0;
+	float avg_usagecount = 0;
+
+	/* Construct a tuple descriptor for the result rows. */
+	tupledesc = CreateTemplateTupleDesc(NUM_BUFFERCACHE_SUMM

Avoid overhead with fprintf related functions

Based on work in [1].
According to https://cplusplus.com/reference/cstdio/fprintf/
The use of fprintf is related to the need to generate a string based on a
format, which should be different from "%s".
Since fprintf has overhead when parsing the "format" parameter, plus all
the trouble of checking the va_arg parameters.
I think this is one of the low fruits available and easy to reap.
By replacing fprintf with its equivalents, fputs and fputc,
we avoid overhead and increase security [2] and [3].

The downside is a huge big churm, which unfortunately will occur.
But, IHMO, I think the advantages are worth it.
Note that behavior remains the same, since fputs and fputc do not change
the expected behavior of fprintf.

A small performance gain is expected, mainly for the client, since there
are several occurrences in some critical places, such as
(usr/src/fe_utils/print.c).

Patch attached.
This pass check-world.

regards,
Ranier Vilela

[1]
https://www.postgresql.org/message-id/CAApHDvp2THseLvCc%2BTcYFBC7FKHpHTs1JyYmd2JghtOVhb5WGA%40mail.gmail.com
[2]
https://stackoverflow.com/questions/20837989/fprintf-stack-buffer-overflow
[3]
https://bufferoverflows.net/format-string-vulnerability-what-when-and-how/


fprintf_fixes.patch
Description: Binary data

Re: why can't a table be part of the same publication as its schema

On Fri, Sep 9, 2022 at 5:21 AM Amit Kapila  wrote:
> So, why shouldn't a "FOR ALL TABLES IN SCHEMA" publication follow a
> similar behavior?

Surely that is not the same case at all. If you're publishing
everything, there's no point in also having a specific list of things
that you want published, but when you're publishing only some things,
there is. If my wife tells me to wash everything in the laundry basket
and also my nice pants, and I discover that my nice pants already
happen to be in the laundry basket, I do not tell her:

ERROR:  my nice pants are already in the laundry basket

It feels like a mistake to me that there's any catalog representation
at all for a table that is published because it is part of a schema.
The way a feature like this should work is that the schema should be
labelled as published, and we should discover which tables are part of
it at any given time as we go. We shouldn't need separate catalog
entries for each table in the schema just because the schema is
published. But if we do have such catalog entries, surely there should
be a difference between the catalog entry that gets created when the
table is individually published and the one that gets created when the
containing schema is published. We have such tracking in other cases
(coninhcount, conislocal; attinhcount, attislocal).

In my opinion, this shouldn't have been committed working the way it does.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: HOT chain validation in verify_heapam()

2022-09-09 Thread Himanshu Upadhyaya

On Wed, Sep 7, 2022 at 2:49 AM Robert Haas  wrote:

>
> But here's one random idea: add a successor[] array and an lp_valid[]
> array. In the first loop, set lp_valid[offset] = true if it passes the
> check_lp() checks, and set successor[A] = B if A redirects to B or has
> a CTID link to B, without matching xmin/xmax. Then, in a second loop,
> iterate over the successor[] array. If successor[A] = B && lp_valid[A]
> && lp_valid[B], then check whether A.xmax = B.xmin; if so, then
> complain if predecessor[B] is already set, else set predecessor[B] =
> A. Then, in the third loop, iterate over the predecessor array just as
> you're doing now. Then it's clear that we do the lp_valid checks
> exactly once for every offset that might need them, and in order. And
> it's also clear that the predecessor-based checks can never happen
> unless the lp_valid checks passed for both of the offsets involved.
>
>
>
Approach of introducing a successor array is good but I see one overhead
with having both successor and predecessor array, that is, we will traverse
each offset on page thrice(one more for original loop on offset) and with
each offset we have to retrieve
/reach an ItemID(PageGetItemId) and Item(PageGetItem) itself. This is not
much overhead as they are all preprocessors but there will be some overhead.
How about having new array(initialised with LP_NOT_CHECKED) of enum
LPStatus as below

typedef enum LPStatus
{
LP_NOT_CHECKED,
LP_VALID,
LP_NOT_VALID
}LPStatus;

and validating and setting with proper status at three places
1) while validating Redirect Tuple
2) while validating populating predecessor array and
3) at original place  of "sanity check"


something like:
" if (lpStatus[rdoffnum] == LP_NOT_CHECKED)
{
ctx.offnum = rdoffnum;
if (!check_lp(&ctx,
ItemIdGetLength(rditem), ItemIdGetOffset(rditem)))
{
lpStatus[rdoffnum] =
LP_NOT_VALID;
continue;
}
lpStatus[rdoffnum] = LP_VALID;
}
else if (lpStatus[rdoffnum] == LP_NOT_VALID)
continue;
"



thoughts?
-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com

Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf

"Drouvot, Bertrand"  writes:
> Agree that it seems unlikely but maybe we could add a new GUC to turn 
> the regex usage on the hba file on/off (and use off as the default)?

I think that will just add useless complication.

regards, tom lane

Re: [PATCH v1] fix potential memory leak in untransformRelOptions

On 2022-Sep-01, Tom Lane wrote:

> Junwang Zhao  writes:
> >   result = lappend(result, makeDefElem(pstrdup(s), val, -1));
> > + pfree(s);
> 
> I wonder why it's pstrdup'ing s in the first place.

Yeah, I think both the pstrdups in that function are useless.  The
DefElems can just point to the correct portion of the (already pstrdup'd
by TextDatumGetCString) copy of optiondatums[i].  We modify that copy to
install \0 in the place where the = is, and that copy is not freed
anywhere.

diff --git a/src/backend/access/common/reloptions.c 
b/src/backend/access/common/reloptions.c
index 609329bb21..0aa4b334ab 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -1357,9 +1357,9 @@ untransformRelOptions(Datum options)
if (p)
{
*p++ = '\0';
-   val = (Node *) makeString(pstrdup(p));
+   val = (Node *) makeString(p);
}
-   result = lappend(result, makeDefElem(pstrdup(s), val, -1));
+   result = lappend(result, makeDefElem(s, val, -1));
}

return result;

I think these pstrdups were already not necessary when the function was
added in 265f904d8f25, because textout() was already known to return a
palloc'ed copy of its input; but later 220db7ccd8c8 made this contract
even more explicit.

Keeping 's' and removing the pstrdups better uses memory, because we
have a single palloc'ed chunk per option rather than two.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

RE: why can't a table be part of the same publication as its schema

2022-09-09 Thread houzj.f...@fujitsu.com

On Friday, September 9, 2022 9:57 PM Robert Haas  wrote:
> 
> On Fri, Sep 9, 2022 at 5:21 AM Amit Kapila  wrote:
> > So, why shouldn't a "FOR ALL TABLES IN SCHEMA" publication follow a
> > similar behavior?

Hi
> 
> It feels like a mistake to me that there's any catalog representation at all 
> for a
> table that is published because it is part of a schema.
> The way a feature like this should work is that the schema should be labelled 
> as
> published, and we should discover which tables are part of it at any given 
> time as
> we go. We shouldn't need separate catalog entries for each table in the schema
> just because the schema is published. 

IIRC, the feature currently works almost the same as you described. It doesn't
create entry for tables that are published via its schema level, it only record
the published schema and check which tables are part of it.

Sorry, If I misunderstand your points or missed something.

Best regards,
Hou zj

Re: Introduce wait_for_subscription_sync for TAP tests

Amit Kapila  writes:
> Pushed.

Recently a number of buildfarm animals have failed at the same
place in src/test/subscription/t/100_bugs.pl [1][2][3][4]:

#   Failed test '2x3000 rows in t'
#   at t/100_bugs.pl line 149.
#  got: '9000'
# expected: '6000'
# Looks like you failed 1 test of 7.
[09:30:56] t/100_bugs.pl .. 

This was the last commit to touch that test script.  I'm thinking
maybe it wasn't adjusted quite correctly?  On the other hand, since
I can't find any similar failures before the last 48 hours, maybe
there is some other more-recent commit to blame.  Anyway, something
is wrong there.

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2022-09-09%2012%3A03%3A46
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2022-09-09%2011%3A16%3A36
[3] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2022-09-09%2010%3A33%3A19
[4] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2022-09-08%2010%3A56%3A59

Re: Summary function for pg_buffercache

2022-09-09 Thread Aleksander Alekseev

Hi Melih,

> I would appreciate any feedback/comment on this change.

Another benefit of pg_buffercache_summary() you didn't mention is that
it allocates much less memory than pg_buffercache_pages() does.

Here is v3 where I added this to the documentation. The patch didn't
apply to the current master branch with the following error:

```
pg_buffercache_pages.c:286:19: error: no member named 'rlocator' in
'struct buftag'
if (bufHdr->tag.rlocator.relNumber != InvalidOid)
~~~ ^
1 error generated.
```

I fixed this too. Additionally, the patch was pgindent'ed and some
typos were fixed.

However I'm afraid you can't examine BufferDesc's without taking
locks. This is explicitly stated in buf_internals.h:

"""
Buffer header lock (BM_LOCKED flag) must be held to EXAMINE or change
TAG, state or wait_backend_pgprocno fields.
"""

Let's consider this code again (this is after my fix):

```
if (RelFileNumberIsValid(BufTagGetRelNumber(bufHdr))) {
 /* ... */
}
```

When somebody modifies relNumber concurrently (e.g. calls
ClearBufferTag()) this will cause an undefined behaviour.

I suggest we focus on saving the memory first and then think about the
performance, if necessary.

-- 
Best regards,
Aleksander Alekseev


v3-0001-Added-pg_buffercache_summary-function.patch
Description: Binary data

Re: HOT chain validation in verify_heapam()

On Fri, Sep 9, 2022 at 10:00 AM Himanshu Upadhyaya
 wrote:
> Approach of introducing a successor array is good but I see one overhead with 
> having both successor and predecessor array, that is, we will traverse each 
> offset on page thrice(one more for original loop on offset) and with each 
> offset we have to retrieve
> /reach an ItemID(PageGetItemId) and Item(PageGetItem) itself. This is not 
> much overhead as they are all preprocessors but there will be some overhead.
> How about having new array(initialised with LP_NOT_CHECKED) of enum LPStatus 
> as below
>
> typedef enum LPStatus
> {
> LP_NOT_CHECKED,
> LP_VALID,
> LP_NOT_VALID
> }LPStatus;
>
> and validating and setting with proper status at three places
> 1) while validating Redirect Tuple
> 2) while validating populating predecessor array and
> 3) at original place  of "sanity check"

Well, having to duplicate the logic in three places doesn't seem all
that clean to me. Admittedly, I haven't tried implementing my
proposal, so maybe that doesn't come out very clean either. I don't
know. But I think having the code be clearly correct here is the most
important thing, not shaving a few CPU cycles here or there. It's not
even clear to me that your way would be cheaper, because an "if"
statement is certainly not free, and in fact is probably more
expensive than an extra call to PageGetItem() or PageGetItemId().
Branches are usually more expensive than math. But actually I doubt
that it matters much either way. I think the way to figure out whether
we have a performance problem is to write the code in the
stylistically best way and then test it. There may be no problem at
all, and if there is a problem, it may not be where we think it will
be.

In short, let's apply Knuth's optimization principle.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: configure --with-uuid=bsd fails on NetBSD

2022-09-09 Thread Nazir Bilal Yavuz


Hi,


On 8/26/22 19:21, Tom Lane wrote:

Nazir Bilal Yavuz  writes:

Based on these discussions, I attached a patch.


I think the right fix is to call uuid_create and then actually check
the version field of the result.  This avoids breaking what need not
be broken, and it'd also guard against comparable problems on other
platforms (so don't blame NetBSD specifically in the message, either).



I updated my patch. I checked version field in 'uuid_generate_internal' 
function instead of checking it in 'uuid_generate_v1' and 
'uuid_generate_v1mc' functions, but I have some questions:


1 - Should it be checked only for  '--with-uuid=bsd' option?
    1.1 - If it is needed to be checked only for '--with-uuid=bsd', 
should just NetBSD be checked?
2 - Should it error out without including current UUID version in the 
error message? General error message could mask if the 'uuid_create' 
function starts to produce UUIDs other than version-4.


Regards,
Nazir Bilal Yavuz
diff --git a/contrib/uuid-ossp/expected/uuid_ossp_1.out b/contrib/uuid-ossp/expected/uuid_ossp_1.out
new file mode 100644
index 00..e9db9d9b1b
--- /dev/null
+++ b/contrib/uuid-ossp/expected/uuid_ossp_1.out
@@ -0,0 +1,111 @@
+CREATE EXTENSION "uuid-ossp";
+SELECT uuid_nil();
+   uuid_nil   
+--
+ ----
+(1 row)
+
+SELECT uuid_ns_dns();
+ uuid_ns_dns  
+--
+ 6ba7b810-9dad-11d1-80b4-00c04fd430c8
+(1 row)
+
+SELECT uuid_ns_url();
+ uuid_ns_url  
+--
+ 6ba7b811-9dad-11d1-80b4-00c04fd430c8
+(1 row)
+
+SELECT uuid_ns_oid();
+ uuid_ns_oid  
+--
+ 6ba7b812-9dad-11d1-80b4-00c04fd430c8
+(1 row)
+
+SELECT uuid_ns_x500();
+ uuid_ns_x500 
+--
+ 6ba7b814-9dad-11d1-80b4-00c04fd430c8
+(1 row)
+
+-- some quick and dirty field extraction functions
+-- this is actually timestamp concatenated with clock sequence, per RFC 4122
+CREATE FUNCTION uuid_timestamp_bits(uuid) RETURNS varbit AS
+$$ SELECT ('x' || substr($1::text, 15, 4) || substr($1::text, 10, 4) ||
+   substr($1::text, 1, 8) || substr($1::text, 20, 4))::bit(80)
+  & x'0FFF3FFF' $$
+LANGUAGE SQL STRICT IMMUTABLE;
+CREATE FUNCTION uuid_version_bits(uuid) RETURNS varbit AS
+$$ SELECT ('x' || substr($1::text, 15, 2))::bit(8) & '' $$
+LANGUAGE SQL STRICT IMMUTABLE;
+CREATE FUNCTION uuid_reserved_bits(uuid) RETURNS varbit AS
+$$ SELECT ('x' || substr($1::text, 20, 2))::bit(8) & '1100' $$
+LANGUAGE SQL STRICT IMMUTABLE;
+CREATE FUNCTION uuid_multicast_bit(uuid) RETURNS bool AS
+$$ SELECT (('x' || substr($1::text, 25, 2))::bit(8) & '0001') != '' $$
+LANGUAGE SQL STRICT IMMUTABLE;
+CREATE FUNCTION uuid_local_admin_bit(uuid) RETURNS bool AS
+$$ SELECT (('x' || substr($1::text, 25, 2))::bit(8) & '0010') != '' $$
+LANGUAGE SQL STRICT IMMUTABLE;
+CREATE FUNCTION uuid_node(uuid) RETURNS text AS
+$$ SELECT substr($1::text, 25) $$
+LANGUAGE SQL STRICT IMMUTABLE;
+-- Ideally, the multicast bit would never be set in V1 output, but the
+-- UUID library may fall back to MC if it can't get the system MAC address.
+-- Also, the local-admin bit might be set (if so, we're probably inside a VM).
+-- So we can't test either bit here.
+SELECT uuid_version_bits(uuid_generate_v1()),
+   uuid_reserved_bits(uuid_generate_v1());
+ERROR:  uuid_create function is not generating version-1 UUIDs
+-- Although RFC 4122 only requires the multicast bit to be set in V1MC style
+-- UUIDs, our implementation always sets the local-admin bit as well.
+SELECT uuid_version_bits(uuid_generate_v1mc()),
+   uuid_reserved_bits(uuid_generate_v1mc()),
+   uuid_multicast_bit(uuid_generate_v1mc()),
+   uuid_local_admin_bit(uuid_generate_v1mc());
+ERROR:  uuid_create function is not generating version-1 UUIDs
+-- timestamp+clock sequence should be monotonic increasing in v1
+SELECT uuid_timestamp_bits(uuid_generate_v1()) < uuid_timestamp_bits(uuid_generate_v1());
+ERROR:  uuid_create function is not generating version-1 UUIDs
+SELECT uuid_timestamp_bits(uuid_generate_v1mc()) < uuid_timestamp_bits(uuid_generate_v1mc());
+ERROR:  uuid_create function is not generating version-1 UUIDs
+-- Ideally, the node value is stable in V1 addresses, but OSSP UUID
+-- falls back to V1MC behavior if it can't get the system MAC address.
+SELECT CASE WHEN uuid_multicast_bit(uuid_generate_v1()) AND
+ uuid_local_admin_bit(uuid_generate_v1()) THEN
+ true -- punt, no test
+   ELSE
+ uuid_node(uuid_generate_v1()) = uuid_node(uuid_generate_v1())
+   END;
+ERROR:  uuid_create function is not generating version-1 UUIDs
+-- In any case, V1MC node addresses should be random.
+SELECT uuid_node(uuid_generate_v1()) <> uuid_node(uuid_generate_v

Remove redundant code in pl_exec.c

2022-09-09 Thread Japin Li


Hi, hackers

I found there are some redundant code in pl_exec.c,
plpgsql_param_eval_generic_ro is same as plpgsql_param_eval_generic
except it invokes MakeExpandedObjectReadOnly.

IMO, we can invoke plpgsql_param_eval_generic in plpgsql_param_eval_generic_ro
to avoid the redundant.

Is there something I missed?  Any thoughts?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 7bd2a9fff1..543419d3da 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -6673,34 +6673,7 @@ static void
 plpgsql_param_eval_generic_ro(ExprState *state, ExprEvalStep *op,
 			  ExprContext *econtext)
 {
-	ParamListInfo params;
-	PLpgSQL_execstate *estate;
-	int			dno = op->d.cparam.paramid - 1;
-	PLpgSQL_datum *datum;
-	Oid			datumtype;
-	int32		datumtypmod;
-
-	/* fetch back the hook data */
-	params = econtext->ecxt_param_list_info;
-	estate = (PLpgSQL_execstate *) params->paramFetchArg;
-	Assert(dno >= 0 && dno < estate->ndatums);
-
-	/* now we can access the target datum */
-	datum = estate->datums[dno];
-
-	/* fetch datum's value */
-	exec_eval_datum(estate, datum,
-	&datumtype, &datumtypmod,
-	op->resvalue, op->resnull);
-
-	/* safety check -- needed for, eg, record fields */
-	if (unlikely(datumtype != op->d.cparam.paramtype))
-		ereport(ERROR,
-(errcode(ERRCODE_DATATYPE_MISMATCH),
- errmsg("type of parameter %d (%s) does not match that when preparing the plan (%s)",
-		op->d.cparam.paramid,
-		format_type_be(datumtype),
-		format_type_be(op->d.cparam.paramtype;
+	plpgsql_param_eval_generic(state, op, econtext);
 
 	/* force the value to read-only */
 	*op->resvalue = MakeExpandedObjectReadOnly(*op->resvalue,

Re: why can't a table be part of the same publication as its schema

On Fri, Sep 9, 2022 at 10:29 AM houzj.f...@fujitsu.com
 wrote:
> IIRC, the feature currently works almost the same as you described. It doesn't
> create entry for tables that are published via its schema level, it only 
> record
> the published schema and check which tables are part of it.

Oh, well if that's the case, that is great news. But then I don't
understand Amit's comment from before:

> Yes, because otherwise, there was confusion while dropping the objects
> from publication. Consider in the above case, if we would have allowed
> it and then the user performs ALTER PUBLICATION p1 DROP ALL TABLES IN
> SCHEMA s1, then (a) shall we remove both schema s1 and a table that is
> separately added (s1.t1) from that schema, or (b) just remove schema
> s1?

I believe that (b) is the correct behavior, so I assumed that this
issue must be some difficulty in implementing it, like a funny catalog
representation.

Things might be clearer if we'd made the syntax "ALTER PUBLICATION p1
{ ADD | DROP } { TABLE | SCHEMA } name". I don't understand why we
used this ALL TABLES IN SCHEMA language.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Remove redundant code in pl_exec.c

Japin Li  writes:
> I found there are some redundant code in pl_exec.c,
> plpgsql_param_eval_generic_ro is same as plpgsql_param_eval_generic
> except it invokes MakeExpandedObjectReadOnly.

Which is exactly why it's NOT redundant.

> IMO, we can invoke plpgsql_param_eval_generic in plpgsql_param_eval_generic_ro
> to avoid the redundant.

I don't like this particularly --- it puts way too much premium on
the happenstance that the MakeExpandedObjectReadOnly call is the
very last step in the callback function.  If that needed to change,
we'd have a mess.

regards, tom lane

Re: is_superuser is not documented

2022-09-09 Thread Euler Taveira

On Fri, Sep 9, 2022, at 2:28 AM, bt22kawamotok wrote:
> is_superuser function checks whether a user is a superuser or not, and 
> is commonly used. However, is_superuser is not documented and is set to 
> UNGROUPED in guc.c. I think is_superuser should be added to the 
> documentation and set to PRESET OPTIONS.What are you thought on this?
There is no such function. Are you referring to the GUC? I agree that it should
be added to the documentation. The main reason is that it is reported by
ParameterStatus along with some other GUCs described in the Preset Options
section.

postgres=# \dfS is_superuser
   List of functions
Schema | Name | Result data type | Argument data types | Type 
+--+--+-+--
(0 rows)

postgres=# SHOW is_superuser;
is_superuser 
--
on
(1 row)

Do you mind writing a patch?


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Remove redundant code in pl_exec.c

2022-09-09 Thread Japin Li



On Fri, 09 Sep 2022 at 23:34, Tom Lane  wrote:
> Japin Li  writes:
>> IMO, we can invoke plpgsql_param_eval_generic in 
>> plpgsql_param_eval_generic_ro
>> to avoid the redundant.
>
> I don't like this particularly --- it puts way too much premium on
> the happenstance that the MakeExpandedObjectReadOnly call is the
> very last step in the callback function.  If that needed to change,
> we'd have a mess.
>

Sorry, I don't get your mind.  Could you explain it more?  Thanks in advance!

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: pg_upgrade generated files in subdir follow-up

On Wed, Aug 31, 2022 at 04:41:24PM +0200, Daniel Gustafsson wrote:
> > On 31 Aug 2022, at 15:59, Tom Lane  wrote:
> > 
> > Daniel Gustafsson  writes:
> >> Commit 38bfae36526 moved the .txt files pg_upgrade generates to a separate
> >> subdir, but there are a few left which are written to cwd.  The thread
> >> resulting in that patch doesn't discuss these files specifically so it 
> >> seems
> >> they are just an oversight. Unless I'm missing something.
> > 
> >> Should something the attached be applied to ensure all generated files are
> >> placed in the subdirectory?
> > 
> > It certainly looks inconsistent ATM.  I wondered if maybe the plan was to
> > put routine output into the log directory but problem-reporting files
> > into cwd --- but that isn't what's happening now.

The script files are intended to stay where they are, and the error
files are intended to move under the subdir, to allow for their easy
removal, per Tom's request.

> Right, check_proper_datallowconn and check_for_isn_and_int8_passing_mismatch
> and a few other check functions already place error reporting in the subdir.

It looks like I may have grepped for fprintf or similar, and missed
checking output_path.

I updated your patach to put the logic inside
check_for_data_types_usage(), which is shorter, and seems to simplify
doing what's intended into the future.

-- 
Justin
>From 36aaf5470d5b0bb0ad2a322880e526b8c5f87066 Mon Sep 17 00:00:00 2001
From: Daniel Gustafsson 
Date: Wed, 31 Aug 2022 14:09:06 +0200
Subject: [PATCH 1/2] pg_upgrade generated files in subdir follow-up

Commit 38bfae36526 moved the .txt files pg_upgrade generates to a separate
subdir, but there are a few left which are written to cwd.  The thread
resulting in that patch doesn't discuss these files specifically so it seems
they are just an oversight. Unless I'm missing something.

Should something the attached be applied to ensure all generated files are
placed in the subdirectory?

--
Daniel Gustafsson		https://vmware.com/
---
 src/bin/pg_upgrade/check.c   | 12 +---
 src/bin/pg_upgrade/version.c | 12 +---
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index f4969bcdad7..f1bc1e68868 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -1218,7 +1218,9 @@ check_for_composite_data_type_usage(ClusterInfo *cluster)
 
 	prep_status("Checking for system-defined composite types in user tables");
 
-	snprintf(output_path, sizeof(output_path), "tables_using_composite.txt");
+	snprintf(output_path, sizeof(output_path), "%s/%s",
+			 log_opts.basedir,
+			 "tables_using_composite.txt");
 
 	/*
 	 * Look for composite types that were made during initdb *or* belong to
@@ -1275,7 +1277,9 @@ check_for_reg_data_type_usage(ClusterInfo *cluster)
 
 	prep_status("Checking for reg* data types in user tables");
 
-	snprintf(output_path, sizeof(output_path), "tables_using_reg.txt");
+	snprintf(output_path, sizeof(output_path), "%s/%s",
+			 log_opts.basedir,
+			 "tables_using_reg.txt");
 
 	/*
 	 * Note: older servers will not have all of these reg* types, so we have
@@ -1328,7 +1332,9 @@ check_for_jsonb_9_4_usage(ClusterInfo *cluster)
 
 	prep_status("Checking for incompatible \"jsonb\" data type");
 
-	snprintf(output_path, sizeof(output_path), "tables_using_jsonb.txt");
+	snprintf(output_path, sizeof(output_path), "%s/%s",
+			 log_opts.basedir,
+			 "tables_using_jsonb.txt");
 
 	if (check_for_data_type_usage(cluster, "pg_catalog.jsonb", output_path))
 	{
diff --git a/src/bin/pg_upgrade/version.c b/src/bin/pg_upgrade/version.c
index 064d23797c5..dc19fc6ec84 100644
--- a/src/bin/pg_upgrade/version.c
+++ b/src/bin/pg_upgrade/version.c
@@ -183,7 +183,9 @@ old_9_3_check_for_line_data_type_usage(ClusterInfo *cluster)
 
 	prep_status("Checking for incompatible \"line\" data type");
 
-	snprintf(output_path, sizeof(output_path), "tables_using_line.txt");
+	snprintf(output_path, sizeof(output_path), "%s/%s",
+			 log_opts.basedir,
+			 "tables_using_line.txt");
 
 	if (check_for_data_type_usage(cluster, "pg_catalog.line", output_path))
 	{
@@ -221,7 +223,9 @@ old_9_6_check_for_unknown_data_type_usage(ClusterInfo *cluster)
 
 	prep_status("Checking for invalid \"unknown\" user columns");
 
-	snprintf(output_path, sizeof(output_path), "tables_using_unknown.txt");
+	snprintf(output_path, sizeof(output_path), "%s/%s",
+			 log_opts.basedir,
+			 "tables_using_unknown.txt");
 
 	if (check_for_data_type_usage(cluster, "pg_catalog.unknown", output_path))
 	{
@@ -364,7 +368,9 @@ old_11_check_for_sql_identifier_data_type_usage(ClusterInfo *cluster)
 
 	prep_status("Checking for invalid \"sql_identifier\" user columns");
 
-	snprintf(output_path, sizeof(output_path), "tables_using_sql_identifier.txt");
+	snprintf(output_path, sizeof(output_path), "%s/%s",
+			 log_opts.basedir,
+			 "tables_using_sql_identifier.txt");
 
 	if (check_for_data_type_usage(cluster, "information_schema.sql_ide

Re: Avoid overhead with fprintf related functions

On Fri, Sep 09, 2022 at 10:45:37AM -0300, Ranier Vilela wrote:
> Based on work in [1].
> According to https://cplusplus.com/reference/cstdio/fprintf/
> The use of fprintf is related to the need to generate a string based on a
> format, which should be different from "%s".
> Since fprintf has overhead when parsing the "format" parameter, plus all
> the trouble of checking the va_arg parameters.
> I think this is one of the low fruits available and easy to reap.
> By replacing fprintf with its equivalents, fputs and fputc,
> we avoid overhead and increase security [2] and [3].
> 
> The downside is a huge big churm, which unfortunately will occur.
> But, IHMO, I think the advantages are worth it.
> Note that behavior remains the same, since fputs and fputc do not change
> the expected behavior of fprintf.
> 
> A small performance gain is expected, mainly for the client, since there
> are several occurrences in some critical places, such as
> (usr/src/fe_utils/print.c).

I agree with David [0].  But if you can demonstrate a performance gain,
perhaps it's worth considering a subset of these changes in hot paths.

[0] 
https://postgr.es/m/CAApHDvp2THseLvCc%2BTcYFBC7FKHpHTs1JyYmd2JghtOVhb5WGA%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Add tracking of backend memory allocated to pg_stat_activity

2022-09-09 Thread Stephen Frost

Greetings,

* Drouvot, Bertrand (bdrou...@amazon.com) wrote:
> On 9/1/22 3:28 AM, Kyotaro Horiguchi wrote:
> >At Wed, 31 Aug 2022 12:05:55 -0500, Justin Pryzby  
> >wrote in
> >>On Wed, Aug 31, 2022 at 12:03:06PM -0400, Reid Thompson wrote:
> >>>Attached is a patch to
> >>>Add tracking of backend memory allocated
> 
> Thanks for the patch.
> 
> + 1 on the idea.

Glad folks are in support of the general idea.

> >>>+  proargmodes => 
> >>>'{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
> >>In the past, there was concern about making pg_stat_activity wider by
> >>adding information that's less-essential than what's been there for
> >>years.  This is only an int64, so it's not "wide", but I wonder if
> >>there's another way to expose this information?  Like adding backends to
> >The view looks already too wide to me.  I don't want the numbers for
> >metrics are added to the view.
> 
> +1 for a dedicated view.

A dedicated view with a single column in it hardly seems sensible.  I'd
also argue that this particular bit of information is extremely useful
and therefore worthy of being put directly into pg_stat_activity.  I
could see a dedicated view possibly *also* being added later if/when we
provide a more detailed break-down of how the memory is being used but
that's a whole other thing and I'm not even 100% sure we'll ever
actually get there, as you can already poke a backend and have it dump
out the memory context-level information on an as-needed basis.

> While we are at it, what do you think about also recording the max memory
> allocated by a backend? (could be useful and would avoid sampling for which
> there is no guarantee to sample the max anyway).

What would you do with that information..?  By itself, it doesn't strike
me as useful.  Perhaps it'd be interesting to grab the max required for
a particular query in pg_stat_statements or such but again, that's a
very different thing.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [PATCH] Fix alter subscription concurrency errors

On 2022-Aug-26, Jelte Fennema wrote:

> I indeed don't think this problem is unique to subscriptions, but it seems 
> better to at least have this problem in a few places less (not making perfect
> the enemy of good).
> 
> If someone has a more generic way of solving this for other commands too, 
> then that sounds great, but if not then slowly chipping away at these cases 
> seems better than keeping the status quo.
> 
> Attached is a new patch where ALTER SUBSCRIPTION ... OWNER TO ... can 
> now also be executed concurrently with the other subscription commands.

Would it work to use get_object_address() instead?  That would save
having to write a lookup-and-lock function with a retry loop for each
object type.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Add tracking of backend memory allocated to pg_stat_activity

2022-09-09 Thread Stephen Frost

Greetings,

* Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> At Tue, 06 Sep 2022 17:10:49 -0400, Reid Thompson 
>  wrote in 
> >  I'm open to guidance on testing for performance degradation.  I did
> >  note some basic pgbench comparison numbers in the thread regarding
> >  limiting backend memory allocations. 
> 
> Yeah.. That sounds good..
> 
> (I have a patch that is stuck at benchmarking on slight possible
> degradation caused by a branch (or indirect call) on a hot path
> similary to this one.  The test showed fluctuation that is not clearly
> distinguishable between noise and degradation by running the target
> functions in a busy loop..)

Just to be clear- this path is (hopefully) not *super* hot as we're only
tracking actual allocations (that is- malloc() calls), this isn't
changing anything for palloc() calls that aren't also needing to do a
malloc(), and we already try to reduce the amount of malloc() calls
we're doing by allocating more and more each time we run out in a given
context.

While I'm generally supportive of doing some benchmarking around this, I
don't think the bar is as high as it would be if we were actually
changing the cost of routine palloc() or such calls.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: ICU for global collation

In pg14:
|postgres=# create database a LC_COLLATE C LC_CTYPE C LOCALE C;
|ERROR:  conflicting or redundant options
|DETAIL:  LOCALE cannot be specified together with LC_COLLATE or LC_CTYPE.

In pg15:
|postgres=# create database a LC_COLLATE "en_US.UTF-8" LC_CTYPE "en_US.UTF-8" 
LOCALE "en_US.UTF-8" ;
|CREATE DATABASE

f2553d430 actually relaxed the restriction by removing this check:

-   if (dlocale && (dcollate || dctype))
-   ereport(ERROR,
-   (errcode(ERRCODE_SYNTAX_ERROR),
-errmsg("conflicting or redundant options"),
-errdetail("LOCALE cannot be specified together 
with LC_COLLATE or LC_CTYPE.")));

But isn't the right fix to do the corresponding thing in createdb
(relaxing the frontend restriction rather than reverting its relaxation
in the backend).

diff --git a/src/bin/scripts/createdb.c b/src/bin/scripts/createdb.c
index e523e58b218..5b80e56dfd9 100644
--- a/src/bin/scripts/createdb.c
+++ b/src/bin/scripts/createdb.c
@@ -159,15 +159,10 @@ main(int argc, char *argv[])
exit(1);
}
 
-   if (locale)
-   {
-   if (lc_ctype)
-   pg_fatal("only one of --locale and --lc-ctype can be 
specified");
-   if (lc_collate)
-   pg_fatal("only one of --locale and --lc-collate can be 
specified");
+   if (locale && !lc_ctype)
lc_ctype = locale;
+   if (locale && !lc_collate)
lc_collate = locale;
-   }
 
if (encoding)
{


BTW it's somewhat crummy that it uses a string comparison, so if you
write "UTF8" without a dash, it says this; it took me a few minutes to
see the difference...

postgres=# create database a LC_COLLATE "en_US.UTF8" LC_CTYPE "en_US.UTF8" 
LOCALE "en_US.UTF8";
ERROR:  new collation (en_US.UTF8) is incompatible with the collation of the 
template database (en_US.UTF-8)

Re: configure --with-uuid=bsd fails on NetBSD

Nazir Bilal Yavuz  writes:
> I updated my patch. I checked version field in 'uuid_generate_internal' 
> function instead of checking it in 'uuid_generate_v1' and 
> 'uuid_generate_v1mc' functions, but I have some questions:

Yeah, that seems like the right place.  I tweaked the code to check
strbuf not str just so we aren't making unnecessary assumptions about
the length of what is returned.  strbuf[14] is guaranteed to exist,
str[14] maybe not.

> 1 - Should it be checked only for  '--with-uuid=bsd' option?
>      1.1 - If it is needed to be checked only for '--with-uuid=bsd', 
> should just NetBSD be checked?

I don't see any reason not to check in the BSD code path --- it's
a cheap enough test.  On the other hand, in the other code paths
there is no evidence that it's necessary, and we'd find out soon
enough if it becomes necessary.

> 2 - Should it error out without including current UUID version in the 
> error message? General error message could mask if the 'uuid_create' 
> function starts to produce UUIDs other than version-4.

Yeah, I thought reporting the actual version digit was worth doing,
and made it do so.

Pushed with those changes and doc updates.  I did not push the
variant expected-file.  I think the entire point here is that
we are *not* deeming the new NetBSD implementation acceptable,
so allowing it to pass regression tests is the wrong thing.

regards, tom lane

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2022-09-09 Thread Stephen Frost

Greetings,

* David Rowley (dgrowle...@gmail.com) wrote:
> On Thu, 1 Sept 2022 at 04:52, Reid Thompson
>  wrote:
> > Add the ability to limit the amount of memory that can be allocated to
> > backends.
> 
> Are you aware that relcache entries are stored in backend local memory
> and that once we've added a relcache entry for a relation that we have
> no current code which attempts to reduce the memory consumption used
> by cache entries when there's memory pressure?

Short answer to this is yes, and that's an issue, but it isn't this
patch's problem to deal with- that's an issue that the relcache system
needs to be changed to address.

> It seems to me that if we had this feature as you propose that a
> backend could hit the limit and stay there just from the memory
> requirements of the relation cache after some number of tables have
> been accessed from the given backend. It's not hard to imagine a
> situation where the palloc() would start to fail during parse, which
> might make it quite infuriating for anyone trying to do something
> like:

Agreed that this could happen but I don't imagine it to be super likely-
and even if it does, this is probably a better position to be in as the
backend could then be disconnected from and would then go away and its
memory free'd, unlike the current OOM-killer situation where we crash
and go through recovery.  We should note this in the documentation
though, sure, so that administrators understand how this can occur and
can take action to address it.

> I think a better solution to this problem would be to have "memory
> grants", where we configure some amount of "pool" memory that backends
> are allowed to use for queries.  The planner would have to add the
> expected number of work_mem that the given query is expected to use
> and before that query starts, the executor would have to "checkout"
> that amount of memory from the pool and return it when finished.  If
> there is not enough memory in the pool then the query would have to
> wait until enough memory is available.   This creates a deadlocking
> hazard that the deadlock detector would need to be made aware of.

Sure, that also sounds great and a query acceptance system would be
wonderful.  If someone is working on that with an expectation of it
landing before v16, great.  Otherwise, I don't see it as relevant to
the question about if we should include this feature or not, and I'm not
even sure that we'd refuse this feature even if we already had an
acceptance system as a stop-gap should we guess wrong and not realize it
until it's too late.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: is_superuser is not documented

"Euler Taveira"  writes:
> On Fri, Sep 9, 2022, at 2:28 AM, bt22kawamotok wrote:
>> is_superuser function checks whether a user is a superuser or not, and 
>> is commonly used. However, is_superuser is not documented and is set to 
>> UNGROUPED in guc.c. I think is_superuser should be added to the 
>> documentation and set to PRESET OPTIONS.What are you thought on this?

> There is no such function. Are you referring to the GUC? I agree that it 
> should
> be added to the documentation.

If you look at guc.c, it kind of seems intentional that it's undocumented:

/* Not for general use --- used by SET SESSION AUTHORIZATION */
{"is_superuser", PGC_INTERNAL, UNGROUPED,
gettext_noop("Shows whether the current user is a superuser."),
NULL,
GUC_REPORT | GUC_NO_SHOW_ALL | GUC_NO_RESET_ALL | GUC_NOT_IN_SAMPLE 
| GUC_DISALLOW_IN_FILE
},
&session_auth_is_superuser,
false,
NULL, NULL, NULL

On the other hand, it seems pretty silly that it's GUC_REPORT if
we want to consider it private.  I've not checked the git history,
but I bet that flag was added later with no thought about context.

If we are going to document this then we should at least remove
the GUC_NO_SHOW_ALL flag and rewrite the comment.  I wonder whether
the GUC_NO_RESET_ALL flag is needed either --- seems like the
PGC_INTERNAL context protects it sufficiently.

regards, tom lane

Re: Remove redundant code in pl_exec.c

Japin Li  writes:
> On Fri, 09 Sep 2022 at 23:34, Tom Lane  wrote:
>> I don't like this particularly --- it puts way too much premium on
>> the happenstance that the MakeExpandedObjectReadOnly call is the
>> very last step in the callback function.  If that needed to change,
>> we'd have a mess.

> Sorry, I don't get your mind.  Could you explain it more?  Thanks in advance!

This refactoring cannot support the situation where there is more
code to execute after MakeExpandedObjectReadOnly.

regards, tom lane

Re: Switching XLog source from archive to streaming when primary available

On Fri, Sep 09, 2022 at 12:14:25PM +0530, Bharath Rupireddy wrote:
> On Fri, Sep 9, 2022 at 10:57 AM Kyotaro Horiguchi
>  wrote:
>> At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart  
>> wrote in
>> > My general point is that we should probably offer some basic preventative
>> > measure against flipping back and forth between streaming and archive
>> > recovery while making zero progress.  As I noted, maybe that's as simple as
>> > having WaitForWALToBecomeAvailable() attempt to restore a file from archive
>> > at least once before the new parameter forces us to switch to streaming
>> > replication.  There might be other ways to handle this.
>>
>> +1.
> 
> Hm. In that case, I think we can get rid of timeout based switching
> mechanism and have this behaviour - the standby can attempt to switch
> to streaming mode from archive, say, after fetching 1, 2 or a
> configurable number of WAL files. In fact, this is the original idea
> proposed by Satya in this thread.

IMO the timeout approach would be more intuitive for users.  When it comes
to archive recovery, "WAL segment" isn't a standard unit of measure.  WAL
segment size can differ between clusters, and WAL files can have different
amounts of data or take different amounts of time to replay.  So I think it
would be difficult for the end user to decide on a value.  However, even
the timeout approach has this sort of problem.  If your parameter is set to
1 minute, but the current archive takes 5 minutes to recover, you won't
really be testing streaming replication once a minute.  That would likely
need to be documented.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Add tracking of backend memory allocated to pg_stat_activity

On Fri, Sep 09, 2022 at 12:34:15PM -0400, Stephen Frost wrote:
> > While we are at it, what do you think about also recording the max memory
> > allocated by a backend? (could be useful and would avoid sampling for which
> > there is no guarantee to sample the max anyway).

FYI, that's already kind-of available from getrusage:

$ psql ts -c "SET log_executor_stats=on; SET client_min_messages=debug;
SELECT a, COUNT(1) FROM generate_series(1,99) a GROUP BY 1;" |wc
LOG:  EXECUTOR STATISTICS
...
!   194568 kB max resident size

Note that max rss counts things allocated outside postgres (like linked
libraries).

> What would you do with that information..?  By itself, it doesn't strike
> me as useful.  Perhaps it'd be interesting to grab the max required for
> a particular query in pg_stat_statements or such but again, that's a
> very different thing.

log_executor_stats is at level "debug", so it's not great to enable it
for a single session, and painful to think about enabling it globally.
This would be a lot friendlier.

Storing the maxrss per backend somewhere would be useful (and avoid the
issue of "sampling" with top), after I agree that it ought to be exposed
to a view.  For example, it might help to determine whether (and which!)
backends are using large multiple of work_mem, and then whether that can
be increased.  If/when we had a "memory budget allocator", this would
help to determine how to set its GUCs, maybe to see "which backends are
using the work_mem that are precluding this other backend from using
efficient query plan".

I wonder if it's better to combine these two threads into one.  The 0001
patch of course can be considered independently from the 0002 patch, as
usual.  Right now, there's different parties on both threads ...

-- 
Justin

Re: Add the ability to limit the amount of memory that can be allocated to backends.

On Sat, Sep 03, 2022 at 11:40:03PM -0400, Reid Thompson wrote:
> > > +   0, 0, INT_MAX,
> > > +   NULL, NULL, NULL
> > I think this needs a maximum like INT_MAX/1024/1024
> 
> Is this noting that we'd set a ceiling of 2048MB?

The reason is that you're later multiplying it by 1024*1024, so you need
to limit it to avoid overflowing.  Compare with
min_dynamic_shared_memory, Log_RotationSize, maintenance_work_mem,
autovacuum_work_mem.

typo: Explicitely

+errmsg("request will exceed postgresql.conf defined 
max_total_backend_memory limit (%lu > %lu)",

I wouldn't mention postgresql.conf - it could be in
postgresql.auto.conf, or an include file, or a -c parameter.
Suggest: allocation would exceed max_total_backend_memory limit...

+   ereport(LOG, errmsg("decrease reduces reported backend memory 
allocated below zero; setting reported to 0"));

Suggest: deallocation would decrease backend memory below zero;

+   {"max_total_backend_memory", PGC_SIGHUP, RESOURCES_MEM, 

Should this be PGC_SU_BACKEND to allow a superuser to set a higher
limit (or no limit)?

There's compilation warning under mingw cross compile due to
sizeof(long).  See d914eb347 and other recent commits which I guess is
the current way to handle this.
http://cfbot.cputube.org/reid-thompson.html

For performance test, you'd want to check what happens with a large
number of max_connections (and maybe a large number of clients).  TPS
isn't the only thing that matters.  For example, a utility command might
sometimes do a lot of allocations (or deallocations), or a
"parameterized nested loop" may loop over over many outer tuples and
reset for each.  There's also a lot of places that reset to a
"per-tuple" context.  I started looking at its performance, but nothing
to show yet.

Would you keep people copied on your replies ("reply all") ?  Otherwise
I (at least) may miss them.  I think that's what's typical on these
lists (and the list tool is smart enough not to send duplicates to
people who are direct recipients).

-- 
Justin

Re: is_superuser is not documented

I wrote:
> On the other hand, it seems pretty silly that it's GUC_REPORT if
> we want to consider it private.  I've not checked the git history,
> but I bet that flag was added later with no thought about context.
>
> If we are going to document this then we should at least remove
> the GUC_NO_SHOW_ALL flag and rewrite the comment.  I wonder whether
> the GUC_NO_RESET_ALL flag is needed either --- seems like the
> PGC_INTERNAL context protects it sufficiently.

BTW, "session_authorization" has a subset of these same issues:

/* Not for general use --- used by SET SESSION AUTHORIZATION */
{"session_authorization", PGC_USERSET, UNGROUPED,
gettext_noop("Sets the session user name."),
NULL,
GUC_IS_NAME | GUC_REPORT | GUC_NO_SHOW_ALL | GUC_NO_RESET_ALL | 
GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE | GUC_NOT_WHILE_SEC_REST
},
&session_authorization_string,
NULL,
check_session_authorization, assign_session_authorization, NULL

I wonder why this one is marked USERSET where the other is not.
You'd think both of them need similar special-casing about how
to handle SET.

regards, tom lane

Re: Possible crash on standby

On Fri, Sep 09, 2022 at 05:29:49PM +0900, Kyotaro Horiguchi wrote:
> This is because WaitForWALToBecomeAvailable doesn't call
> XLogSHutdownWalRcv() when walreceiver has been stopped before we reach
> the WalRcvStreaming() call cited above. But we need to set
> InstasllXLogFileSegmentActive to false even in that case, since no one
> other than startup process does that.

Nice find.

> Unconditionally calling XLogShutdownWalRcv() fixes it. I feel we might
> need to correct the dependencies between the flag and walreceiver
> state, but it not mandatory because XLogShutdownWalRcv() is designed
> so that it can be called even after walreceiver is stopped.  I don't
> have a clear memory about why we do that at the time, though, but
> recovery check runs successfully with this.

I suppose the alternative would be to set InstallXLogFileSegmentActive to
false in an 'else' block, but that doesn't seem necessary if
XLogShutdownWalRcv() is safe to call unconditionally.  So, unless there is
a bigger problem that I'm not seeing, +1 for your patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Error "initial slot snapshot too large" in create replication slot

On Mon, Jul 18, 2022 at 10:55 PM Kyotaro Horiguchi
 wrote:
> Thanks! Just rebased.

Hi,

I mentioned this patch to Andres in conversation, and he expressed a
concern that there might be no guarantee that we retain enough CLOG to
look up XIDs. Presumably this wouldn't be an issue when the snapshot
doesn't get marked suboverflowed, but what if it does?

Adding Andres in the hopes that he may comment further.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Possible crash on standby

2022-09-09 Thread Bharath Rupireddy

On Fri, Sep 9, 2022 at 2:00 PM Kyotaro Horiguchi
 wrote:
>
> Hello.
>
> While I played  with some patch, I met an assertion failure.
>
> #2  0x00b350e0 in ExceptionalCondition (
> conditionName=0xbd8970 "!IsInstallXLogFileSegmentActive()",
> errorType=0xbd6e11 "FailedAssertion", fileName=0xbd6f28 "xlogrecovery.c",
> lineNumber=4190) at assert.c:69
> #3  0x00586f9c in XLogFileRead (segno=61, emode=13, tli=1,
> source=XLOG_FROM_ARCHIVE, notfoundOk=true) at xlogrecovery.c:4190
> #4  0x005871d2 in XLogFileReadAnyTLI (segno=61, emode=13,
> source=XLOG_FROM_ANY) at xlogrecovery.c:4296
> #5  0x0058656f in WaitForWALToBecomeAvailable (RecPtr=1023410360,
> randAccess=false, fetching_ckpt=false, tliRecPtr=1023410336, replayTLI=1,
> replayLSN=1023410336, nonblocking=false) at xlogrecovery.c:3727
>
> This is replayable by the following steps.
>
> 1. insert a sleep(1) in WaitForWALToBecomeAvailable().
> >* WAL that we restore from archive.
> >*/
> > + sleep(1);
> >   if (WalRcvStreaming())
> >   XLogShutdownWalRcv();
>
> 2. create a primary with archiving enabled.
>
> 3. create a standby with recovering from the primary's archive and
>   unconnectable primary_conninfo.
>
> 4. start the primary.
>
> 5. switch wal on the primary.
>
> 6. Kaboom.
>
> This is because WaitForWALToBecomeAvailable doesn't call
> XLogSHutdownWalRcv() when walreceiver has been stopped before we reach
> the WalRcvStreaming() call cited above. But we need to set
> InstasllXLogFileSegmentActive to false even in that case, since no one
> other than startup process does that.
>
> Unconditionally calling XLogShutdownWalRcv() fixes it. I feel we might
> need to correct the dependencies between the flag and walreceiver
> state, but it not mandatory because XLogShutdownWalRcv() is designed
> so that it can be called even after walreceiver is stopped.  I don't
> have a clear memory about why we do that at the time, though, but
> recovery check runs successfully with this.
>
> This code was introduced at PG12.

I think it is a duplicate of [1]. I have tested the above use-case
with the patch at [1] and it fixes the issue.

[1] 
https://www.postgresql.org/message-id/CALj2ACXPn_xePphnh88qmoQqqW%2BE2KEOdxGL%2BD-o9o7_XNGkkw%40mail.gmail.com

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Summary function for pg_buffercache

On Fri, Sep 09, 2022 at 05:36:45PM +0300, Aleksander Alekseev wrote:
> However I'm afraid you can't examine BufferDesc's without taking
> locks. This is explicitly stated in buf_internals.h:

Yeah, when I glanced at this patch earlier, I wondered about this.

> I suggest we focus on saving the memory first and then think about the
> performance, if necessary.

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Possible crash on standby

On Fri, Sep 09, 2022 at 10:51:10PM +0530, Bharath Rupireddy wrote:
> I think it is a duplicate of [1]. I have tested the above use-case
> with the patch at [1] and it fixes the issue.

I added this thread to the existing commitfest entry.  Thanks for pointing
this out.

https://commitfest.postgresql.org/39/3814

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Switching XLog source from archive to streaming when primary available

2022-09-09 Thread Bharath Rupireddy

On Fri, Sep 9, 2022 at 10:29 PM Nathan Bossart  wrote:
>
> On Fri, Sep 09, 2022 at 12:14:25PM +0530, Bharath Rupireddy wrote:
> > On Fri, Sep 9, 2022 at 10:57 AM Kyotaro Horiguchi
> >  wrote:
> >> At Thu, 8 Sep 2022 10:53:56 -0700, Nathan Bossart 
> >>  wrote in
> >> > My general point is that we should probably offer some basic preventative
> >> > measure against flipping back and forth between streaming and archive
> >> > recovery while making zero progress.  As I noted, maybe that's as simple 
> >> > as
> >> > having WaitForWALToBecomeAvailable() attempt to restore a file from 
> >> > archive
> >> > at least once before the new parameter forces us to switch to streaming
> >> > replication.  There might be other ways to handle this.
> >>
> >> +1.
> >
> > Hm. In that case, I think we can get rid of timeout based switching
> > mechanism and have this behaviour - the standby can attempt to switch
> > to streaming mode from archive, say, after fetching 1, 2 or a
> > configurable number of WAL files. In fact, this is the original idea
> > proposed by Satya in this thread.
>
> IMO the timeout approach would be more intuitive for users.  When it comes
> to archive recovery, "WAL segment" isn't a standard unit of measure.  WAL
> segment size can differ between clusters, and WAL files can have different
> amounts of data or take different amounts of time to replay.

How about the amount of WAL bytes fetched from the archive after which
a standby attempts to connect to primary or enter streaming mode? Of
late, we've changed some GUCs to represent bytes instead of WAL
files/segments, see [1].

> So I think it
> would be difficult for the end user to decide on a value.  However, even
> the timeout approach has this sort of problem.  If your parameter is set to
> 1 minute, but the current archive takes 5 minutes to recover, you won't
> really be testing streaming replication once a minute.  That would likely
> need to be documented.

If we have configurable WAL bytes instead of timeout for standby WAL
source switch from archive to primary, we don't have the above problem
right?

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c3fe108c025e4a080315562d4c15ecbe3f00405e

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: START_REPLICATION SLOT causing a crash in an assert build

2022-09-09 Thread Jaime Casanova

On Wed, Sep 07, 2022 at 12:39:08PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2022-09-06 18:40:49 -0500, Jaime Casanova wrote:
> > I'm not sure what is causing this, but I have seen this twice. The
> > second time without activity after changing the set of tables in a
> > PUBLICATION.
> 
> Can you describe the steps to reproduce?
> 

I'm still trying to determine that

> Which git commit does this happen on?
> 

6e55ea79faa56db85a2b6c5bf94cee8acf8bfdb8 (Stamp 15beta4) 

> 
> > gdb says that debug_query_string contains:
> > 
> > """
> > START_REPLICATION SLOT "sub_pgbench" LOGICAL 0/0 (proto_version '3', 
> > publication_names '"pub_pgbench"')START_REPLICATION SLOT "sub_pgbench" 
> > LOGICAL 0/0 (proto_version '3', publication_names '"pub_pgbench"')
> > """
> > 
> > attached the backtrace.
> > 
> 
> > #2  0x5559bfd4f0ed in ExceptionalCondition (
> > conditionName=0x5559bff30e20 "namestrcmp(&statent->slotname, 
> > NameStr(slot->data.name)) == 0", errorType=0x5559bff30e0d 
> > "FailedAssertion", fileName=0x5559bff30dbb "pgstat_replslot.c", 
> > lineNumber=89) at assert.c:69
> 
> what are statent->slotname and slot->data.name?
> 

slot->data.name seems to be the replication_slot record, and
statent->slotname comes from the in shared memory stats for that slot.

And the assert happens when &statent->slotname.data comes empty, which 
is not frequent but it happens from time to time

btw, while I'm looking at this I found that we can drop a publication
while there are active subscriptions pointing to it, is that something
we should allow?
anyway, that is not the cause of this because the replication slot actually
exists.

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

Re: small windows psqlrc re-wording

I wrote:
> On testing that in HEAD, I read

> Both the system-wide startup file and the user's personal startup file
> can be made psql-version-specific by appending a dash and the
> PostgreSQL major or minor release number to the file name, for example
> ~/.psqlrc-16 or ~/.psqlrc-16devel.

> That's a little confusing but it's actually accurate, because what
> process_psqlrc_file appends is the string PG_VERSION, so in a devel
> branch or beta release there's a non-numeric "minor release".
> I'm inclined to go ahead and do it like that.

I decided that what I found jarring about that was the use of "release
number" with a non-numeric version, so I changed it to "release
identifier" and pushed.

regards, tom lane

Re: Making autovacuum logs indicate if insert-based threshold was the triggering condition

On Sat, Aug 06, 2022 at 04:09:28PM -0700, Peter Geoghegan wrote:
> What are the chances that both thresholds will be crossed at *exactly*
> (not approximately) the same time in a real world case, where the
> table isn't tiny (tiny relative to the "autovacuum_naptime quantum")?
> This is a very narrow case.

The threshold wouldn't need to be crossed within autovacuum_naptime, since all
the workers might be busy.  Consider autovacuum delay, analyze on long/wide
tables, multiple extended stats objects, or autovacuum parameters which are
changed off-hours by a cronjob.

> It might make sense to *always* show how close we were to hitting each
> of the thresholds, including the ones that we didn't end up hitting
> (we may come pretty close quite often, which seems like it might
> matter). But showing multiple conditions together just because the
> planets aligned (we hit multiple thresholds together) emphasizes the
> low-level mechanism, which is pretty far removed from anything that
> matters. You might as well pick either threshold at random once this
> happens -- even an expert won't be able to tell the difference.

I don't have strong feelings about it; I'm just pointing out that the
two of the conditions aren't actually exclusive.

It seems like it could be less confusing to show both.  Consider someone who is
trying to *reduce* how often autovacuum runs, or give priority to some tables
by raising the thresholds for other tables.

-- 
Justin

Re: why can't a table be part of the same publication as its schema

2022-09-09 Thread Mark Dilger

> On Sep 9, 2022, at 8:18 AM, Robert Haas  wrote:
> 
> Things might be clearer if we'd made the syntax "ALTER PUBLICATION p1
> { ADD | DROP } { TABLE | SCHEMA } name". I don't understand why we
> used this ALL TABLES IN SCHEMA language.

The conversation, as I recall, was that "ADD SCHEMA foo" would only mean all 
tables in foo, until publication of other object types became supported, at 
which point "ADD SCHEMA foo" would suddenly mean more than it did before.  
People might find that surprising, so the "ALL TABLES IN" was intended to 
future-proof against surprising behavioral changes.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Making autovacuum logs indicate if insert-based threshold was the triggering condition

2022-09-09 Thread Peter Geoghegan

On Fri, Sep 9, 2022 at 11:11 AM Justin Pryzby  wrote:
> > It might make sense to *always* show how close we were to hitting each
> > of the thresholds, including the ones that we didn't end up hitting
> > (we may come pretty close quite often, which seems like it might
> > matter). But showing multiple conditions together just because the
> > planets aligned (we hit multiple thresholds together) emphasizes the
> > low-level mechanism, which is pretty far removed from anything that
> > matters. You might as well pick either threshold at random once this
> > happens -- even an expert won't be able to tell the difference.
>
> I don't have strong feelings about it; I'm just pointing out that the
> two of the conditions aren't actually exclusive.

Fair enough. I'm just pointing out that the cutoffs are continuous for
all practical purposes, even if there are cases where they seem kind
of discrete, due only to implementation details (e.g.
autovacuum_naptime stuff). Displaying only one reason for triggering
an autovacuum is consistent with the idea that the cutoffs are
continuous. It's not literally true that they're continuous, but it
might as well be.

I think of it as similar to how it's not literally true that a coin
toss is always either heads or tails, though it might as well be true.
Sure, even a standard fair coin toss might result in the coin landing
on its side. That'll probably never happen even once, but if does:
just flip the coin again! The physical coin toss was never the
important part.

> It seems like it could be less confusing to show both.  Consider someone who 
> is
> trying to *reduce* how often autovacuum runs, or give priority to some tables
> by raising the thresholds for other tables.

My objection to that sort of approach is that it suggests a difference
in what each VACUUM actually does -- as if autovacuum.c actually
decided on a particular runtime behavior for the VACUUM up front,
based on its own considerations that come from statistics about the
workload. I definitely want to avoid creating that false impression in
the minds of users.

-- 
Peter Geoghegan

Re: preserve timestamps when installing headers

2022-09-09 Thread Heikki Linnakangas


On 11/01/2022 00:03, Tom Lane wrote:

Peter Eisentraut  writes:

I don't think preserving timestamps should be the default behavior, but
I would support organizing things so that additional options can be
passed to "install" to make it do whatever the user prefers.  But that
won't work if some installations don't go through install.


+1. We just bumped into this with Neon, where we have a build script 
that generates Rust bindings from the PostgreSQL header files. The build 
script runs "make install", and because that changes the mtime even if 
there were no changes to the headers, the bindings are also regenerated 
every time.



So I fear we're optimizing for a case that stopped being mainstream
a decade or more back.  I could get behind switching the code back
to using $(INSTALL) for this, and then offering some way to inject
user-selected switches into the $(INSTALL) invocations.  That
wouldn't need much more than another gmake macro.  (Does there
need to be a way to inject such switches only into header
installations, or is it OK to do it across the board?)


Here's a patch to switch back to $(INSTALL). With that, you can do:

./configure INSTALL="/usr/bin/install -C"


[ wanders away wondering how this'd affect the meson conversion
project ]


If anything, I guess this will help, by making the Makefile a bit less 
special.


- HeikkiFrom c0305f950e45ed905cc8145879d0b48390c387fa Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 9 Sep 2022 21:25:26 +0300
Subject: [PATCH 1/1] Use normal install program to install server headers.

Commit a7032690f9 replaced $(INSTALL) with plain "cp" for installing the
server header files. It sped up "make install" significantly, because
the old logic called $(INSTALL) separately for every header file,
whereas plain "cp" could copy all the files in one command. However, we
have long since made it a requirement that $(INSTALL) can also install
multiple files in one command, see commit f1c5247563. Switch back to
$(INSTALL).

Discussion: https://www.postgresql.org/message-id/200503252305.j2PN52m23610%40candle.pha.pa.us
Discussion: https://www.postgresql.org/message-id/2415283.1641852217%40sss.pgh.pa.us
---
 src/include/Makefile | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/src/include/Makefile b/src/include/Makefile
index 0b4cab9bb1..1e50400617 100644
--- a/src/include/Makefile
+++ b/src/include/Makefile
@@ -48,22 +48,15 @@ install: all installdirs
 	$(INSTALL_DATA) utils/errcodes.h '$(DESTDIR)$(includedir_server)/utils'
 	$(INSTALL_DATA) utils/fmgroids.h '$(DESTDIR)$(includedir_server)/utils'
 	$(INSTALL_DATA) utils/fmgrprotos.h '$(DESTDIR)$(includedir_server)/utils'
-# We don't use INSTALL_DATA for performance reasons --- there are a lot of files
-# (in fact, we have to take some pains to avoid overlength shell commands here)
-	cp $(srcdir)/*.h '$(DESTDIR)$(includedir_server)'/
+	$(INSTALL_DATA) $(srcdir)/*.h '$(DESTDIR)$(includedir_server)'
 	for dir in $(SUBDIRS); do \
-	  cp $(srcdir)/$$dir/*.h '$(DESTDIR)$(includedir_server)'/$$dir/ || exit; \
+	  $(INSTALL_DATA) $(srcdir)/$$dir/*.h '$(DESTDIR)$(includedir_server)'/$$dir || exit; \
 	done
 ifeq ($(vpath_build),yes)
 	for file in catalog/schemapg.h catalog/system_fk_info.h catalog/pg_*_d.h parser/gram.h storage/lwlocknames.h utils/probes.h; do \
-	  cp $$file '$(DESTDIR)$(includedir_server)'/$$file || exit; \
+	  $(INSTALL_DATA) $$file '$(DESTDIR)$(includedir_server)'/$$file || exit; \
 	done
 endif
-	cd '$(DESTDIR)$(includedir_server)' && chmod $(INSTALL_DATA_MODE) *.h
-	for dir in $(SUBDIRS); do \
-	  cd '$(DESTDIR)$(includedir_server)'/$$dir || exit; \
-	  chmod $(INSTALL_DATA_MODE) *.h || exit; \
-	done
 
 installdirs:
 	$(MKDIR_P) '$(DESTDIR)$(includedir)/libpq' '$(DESTDIR)$(includedir_internal)/libpq'
-- 
2.30.2

Re: [BUG] Storage declaration in ECPG

Andrey Sokolov  writes:
> [ v2-0001-Fix-storage-declaration-in-ECPG.patch ]

Pushed.  I didn't think a whole new test case was appropriate,
either from the patch-footprint or test-runtime standpoint,
so I just added a couple of declarations to preproc/variable.pgc.

regards, tom lane

Re: Expand palloc/pg_malloc API

Peter Eisentraut  writes:
> I have updated this patch set to rename the _obj() functions to 
> _object(), and I have dropped the _ptrtype() variants.

> I have also split the patch to put the new API and the example uses into 
> separate patches.

This patch set seems fine to me, so I've marked it Ready for Committer.

I think serious consideration should be given to back-patching the
0001 part (that is, addition of the macros).  Otherwise we'll have
to remember not to use these macros in code intended for back-patch,
and that'll be mighty annoying once we are used to them.

regards, tom lane

Re: Expand palloc/pg_malloc API

Robert Haas  writes:
> On Fri, Aug 12, 2022 at 3:31 AM Peter Eisentraut
>  wrote:
>> (Personally, I have always been a bit suspicious about using the name
>> palloc() without memory context semantics in frontend code, but I guess
>> this is wide-spread now.)

> I think it would be a good thing to add memory context support to the
> frontend. We could just put everything in a single context for
> starters, and then frontend utilities that wanted to create other
> contexts could do so.

Perhaps, but I think we should have at least one immediate use-case
for multiple contexts in frontend.  Otherwise it's just useless extra
code.  The whole point of memory contexts in the backend is that we
have well-defined lifespans for certain types of allocations (executor
state, function results, etc); but it's not very clear to me that the
same concept will be helpful in any of our frontend programs.

> There are difficulties, though. For instance, memory contexts are
> nodes, and have a NodeTag. And I'm pretty sure we don't want frontend
> code to know about all the backend node types. My suspicion is that
> memory context types really shouldn't be node types, but right now,
> they are. Whether that's the correct view or not, this kind of problem
> means it's not a simple lift-and-shift to move the memory context code
> into src/common. Someone would need to spend some time thinking about
> how to engineer it.

I don't really think that's much of an issue.  We could replace the
nodetag fields with some sort of magic number and have just as much
wrong-pointer safety as in the backend.  What I do take issue with
is moving the code into src/common.  I think we'd be better off
just writing a distinct implementation for frontend.  For one thing,
it's not apparent to me that aset.c is a good allocator for frontend
(and the other two surely are not).

This is all pretty off-topic for Peter's patch, though.

regards, tom lane

Re: Summary function for pg_buffercache

2022-09-09 Thread Aleksander Alekseev

Hi hackers,

> > I suggest we focus on saving the memory first and then think about the
> > performance, if necessary.
>
> +1

I made a mistake in v3 cfbot complained about. It should have been:

```
if (RelFileNumberIsValid(BufTagGetRelNumber(&bufHdr->tag)))
```

Here is the corrected patch.

-- 
Best regards,
Aleksander Alekseev


v4-0001-Added-pg_buffercache_summary-function.patch
Description: Binary data

Re: configure --with-uuid=bsd fails on NetBSD

2022-09-09 Thread Andres Freund

Hi,

On 2022-09-09 12:48:38 -0400, Tom Lane wrote:
> Pushed with those changes and doc updates.  I did not push the
> variant expected-file.  I think the entire point here is that
> we are *not* deeming the new NetBSD implementation acceptable,
> so allowing it to pass regression tests is the wrong thing.

What do we gain from the regression test failing exactly this way, given that
we know it's a problem? It just makes it harder to run tests. How about we add
it as variant file, but via the resultmap mechanism? That way we wouldn't
silently accept the same bug on other platforms, but can still run the test
without needing to manually filter out bogus netbsd results?

Greetings,

Andres Freund

Re: configure --with-uuid=bsd fails on NetBSD

Andres Freund  writes:
> On 2022-09-09 12:48:38 -0400, Tom Lane wrote:
>> Pushed with those changes and doc updates.  I did not push the
>> variant expected-file.  I think the entire point here is that
>> we are *not* deeming the new NetBSD implementation acceptable,
>> so allowing it to pass regression tests is the wrong thing.

> What do we gain from the regression test failing exactly this way, given that
> we know it's a problem?

It tells people not to use --with-uuid=bsd on those NetBSD versions.
They can either do without uuid-ossp, or install ossp or e2fs.
("Do without" is not much of a hardship, now that we have
gen_random_uuid() in core.)

IMV a substantial part of the point of the regression tests is to
let end users and/or packagers verify that they have a non-broken
installation.  Hiding a problem by making the tests not fail
basically breaks that use-case.

If we had, say, a known openssl security bug that was exposed by our
test cases, would you advocate dumbing down the tests to not expose
the bug?

> It just makes it harder to run tests.

Harder for who?  AFAICT there is nobody but me routinely running
full tests on NetBSD, else we'd have found this problem much earlier.
I've got my animals configured not to use --with-uuid (not much of
a lift considering that's the buildfarm's default).  End of problem.

regards, tom lane

Re: Introduce wait_for_subscription_sync for TAP tests

2022-09-09 Thread Masahiko Sawada

On Fri, Sep 9, 2022 at 11:31 PM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > Pushed.
>
> Recently a number of buildfarm animals have failed at the same
> place in src/test/subscription/t/100_bugs.pl [1][2][3][4]:
>
> #   Failed test '2x3000 rows in t'
> #   at t/100_bugs.pl line 149.
> #  got: '9000'
> # expected: '6000'
> # Looks like you failed 1 test of 7.
> [09:30:56] t/100_bugs.pl ..
>
> This was the last commit to touch that test script.  I'm thinking
> maybe it wasn't adjusted quite correctly?  On the other hand, since
> I can't find any similar failures before the last 48 hours, maybe
> there is some other more-recent commit to blame.  Anyway, something
> is wrong there.

It seems that this commit is innocent as it changed only how to wait.
Rather, looking at the logs, the tablesync worker errored out at an
interesting point:

022-09-09 09:30:19.630 EDT [631b3feb.840:13]
pg_16400_sync_16392_7141371862484106124 ERROR:  could not find record
while sending logically-decoded data: missing contrecord at 0/1D4FFF8
2022-09-09 09:30:19.630 EDT [631b3feb.840:14]
pg_16400_sync_16392_7141371862484106124 STATEMENT:  START_REPLICATION
SLOT "pg_16400_sync_16392_7141371862484106124" LOGICAL 0/0
(proto_version '3', origin 'any', publication_names '"testpub"')
ERROR:  could not find record while sending logically-decoded data:
missing contrecord at 0/1D4FFF8
2022-09-09 09:30:19.631 EDT [631b3feb.26e8:2] ERROR:  error while
shutting down streaming COPY: ERROR:  could not find record while
sending logically-decoded data: missing contrecord at 0/1D4FFF8

It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef
is relevant.

Regards,

-- 
Masahiko Sawada

Re: Avoid overhead with fprintf related functions

Em sex., 9 de set. de 2022 às 13:20, Nathan Bossart <
nathandboss...@gmail.com> escreveu:

> On Fri, Sep 09, 2022 at 10:45:37AM -0300, Ranier Vilela wrote:
> > Based on work in [1].
> > According to https://cplusplus.com/reference/cstdio/fprintf/
> > The use of fprintf is related to the need to generate a string based on a
> > format, which should be different from "%s".
> > Since fprintf has overhead when parsing the "format" parameter, plus all
> > the trouble of checking the va_arg parameters.
> > I think this is one of the low fruits available and easy to reap.
> > By replacing fprintf with its equivalents, fputs and fputc,
> > we avoid overhead and increase security [2] and [3].
> >
> > The downside is a huge big churm, which unfortunately will occur.
> > But, IHMO, I think the advantages are worth it.
> > Note that behavior remains the same, since fputs and fputc do not change
> > the expected behavior of fprintf.
> >
> > A small performance gain is expected, mainly for the client, since there
> > are several occurrences in some critical places, such as
> > (usr/src/fe_utils/print.c).
>
> I agree with David [0].  But if you can demonstrate a performance gain,
> perhaps it's worth considering a subset of these changes in hot paths.
>
Simple benchmark with David sort example.

1. make data
create table t (a bigint not null, b bigint not null, c bigint not
null, d bigint not null, e bigint not null, f bigint not null);

insert into t select x,x,x,x,x,x from generate_Series(1,140247142) x; --
10GB!
vacuum freeze t;

2. client run
\timing on
\pset pager off
select * from t limit 100;

head:
Time: 418,210 ms
Time: 419,588 ms
Time: 424,713 ms

fprintf patch:
Time: 416,919 ms
Time: 416,246 ms
Time: 416,237 ms

regards,
Ranier Vilela

Re: Introduce wait_for_subscription_sync for TAP tests

Masahiko Sawada  writes:
> On Fri, Sep 9, 2022 at 11:31 PM Tom Lane  wrote:
>> Recently a number of buildfarm animals have failed at the same
>> place in src/test/subscription/t/100_bugs.pl [1][2][3][4]:
>> 
>> #   Failed test '2x3000 rows in t'
>> #   at t/100_bugs.pl line 149.
>> #  got: '9000'
>> # expected: '6000'
>> # Looks like you failed 1 test of 7.
>> [09:30:56] t/100_bugs.pl ..
>> 
>> This was the last commit to touch that test script.  I'm thinking
>> maybe it wasn't adjusted quite correctly?  On the other hand, since
>> I can't find any similar failures before the last 48 hours, maybe
>> there is some other more-recent commit to blame.  Anyway, something
>> is wrong there.

> It seems that this commit is innocent as it changed only how to wait.

Yeah.  I was wondering if it caused us to fail to wait somewhere,
but I concur that's not all that likely.

> It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef
> is relevant.

Noting that the errors have only appeared in the past couple of
days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f
(Fix recovery_prefetch with low maintenance_io_concurrency).

regards, tom lane

Re: Avoid overhead with fprintf related functions

Em sex., 9 de set. de 2022 às 10:45, Ranier Vilela 
escreveu:

> Based on work in [1].
> According to https://cplusplus.com/reference/cstdio/fprintf/
> The use of fprintf is related to the need to generate a string based on a
> format, which should be different from "%s".
> Since fprintf has overhead when parsing the "format" parameter, plus all
> the trouble of checking the va_arg parameters.
> I think this is one of the low fruits available and easy to reap.
> By replacing fprintf with its equivalents, fputs and fputc,
> we avoid overhead and increase security [2] and [3].
>
> The downside is a huge big churm, which unfortunately will occur.
> But, IHMO, I think the advantages are worth it.
> Note that behavior remains the same, since fputs and fputc do not change
> the expected behavior of fprintf.
>
> A small performance gain is expected, mainly for the client, since there
> are several occurrences in some critical places, such as
> (usr/src/fe_utils/print.c).
>
> Patch attached.
> This pass check-world.
>
Rechecked for the hundredth time.
One typo.

regards,
Ranier Vilela
diff --git a/contrib/oid2name/oid2name.c b/contrib/oid2name/oid2name.c
index 32d5444831..4672d8b2a4 100644
--- a/contrib/oid2name/oid2name.c
+++ b/contrib/oid2name/oid2name.c
@@ -422,7 +422,7 @@ sql_exec(PGconn *conn, const char *todo, bool quiet)
 			fprintf(stdout, "%*s", length[j] + 2, PQfname(res, j));
 			l += length[j] + 2;
 		}
-		fprintf(stdout, "\n");
+		fputc('\n', stdout);
 		pad = (char *) pg_malloc(l + 1);
 		memset(pad, '-', l);
 		pad[l] = '\0';
@@ -435,7 +435,7 @@ sql_exec(PGconn *conn, const char *todo, bool quiet)
 	{
 		for (j = 0; j < nfields; j++)
 			fprintf(stdout, "%*s", length[j] + 2, PQgetvalue(res, i, j));
-		fprintf(stdout, "\n");
+		fputc('\n', stdout);
 	}
 
 	/* cleanup */
diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 58d32ba946..ed1ab65940 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -2201,7 +2201,7 @@ printSourceNFA(regex_t *regex, TrgmColorInfo *colors, int ncolors)
 		/* dot -Tpng -o /tmp/source.png < /tmp/source.gv */
 		FILE	   *fp = fopen("/tmp/source.gv", "w");
 
-		fprintf(fp, "%s", buf.data);
+		fputs(buf.data, fp);	
 		fclose(fp);
 	}
 
@@ -2263,7 +2263,7 @@ printTrgmNFA(TrgmNFA *trgmNFA)
 		/* dot -Tpng -o /tmp/transformed.png < /tmp/transformed.gv */
 		FILE	   *fp = fopen("/tmp/transformed.gv", "w");
 
-		fprintf(fp, "%s", buf.data);
+		fputs(buf.data, fp);	
 		fclose(fp);
 	}
 
@@ -2354,7 +2354,7 @@ printTrgmPackedGraph(TrgmPackedGraph *packedGraph, TRGM *trigrams)
 		/* dot -Tpng -o /tmp/packed.png < /tmp/packed.gv */
 		FILE	   *fp = fopen("/tmp/packed.gv", "w");
 
-		fprintf(fp, "%s", buf.data);
+		fputs(buf.data, fp);	
 		fclose(fp);
 	}
 
diff --git a/contrib/vacuumlo/vacuumlo.c b/contrib/vacuumlo/vacuumlo.c
index 264b879bd3..e51ce8df57 100644
--- a/contrib/vacuumlo/vacuumlo.c
+++ b/contrib/vacuumlo/vacuumlo.c
@@ -133,7 +133,7 @@ vacuumlo(const char *database, const struct _param *param)
 	{
 		fprintf(stdout, "Connected to database \"%s\"\n", database);
 		if (param->dry_run)
-			fprintf(stdout, "Test run: no large objects will be removed!\n");
+			fputs("Test run: no large objects will be removed!\n", stdout);
 	}
 
 	res = PQexec(conn, ALWAYS_SECURE_SEARCH_PATH_SQL);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7a710e6490..343555e095 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8641,7 +8641,7 @@ do_pg_backup_stop(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 		 * Transfer remaining lines including label and start timeline to
 		 * history file.
 		 */
-		fprintf(fp, "%s", remaining);
+		fputs(remaining, fp);
 		fprintf(fp, "STOP TIME: %s\n", strfbuf);
 		fprintf(fp, "STOP TIMELINE: %u\n", stoptli);
 		if (fflush(fp) || ferror(fp) || FreeFile(fp))
diff --git a/src/backend/optimizer/geqo/geqo_misc.c b/src/backend/optimizer/geqo/geqo_misc.c
index 890ac363e9..0d024f227f 100644
--- a/src/backend/optimizer/geqo/geqo_misc.c
+++ b/src/backend/optimizer/geqo/geqo_misc.c
@@ -114,17 +114,17 @@ print_edge_table(FILE *fp, Edge *edge_table, int num_gene)
 	int			i,
 j;
 
-	fprintf(fp, "\nEDGE TABLE\n");
+	fputs("\nEDGE TABLE\n", fp);
 
 	for (i = 1; i <= num_gene; i++)
 	{
 		fprintf(fp, "%d :", i);
 		for (j = 0; j < edge_table[i].unused_edges; j++)
 			fprintf(fp, " %d", edge_table[i].edge_list[j]);
-		fprintf(fp, "\n");
+		fputc('\n', fp);
 	}
 
-	fprintf(fp, "\n");
+	fputc('\n', fp);
 
 	fflush(fp);
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index e75611fdd5..36808438c8 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -5601,10 +5601,10 @@ CreateOptsFile(int argc, char *argv[], char *fullprogname)
 		return false;
 	}
 
-	fprintf(fp, "%s", fullprogname);
+	fputs(fullprogname, fp);
 	for (i = 1; i < argc; i++)
 		fprintf(fp

Re: Avoid overhead with fprintf related functions

Ranier Vilela  writes:
> Em sex., 9 de set. de 2022 às 13:20, Nathan Bossart <
> nathandboss...@gmail.com> escreveu:
>> I agree with David [0].  But if you can demonstrate a performance gain,
>> perhaps it's worth considering a subset of these changes in hot paths.

> head:
> Time: 418,210 ms
> Time: 419,588 ms
> Time: 424,713 ms

> fprintf patch:
> Time: 416,919 ms
> Time: 416,246 ms
> Time: 416,237 ms

That is most certainly not enough gain to justify a large amount
of code churn.  In fact, given that this is probably pretty
platform-dependent and you've checked only one platform, I don't
think I'd call this a sufficient case for even a one-line change.

regards, tom lane

Re: Introduce wait_for_subscription_sync for TAP tests

2022-09-09 Thread Masahiko Sawada

On Sat, Sep 10, 2022 at 6:45 AM Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > On Fri, Sep 9, 2022 at 11:31 PM Tom Lane  wrote:
> >> Recently a number of buildfarm animals have failed at the same
> >> place in src/test/subscription/t/100_bugs.pl [1][2][3][4]:
> >>
> >> #   Failed test '2x3000 rows in t'
> >> #   at t/100_bugs.pl line 149.
> >> #  got: '9000'
> >> # expected: '6000'
> >> # Looks like you failed 1 test of 7.
> >> [09:30:56] t/100_bugs.pl ..
> >>
> >> This was the last commit to touch that test script.  I'm thinking
> >> maybe it wasn't adjusted quite correctly?  On the other hand, since
> >> I can't find any similar failures before the last 48 hours, maybe
> >> there is some other more-recent commit to blame.  Anyway, something
> >> is wrong there.
>
> > It seems that this commit is innocent as it changed only how to wait.
>
> Yeah.  I was wondering if it caused us to fail to wait somewhere,
> but I concur that's not all that likely.
>
> > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef
> > is relevant.
>
> Noting that the errors have only appeared in the past couple of
> days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f
> (Fix recovery_prefetch with low maintenance_io_concurrency).

Probably I found the cause of this failure[1]. The commit
f6c5edb8abcac04eb3eac6da356e59d399b2bcef didn't fix the problem
properly.

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoAw0Oofi4kiDpJBOwpYyBBBkJj%3DsLUOn4Gd2GjUAKG-fw%40mail.gmail.com

-- 
Masahiko Sawada

Re: Avoid overhead with fprintf related functions

Em sex., 9 de set. de 2022 às 18:53, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > Em sex., 9 de set. de 2022 às 13:20, Nathan Bossart <
> > nathandboss...@gmail.com> escreveu:
> >> I agree with David [0].  But if you can demonstrate a performance gain,
> >> perhaps it's worth considering a subset of these changes in hot paths.
>
> > head:
> > Time: 418,210 ms
> > Time: 419,588 ms
> > Time: 424,713 ms
>
> > fprintf patch:
> > Time: 416,919 ms
> > Time: 416,246 ms
> > Time: 416,237 ms
>
> That is most certainly not enough gain to justify a large amount
> of code churn.  In fact, given that this is probably pretty
> platform-dependent and you've checked only one platform, I don't
> think I'd call this a sufficient case for even a one-line change.
>
Of course, base these changes not on performance gain, but on correct style
and increased security.
But out-vote is out-vote, case closed.

Regards,
Ranier Vilela

Re: Introduce wait_for_subscription_sync for TAP tests

2022-09-09 Thread Thomas Munro

On Sat, Sep 10, 2022 at 9:45 AM Tom Lane  wrote:
> Masahiko Sawada  writes:
> > On Fri, Sep 9, 2022 at 11:31 PM Tom Lane  wrote:
> >> Recently a number of buildfarm animals have failed at the same
> >> place in src/test/subscription/t/100_bugs.pl [1][2][3][4]:
> >>
> >> #   Failed test '2x3000 rows in t'
> >> #   at t/100_bugs.pl line 149.
> >> #  got: '9000'
> >> # expected: '6000'
> >> # Looks like you failed 1 test of 7.
> >> [09:30:56] t/100_bugs.pl ..
> >>
> >> This was the last commit to touch that test script.  I'm thinking
> >> maybe it wasn't adjusted quite correctly?  On the other hand, since
> >> I can't find any similar failures before the last 48 hours, maybe
> >> there is some other more-recent commit to blame.  Anyway, something
> >> is wrong there.
>
> > It seems that this commit is innocent as it changed only how to wait.
>
> Yeah.  I was wondering if it caused us to fail to wait somewhere,
> but I concur that's not all that likely.
>
> > It's likely that the commit f6c5edb8abcac04eb3eac6da356e59d399b2bcef
> > is relevant.
>
> Noting that the errors have only appeared in the past couple of
> days, I'm now suspicious of adb466150b44d1eaf43a2d22f58ff4c545a0ed3f
> (Fix recovery_prefetch with low maintenance_io_concurrency).

Yeah, I also just spotted the coincidence of those failures while
monitoring the build farm.  I'll look into this later today.  My
initial suspicion is that there was pre-existing code here that was
(incorrectly?) relying on the lack of error reporting in that case.
But maybe I misunderstood and it was incorrect to report the error for
some reason that was not robustly covered with tests.

Re: configure --with-uuid=bsd fails on NetBSD

2022-09-09 Thread Andres Freund

Hi,

On 2022-09-09 17:31:40 -0400, Tom Lane wrote:
> Harder for who?  AFAICT there is nobody but me routinely running
> full tests on NetBSD, else we'd have found this problem much earlier.

Bilal's report was caused by automating testing on netbsd (and openbsd) as
well, as part of the meson stuff.  I also occasionally run the tests in a VM.

But as you say:

> I've got my animals configured not to use --with-uuid (not much of
> a lift considering that's the buildfarm's default).  End of problem.

Fair enough.

Greetings,

Andres Freund

Re: Switching XLog source from archive to streaming when primary available

On Fri, Sep 09, 2022 at 11:07:00PM +0530, Bharath Rupireddy wrote:
> On Fri, Sep 9, 2022 at 10:29 PM Nathan Bossart  
> wrote:
>> IMO the timeout approach would be more intuitive for users.  When it comes
>> to archive recovery, "WAL segment" isn't a standard unit of measure.  WAL
>> segment size can differ between clusters, and WAL files can have different
>> amounts of data or take different amounts of time to replay.
> 
> How about the amount of WAL bytes fetched from the archive after which
> a standby attempts to connect to primary or enter streaming mode? Of
> late, we've changed some GUCs to represent bytes instead of WAL
> files/segments, see [1].

Well, for wal_keep_size, using bytes makes sense.  Given you know how much
disk space you have, you can set this parameter accordingly to avoid
retaining too much of it for standby servers.  For your proposed parameter,
it's not so simple.  The same setting could have wildly different timing
behavior depending on the server.  I still think that a timeout is the most
intuitive.

>> So I think it
>> would be difficult for the end user to decide on a value.  However, even
>> the timeout approach has this sort of problem.  If your parameter is set to
>> 1 minute, but the current archive takes 5 minutes to recover, you won't
>> really be testing streaming replication once a minute.  That would likely
>> need to be documented.
> 
> If we have configurable WAL bytes instead of timeout for standby WAL
> source switch from archive to primary, we don't have the above problem
> right?

If you are going to stop replaying in the middle of a WAL archive, then
maybe.  But I don't think I'd recommend that.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf

2022-09-09 Thread Jacob Champion

On Thu, Sep 8, 2022 at 5:46 PM Tom Lane  wrote:
> Jacob Champion  writes:
> > I think you're going to have to address backwards compatibility
> > concerns. Today, I can create a role named "/a", and I can put that
> > into the HBA without quoting it. I'd be unamused if, after an upgrade,
> > my rule suddenly matched any role name containing an 'a'.
>
> Meh ... that concern seems overblown to me.  I guess it's possible
> that somebody has an HBA entry that looks like that, but it doesn't
> seem very plausible.  Note that we made this exact same change in
> pg_ident.conf years ago, and AFAIR we got zero complaints.

What percentage of users actually use pg_ident maps? My assumption
would be that a change to pg_hba would affect many more people, but
then I don't have any proof that there are users with role names that
look like that to begin with. I won't pound the table with it.

> > Speaking of partial matches, should this feature allow them? Maybe
> > rules should have to match the entire username instead, and sidestep
> > the inevitable "I forgot to anchor my regex" problems?
>
> I think the pg_ident.conf precedent is binding on us here.  If we
> make this one work differently, nobody's going to thank us for it,
> they're just going to wonder "did the left hand not know what the
> right hand already did?"

Hmm... yeah, I suppose. From the other direction, it'd be bad to train
users that unanchored regexes are safe in pg_hba only to take those
guardrails off in pg_ident. I will tuck that away as a potential
behavior change, for a different thread.

Thanks,
--Jacob

Re: predefined role(s) for VACUUM and ANALYZE

On Wed, Sep 07, 2022 at 04:15:23PM -0700, Mark Dilger wrote:
> Ok, now I'm a bit lost.  If I want to use Nathan's feature to create a role 
> to vacuum and analyze my database on a regular basis, how does per-relation 
> granularity help me?  If somebody creates a new table and doesn't grant those 
> privileges to the role, doesn't that break the usage case?  To me, 
> per-relation granularity sounds useful, but orthogonal, to this feature.

I think there is room for both per-relation privileges and new predefined
roles.  My latest patch set [0] introduces both.

[0] https://postgr.es/m/20220908055035.GA2100193%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Avoid overhead with fprintf related functions

On Fri, Sep 09, 2022 at 05:53:54PM -0400, Tom Lane wrote:
> Ranier Vilela  writes:
>> Em sex., 9 de set. de 2022 às 13:20, Nathan Bossart <
>> nathandboss...@gmail.com> escreveu:
>>> I agree with David [0].  But if you can demonstrate a performance gain,
>>> perhaps it's worth considering a subset of these changes in hot paths.
> 
>> head:
>> Time: 418,210 ms
>> Time: 419,588 ms
>> Time: 424,713 ms
> 
>> fprintf patch:
>> Time: 416,919 ms
>> Time: 416,246 ms
>> Time: 416,237 ms
> 
> That is most certainly not enough gain to justify a large amount
> of code churn.  In fact, given that this is probably pretty
> platform-dependent and you've checked only one platform, I don't
> think I'd call this a sufficient case for even a one-line change.

Agreed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

[Commitfest 2022-09] First week is over

2022-09-09 Thread Ibrar Ahmed

Just a reminder that the first week of "September 2022 commitfest" is over,
As of now, there are "295" patches in total. Out of these 295 patches, "29"
patches required committer attention, and 188 patches need reviews. I think
we need more reviewers to low down the number.

I will keep sending the emails for the reminder and change the status
of the patches entries accordingly.

Thanks to everyone who is participating in commitfest.

-- 

Ibrar Ahmed.
Senior Software Engineer, PostgreSQL Consultant.

Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf

2022-09-09 Thread Jacob Champion

On 8/19/22 01:12, Drouvot, Bertrand wrote:
> +   wstr = palloc((strlen(tok->string + 1) + 1) * sizeof(pg_wchar));  
>
> +   wlen = pg_mb2wchar_with_len(tok->string + 1,  
>
> +   wstr, strlen(tok->string + 1));

The (tok->string + 1) construction comes up often enough that I think it
should be put in a `regex` variable or similar. That would help my eyes
with the (strlen(tok->string + 1) + 1) construction, especially.

I noticed that for pg_ident, we precompile the regexes per-line and
reuse those in child processes. Whereas here we're compiling, using, and
then discarding the regex for each check. I think the example set by the
pg_ident code is probably the one to follow, unless you have a good
reason not to.

> +# Testing with regular expression for username   
>
> +reset_pg_hba($node, '/^.*md.*$', 'password');
>
> +test_role($node, 'md5_role', 'password from pgpass and regular expression 
> for username', 0);
> +  

IMO the coverage for this patch needs to be filled out. Negative test
cases are more important than positive ones for security-related code.

Other than that, and Tom's note on potentially expanding this to other
areas, I think this is a pretty straightforward patch.

Thanks,
--Jacob

Re: [RFC] building postgres with meson - v12

2022-09-09 Thread Andres Freund

Hi,

On 2022-08-31 11:11:54 -0700, Andres Freund wrote:
> > If the above are addressed, I think this will be just about at the
> > point where the above patches can be committed.
>
> Woo!

There was a lot less progress over the last ~week than I had hoped. The reason
is that I was trying to figure out the reason for the occasional failures of
ecpg tests getting compiled when building on windows in CI, with msbuild.

I went into many layers of rabbitholes while investigating. Wasting an absurd
amount of time.

The problem:

Occasionally ecpg test files would fail to compile, exiting with -1073741819:
C:\BuildTools\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(241,5):
error MSB8066: Custom build for
'C:\cirrus\build\meson-private\custom_target.rule' exited with code
-1073741819.
[c:\cirrus\build\src\interfaces\ecpg\test\sql\3701597@@twophase.c@cus.vcxproj]

-1073741819 is 0xc005, which in turn is STATUS_ACCESS_VIOLATION, i.e. a
segfault. This happens in roughly 1/3 of the builds, but with "streaks" of not
happening and more frequently happening.

However, despite our CI images having a JIT debugger configured (~coredump
handler), no crash report was triggered. The problem never occurs in my
windows VM.

At first I thought that might be because it's an assertion failure or such,
which only causes a dump when a bunch of magic is done (see main.c). But
despite adding all the necessary magic to ecpg.exe, no dump.

Unfortunately, adding debug output reduces the frequency of the issue
substantially.

Eventually I figured out that it's not actually ecpg.exe that is crashing. It
is meson's python wrapper around built binaries as part of the build (for
setting PATH, working directory, without running into cmd.exe issues). A
modified meson wrapper showed that ecpg.exe completes successfully.

The only thing the meson wrapper does after running the command is to call
sys.exit(returncode), and I had printed out the returncode, which is 0.

I looked through a lot of the python code, to see why no crashdump and no
details are forthcoming. There weren't any relevant
SetErrorMode(SEM_NOGPFAULTERRORBOX) calls. I tried to set PYTHONFAULTHANDLER,
but still no stack trace.

Next I suspected that cmd.exe might be crashing and causing the
problem. Modified meson to add 'echo %ERRORLEVEL%' to the msbuild
custombuild. Which indeed shows the STATUS_ACCESS_VIOLATION returncode after
running python. So it's not cmd.exe.

The problem even persisted when replacing meson's sys.exit() with os._exit(),
which indeed just calls _exit().

I tried to reproduce the problem using a python with debugging enabled. The
problem doesn't occur despite quite a few runs.

I found scattered other reports of this problem happening on windows. Went
down a few more rabbitholes. Too boring to repeat here.

At this point I finally figured out that the reason the crash reports don't
happen is that everythin started by cirrus-ci on windows has an errormode of
SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX | SEM_NOOPENFILEERRORBOX.

A good bit later I figured out that while cirrus-ci isn't intentionally
setting that, golang does so *unconditionally* on windows:
https://github.com/golang/go/blob/54182ff54a687272dd7632c3a963e036ce03cb7c/src/runtime/signal_windows.go#L14
https://github.com/golang/go/blob/54182ff54a687272dd7632c3a963e036ce03cb7c/src/runtime/os_windows.go#L553
Argh. I should have checked what the error mode is earlier, but this is just
very sneaky.

So I modified meson to change the errormode and tried to reproduce the issue
again, to finally get a stackdump. And tried again. And again. Without a
single relevant failure (I saw tests fail in ways that are discussed on the
list, but that's irrelevant here).

I've run this through enough attempts by now that I'm quite confident that the
problem does not occur when the errormode does not include
SEM_NOOPENFILEERRORBOX. I'll want a few more runs to be certain, but...

Given that the problem appears to happen after _exit() is called, and only
when SEM_NOOPENFILEERRORBOX is not set, it seems likely to be an OS / C
runtime bug. Presumably it's related to something that python does first, but
I don't see how anything could justify crashing only if SEM_NOOPENFILEERRORBOX
is set (rather than the opposite).

I have no idea how to debug this further, given that the problem is quite rare
(can't attach a debugger and wait), only happens when crashdumps are prevented
from happening (so no idea where it crashes) and is made less common by debug
printfs.

So for now the best way forward I can see is to change the error mode for CI
runs. Which is likely a good idea anyway, so we can see crashdumps for
binaries other than postgres.exe (which does SetErrorMode() internally). I
managed to do so by setting CIRRUS_SHELL to a python wrapper around cmd.exe
that does SetErrorMode(). I'm sure there's easier ways, but I couldn't figure
out any.

I'd like to reclaim my time. But I'm afraid nobody will be

Re: Summary function for pg_buffercache

2022-09-09 Thread Melih Mutlu

Hi Aleksander and Nathan,

Thanks for your comments.

Aleksander Alekseev , 9 Eyl 2022 Cum, 17:36
tarihinde şunu yazdı:

> However I'm afraid you can't examine BufferDesc's without taking
> locks. This is explicitly stated in buf_internals.h:
>
> """
> Buffer header lock (BM_LOCKED flag) must be held to EXAMINE or change
> TAG, state or wait_backend_pgprocno fields.
> """
>

I wasn't aware of this explanation. Thanks for pointing it out.

When somebody modifies relNumber concurrently (e.g. calls
> ClearBufferTag()) this will cause an undefined behaviour.
>

I thought that it wouldn't really be a problem even if relNumber is
modified concurrently, since the function does not actually rely on the
actual values.
I'm not sure about what undefined behaviour could harm this badly. It
seemed to me that it would read an invalid relNumber in the worst case
scenario.
But I'm not actually familiar with buffer related parts of the code, so I
might be wrong.
And I'm okay with taking header locks if necessary.

In the attached patch, I added buffer header locks just before examining
tag as follows:

+ buf_state = LockBufHdr(bufHdr);
> +
> + /* Invalid RelFileNumber means the buffer is unused */
> + if (RelFileNumberIsValid(BufTagGetRelNumber(&bufHdr->tag)))
> + {
> ...
> + }
> ...
> + UnlockBufHdr(bufHdr, buf_state);
>


> > I suggest we focus on saving the memory first and then think about the
> > > performance, if necessary.
> >
> > +1
>

I again did the same quick benchmarking, here are the numbers with locks.

postgres=# show shared_buffers;
 shared_buffers

 16GB
(1 row)

postgres=#  SELECT relfilenode <> 0 AS is_valid, isdirty, count(*) FROM
pg_buffercache GROUP BY relfilenode <> 0, isdirty;
 is_valid | isdirty |  count
--+-+-
 t| f   | 256
  | | 2096876
 t| t   |  20
(3 rows)

Time: 1024.456 ms (00:01.024)

postgres=#  select * from pg_buffercache_summary();
 used_buffers | unused_buffers | dirty_buffers | pinned_buffers |
avg_usagecount
--++---++
  282 |2096870 |20 |  0 |
 3.4574468
(1 row)

Time: 33.074 ms

Yes, locks slowed pg_buffercache_summary down. But there is still quite a
bit of performance improvement, plus memory saving as you mentioned.


> Here is the corrected patch.
>

Also thanks for corrections.

Best,
Melih


v5-0001-Added-pg_buffercache_summary-function.patch
Description: Binary data

Re: cataloguing NOT NULL constraints

2022-09-09 Thread Zhihong Yu

Hi,
w.r.t. the while loop in findNotNullConstraintAttnum():

+   if (multiple == NULL)
+   break;

I think `pfree(arr)` should be called before breaking.

+   if (constraint->cooked_expr != NULL)
+   return
tryExtractNotNullFromNode(stringToNode(constraint->cooked_expr), rel);
+   else
+   return tryExtractNotNullFromNode(constraint->raw_expr, rel);

nit: the `else` keyword is not needed.

+   if (isnull)
+   elog(ERROR, "null conbin for constraint %u", conForm->oid);

It would be better to expand `conbin` so that the user can better
understand the error.

Cheers

>

Re: why can't a table be part of the same publication as its schema

2022-09-09 Thread Amit Kapila

On Fri, Sep 9, 2022 at 8:48 PM Robert Haas  wrote:
>
> On Fri, Sep 9, 2022 at 10:29 AM houzj.f...@fujitsu.com
>  wrote:
> > IIRC, the feature currently works almost the same as you described. It 
> > doesn't
> > create entry for tables that are published via its schema level, it only 
> > record
> > the published schema and check which tables are part of it.
>
> Oh, well if that's the case, that is great news.
>

Yes, the feature works as you and Hou-San have mentioned.

> But then I don't
> understand Amit's comment from before:
>
> > Yes, because otherwise, there was confusion while dropping the objects
> > from publication. Consider in the above case, if we would have allowed
> > it and then the user performs ALTER PUBLICATION p1 DROP ALL TABLES IN
> > SCHEMA s1, then (a) shall we remove both schema s1 and a table that is
> > separately added (s1.t1) from that schema, or (b) just remove schema
> > s1?
>
> I believe that (b) is the correct behavior, so I assumed that this
> issue must be some difficulty in implementing it, like a funny catalog
> representation.
>

No, it was because of syntax. IIRC, during development, Greg Nancarrow
raised a point [1] that a user can expect the individually added
tables for a schema which is also part of the publication to also get
dropped when she specifies DROP ALL TABLES IN SCHEMA. IIRC,
originally, the patch had a behavior (b) but then changed due to
discussion around this point. But now that it seems you and others
don't feel that was right, we can change back to (b) as I think that
shouldn't be difficult to achieve.

> Things might be clearer if we'd made the syntax "ALTER PUBLICATION p1
> { ADD | DROP } { TABLE | SCHEMA } name". I don't understand why we
> used this ALL TABLES IN SCHEMA language.
>

It was exactly due to the reason Mark had mentioned in the email [2].

[1] - 
https://www.postgresql.org/message-id/CAJcOf-fTRZ3HiA5xU0-O-PT390A7wuUUkjP8uX3aQJLBsJNVmw%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/596EA671-66DF-4285-8560-0270DC062353%40enterprisedb.com

-- 
With Regards,
Amit Kapila.

pg15b4: FailedAssertion("TransactionIdIsValid(xmax)

The sequence of events leading up to this:

0) Yesterday I upgraded an internal VM to pg15b4 using PGDG RPMs;
   It's the same VM that hit the prefetch_recovery bug which was fixed by
   adb466150.  I don't think that should've left it in a weird state
   (since recovery was sucessful when prefetch was disabled, and the
   cluser worked fine until now).
1) This evening, I started running sqlsmith connected to the postgres DB
   that has some logging tables in it;
2) There was a lot of swapping, and the backend was finally killed due
   to OOM when sqlsmith tried to run a terrible query involving
   database_to_xml();
3) lots of backends started crashing with SIGABRT;
4) I was simultaneously compiling pg14b4 to run with with
   -DRELCACHE_FORCE_RELEASE and installing it into /usr/local.  I don't *think*
   running libraries would've been overwritten, and that shouldn't have
   affected the running instance anyway...

I got a lot (100+) of SIGABRT.  I suspect all the backtraces are the same.  My
hypothesis is that the OOM crash caused bad xmax to be written (or rather,
recovery didn't cause it to be written correctly?).  We may have hit a race
condition due to heavy swapping.

(gdb) bt
#0  0x7fb8a22f31f7 in raise () from /lib64/libc.so.6
#1  0x7fb8a22f48e8 in abort () from /lib64/libc.so.6
#2  0x0098f9be in ExceptionalCondition 
(conditionName=conditionName@entry=0x9fada4 "TransactionIdIsValid(xmax)", 
errorType=errorType@entry=0x9ed217 "FailedAssertion", 
fileName=fileName@entry=0x9fad90 "heapam_visibility.c", 
lineNumber=lineNumber@entry=1353) at assert.c:69
#3  0x004fd4d7 in HeapTupleSatisfiesVacuumHorizon 
(htup=htup@entry=0x7ffc225a87e0, buffer=buffer@entry=5100, 
dead_after=dead_after@entry=0x7ffc225a87d0) at heapam_visibility.c:1353
#4  0x00501702 in heap_prune_satisfies_vacuum (buffer=5100, 
tup=0x7ffc225a87e0, prstate=0x7ffc225a8a50) at pruneheap.c:504
#5  heap_page_prune (relation=relation@entry=0x7fb8a50c3438, 
buffer=buffer@entry=5100, vistest=vistest@entry=0xec7890 , 
old_snap_xmin=, old_snap_ts=, 
nnewlpdead=nnewlpdead@entry=0x7ffc225a964c, off_loc=off_loc@entry=0x0) at 
pruneheap.c:351
#6  0x00502326 in heap_page_prune_opt (relation=0x7fb8a50c3438, 
buffer=buffer@entry=5100) at pruneheap.c:209
#7  0x004f3ae3 in heapgetpage (sscan=sscan@entry=0x199b1d0, 
page=page@entry=2892) at heapam.c:415
#8  0x004f44c2 in heapgettup_pagemode (scan=scan@entry=0x199b1d0, 
dir=, nkeys=0, key=0x0) at heapam.c:1120
#9  0x004f5abe in heap_getnextslot (sscan=0x199b1d0, 
direction=, slot=0x1967be8) at heapam.c:1352
#10 0x006de16b in table_scan_getnextslot (slot=0x1967be8, 
direction=ForwardScanDirection, sscan=) at 
../../../src/include/access/tableam.h:1046
#11 SeqNext (node=node@entry=0x1967a38) at nodeSeqscan.c:80
#12 0x006b109a in ExecScanFetch (recheckMtd=0x6de0f0 , 
accessMtd=0x6de100 , node=0x1967a38) at execScan.c:133
#13 ExecScan (node=0x1967a38, accessMtd=0x6de100 , recheckMtd=0x6de0f0 
) at execScan.c:199
#14 0x006add88 in ExecProcNodeInstr (node=0x1967a38) at 
execProcnode.c:479
#15 0x006a6182 in ExecProcNode (node=0x1967a38) at 
../../../src/include/executor/executor.h:259
#16 ExecutePlan (execute_once=, dest=0x1988448, 
direction=, numberTuples=0, sendTuples=true, 
operation=CMD_SELECT, use_parallel_mode=, planstate=0x1967a38, 
estate=0x19677a0)
at execMain.c:1636
#17 standard_ExecutorRun (queryDesc=0x1996e80, direction=, 
count=0, execute_once=) at execMain.c:363
#18 0x7fb8960913bd in pgss_ExecutorRun (queryDesc=0x1996e80, 
direction=ForwardScanDirection, count=0, execute_once=) at 
pg_stat_statements.c:1010
#19 0x7fb895c6f781 in explain_ExecutorRun (queryDesc=0x1996e80, 
direction=ForwardScanDirection, count=0, execute_once=) at 
auto_explain.c:320
#20 0x0084976e in PortalRunSelect (portal=portal@entry=0x18fed30, 
forward=forward@entry=true, count=0, count@entry=9223372036854775807, 
dest=dest@entry=0x1988448) at pquery.c:924
#21 0x0084af4f in PortalRun (portal=0x18fed30, 
count=9223372036854775807, isTopLevel=, run_once=, dest=0x1988448, altdest=0x1988448, qc=0x7ffc225a9ce0) at pquery.c:768
#22 0x0084679b in exec_simple_query (query_string=0x186d8a0 "SELECT 
alarm_id, alarm_disregard FROM alarms WHERE alarm_ack_time IS NULL AND 
alarm_clear_time IS NULL AND alarm_office = 'ETHERNET'") at postgres.c:1250
#23 0x0084848a in PostgresMain (dbname=, 
username=) at postgres.c:4581
#24 0x00495afe in BackendRun (port=, port=) at postmaster.c:4504
#25 BackendStartup (port=0x1894250) at postmaster.c:4232
#26 ServerLoop () at postmaster.c:1806
#27 0x007b0c60 in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0x1868280) at postmaster.c:1478
#28 0x004976a6 in main (argc=3, argv=0x1868280) at main.c:202


< 2022-09-09 19:44:03.329 CDT  >LOG:  server process (PID 8949) was terminated 
by signal 6: Aborted
< 2022-09-09 19:44:03.329 CDT  >DETA

Re: Assertion failure in WaitForWALToBecomeAvailable state machine

2022-09-09 Thread Bharath Rupireddy

On Mon, Aug 15, 2022 at 11:30 AM Bharath Rupireddy
 wrote:
>
> On Thu, Aug 11, 2022 at 10:06 PM Bharath Rupireddy
>  wrote:
> >
> > Today I encountered the assertion failure [2] twice while working on
> > another patch [1]. The pattern seems to be that the walreceiver got
> > killed or crashed and set it's state to WALRCV_STOPPING or
> > WALRCV_STOPPED by the team the WAL state machine moves to archive and
> > hence the following XLogShutdownWalRcv() code will not get hit:
> >
> > /*
> >  * Before we leave XLOG_FROM_STREAM state, make sure 
> > that
> >  * walreceiver is not active, so that it won't overwrite
> >  * WAL that we restore from archive.
> >  */
> > if (WalRcvStreaming())
> > ShutdownWalRcv();
> >
> > I agree with Kyotaro-san to reset InstallXLogFileSegmentActive before
> > entering into the archive mode. Hence I tweaked the code introduced by
> > the following commit a bit, the result v1 patch is attached herewith.
> > Please review it.
>
> I added it to the current commitfest to not lose track of it:
> https://commitfest.postgresql.org/39/3814/.

Today, I spent some more time on this issue, I modified the v1 patch
posted upthread a bit - now resetting the InstallXLogFileSegmentActive
only when the WAL source switched to archive, not every time in
archive mode.

I'm attaching v2 patch here with, please review it further.

Just for the records - there's another report of the assertion failure
at [1], many thanks to Kyotaro-san for providing consistent
reproducible steps.

[1] - 
https://www.postgresql.org/message-id/flat/20220909.172949.2223165886970819060.horikyota.ntt%40gmail.com

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v2-0001-Avoid-race-condition-in-resetting-XLogCtl-Install.patch
Description: Binary data

Re: pg15b4: FailedAssertion("TransactionIdIsValid(xmax)

On Fri, Sep 09, 2022 at 09:06:37PM -0500, Justin Pryzby wrote:
> #0  0x7fb8a22f31f7 in raise () from /lib64/libc.so.6
> #1  0x7fb8a22f48e8 in abort () from /lib64/libc.so.6
> #2  0x0098f9be in ExceptionalCondition 
> (conditionName=conditionName@entry=0x9fada4 "TransactionIdIsValid(xmax)", 
> errorType=errorType@entry=0x9ed217 "FailedAssertion", 
> fileName=fileName@entry=0x9fad90 "heapam_visibility.c", 
> lineNumber=lineNumber@entry=1353) at assert.c:69
> #3  0x004fd4d7 in HeapTupleSatisfiesVacuumHorizon 
> (htup=htup@entry=0x7ffc225a87e0, buffer=buffer@entry=5100, 
> dead_after=dead_after@entry=0x7ffc225a87d0) at heapam_visibility.c:1353
> #4  0x00501702 in heap_prune_satisfies_vacuum (buffer=5100, 
> tup=0x7ffc225a87e0, prstate=0x7ffc225a8a50) at pruneheap.c:504
> #5  heap_page_prune (relation=relation@entry=0x7fb8a50c3438, 
> buffer=buffer@entry=5100, vistest=vistest@entry=0xec7890 , 
> old_snap_xmin=, old_snap_ts=, 
> nnewlpdead=nnewlpdead@entry=0x7ffc225a964c, off_loc=off_loc@entry=0x0) at 
> pruneheap.c:351
> #6  0x00502326 in heap_page_prune_opt (relation=0x7fb8a50c3438, 
> buffer=buffer@entry=5100) at pruneheap.c:209
> #7  0x004f3ae3 in heapgetpage (sscan=sscan@entry=0x199b1d0, 
> page=page@entry=2892) at heapam.c:415
> #8  0x004f44c2 in heapgettup_pagemode (scan=scan@entry=0x199b1d0, 
> dir=, nkeys=0, key=0x0) at heapam.c:1120
> #9  0x004f5abe in heap_getnextslot (sscan=0x199b1d0, 
> direction=, slot=0x1967be8) at heapam.c:1352
> #10 0x006de16b in table_scan_getnextslot (slot=0x1967be8, 
> direction=ForwardScanDirection, sscan=) at 
> ../../../src/include/access/tableam.h:1046
> #11 SeqNext (node=node@entry=0x1967a38) at nodeSeqscan.c:80
> #12 0x006b109a in ExecScanFetch (recheckMtd=0x6de0f0 , 
> accessMtd=0x6de100 , node=0x1967a38) at execScan.c:133
> #13 ExecScan (node=0x1967a38, accessMtd=0x6de100 , 
> recheckMtd=0x6de0f0 ) at execScan.c:199
> #14 0x006add88 in ExecProcNodeInstr (node=0x1967a38) at 
> execProcnode.c:479
> #15 0x006a6182 in ExecProcNode (node=0x1967a38) at 
> ../../../src/include/executor/executor.h:259
> #16 ExecutePlan (execute_once=, dest=0x1988448, 
> direction=, numberTuples=0, sendTuples=true, 
> operation=CMD_SELECT, use_parallel_mode=, planstate=0x1967a38, 
> estate=0x19677a0)
> at execMain.c:1636
> #17 standard_ExecutorRun (queryDesc=0x1996e80, direction=, 
> count=0, execute_once=) at execMain.c:363
> #18 0x7fb8960913bd in pgss_ExecutorRun (queryDesc=0x1996e80, 
> direction=ForwardScanDirection, count=0, execute_once=) at 
> pg_stat_statements.c:1010
> #19 0x7fb895c6f781 in explain_ExecutorRun (queryDesc=0x1996e80, 
> direction=ForwardScanDirection, count=0, execute_once=) at 
> auto_explain.c:320
> #20 0x0084976e in PortalRunSelect (portal=portal@entry=0x18fed30, 
> forward=forward@entry=true, count=0, count@entry=9223372036854775807, 
> dest=dest@entry=0x1988448) at pquery.c:924
> #21 0x0084af4f in PortalRun (portal=0x18fed30, 
> count=9223372036854775807, isTopLevel=, run_once= out>, dest=0x1988448, altdest=0x1988448, qc=0x7ffc225a9ce0) at pquery.c:768
> #22 0x0084679b in exec_simple_query (query_string=0x186d8a0 "SELECT 
> alarm_id, alarm_disregard FROM alarms WHERE alarm_ack_time IS NULL AND 
> alarm_clear_time IS NULL AND alarm_office = 'ETHERNET'") at postgres.c:1250
> #23 0x0084848a in PostgresMain (dbname=, 
> username=) at postgres.c:4581
> #24 0x00495afe in BackendRun (port=, port= out>) at postmaster.c:4504
> #25 BackendStartup (port=0x1894250) at postmaster.c:4232
> #26 ServerLoop () at postmaster.c:1806
> #27 0x007b0c60 in PostmasterMain (argc=argc@entry=3, 
> argv=argv@entry=0x1868280) at postmaster.c:1478
> #28 0x004976a6 in main (argc=3, argv=0x1868280) at main.c:202
> 
> I saved a copy of the data dir, but I'm going to have to bring the DB back
> online soon.  I don't know if I can get a page image easily since GDB is a bit
> busted (won't show source code) and I can't remember how I fixed that last
> time...

Actually gdb seems to be fine, except it doesn't work well on the
corefile.

Breakpoint 4, HeapTupleSatisfiesVacuumHorizon (htup=htup@entry=0x7ffcbdc89970, 
buffer=buffer@entry=690, dead_after=dead_after@entry=0x7ffcbdc898fc) at 
heapam_visibility.c:1196
1196{
(gdb) n
1197HeapTupleHeader tuple = htup->t_data;
(gdb) p *tuple
$21 = {t_choice = {t_heap = {t_xmin = 3779887563, t_xmax = 133553150, t_field3 
= {t_cid = 103, t_xvac = 103}}, t_datum = {datum_len_ = -515079733, 
datum_typmod = 133553150, datum_typeid = 103}}, t_ctid = {ip_blkid = {
  bi_hi = 0, bi_lo = 30348}, ip_posid = 11}, t_infomask2 = 8225, t_infomask 
= 4423, t_hoff = 32 ' ', t_bits = 0x2aaab37ebe0f "\303\377\367\377\001"}

(gdb) n
1350Assert(!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));
(gdb) p xmax
$20 = 0
(gdb) n
1353Assert(TransactionIdIsValid(xmax

Re: pg15b4: FailedAssertion("TransactionIdIsValid(xmax)

2022-09-09 Thread Zhang Mingli

Hi,


That’s interesting, dig into it for a while but not too much progress.

Maybe we could add some logs to print MultiXactMembers’ xid and status if xid 
is 0.

Inside MultiXactIdGetUpdateXid()

```
nmembers = GetMultiXactIdMembers(xmax, &members, false, false);

if (nmembers > 0)
{
 int i;

 for (i = 0; i < nmembers; i++)
 {
 /* Ignore lockers */
 if (!ISUPDATE_from_mxstatus(members[i].status))
 continue;

 /* there can be at most one updater */
 Assert(update_xact == InvalidTransactionId);
 update_xact = members[i].xid;

// log here if xid is invalid
#ifndef USE_ASSERT_CHECKING

 /*
 * in an assert-enabled build, walk the whole array to ensure
 * there's no other updater.
 */
 break;
#endif
 }

 pfree(members);
}
// and here if didn’t update update_xact at all (it shouldn’t happen as 
designed)
return update_xact;
```
That will help a little if we can reproduce it.

And could we see multixact reply in logs if db does recover?

Regards,
Zhang Mingli

Re: pg15b4: FailedAssertion("TransactionIdIsValid(xmax)

On Sat, Sep 10, 2022 at 12:07:30PM +0800, Zhang Mingli wrote:
> That’s interesting, dig into it for a while but not too much progress.
> 
> Maybe we could add some logs to print MultiXactMembers’ xid and status if xid 
> is 0.
> 
> Inside MultiXactIdGetUpdateXid()
> 
> ```
>   nmembers = GetMultiXactIdMembers(xmax, &members, false, false);
> 
>   if (nmembers > 0)
>   {
>  int i;
> 
>  for (i = 0; i < nmembers; i++)
>  {
>  /* Ignore lockers */
>  if (!ISUPDATE_from_mxstatus(members[i].status))
>  continue;
> 
>  /* there can be at most one updater */
>  Assert(update_xact == InvalidTransactionId);
>  update_xact = members[i].xid;
> 
> // log here if xid is invalid

> #ifndef USE_ASSERT_CHECKING
> 
>  /*
>  * in an assert-enabled build, walk the whole array to ensure
>  * there's no other updater.
>  */
>  break;
> #endif
>  }
> 
>  pfree(members);
>   }
> // and here if didn’t update update_xact at all (it shouldn’t happen as 
> designed)

Yeah.  I added assertions for the above case inside the loop, and for
this one, and this fails right before "return".

TRAP: FailedAssertion("update_xact != InvalidTransactionId", File: 
"src/backend/access/heap/heapam.c", Line: 6939, PID: 4743)

It looks like nmembers==2, both of which are lockers and being ignored.

> And could we see multixact reply in logs if db does recover?

Do you mean waldump or ??

BTW, after a number of sigabrt's, I started seeing these during
recovery:

< 2022-09-09 19:44:04.180 CDT  >LOG:  unexpected pageaddr 1214/AF0FE000 in log 
segment 0001121400B4, offset 1040384
< 2022-09-09 23:20:50.830 CDT  >LOG:  unexpected pageaddr 1214/CF65C000 in log 
segment 0001121400D8, offset 6668288

-- 
Justin

Re: pg15b4: FailedAssertion("TransactionIdIsValid(xmax)