date:20220128

Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2022-01-28 Thread Bharath Rupireddy

On Thu, Jan 27, 2022 at 3:23 PM Sergey Dudoladov
 wrote:
> > Here's v2, rebased onto the latest master.
>
> I've reviewed this patch. The patch builds against the master (commit
> e9d4001ec592bcc9a3332547cb1b0211e8794f38) and passes all the tests.
> The patch does what it intends to do, namely store the kind of the
> last checkpoint in the control file and display it in the output of
> the pg_control_checkpoint() function and pg_controldata utility.
> I did not test it with restartpoints though. Speaking of the torn
> writes, the size of the control file with this patch applied does not
> exceed 8Kb.

Thanks for the review.

> A few code comments:
>
> + char ckpt_kind[2 * MAXPGPATH];
>
> I don't completely understand why 2 * MAXPGPATH is used here for the
> buffer size.
> [1] defines MAXPGPATH to be standard size of a pathname buffer. How
> does it relate to ckpt_kind ?

I was using it loosely. Changed in the v3 patch.

> + ControlFile->checkPointKind = 0;
>
> It is worth a comment that 0 is unknown, as for instance in [2]

We don't even need ControlFile->checkPointKind = 0; because
InitControlFile will memset(ControlFile, 0, sizeof(ControlFileData));,
hence removed this.

> + (flags == 0) ? "unknown" : "",
>
> That reads as if this patch would introduce a new "unknown" checkpoint state.
> Why have it here at all if after for example initdb the kind is "shutdown" ?

Yeah, even LogCheckpointStart doesn't have anything "unknown" so removed it.

> The space at the strings' end (as in "wait " or "immediate ")
> introduces extra whitespace in the output of pg_control_checkpoint().
> A similar check at [3] places whitespace differently; that arrangement
> of whitespace should remove the issue.

Changed.

> >   Datum   values[18];
> >   boolnulls[18];
>
> You forgot to expand these arrays.

Not sure what you meant here. The size of the array is already 19 in v2.

> This breaks checkpoint file format. Need to bump PG_CONTROL_VERSION,
> and pg_upgrade need to treat the change.

I added a note in the commit message to bump cat version so that the
committer will take care of it.

> -  proallargtypes => 
> '{pg_lsn,pg_lsn,text,int4,int4,bool,text,oid,xid,xid,xid,oid,xid,xid,oid,xid,xid,timestamptz}',
> +  proallargtypes => 
> '{pg_lsn,pg_lsn,text,int4,int4,bool,text,oid,xid,xid,xid,oid,xid,xid,oid,xid,xid,timestamptz,text}',
>
> I think the additional column should be text[] instead of text, but
> not sure.

We are preparing a single string of all the checkpoint kinds and
outputting as a text column, so we don't need text[].

Regards,
Bharath Rupireddy.


v3-0001-add-last-checkpoint-kind-to-pg_control-file.patch
Description: Binary data

Re: Printing backtrace of postgres processes

2022-01-28 Thread Bharath Rupireddy

On Thu, Jan 27, 2022 at 10:45 AM vignesh C  wrote:
>
> On Wed, Jan 26, 2022 at 11:07 AM Bharath Rupireddy
>  wrote:
> >
> > On Tue, Jan 25, 2022 at 12:00 PM vignesh C  wrote:
> > > Thanks for the comments, attached v17 patch has the fix for the same.
> >
> > Have a minor comment on v17:
> >
> > can we modify the elog(LOG, to new style ereport(LOG, ?
> > + elog(LOG_SERVER_ONLY, "current backtrace:%s", errtrace.data);
> >
> > /*--
> >  * New-style error reporting API: to be used in this way:
> >  *  ereport(ERROR,
> >  *  errcode(ERRCODE_UNDEFINED_CURSOR),
> >  *  errmsg("portal \"%s\" not found", stmt->portalname),
> >  *  ... other errxxx() fields as needed ...);
> >  *
>
> Attached v18 patch has the changes for the same.

Thanks. The v18 patch LGTM. I'm not sure if the CF bot failure is
related to the patch -  https://cirrus-ci.com/task/5633364051886080

Regards,
Bharath Rupireddy.

Re: Column Filtering in Logical Replication

2022-01-28 Thread Peter Smith

Here are some review comments for the v17-0001 patch.

~~~

1. Commit message

If no column list is specified, all the columns are replicated, as
previously

Missing period (.) at the end of that sentence.

~~~

2. doc/src/sgml/catalogs.sgml

+  
+   This is an array of values that indicates which table columns are
+   part of the publication.  For example a value of 1 3
+   would mean that the first and the third table columns are published.
+   A null value indicates that all attributes are published.
+  

Missing comma:
"For example" --> "For example,"

Terms:
The text seems to jump between "columns" and "attributes". Perhaps,
for consistency, that last sentence should say: "A null value
indicates that all columns are published."

~~~

3. doc/src/sgml/protocol.sgml

 
-Next, the following message part appears for each column
(except generated columns):
+Next, the following message part appears for each column (except
+generated columns and other columns that don't appear in the column
+filter list, for tables that have one):
 

Perhaps that can be expressed more simply, like:

Next, the following message part appears for each column (except
generated columns and other columns not present in the optional column
filter list):

~~~

4. doc/src/sgml/ref/alter_publication.sgml

+ALTER PUBLICATION name
ALTER TABLE publication_object SET COLUMNS { (
name [, ...] ) | ALL }

The syntax chart looks strange because there is already a "TABLE" and
a column_name list within the "publication_object" definition, so do
ALTER TABLE and publication_object co-exist?
According to the current documentation it suggests nonsense like below is valid:
ALTER PUBLICATION mypublication ALTER TABLE TABLE t1 (a,b,c) SET
COLUMNS (a,b,c);

--

But more fundamentally, I don't see why any new syntax is even needed at all.

Instead of:
ALTER PUBLICATION mypublication ALTER TABLE users SET COLUMNS
(user_id, firstname, lastname);
Why not just:
ALTER PUBLICATION mypublication ALTER TABLE users (user_id, firstname,
lastname);

Then, if the altered table defines a *different* column list then it
would be functionally equivalent to whatever your SET COLUMNS is doing
now. AFAIK this is how the Row-Filter [1] works, so that altering an
existing table to have a different Row-Filter just overwrites that
table's filter. IMO the Col-Filter behaviour should work the same as
that - "SET COLUMNS" is redundant.

~~~

5. doc/src/sgml/ref/alter_publication.sgml

-TABLE [ ONLY ] table_name [ * ] [, ... ]
+TABLE [ ONLY ] table_name [ * ] [ ( column_name, [, ... ] ) ] [, ... ]

That extra comma after the "column_name" seems wrong because there is
one already in "[, ... ]".

~~~

6. doc/src/sgml/ref/create_publication.sgml

-TABLE [ ONLY ] table_name [ * ] [, ... ]
+TABLE [ ONLY ] table_name [ * ] [ ( column_name, [, ... ] ) ] [, ... ]

(Same as comment #5).
That extra comma after the "column_name" seems wrong because there is
one already in "[, ... ]".

~~~

7. doc/src/sgml/ref/create_publication.sgml

+ 
+  When a column list is specified, only the listed columns are replicated;
+  any other columns are ignored for the purpose of replication through
+  this publication.  If no column list is specified, all columns of the
+  table are replicated through this publication, including any columns
+  added later.  If a column list is specified, it must include the replica
+  identity columns.
+ 

Suggest to re-word this a bit simpler:

e.g.
- "listed columns" --> "named columns"
- I don't think it is necessary to say the unlisted columns are ignored.
- I didn't think it is necessary to say "though this publication"

AFTER
When a column list is specified, only the named columns are replicated.
If no column list is specified, all columns of the table are replicated,
including any columns added later. If a column list is specified, it must
include the replica identity columns.

~~~

8. doc/src/sgml/ref/create_publication.sgml

Consider adding another example showing a CREATE PUBLICATION which has
a column list.

~~~

9. src/backend/catalog/pg_publication.c - check_publication_add_relation

 /*
- * Check if relation can be in given publication and throws appropriate
- * error if not.
+ * Check if relation can be in given publication and that the column
+ * filter is sensible, and throws appropriate error if not.
+ *
+ * targetcols is the bitmapset of attribute numbers given in the column list,
+ * or NULL if it was not specified.
  */

Typo: "targetcols" --> "columns" ??

~~~

10. src/backend/catalog/pg_publication.c - check_publication_add_relation

+
+ /* Make sure the column list checks out */
+ if (columns != NULL)
+ {

Perhaps "checks out" could be worded better.

~~~

11. src/backend/catalog/pg_publication.c - check_publication_add_relation

+ /* Make sure the column list checks out */
+ if (columns != NULL)
+ {
+ /*
+ * Even if the user listed all columns in

Re: RFC: Logging plan of the running query

2022-01-28 Thread torikoshia


On 2022-01-27 20:18, Fujii Masao wrote:

Here are another review comments:


Thanks for reviewing!


+LOG:  plan of the query running on backend with PID 17793 is:

This seems not the same as what actually logged.


Modified.


+   ereport(WARNING,
+   (errmsg("PID %d is not a PostgreSQL server 
process", pid)));

Like commit 7fa945b857 changed, this warning message should be "PID %d
is not a PostgreSQL backend process"?


Modified.

+	if (SendProcSignal(pid, PROCSIG_LOG_CURRENT_PLAN, InvalidBackendId) < 
0)


proc->backendId should be specified instead of InvalidBackendId, to
speed up the processing in SendProcSignal()?


Agreed. Modified.

+	PROCSIG_LOG_CURRENT_PLAN, /* ask backend to log plan of the current 
query */

+volatile sig_atomic_t LogCurrentPlanPending = false;
+extern void HandleLogCurrentPlanInterrupt(void);
+extern void ProcessLogCurrentPlanInterrupt(void);

Isn't it better to use the names that are more consistent with the
function name, i.e., pg_log_query_plan? For example,
PROCSIG_LOG_QUERY_PLAN instead of PROCSIG_LOG_CURRENT_PLAN?


Agreed.
I removed 'current' from the variable and function names and used 
'query' instead.



+   ereport(LOG_SERVER_ONLY,
+   errmsg("backend with PID %d is not running a 
query",
+   MyProcPid));

errhidestmt(true) and errhidecontext(true) need to be added, don't
they? Otherwise, for example, if pg_log_query_plan() is executed after
debug_query_string is set but before ActiveQueryDesc is set, STATEMENT
message would be output even though the message saying "not running a
query" is output. Which seems confusing.


Agreed. Added errhidestmt(true) and errhidecontext(true).


+   hash_seq_init(&status, GetLockMethodLocalHash());
+   while ((locallock = (LOCALLOCK *) hash_seq_search(&status)) != NULL)
+   {
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_PAGE)

Why did you use the search for local lock hash instead of
IsPageLockHeld flag variable, to check whether a page lock is held or
not? Because there is the corner case where the interrupt is processed
after the local lock is registered into the hash but before
IsPageLockHeld is enabled?


As far as I read CheckAndSetLockHeld(), IsPageLockHeld can be used only 
when USE_ASSERT_CHECKING is enabled.
Since removing USE_ASSERT_CHECKING from CheckAndSetLockHeld() would give 
performance impact on every granting/removing local lock, I used the 
search for local local hash.



There is the case where the request to log a query plan is skipped
even while the target backend is running a query. If this happens,
users can just retry pg_log_query_plan(). These things should be
documented?


Agreed.
Added following:

  +Note that there is the case where the request to log a query
  +plan is skipped even while the target backend is running a
  +query due to lock conflict avoidance.
  +If this happens, users can just retry pg_log_query_plan().

|


+   ereport(LOG_SERVER_ONLY,
+   errmsg("backend with PID %d is holding a page lock. 
Try again",
+   MyProcPid));

It seems more proper to output this message in DETAIL or HINT,
instead. So how about something like the following messages?

LOG: could not log the query plan
DETAIL: query plan cannot be logged while page level lock is being held
HINT: Try pg_log_query_plan() after a few 


Agreed.
I felt the HINT message 'after a few ...' is difficult to describe, and 
wrote as below.


| HINT: Retrying pg_log_query_plan() might succeed since the lock 
duration of page level locks are usually short


How do you think?

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONFrom 5fcb62779efb0a389fc9448457b7e3393a2f6d3f Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi
Date: Fri, 28 Jan 2022 16:48:50 +0900
Subject: [PATCH v18] Add function to log the plan of the query currently
 running on the backend with specified process ID along with the untruncated
 query string.

Currently, we have to wait for the query execution to finish
to check its plan. This is not so convenient when
investigating long-running queries on production environments
where we cannot use debuggers.
To improve this situation, this patch adds
pg_log_query_plan() function that requests to log the
plan of the specified backend process.

By default, only superusers are allowed to request to log the
plans because allowing any users to issue this request at an
unbounded rate would cause lots of log messages and which can
lead to denial of service.

On receipt of the request, at the next CHECK_FOR_INTERRUPTS(),
the target backend logs its plan at LOG_SERVER_ONLY level, so
that these plans will appear in the server log but not be sent
to the client.

Reviewed-by: Bharath Rupireddy, Fujii Masao, Dilip Kumar, Masahiro Ikeda, Ekaterina Sokolova, Justin Pryzby
---
 d

pg_log_backend_memory_contexts - remove unnecessary test case

2022-01-28 Thread Bharath Rupireddy

Hi,

While reviewing another patch [1], I found an unnecessary test case of
pg_log_backend_memory_contexts, that is, executing the function after
granting the permission to the role and checking the permissions with
the has_function_privilege. IMO, it's unnecessary as we have the test
cases covering the function execution just above that. With the
removal of these test cases, the regression tests timings can be
improved (a tiniest tiniest tiniest tiniest amount of time though).
Attaching a small patch.

Thoughts?

[1] 
https://www.postgresql.org/message-id/04596d19-1bb6-4865-391d-4e63c9b3317f%40oss.nttdata.com

Regards,
Bharath Rupireddy.


v1-0001-remove-unnecessary-test-case.patch
Description: Binary data

Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2022-01-28 Thread Julien Rouhaud

Hi,

On Fri, Jan 28, 2022 at 01:49:19PM +0530, Bharath Rupireddy wrote:
> On Thu, Jan 27, 2022 at 3:23 PM Sergey Dudoladov
>  wrote:
> > > Here's v2, rebased onto the latest master.
> >
> > I've reviewed this patch. The patch builds against the master (commit
> > e9d4001ec592bcc9a3332547cb1b0211e8794f38) and passes all the tests.
> > The patch does what it intends to do, namely store the kind of the
> > last checkpoint in the control file and display it in the output of
> > the pg_control_checkpoint() function and pg_controldata utility.
> > I did not test it with restartpoints though. Speaking of the torn
> > writes, the size of the control file with this patch applied does not
> > exceed 8Kb.
> 
> Thanks for the review.
> 
> > A few code comments:
> >
> > + char ckpt_kind[2 * MAXPGPATH];
> >
> > I don't completely understand why 2 * MAXPGPATH is used here for the
> > buffer size.
> > [1] defines MAXPGPATH to be standard size of a pathname buffer. How
> > does it relate to ckpt_kind ?
> 
> I was using it loosely. Changed in the v3 patch.
> 
> > + ControlFile->checkPointKind = 0;
> >
> > It is worth a comment that 0 is unknown, as for instance in [2]
> 
> We don't even need ControlFile->checkPointKind = 0; because
> InitControlFile will memset(ControlFile, 0, sizeof(ControlFileData));,
> hence removed this.
> 
> > + (flags == 0) ? "unknown" : "",
> >
> > That reads as if this patch would introduce a new "unknown" checkpoint 
> > state.
> > Why have it here at all if after for example initdb the kind is "shutdown" ?
> 
> Yeah, even LogCheckpointStart doesn't have anything "unknown" so removed it.
> 
> > The space at the strings' end (as in "wait " or "immediate ")
> > introduces extra whitespace in the output of pg_control_checkpoint().
> > A similar check at [3] places whitespace differently; that arrangement
> > of whitespace should remove the issue.
> 
> Changed.
> 
> > >   Datum   values[18];
> > >   boolnulls[18];
> >
> > You forgot to expand these arrays.
> 
> Not sure what you meant here. The size of the array is already 19 in v2.
> 
> > This breaks checkpoint file format. Need to bump PG_CONTROL_VERSION,
> > and pg_upgrade need to treat the change.
> 
> I added a note in the commit message to bump cat version so that the
> committer will take care of it.

PG_CONTROL_VERSION is different from catversion.  You should update it in this
patch.

But Horiguchi-san was also mentioning that pg_upgrade/controldata.c needs some
modifications if you change the format (thus the requirement to bump
PG_CONTROL_VERSION).

Why are you defining CHECKPOINT_KIND_TEXT_LENGTH twice?  You
should just define it in some sensible header used by both files, or better
have a new function to take care of that rather than having the code
duplicated.

Also, you still didn't fix the possible flag upgrade issue.

Re: refactoring basebackup.c

2022-01-28 Thread Dipesh Pandit

Hi,

> I made a pass over these patches today and made a bunch of minor
> corrections. New version attached. The two biggest things I changed
> are (1) s/gzip_extractor/gzip_compressor/, because I feel like you
> extract an archive like a tarfile, but that is not what is happening
> here, this is not an archive and (2) I took a few bits of out of the
> test case that didn't seem to be necessary. There wasn't any reason
> that I could see why testing for PG_VERSION needed to be skipped when
> the compression method is 'none', so my first thought was to just take
> out the 'if' statement around that, but then after more thought that
> test and the one for pg_verifybackup are certainly going to fail if
> those files are not present, so why have an extra test? It might make
> sense if we were only conditionally able to run pg_verifybackup and
> wanted to have some test coverage even when we can't, but that's not
> the case here, so I see no point.

Thanks. This makes sense.

+#ifdef HAVE_LIBZ
+   /*
+* If the user has requested a server compressed archive along with
archive
+* extraction at client then we need to decompress it.
+*/
+   if (format == 'p' && compressmethod == COMPRESSION_GZIP &&
+   compressloc == COMPRESS_LOCATION_SERVER)
+   streamer = bbstreamer_gzip_decompressor_new(streamer);
+#endif

I think it is not required to have HAVE_LIBZ check in pg_basebackup.c
while creating a new gzip writer/decompressor. This check is already
in place in bbstreamer_gzip_writer_new() and
bbstreamer_gzip_decompressor_new()
and it throws an error in case the build does not have required library
support. I have removed this check from pg_basebackup.c and updated
a delta patch. The patch can be applied on v5 patch.

Thanks,
Dipesh
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 46ab60d..1f81bbf 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -1199,7 +1199,6 @@ CreateBackupStreamer(char *archive_name, char *spclocation,
 			compressloc != COMPRESS_LOCATION_CLIENT)
 			streamer = bbstreamer_plain_writer_new(archive_filename,
    archive_file);
-#ifdef HAVE_LIBZ
 		else if (compressmethod == COMPRESSION_GZIP)
 		{
 			strlcat(archive_filename, ".gz", sizeof(archive_filename));
@@ -1207,7 +1206,6 @@ CreateBackupStreamer(char *archive_name, char *spclocation,
   archive_file,
   compresslevel);
 		}
-#endif
 		else
 		{
 			Assert(false);		/* not reachable */
@@ -1256,7 +1254,6 @@ CreateBackupStreamer(char *archive_name, char *spclocation,
 	else if (expect_unterminated_tarfile)
 		streamer = bbstreamer_tar_terminator_new(streamer);
 
-#ifdef HAVE_LIBZ
 	/*
 	 * If the user has requested a server compressed archive along with archive
 	 * extraction at client then we need to decompress it.
@@ -1264,7 +1261,6 @@ CreateBackupStreamer(char *archive_name, char *spclocation,
 	if (format == 'p' && compressmethod == COMPRESSION_GZIP &&
 			compressloc == COMPRESS_LOCATION_SERVER)
 		streamer = bbstreamer_gzip_decompressor_new(streamer);
-#endif
 
 	/* Return the results. */
 	*manifest_inject_streamer_p = manifest_inject_streamer;

Re: Add header support to text format and matching feature

2022-01-28 Thread Peter Eisentraut


On 31.12.21 18:36, Rémi Lapeyre wrote:

Here’s an updated version of the patch that takes into account the changes in 
d1029bb5a2. The actual code is the same as v10 which was already marked as 
ready for committer.


I have committed the 0001 patch.  I will work on the 0002 patch next.

I notice in the 0002 patch that there is no test case for the error 
"wrong header for column \"%s\": got \"%s\"", which I think is really 
the core functionality of this patch.  So please add that.


I wonder whether the header matching should be a separate option from 
the HEADER option.  The option parsing in this patch is quite 
complicated and could be simpler if there were two separate options.  It 
appears this has been mentioned in the thread but not fully discussed.

Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2022-01-28 Thread Kyotaro Horiguchi

At Thu, 27 Jan 2022 10:47:20 -0800, Nathan Bossart  
wrote in 
> On Thu, Jan 27, 2022 at 02:06:40PM +0900, Michael Paquier wrote:
> > So, I have been checking this idea in details, and spotted what looks
> > like one issue in CreateRestartPoint(), as of:
> > /*
> >  * Update pg_control, using current time.  Check that it still shows
> >  * DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do 
> > nothing;
> >  * this is a quick hack to make sure nothing really bad happens if 
> > somehow
> >  * we get here after the end-of-recovery checkpoint.
> >  */
> > LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> > if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
> > ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
> > 
> > This change increases the window making this code path reachable if an
> > end-of-recovery checkpoint is triggered but not finished at the end of
> > recovery (possible of course at the end of crash recovery, but
> > DB_IN_ARCHIVE_RECOVERY is also possible when !IsUnderPostmaster with a
> > promotion request), before updating ControlFile->checkPointCopy at the
> > end of the checkpoint because the state could still be
> > DB_IN_ARCHIVE_RECOVERY.  The window is wider the longer the
> > end-of-recovery checkpoint.  And this would be the case of an instance
> > restarted, when a restart point is created.
> 
> I wonder if this is actually a problem in practice.  IIUC all of the values
> updated in this block should be reset at the end of the end-of-recovery
> checkpoint.  Is the intent of the quick hack to prevent those updates after
> an end-of-recovery checkpoint completes, or is it trying to block them
> after one begins?  It looks like the control file was being updated to
> DB_SHUTDOWNING at the beginning of end-of-recovery checkpoints when that
> change was first introduced (2de48a8), so I agree that we'd better be
> careful with this change.

Putting aside the readyness of the patch, I think what the patch
intends is to avoid starnge state transitions happen during
end-of-recovery checkpoint.

So, I'm confused...

End-of-recovery checkpoint is requested as CHECKPOINT_WAIT, which
seems to me to mean the state is always DB_IN_ARCHIVE_RECOVERY while
the checkpoint is running?  If correct, if server is killed druing the
end-of-recovery checkpoint, the state stays DB_IN_ARCHIVE_RECOVERY
instead of DB_SHUTDOWNING or DB_SHUTDOWNED.  AFAICS there's no
differece between the first two at next startup.  I dont' think
DB_SHUTDOWNED case is not worth considering here.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Suppressing useless wakeups in walreceiver

2022-01-28 Thread Thomas Munro

On Fri, Jan 28, 2022 at 8:26 PM Kyotaro Horiguchi
 wrote:
> At Thu, 27 Jan 2022 23:50:04 +1300, Thomas Munro  
> wrote in
> > While working on WaitEventSet-ifying various codepaths, I found it
> > strange that walreceiver wakes up 10 times per second while idle.
> > Here's a draft patch to compute the correct sleep time.
>
> Agree to the objective.  However I feel the patch makes the code
> somewhat less readable maybe because WalRcvComputeNextWakeup hides the
> timeout deatils.  Of course other might thing differently.

Thanks for looking!

The reason why I put the timeout computation into a function is
because there are about 3 places you need to do it: at the beginning,
when rescheduling for next time, and when the configuration file
changes and you might want to recompute them.  The logic to decode the
GUCs and compute the times would be duplicated.  I have added a
comment about that, and tried to make the code clearer.

Do you have another idea?

> -  ping_sent = false;
> -  XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1,
> - startpointTLI);
> +  WalRcvComputeNextWakeup(&state,
> +  WALRCV_WAKEUP_TIMEOUT,
> +  last_recv_timestamp);
> +  WalRcvComputeNextWakeup(&state,
> +  WALRCV_WAKEUP_PING,
> +  last_recv_timestamp);
>
> The calculated target times are not used within the closest loop and
> the loop is quite busy. WALRCV_WAKEUP_PING is the replacement of
> ping_sent, but I think the computation of both two target times would
> be better done after the loop only when the "if (len > 0)" block was
> passed.

Those two wakeup times should only be adjusted when data is received.
The calls should be exactly where last_recv_timestamp is set in
master, no?

> -  XLogWalRcvSendReply(false, false);
> +  XLogWalRcvSendReply(&state, GetCurrentTimestamp(),
> +false, false);
>
> The GetCurrentTimestamp() is same with last_recv_timestamp when the
> recent recv() had any bytes received. So we can avoid the call to
> GetCurrentTimestamp in that case. If we do the change above, the same
> flag notifies the necessity of separete GetCurrentTimestamp().

Yeah, I agree that we should remove more GetCurrentTimestamp() calls.
Here's another idea: let's remove last_recv_timestamp, and just use
'now', being careful to update it after sleeping and in the receive
loop (in case it gets busy for a long time), so that it's always fresh
enough.

> I understand the reason for startpointTLI being stored in WalRcvInfo
> but don't understand about primary_has_standby_xmin. It is just moved
> from a static variable of XLogWalRcvSendHSFeedback to the struct
> member that is modifed and read only by the same function.

Yeah, this was unnecessary refactoring.  I removed that change (we
could look into moving more state into the new state object instead of
using static locals and globals, but that can be for another thread, I
got carried away...)

> The enum item WALRCV_WAKEUP_TIMEOUT looks somewhat uninformative. How
> about naming it WALRCV_WAKEUP_TERMIATE (or WALRCV_WAKEUP_TO_DIE)?

Ok, let's try TERMINATE.

> WALRCV_WAKEUP_TIMEOUT and WALRCV_WAKEUP_PING doesn't fire when
> wal_receiver_timeout is zero.  In that case we should not set the
> timeouts at all to avoid spurious wakeups?

Oh, yeah, I forgot to handle wal_receiver_timeout = 0.  Fixed.
From f7872d44d955e8b40f040487968ad366c16fb54f Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 27 Jan 2022 21:43:17 +1300
Subject: [PATCH v2] Suppress useless wakeups in walreceiver.

Instead of waking up 10 times per second to check for various timeout
conditions, keep track of when we next have periodic work to do.

Reviewed-by: Kyotaro Horiguchi 
Discussion: https://postgr.es/m/CA%2BhUKGJGhX4r2LPUE3Oy9BX71Eum6PBcS8L3sJpScR9oKaTVaA%40mail.gmail.com
---
 src/backend/replication/walreceiver.c | 277 --
 1 file changed, 169 insertions(+), 108 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..d90f5aefef 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -94,8 +94,6 @@ bool		hot_standby_feedback;
 static WalReceiverConn *wrconn = NULL;
 WalReceiverFunctionsType *WalReceiverFunctions = NULL;
 
-#define NAPTIME_PER_CYCLE 100	/* max sleep time between cycles (100ms) */
-
 /*
  * These variables are used similarly to openLogFile/SegNo,
  * but for walreceiver to write the XLOG. recvFileTLI is the TimeLineID
@@ -115,6 +113,27 @@ static struct
 	XLogRecPtr	Flush;			/* last byte + 1 flushed in the standby */
 }			LogstreamResult;
 
+/*
+ * Reasons to wake up and perform periodic tasks.
+ */
+typedef enum WalRcvWakeupReason
+{
+	WALRCV_WAKEUP_TERMINATE,
+	WALRCV_WAKEUP_PING,
+	WALRCV_WAKEUP_REPLY,
+	WALRCV_WAKEUP_HSFEEDBACK,
+	NUM_WALRCV_WAKEUPS
+} WalRcvWakeu

Re: [PATCH] Disable bgworkers during servers start in pg_upgrade

2022-01-28 Thread Denis Laxalde


Hi,

Julien Rouhaud a écrit :

On Wed, 27 Jan 2021 11:25:11 +0100
Denis Laxalde  wrote:


Andres Freund a écrit :

b) when in binary upgrade mode / -b, error out on all wal writes in
sessions that don't explicitly set a session-level GUC to allow
writes.



It should be enough to add an additional test in XLogInsertAllowed() with some 
new
variable that is set when starting in binary upgrade mode, and a new function
to disable it that will be emitted by pg_dump / pg_dumpall in binary upgrade
mode.


I tried that simple change first:

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c

index dfe2a0bcce..8feab0cb96 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8498,6 +8498,9 @@ HotStandbyActiveInReplay(void)
 bool
 XLogInsertAllowed(void)
 {
+   if (IsBinaryUpgrade)
+   return false;
+
/*
 * If value is "unconditionally true" or "unconditionally 
false", just
 * return it.  This provides the normal fast path once recovery 
is known



But then, pg_upgrade's tests (make -C src/bin/pg_upgrade/ check) fail at 
vaccumdb but not during pg_dumpall:


$ cat src/bin/pg_upgrade/pg_upgrade_utility.log
-
  pg_upgrade run on Fri Jan 28 10:37:36 2022
-

command: 
"/home/denis/src/pgsql/build/tmp_install/home/denis/.local/pgsql/bin/pg_dumpall" 
--host /home/denis/src/pgsql/build/src/bin/pg_upgrade --port 51696 
--username denis --globals-only --quote-all-identifiers --binary-upgrade 
 -f pg_upgrade_dump_globals.sql >> "pg_upgrade_utility.log" 2>&1



command: 
"/home/denis/src/pgsql/build/tmp_install/home/denis/.local/pgsql/bin/vacuumdb" 
--host /home/denis/src/pgsql/build/src/bin/pg_upgrade --port 51696 
--username denis --all --analyze  >> "pg_upgrade_utility.log" 2>&1

vacuumdb: vacuuming database "postgres"
vacuumdb: error: processing of database "postgres" failed: PANIC: 
cannot make new WAL entries during recovery



In contrast with pg_dump/pg_dumpall, vacuumdb has no --binary-upgrade 
flag, so it does not seem possible to use a special GUC setting to allow 
WAL writes in that vacuumdb session at the moment.
Should we add --binary-upgrade to vacuumdb as well? Or am I going in the 
wrong direction?



Thanks,
Denis

Re: row filtering for logical replication

2022-01-28 Thread Greg Nancarrow

On Fri, Jan 28, 2022 at 2:26 PM houzj.f...@fujitsu.com
 wrote:
>
> Attach the V72 patch set which did the above changes.
>

Thanks for updating the patch set.
One thing I noticed, in the patch commit comment it says:

Psql commands \dRp+ and \d will display any row filters.

However, "\d" by itself doesn't show any row filter information, so I
think it should say:

Psql commands "\dRp+" and "\d " will display any row filters.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: TAP test to cover "EndOfLogTLI != replayTLI" case

2022-01-28 Thread Amul Sul

On Thu, Jan 27, 2022 at 12:01 PM Michael Paquier  wrote:
>
> On Thu, Jan 20, 2022 at 12:13:08PM +0530, Amul Sul wrote:
> > I found the cause for the test failing on window -- is due to the
> > custom archive command setting which wasn't setting the correct window
> > archive directory path.
>
> After reading this patch and this thread, I have noticed that you are
> testing the same thing as Heikki here, patch 0001:
> https://www.postgresql.org/message-id/52bc9ccd-8591-431b-0086-15d9acf25...@iki.fi
>
> The patch sent on the other thread has a better description and shape,
> so perhaps we'd better drop what is posted here in favor of the other
> version.  Thoughts?

Yes, I do agree with you. Thanks for the comparison.

Regards,
Amul

BeginCopyTo - remove switching to old memory context in between COPY TO command processing

2022-01-28 Thread Bharath Rupireddy

Hi,

While reviewing patch at [1], it has been found that the memory
context switch to oldcontext from cstate->copycontext in between
BeginCopyTo is not correct because the intention of the copycontext is
to use it through the copy command processing. It looks like a thinko
from the commit c532d1 [2]. Attaching a small patch to remove this.

Thoughts?

[1] https://www.postgresql.org/message-id/539760.1642610324%40sss.pgh.pa.us
[2]
commit c532d15dddff14b01fe9ef1d465013cb8ef186df
Author: Heikki Linnakangas 
Date:   Mon Nov 23 10:50:50 2020 +0200

Split copy.c into four files.

Regards,
Bharath Rupireddy.


v1-0001-remove-switching-to-old-memory-context-in-between.patch
Description: Binary data

Re: refactoring basebackup.c

2022-01-28 Thread tushar


On 1/27/22 11:12 PM, Robert Haas wrote:

Well what's weird here is that you are using both --gzip and also
--compress. Those both control the same behavior, so it's a surprising
idea to specify both. But I guess if someone does, we should make the
second one fully override the first one. Here's a patch to try to do
that.

right, the current behavior was  -

[edb@centos7tushar bin]$ ./pg_basebackup  -t server:/tmp/y101 --gzip -Z 
none  -Xnone

pg_basebackup: error: cannot use compression level with method none
Try "pg_basebackup --help" for more information.

and even this was not matching with PG v14 behavior too
e.g
 ./pg_basebackup -Ft -z -Z none  -D /tmp/test1  ( working in PG v14 but 
throwing above error on PG HEAD)


and somewhere we were breaking the backward compatibility.

now with your patch -this seems working fine

[edb@centos7tushar bin]$ ./pg_basebackup  -t server:/tmp/y101 --gzip*-Z 
none*  -Xnone
NOTICE:  WAL archiving is not enabled; you must ensure that all required 
WAL segments are copied through other means to complete the backup

[edb@centos7tushar bin]$ ls /tmp/y101
backup_manifest *base.tar*

OR

[edb@centos7tushar bin]$  ./pg_basebackup  -t server:/tmp/y0p -Z none  
-Xfetch *-z*

[edb@centos7tushar bin]$ ls /tmp/y0p
backup_manifest *base.tar.gz*

but what about server-gzip:0? should it allow compressing the directory?

[edb@centos7tushar bin]$  ./pg_basebackup  -t server:/tmp/1 
--compress=server-gzip:0  -Xfetch

[edb@centos7tushar bin]$ ls /tmp/1
backup_manifest  base.tar.gz

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Server-side base backup: why superuser, not pg_write_server_files?

2022-01-28 Thread Dagfinn Ilmari Mannsåker

Hi Hackers,

I just noticed that the new server-side base backup feature requires
superuser privileges (which is only documented in the pg_basebackup
manual, not in the streaming replication protocol specification).

Isn't this the kind of thing the pg_write_server_files role was created
for, so that it can be delegated to a non-superuser?

- ilmari

Re: BeginCopyTo - remove switching to old memory context in between COPY TO command processing

2022-01-28 Thread Japin Li



On Fri, 28 Jan 2022 at 18:11, Bharath Rupireddy 
 wrote:
> Hi,
>
> While reviewing patch at [1], it has been found that the memory
> context switch to oldcontext from cstate->copycontext in between
> BeginCopyTo is not correct because the intention of the copycontext is
> to use it through the copy command processing. It looks like a thinko
> from the commit c532d1 [2]. Attaching a small patch to remove this.
>
> Thoughts?
>

Thanks for the patch!  Tested and passed regression tests.  LGTM.

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: BeginCopyTo - remove switching to old memory context in between COPY TO command processing

2022-01-28 Thread Michael Paquier

On Fri, Jan 28, 2022 at 03:41:11PM +0530, Bharath Rupireddy wrote:
> While reviewing patch at [1], it has been found that the memory
> context switch to oldcontext from cstate->copycontext in between
> BeginCopyTo is not correct because the intention of the copycontext is
> to use it through the copy command processing. It looks like a thinko
> from the commit c532d1 [2]. Attaching a small patch to remove this.
> 
> Thoughts?

I think that you are right.  Before c532d15d, BeginCopy() was
internally doing one switch with copycontext and the current memory
context.  This commit has just moved the internals of BeginCopy()
into BeginCopyTo(), so this looks like a copy-paste error to me.  I'll
go fix it, thanks for the report!
-
Michael

signature.asc
Description: PGP signature

Re: Logical replication timeout problem

2022-01-28 Thread Fabrice Chapuis

Thanks for your new fix Wang.

TimestampTz ping_time = TimestampTzPlusMilliseconds(sendTime,
wal_sender_timeout / 2);

shouldn't we use receiver_timeout in place of wal_sender_timeout because de
problem comes from the consummer.

On Wed, Jan 26, 2022 at 4:37 AM wangw.f...@fujitsu.com <
wangw.f...@fujitsu.com> wrote:

> On Thu, Jan 22, 2022 at 7:12 PM Amit Kapila 
> wrote:
> > Now, one idea to solve this problem could be that whenever we skip
> > sending any change we do try to update the plugin progress via
> > OutputPluginUpdateProgress(for walsender, it will invoke
> > WalSndUpdateProgress), and there it tries to process replies and send
> > keep_alive if necessary as we do when we send some data via
> > OutputPluginWrite(for walsender, it will invoke WalSndWriteData). I
> > don't know whether it is a good idea to invoke such a mechanism for
> > every change we skip to send or we should do it after we skip sending
> > some threshold of continuous changes. I think later would be
> > preferred. Also, we might want to introduce a new parameter
> > send_keep_alive to this API so that there is flexibility to invoke
> > this mechanism as we don't need to invoke it while we are actually
> > sending data and before that, we just update the progress via this
> > API.
>
> I tried out the patch according to your advice.
> I found if I invoke ProcessRepliesIfAny and WalSndKeepaliveIfNecessary in
> function OutputPluginUpdateProgress, the running time of the newly added
> function OutputPluginUpdateProgress invoked in pgoutput_change brings
> notable
> overhead:
> --11.34%--pgoutput_change
>   |
>   |--8.94%--OutputPluginUpdateProgress
>   |  |
>   |   --8.70%--WalSndUpdateProgress
>   | |
>   | |--7.44%--ProcessRepliesIfAny
>
> So I tried another way of sending keepalive message to the standby machine
> based on the timeout without asking for a reply(see attachment), the
> running
> time of the newly added function OutputPluginUpdateProgress invoked in
> pgoutput_change also brings slight overhead:
> --3.63%--pgoutput_change
>   |
>   |--1.40%--get_rel_sync_entry
>   |  |
>   |   --1.14%--hash_search
>   |
>--1.08%--OutputPluginUpdateProgress
>  |
>   --0.85%--WalSndUpdateProgress
>
> Based on above, I think the second idea that sending some threshold of
> continuous changes might be better, I will do some research about this
> approach.
>
> Regards,
> Wang wei
>

Re: UNIQUE null treatment option

2022-01-28 Thread Pavel Borisov

>
> Makes sense.  Here is an updated patch with this change.
>
> I didn't end up renaming anynullkeys.  I came up with names like
> "anyalwaysdistinctkeys", but in the end that felt too abstract, and
> moreover, it would require rewriting a bunch of code comments that refer
> to null values in this context.  Since as you wrote, anynullkeys is just
> a local concern between two functions, this slight inaccuracy is perhaps
> better than some highly general but unclear terminology.

Agree with that. With the comment it is clear how it works.

I've looked at the patch v3. It seems good enough for me. CFbot tests have
also come green.
Suggest it is RFC now.

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Parameter for planner estimate of recursive queries

2022-01-28 Thread Hamid Akhtar

On Tue, 25 Jan 2022 at 14:44, Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 31.12.21 15:10, Simon Riggs wrote:
> >> The factor 10 is a reasonably safe assumption and helps avoid worst
> >> case behavior in bigger graph queries. However, the factor 10 is way
> >> too large for many types of graph query, such as where the path
> >> through the data is tight, and/or the query is written to prune bushy
> >> graphs, e.g. shortest path queries. The factor 10 should not be
> >> hardcoded in the planner, but should be settable, just as
> >> cursor_tuple_fraction is.
> > If you think this should be derived without parameters, then we would
> > want a function that starts at 1 for 1 input row and gets much larger
> > for larger input. The thinking here is that Graph OLTP is often a
> > shortest path between two nodes, whereas Graph Analytics and so the
> > worktable will get much bigger.
>
> On the one hand, this smells like a planner hint.  But on the other
> hand, it doesn't look like we will come up with proper graph-aware
> selectivity estimation system any time soon, so just having all graph
> OLTP queries suck until then because the planner hint is hardcoded
> doesn't seem like a better solution.  So I think this setting can be ok.
>   I think the way you have characterized it makes sense, too: for graph
> OLAP, you want a larger value, for graph OLTP, you want a smaller value.
>

Do you think there is a case to replace the 10x multiplier with
"recursive_worktable_estimate" for total_rows calculation in the
cost_recursive_union function too?

Re: pg_upgrade should truncate/remove its logs before running

2022-01-28 Thread Michael Paquier

On Wed, Jan 26, 2022 at 11:00:28AM +0900, Michael Paquier wrote:
> Bleh.  This would point to the old data directory, so this needs to be
> "$self->{pgsql}/src/bin/pg_upgrade/tmp_check/data/pg_upgrade_output.d/log/*.log"
> to point to the upgraded cluster.

Please note that I have sent a patch to merge this change in the
buildfarm code.  Comments are welcome.
--
Michael


signature.asc
Description: PGP signature

Re: refactoring basebackup.c

2022-01-28 Thread Robert Haas

On Fri, Jan 28, 2022 at 3:54 AM Dipesh Pandit  wrote:
> Thanks. This makes sense.
>
> +#ifdef HAVE_LIBZ
> +   /*
> +* If the user has requested a server compressed archive along with 
> archive
> +* extraction at client then we need to decompress it.
> +*/
> +   if (format == 'p' && compressmethod == COMPRESSION_GZIP &&
> +   compressloc == COMPRESS_LOCATION_SERVER)
> +   streamer = bbstreamer_gzip_decompressor_new(streamer);
> +#endif
>
> I think it is not required to have HAVE_LIBZ check in pg_basebackup.c
> while creating a new gzip writer/decompressor. This check is already
> in place in bbstreamer_gzip_writer_new() and 
> bbstreamer_gzip_decompressor_new()
> and it throws an error in case the build does not have required library
> support. I have removed this check from pg_basebackup.c and updated
> a delta patch. The patch can be applied on v5 patch.

Right, makes sense. Committed with that change, plus I realized the
skip count in the test case file was wrong after the changes I made
yesterday, so I fixed that as well.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Server-side base backup: why superuser, not pg_write_server_files?

2022-01-28 Thread Robert Haas

On Fri, Jan 28, 2022 at 5:58 AM Dagfinn Ilmari Mannsåker
 wrote:
> I just noticed that the new server-side base backup feature requires
> superuser privileges (which is only documented in the pg_basebackup
> manual, not in the streaming replication protocol specification).
>
> Isn't this the kind of thing the pg_write_server_files role was created
> for, so that it can be delegated to a non-superuser?

That's a good idea. I didn't think of that. Would you like to propose a patch?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: row filtering for logical replication

2022-01-28 Thread Alvaro Herrera

I just pushed a change to tab-complete because of a comment in the
column-list patch series.  I checked and your v72-0002 does not
conflict, but it doesn't fully work either; AFAICT you'll have to change
it so that the WHERE clause appears in the COMPLETE_WITH(",") line I
just added.  As far as I tested it, with that change the completion
works fine.


Unrelated to these two patches:

Frankly I would prefer that these completions offer a ";" in addition to
the "," and "WHERE".  But we have no precedent for doing that (offering
to end the command) anywhere in the completion rules, so I think it
would be a larger change that would merit more discussion.

And while we're talking of larger changes, I would love it if other
commands such as DROP TABLE offered a "," completion after a table name,
so that a command can be tab-completed to drop multiple tables.  (Same
with other commands that process multiple comma-separated objects, of
course.)

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"On the other flipper, one wrong move and we're Fatal Exceptions"
(T.U.X.: Term Unit X  - http://www.thelinuxreview.com/TUX/)

Re: [PATCH] Disable bgworkers during servers start in pg_upgrade

2022-01-28 Thread Julien Rouhaud

Hi,

On Fri, Jan 28, 2022 at 11:02:46AM +0100, Denis Laxalde wrote:
> 
> I tried that simple change first:
> 
> diff --git a/src/backend/access/transam/xlog.c
> b/src/backend/access/transam/xlog.c
> index dfe2a0bcce..8feab0cb96 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -8498,6 +8498,9 @@ HotStandbyActiveInReplay(void)
>  bool
>  XLogInsertAllowed(void)
>  {
> +   if (IsBinaryUpgrade)
> +   return false;
> +
> 
> 
> But then, pg_upgrade's tests (make -C src/bin/pg_upgrade/ check) fail at
> vaccumdb but not during pg_dumpall:
> 
> [...]
> 
> command: 
> "/home/denis/src/pgsql/build/tmp_install/home/denis/.local/pgsql/bin/vacuumdb"
> --host /home/denis/src/pgsql/build/src/bin/pg_upgrade --port 51696
> --username denis --all --analyze  >> "pg_upgrade_utility.log" 2>&1
> vacuumdb: vacuuming database "postgres"
> vacuumdb: error: processing of database "postgres" failed: PANIC: cannot
> make new WAL entries during recovery
> 
> In contrast with pg_dump/pg_dumpall, vacuumdb has no --binary-upgrade flag,
> so it does not seem possible to use a special GUC setting to allow WAL
> writes in that vacuumdb session at the moment.
> Should we add --binary-upgrade to vacuumdb as well? Or am I going in the
> wrong direction?

I think having a new option for vacuumdb is the right move.

It seems unlikely that any cron or similar on the host will try to run some
concurrent vacuumdb, but we still have to enforce that only the one executed by
pg_upgrade can succeed.

I guess it could be an undocumented option, similar to postgres' -b, which
would only be allowed iff --all and --freeze is also passed to be extra safe.

While at it:

> vacuumdb: error: processing of database "postgres" failed: PANIC: cannot
> make new WAL entries during recovery

Should we tweak that message when IsBinaryUpgrade is true?

Re: Unlogged relations and WAL-logging

2022-01-28 Thread Robert Haas

On Thu, Jan 27, 2022 at 2:32 PM Heikki Linnakangas  wrote:
> Unlogged relations are not WAL-logged, but creating the init-fork is.
> There are a few things around that seem sloppy:
>
> 1. In index_build(), we do this:
>
> >*/
> >   if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED 
> > &&
> >   !smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
> >   {
> >   smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, 
> > false);
> >   indexRelation->rd_indam->ambuildempty(indexRelation);
> >   }
>
> Shouldn't we call log_smgrcreate() here? Creating the init fork is
> otherwise not WAL-logged at all.

Yes, that's a bug.

> 2. Some implementations of ambuildempty() use the buffer cache (hash,
> gist, gin, brin), while others bypass it and call smgrimmedsync()
> instead (btree, spgist, bloom). I don't see any particular reason for
> those decisions, it seems to be based purely on which example the author
> happened to copy-paste.

I thought that this inconsistency was odd when I was developing the
unlogged feature, but I tried to keep each routine's ambuildempty()
consistent with whatever ambuild() was doing. I don't mind if you want
to change it, though.

> 3. Those ambuildempty implementations that bypass the buffer cache use
> smgrwrite() to write the pages. That doesn't make any difference in
> practice, but in principle it's wrong: You are supposed to use
> smgrextend() when extending a relation.

That's a mistake on my part.

> 4. Also, the smgrwrite() calls are performed before WAL-logging the
> pages, so the page that's written to disk has 0/0 as the LSN, not the
> LSN of the WAL record. That's harmless too, but seems a bit sloppy.

That is also a mistake on my part.

> 5. In heapam_relation_set_new_filenode(), we do this:
>
> >
> >   /*
> >* If required, set up an init fork for an unlogged table so that it 
> > can
> >* be correctly reinitialized on restart.  An immediate sync is 
> > required
> >* even if the page has been logged, because the write did not go 
> > through
> >* shared_buffers and therefore a concurrent checkpoint may have 
> > moved the
> >* redo pointer past our xlog record.  Recovery may as well remove it
> >* while replaying, for example, XLOG_DBASE_CREATE or 
> > XLOG_TBLSPC_CREATE
> >* record. Therefore, logging is necessary even if wal_level=minimal.
> >*/
> >   if (persistence == RELPERSISTENCE_UNLOGGED)
> >   {
> >   Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
> >  rel->rd_rel->relkind == RELKIND_MATVIEW ||
> >  rel->rd_rel->relkind == RELKIND_TOASTVALUE);
> >   smgrcreate(srel, INIT_FORKNUM, false);
> >   log_smgrcreate(newrnode, INIT_FORKNUM);
> >   smgrimmedsync(srel, INIT_FORKNUM);
> >   }
>
> The comment doesn't make much sense, we haven't written nor WAL-logged
> any page here, with nor without the buffer cache. It made more sense
> before commit fa0f466d53.

Well, it seems to me (and perhaps I am just confused) that complaining
that there's no page written here might be a technicality. The point
is that there's no synchronization between the work we're doing here
-- which is creating a fork, not writing a page -- and any concurrent
checkpoint. So we both need to log it, and also sync it immediately.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Disable bgworkers during servers start in pg_upgrade

2022-01-28 Thread Denis Laxalde


Julien Rouhaud a écrit :

I think having a new option for vacuumdb is the right move.

It seems unlikely that any cron or similar on the host will try to run some
concurrent vacuumdb, but we still have to enforce that only the one executed by
pg_upgrade can succeed.

I guess it could be an undocumented option, similar to postgres' -b, which
would only be allowed iff --all and --freeze is also passed to be extra safe.


The help text in pg_dump's man page states:

   --binary-upgrade
   This option is for use by in-place upgrade
   utilities. Its use for other purposes is not
   recommended or supported. The behavior of
   the option may change in future releases
   without notice.

Is it enough? Or do we actually want to hide it for vacuumdb?


While at it:


vacuumdb: error: processing of database "postgres" failed: PANIC: cannot
make new WAL entries during recovery


Should we tweak that message when IsBinaryUpgrade is true?


Yes, indeed, I had in mind to simply make the message more generic as: 
"cannot insert new WAL entries".

Re: Parameter for planner estimate of recursive queries

2022-01-28 Thread Robert Haas

On Tue, Jan 25, 2022 at 4:44 AM Peter Eisentraut
 wrote:
> On the one hand, this smells like a planner hint.  But on the other
> hand, it doesn't look like we will come up with proper graph-aware
> selectivity estimation system any time soon, so just having all graph
> OLTP queries suck until then because the planner hint is hardcoded
> doesn't seem like a better solution.  So I think this setting can be ok.

I agree. It's a bit lame, but seems pretty harmless, and I can't see
us realistically doing a lot better with any reasonable amount of
work.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Support for NSS as a libpq TLS backend

2022-01-28 Thread Daniel Gustafsson

>>> Can we propose a patch to document them? Don't want to get bitten by this
>>> suddenly changing...
>> 
>> I can certainly propose something on their mailinglist, but I unfortunately
>> wouldn't get my hopes up too high as NSS and documentation aren't exactly 
>> best
>> friends (the in-tree docs doesn't cover the API and Mozilla recently removed
>> most of the online docs in their neverending developer site reorg).
> 
> Kinda makes me question the wisdom of starting to depend on NSS. When openssl
> docs are vastly outshining a library's, that library really should start to
> ask itself some hard questions.

Sadly, there is that.  While this is not a new problem, Mozilla has been making
some very weird decisions around NSS governance as of late.  Another data point
is the below thread from libcurl:

https://curl.se/mail/lib-2022-01/0120.html

--
Daniel Gustafsson   https://vmware.com/

Remove extra includes of "access/xloginsert.h" when "access/xlog.h" is included

2022-01-28 Thread Bharath Rupireddy

Hi,

It seems like there are some instances where xloginsert.h is included
right after xlog.h but xlog.h has already included xloginsert.h.
Unless I'm missing something badly, we can safely remove including
xloginsert.h after xlog.h. Attempting to post a patch to remove the
extra xloginsert.h includes.

Thoughts?

Regards,
Bharath Rupireddy.


v1-0001-remove-extra-includes-of-xloginsert.h-when-xlog.h.patch
Description: Binary data

Re: Support escape sequence for cluster_name in postgres_fdw.application_name

2022-01-28 Thread Robert Haas

On Thu, Jan 27, 2022 at 3:10 AM Kyotaro Horiguchi
 wrote:
> Is the reason for 'C' in upper-case to avoid possible conflict with
> 'c' of log_line_prefix?  I'm not sure that preventive measure is worth
> doing.  Looking the escape-sequence spec alone, it seems to me rather
> strange that an upper-case letter is used in spite of its lower-case
> is not used yet.

It's good to be consistent, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Disable bgworkers during servers start in pg_upgrade*

2022-01-28 Thread Julien Rouhaud

On Fri, Jan 28, 2022 at 03:06:57PM +0100, Denis Laxalde wrote:
> Julien Rouhaud a écrit :
> > I think having a new option for vacuumdb is the right move.
> > 
> > It seems unlikely that any cron or similar on the host will try to run some
> > concurrent vacuumdb, but we still have to enforce that only the one 
> > executed by
> > pg_upgrade can succeed.
> > 
> > I guess it could be an undocumented option, similar to postgres' -b, which
> > would only be allowed iff --all and --freeze is also passed to be extra 
> > safe.
> 
> The help text in pg_dump's man page states:
> 
>--binary-upgrade
>This option is for use by in-place upgrade
>utilities. Its use for other purposes is not
>recommended or supported. The behavior of
>the option may change in future releases
>without notice.
> 
> Is it enough? Or do we actually want to hide it for vacuumdb?

I think it should be hidden, with a comment about it like postmaster.c getopt
call:

case 'b':
/* Undocumented flag used for binary upgrades */

> > > vacuumdb: error: processing of database "postgres" failed: PANIC: cannot
> > > make new WAL entries during recovery
> > 
> > Should we tweak that message when IsBinaryUpgrade is true?
> 
> Yes, indeed, I had in mind to simply make the message more generic as:
> "cannot insert new WAL entries".

-1, it's good to have a clear reason why the error happened, especially since
it's supposed to be "should not happen" situation.

Re: Support for NSS as a libpq TLS backend

2022-01-28 Thread Robert Haas

On Fri, Jan 28, 2022 at 9:08 AM Daniel Gustafsson  wrote:
> > Kinda makes me question the wisdom of starting to depend on NSS. When 
> > openssl
> > docs are vastly outshining a library's, that library really should start to
> > ask itself some hard questions.

Yeah, OpenSSL is very poor, so being worse is not good.

> Sadly, there is that.  While this is not a new problem, Mozilla has been 
> making
> some very weird decisions around NSS governance as of late.  Another data 
> point
> is the below thread from libcurl:
>
> https://curl.se/mail/lib-2022-01/0120.html

I would really, really like to have an alternative to OpenSSL for PG.
I don't know if this is the right thing, though. If other people are
dropping support for it, that's a pretty bad sign IMHO. Later in the
thread it says OpenLDAP have dropped support for it already as well.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add 64-bit XIDs into PostgreSQL 15

2022-01-28 Thread Aleksander Alekseev

Hi Pavel,

> Please feel free to discuss readme and your opinions on the current patch and 
> proposed changes [1].

Just a quick question about this design choice:

> On-disk tuple format remains unchanged. 32-bit t_xmin and t_xmax store the
> lower parts of 64-bit XMIN and XMAX values. Each heap page has additional
> 64-bit pd_xid_base and pd_multi_base which are common for all tuples on a 
> page.
> They are placed into a pd_special area - 16 bytes in the end of a heap page.
> Actual XMIN/XMAX for a tuple are calculated upon reading a tuple from a page
> as follows:
>
> XMIN = t_xmin + pd_xid_base.
> XMAX = t_xmax + pd_xid_base/pd_multi_base.

Did you consider using 4 bytes for pd_xid_base and another 4 bytes for
(pd_xid_base/pd_multi_base)? This would allow calculating XMIN/XMAX
as:

XMIN = (t_min_extra_bits << 32) | t_xmin
XMAX = (t_max_extra_bits << 32) | t_xmax

... and save 8 extra bytes in the pd_special area. Or maybe I'm
missing some context here?

-- 
Best regards,
Aleksander Alekseev

Re: Make relfile tombstone files conditional on WAL level

2022-01-28 Thread Dilip Kumar

On Wed, Jan 19, 2022 at 10:37 AM Dilip Kumar  wrote:
>
> On Thu, Jan 6, 2022 at 7:22 PM Robert Haas  wrote:
>>
>> On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro  wrote:
>> > Another problem is that relfilenodes are normally allocated with
>> > GetNewOidWithIndex(), and initially match a relation's OID.  We'd need
>> > a new allocator, and they won't be able to match the OID in general
>> > (while we have 32 bit OIDs at least).
>>
>> Personally I'm not sad about that. Values that are the same in simple
>> cases but diverge in more complex cases are kind of a trap for the
>> unwary. There's no real reason to have them ever match. Yeah, in
>> theory, it makes it easier to tell which file matches which relation,
>> but in practice, you always have to double-check in case the table has
>> ever been rewritten. It doesn't seem worth continuing to contort the
>> code for a property we can't guarantee anyway.
>
>
> Make sense, I have started working on this idea, I will try to post the first 
> version by early next week.

Here is the first working patch, with that now we don't need to
maintain the TombStone file until the next checkpoint.  This is still
a WIP patch with this I can see my problem related to ALTER DATABASE
SET TABLESPACE WAL-logged problem is solved which Robert reported a
couple of mails above in the same thread.

General idea of the patch:
- Change the RelFileNode.relNode to be 64bit wide, out of which 8 bits
for fork number and 56 bits for the relNode as shown below. [1]
- GetNewRelFileNode() will just generate a new unique relfilenode and
check the file existence and if it already exists then throw an error,
so no loop.  We also need to add the logic for preserving the
nextRelNode across restart and also WAL logging it but that is similar
to the preserving nextOid.
- mdunlinkfork, will directly forget the relfilenode, so we get rid of
all unlinking code from the code.
- Now, we don't need any post checkpoint unlinking activity.

[1]
/*
* RelNodeId:
*
* this is a storage type for RelNode. The reasoning behind using this is same
* as using the BlockId so refer comment atop BlockId.
*/
typedef struct RelNodeId
{
  uint32 rn_hi;
  uint32 rn_lo;
} RelNodeId;
typedef struct RelFileNode
{
   Oid spcNode; /* tablespace */
   Oid dbNode; /* database */
   RelNodeId relNode; /* relation */
} RelFileNode;

TODO:

There are a couple of TODOs and FIXMEs which I am planning to improve
by next week.  I am also planning to do the testing where relfilenode
consumes more than 32 bits, maybe for that we can set the
FirstNormalRelfileNode to higher value for the testing purpose.  And,
Improve comments.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 4a6502c7950969262c6982388865bbc23e531cde Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 28 Jan 2022 18:32:35 +0530
Subject: [PATCH v1] Don't wait for next checkpoint to remove unwanted
 relfilenode

Currently, relfilenode is 32 bits wide so if we remove the relfilenode
immediately after it is no longer needed then there is risk of reusing
the same relfilenode in the same checkpoint.  So for avoiding that we
delay cleaning up the relfilenode until the next checkpoint.  With this
patch we are using 56 bits for the relfilenode.  Ideally we can make it
64 bits wider but that will increase the size of the BufferTag so for
keeping that size same we are making RelFileNode.relNode as 64 bit wider,
in that 8 bits will be used for storing the fork number and the remaining
56 bits for the relfilenode.
---
 contrib/pg_buffercache/pg_buffercache_pages.c   |   4 +-
 contrib/pg_prewarm/autoprewarm.c|   4 +-
 src/backend/access/common/syncscan.c|   3 +-
 src/backend/access/gin/ginxlog.c|   5 +-
 src/backend/access/rmgrdesc/gistdesc.c  |   4 +-
 src/backend/access/rmgrdesc/heapdesc.c  |   4 +-
 src/backend/access/rmgrdesc/nbtdesc.c   |   4 +-
 src/backend/access/rmgrdesc/seqdesc.c   |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c  |  15 +++-
 src/backend/access/transam/varsup.c |  42 +-
 src/backend/access/transam/xlog.c   |  57 ++---
 src/backend/access/transam/xloginsert.c |  12 +++
 src/backend/access/transam/xlogutils.c  |   9 ++-
 src/backend/catalog/catalog.c   |  61 +++---
 src/backend/catalog/heap.c  |   6 +-
 src/backend/catalog/index.c |   4 +-
 src/backend/catalog/storage.c   |   3 +-
 src/backend/commands/tablecmds.c|  18 +++--
 src/backend/replication/logical/decode.c|   1 +
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/storage/buffer/bufmgr.c |  61 +++---
 src/backend/storage/buffer/localbuf.c   |   8 +-
 src/backend/storage/freespace/fsmpage.c |   4 +-
 src/backend/storage/lmgr/lwlocknames.txt|   1 +
 src/backend/storage/smgr/

Re: warn if GUC set to an invalid shared library

2022-01-28 Thread Robert Haas

On Tue, Dec 28, 2021 at 12:45 PM Justin Pryzby  wrote:
> 0002 adds context when failing to start.
>
> 2021-12-27 17:01:12.996 CST postmaster[1403] WARNING:  could not load 
> library: $libdir/plugins/asdf: cannot open shared object file: No such file 
> or directory
> 2021-12-27 17:01:14.938 CST postmaster[1403] FATAL:  could not access 
> file "asdf": No such file or directory
> 2021-12-27 17:01:14.938 CST postmaster[1403] CONTEXT:  guc 
> "shared_preload_libraries"
> 2021-12-27 17:01:14.939 CST postmaster[1403] LOG:  database system is 
> shut down

-1 from me on using "guc" in any user-facing error message. And even
guc -> setting isn't a big improvement. If we're going to structure
the reporting this way there, we should try to use a meaningful phrase
there, probably beginning with the word "while"; see "git grep
errcontext.*while" for interesting precedents.

That said, that series of messages seems a bit suspect to me, because
the WARNING seems to be stating the same problem as the subsequent
FATAL and CONTEXT lines. Ideally we'd tighten that somehow.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add 64-bit XIDs into PostgreSQL 15

2022-01-28 Thread Pavel Borisov

>
> Did you consider using 4 bytes for pd_xid_base and another 4 bytes for
> (pd_xid_base/pd_multi_base)? This would allow calculating XMIN/XMAX
> as:
>
> XMIN = (t_min_extra_bits << 32) | t_xmin
> XMAX = (t_max_extra_bits << 32) | t_xmax
>
> ... and save 8 extra bytes in the pd_special area. Or maybe I'm
> missing some context here?
>
Hi, Alexander!

In current design it is not possible, as pd_xid_base is roughly just a
minimum 64-xid of all tuples that may fit this page. So we do not make any
extra guess that it should be in multiples of 2^32.

If we make pd_xid_base in multiples of 2^32 then after current XID crosses
the border of 2^32 then pages that contains tuples with XMIN/XMAX before
this point are not suitable for tuple inserts anymore. In effect we will
then have "sets" of the pages for each 2^32 "epoch" with freed space that
can not be used anymore.

I think it's too big a loss for gain of just 8 bytes per page.

Thank you for your dive into this matter!

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: A test for replay of regression tests

2022-01-28 Thread Andrew Dunstan

On 1/27/22 18:24, Thomas Munro wrote:
> On Fri, Jan 28, 2022 at 12:03 PM Andres Freund  wrote:
>> Revert "graceful shutdown" changes for Windows, in back branches only.
> FTR I'm actively working on a fix for that one for master now (see
> that other thread where the POC survived Alexander's torture testing).

OK, good. A further data point on that: I am not seeing a recovery test
hang or commit_ts test failure on real W10 machines, including jacana. I
am only getting them on WS2019 VMs e.g. drongo/fairywren.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2022-01-28 Thread Bharath Rupireddy

On Fri, Jan 28, 2022 at 2:20 PM Julien Rouhaud  wrote:
> PG_CONTROL_VERSION is different from catversion.  You should update it in this
> patch.

My bad. Updated it.

> But Horiguchi-san was also mentioning that pg_upgrade/controldata.c needs some
> modifications if you change the format (thus the requirement to bump
> PG_CONTROL_VERSION).

> Also, you still didn't fix the possible flag upgrade issue.

I don't think we need to change pg_upgrade's ControlData controldata;
structure as the information may not be needed there and the while
loop there specifically parses/searches for the required
pg_controldata output texts. Am I missing something here?

> Why are you defining CHECKPOINT_KIND_TEXT_LENGTH twice?  You
> should just define it in some sensible header used by both files, or better
> have a new function to take care of that rather than having the code
> duplicated.

Yeah, added the macro in pg_control.h. I also wanted to have a common
function to get checkpoint kind text and place it in
controldata_utils.c, but it doesn't have xlog.h included, so no
checkpoint flags there, hence I refrained from the common function
idea.

I think we don't need to print the checkpoint kind in pg_resetwal.c's
PrintControlValues because the pg_resetwal changes the checkpoint and
PrintControlValues just prints the fields that will not be
reset/changed by pg_resetwal. Am I missing something here?

Attaching v4.

Not related to this patch: by looking at the way the fields (like
"Latest checkpoint's TimeLineID:", "Latest checkpoint's NextOID:"
etc.) of pg_controldata output are being used in pg_resetwal.c,
pg_controldata.c, and pg_upgrade/controldata.c, I'm thinking of having
those fields as macros in pg_control.h
#define PG_CONTROL_LATEST_CHECKPOINT_TLI "Latest checkpoint's TimeLineID:"
#define PG_CONTROL_LATEST_CHECKPOINT_NEXTOID "Latest checkpoint's NextOID:"
and so on for all the pg_controldata fields would be a good idea for
better code manageability and not to miss any field text changes.

If okay, I will discuss this in a separate thread.

Regards,
Bharath Rupireddy.
From e118edc5852434847dac36c566e7feca6f1f22ca Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 28 Jan 2022 14:30:18 +
Subject: [PATCH v4] add last checkpoint kind to pg_control file

Bump catalog version
---
 doc/src/sgml/func.sgml  |  5 +
 src/backend/access/transam/xlog.c   |  2 ++
 src/backend/utils/misc/pg_controldata.c | 26 ++---
 src/bin/pg_controldata/pg_controldata.c | 17 
 src/include/catalog/pg_control.h| 10 +++---
 src/include/catalog/pg_proc.dat |  6 +++---
 6 files changed, 57 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0ee6974f1c..4357ca838e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25093,6 +25093,11 @@ SELECT collation for ('foo' COLLATE "de_DE");
timestamp with time zone

+  
+   checkpoint_kind
+   text
+  
+

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dfe2a0bcce..72255f6b8a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9375,6 +9375,7 @@ CreateCheckPoint(int flags)
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	if (shutdown)
 		ControlFile->state = DB_SHUTDOWNED;
+	ControlFile->checkPointKind = flags;
 	ControlFile->checkPoint = ProcLastRecPtr;
 	ControlFile->checkPointCopy = checkPoint;
 	/* crash recovery should always recover to the end of WAL */
@@ -9761,6 +9762,7 @@ CreateRestartPoint(int flags)
 	if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
 		ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
 	{
+		ControlFile->checkPointKind = flags;
 		ControlFile->checkPoint = lastCheckPointRecPtr;
 		ControlFile->checkPointCopy = lastCheckPoint;

diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 781f8b8758..9d48a99378 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -79,20 +79,22 @@ pg_control_system(PG_FUNCTION_ARGS)
 Datum
 pg_control_checkpoint(PG_FUNCTION_ARGS)
 {
-	Datum		values[18];
-	bool		nulls[18];
+	Datum		values[19];
+	bool		nulls[19];
 	TupleDesc	tupdesc;
 	HeapTuple	htup;
 	ControlFileData *ControlFile;
 	XLogSegNo	segno;
 	char		xlogfilename[MAXFNAMELEN];
 	bool		crc_ok;
+	uint16		flags;
+	char 		ckpt_kind[CHECKPOINT_KIND_TEXT_LENGTH];

 	/*
 	 * Construct a tuple descriptor for the result row.  This must match this
 	 * function's pg_proc entry!
 	 */
-	tupdesc = CreateTemplateTupleDesc(18);
+	tupdesc = CreateTemplateTupleDesc(19);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "checkpoint_lsn",
 	   PG_LSNOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "redo_lsn",
@@ -129,6 +131,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 	   XIDOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (Att

Re: Support for NSS as a libpq TLS backend

2022-01-28 Thread Daniel Gustafsson

> On 28 Jan 2022, at 15:30, Robert Haas  wrote:
> 
> On Fri, Jan 28, 2022 at 9:08 AM Daniel Gustafsson  wrote:
>>> Kinda makes me question the wisdom of starting to depend on NSS. When 
>>> openssl
>>> docs are vastly outshining a library's, that library really should start to
>>> ask itself some hard questions.
> 
> Yeah, OpenSSL is very poor, so being worse is not good.

Some background on this for anyone interested: Mozilla removed the
documentation from the MDN website and the attempt at resurrecting it in the
tree (where it should've been all along ) isn't making much progress.
Some more can be found in this post on the NSS mailinglist:

https://groups.google.com/a/mozilla.org/g/dev-tech-crypto/c/p0MO7030K4A/m/Mx5St_2sAwAJ

--
Daniel Gustafsson   https://vmware.com/

Re: Server-side base backup: why superuser, not pg_write_server_files?

2022-01-28 Thread Dagfinn Ilmari Mannsåker

Robert Haas  writes:

> On Fri, Jan 28, 2022 at 5:58 AM Dagfinn Ilmari Mannsåker
>  wrote:
>> I just noticed that the new server-side base backup feature requires
>> superuser privileges (which is only documented in the pg_basebackup
>> manual, not in the streaming replication protocol specification).
>>
>> Isn't this the kind of thing the pg_write_server_files role was created
>> for, so that it can be delegated to a non-superuser?
>
> That's a good idea. I didn't think of that. Would you like to propose a patch?

Sure, I'll try and whip something up over the weekend.

- ilmari

Re: Remove extra includes of "access/xloginsert.h" when "access/xlog.h" is included

2022-01-28 Thread Alvaro Herrera

On 2022-Jan-28, Bharath Rupireddy wrote:

> Hi,
> 
> It seems like there are some instances where xloginsert.h is included
> right after xlog.h but xlog.h has already included xloginsert.h.
> Unless I'm missing something badly, we can safely remove including
> xloginsert.h after xlog.h. Attempting to post a patch to remove the
> extra xloginsert.h includes.

Why isn't it better to remove the line that includes xloginsert.h in
xlog.h instead?  When xloginsert.h was introduced (commit 2076db2aea76),
XLogRecData was put there so xloginsert.h was necessary for xlog.h; but
now we have a forward declaration (per commit 2c03216d8311) so it
doesn't seem needed anymore.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: Support tab completion for upper character inputs in psql

2022-01-28 Thread Tom Lane

"tanghy.f...@fujitsu.com"  writes:
> I did some tests on it and here are something cases I feel we need to confirm 
> whether they are suitable.

> 1) postgres=# create table atest(id int, "iD" int, "ID" int);
> 2) CREATE TABLE
> 3) postgres=# alter table atest rename i[TAB]
> 4) id"iD"
> 5) postgres=# alter table atest rename I[TAB]
> 6) id"iD"

> The tab completion for 5) ignored "ID", is that correct?

Perhaps I misunderstood your original complaint, but what I thought
you were unhappy about was that unquoted ID is a legal spelling of
"id" and so I ought to be willing to complete that.  These
examples with case variants of the same word are of some interest,
but people aren't really going to create tables with these sorts of
names, so we shouldn't let them drive the design IMO.

Anyway, the existing behavior for these examples is

alter table atest rename i --- completes immediately to id
alter table atest rename I --- offers nothing

It's certainly arguable that the first case is right as-is and we
shouldn't change it.  I think that could be handled by tweaking my
patch so that it wouldn't offer completions that start with a quote
unless the input word does.  That would also cause I to complete
immediately to id, which is arguably fine.

> I think what we are trying to do is to ease the burden of typing double quote 
> for user.

I'm not thinking about it that way at all.  To me, the goal is to make
tab completion do something sensible when presented with legal variant
spellings of a word.  The two cases where it currently fails to do
that are (1) unquoted input that needs to be downcased, and (2) input
that is quoted when it doesn't strictly need to be.

To the extent that we can supply a required quote that the user
failed to type, that's fine, but it's not a primary goal of the patch.
Examples like these make me question whether it's even something we
want; it's resulting in extraneous matches that people might find more
annoying than helpful.  Now I *think* that these aren't realistic
cases and that in real cases adding quotes will be helpful more often
than not, but it's debatable.

> One the other hand, I'm not so comfortable with the output of "iD" in line 13.
> If user doesn't type double quote, why we add double quote to the output?

That's certainly a valid argument.

> Could we make the output of 13) like below?
> 12) postgres=# alter table atest rename i[TAB]
> ??) id  iD

That doesn't seem sensible at all.

regards, tom lane

Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2022-01-28 Thread Julien Rouhaud

Hi,

On Fri, Jan 28, 2022 at 08:21:52PM +0530, Bharath Rupireddy wrote:
> 
> I don't think we need to change pg_upgrade's ControlData controldata;
> structure as the information may not be needed there and the while
> loop there specifically parses/searches for the required
> pg_controldata output texts. Am I missing something here?

Right, I was remembering that there was a check that all expected fields were
found but after double checking I was clearly wrong, sorry about that.
> 
> > Also, you still didn't fix the possible flag upgrade issue.

Unless I'm missing something that's an issue that you still haven't addressed
or explained why it's not a problem?

> 
> > Why are you defining CHECKPOINT_KIND_TEXT_LENGTH twice?  You
> > should just define it in some sensible header used by both files, or better
> > have a new function to take care of that rather than having the code
> > duplicated.
> 
> Yeah, added the macro in pg_control.h. I also wanted to have a common
> function to get checkpoint kind text and place it in
> controldata_utils.c, but it doesn't have xlog.h included, so no
> checkpoint flags there, hence I refrained from the common function
> idea.

That's a bit annoying, I'm not sure what's best to do here.

Re: Multiple Query IDs for a rewritten parse tree

2022-01-28 Thread Dmitry Dolgov

> On Fri, Jan 21, 2022 at 11:33:22AM +0500, Andrey V. Lepikhov wrote:
> On 1/9/22 5:49 AM, Tom Lane wrote:
> > The idea I'd been vaguely thinking about is to allow attaching a list
> > of query-hash nodes to a Query, where each node would contain a "tag"
> > identifying the specific hash calculation method, and also the value
> > of the query's hash calculated according to that method.  We could
> > probably get away with saying that all such hash values must be uint64.
> > The main difference from your function-OID idea, I think, is that
> > I'm envisioning the tags as being small integers with well-known
> > values, similarly to the way we manage stakind values in pg_statistic.
> > In this way, an extension that wants a hash that the core knows how
> > to calculate doesn't need its own copy of the code, and similarly
> > one extension could publish a calculation method for use by other
> > extensions.
>
> To move forward, I have made a patch that implements this idea (see
> attachment). It is a POC, but passes all regression tests.

Thanks. Couple of comments off the top of my head:

> Registration of an queryId generator implemented by analogy with extensible
> methods machinery.

Why not more like suggested with stakind and slots in some data
structure? All of those generators have to be iterated anyway, so not
sure if a hash table makes sense.

> Also, I switched queryId to int64 type and renamed to
> 'label'.

A name with "id" in it would be better I believe. Label could be think
of as "the query belongs to a certain category", while the purpose is
identification.

> 2. We need a custom queryId, that is based on a generated queryId (according
> to the logic of pg_stat_statements).

Could you clarify?

> 4. We should reserve position of default in-core generator

>From the discussion above I was under the impression that the core
generator should be distinguished by a predefined kind.

> 5. We should add an EXPLAIN hook, to allow an extension to print this custom
> queryId.

Why? It would make sense if custom generation code will be generating
some complex structure, but the queryId itself is still a hash.

Re: Server-side base backup: why superuser, not pg_write_server_files?

2022-01-28 Thread Dagfinn Ilmari Mannsåker

Dagfinn Ilmari Mannsåker  writes:

> Robert Haas  writes:
>
>> On Fri, Jan 28, 2022 at 5:58 AM Dagfinn Ilmari Mannsåker
>>  wrote:
>>> I just noticed that the new server-side base backup feature requires
>>> superuser privileges (which is only documented in the pg_basebackup
>>> manual, not in the streaming replication protocol specification).
>>>
>>> Isn't this the kind of thing the pg_write_server_files role was created
>>> for, so that it can be delegated to a non-superuser?
>>
>> That's a good idea. I didn't think of that. Would you like to propose a 
>> patch?
>
> Sure, I'll try and whip something up over the weekend.

Or now. Patch attached.

- ilmari

>From 2b5f078905fd463fc33d8ef259e93972ea17cd34 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Fri, 28 Jan 2022 15:54:07 +
Subject: [PATCH] Allow BASE_BACKUP TARGET 'server' to pg_write_server_files
 members

---
 doc/src/sgml/protocol.sgml  | 5 +
 doc/src/sgml/ref/pg_basebackup.sgml | 3 ++-
 src/backend/replication/basebackup_server.c | 6 --
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 68908dcb7b..24e93f9b28 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2647,6 +2647,11 @@
   blackhole, the backup data is not sent
   anywhere; it is simply discarded.
  
+
+ 
+  The server target requires superuser privilege or
+  being granted the pg_write_server_files role.
+ 
 

 
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index a5e03d2c66..d6b3cb18e3 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -237,7 +237,8 @@
 server:/some/path, the backup will be stored on
 the machine where the server is running in the
 /some/path directory. Storing a backup on the
-server requires superuser privileges. If the target is set to
+server requires superuser privileges or being granted the
+pg_write_server_files role. If the target is set to
 blackhole, the contents are discarded and not
 stored anywhere. This should only be used for testing purposes, as you
 will not end up with an actual backup.
diff --git a/src/backend/replication/basebackup_server.c b/src/backend/replication/basebackup_server.c
index ce1b7b4797..18b0e11d90 100644
--- a/src/backend/replication/basebackup_server.c
+++ b/src/backend/replication/basebackup_server.c
@@ -10,10 +10,12 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_authid.h"
 #include "miscadmin.h"
 #include "replication/basebackup.h"
 #include "replication/basebackup_sink.h"
 #include "storage/fd.h"
+#include "utils/acl.h"
 #include "utils/timestamp.h"
 #include "utils/wait_event.h"
 
@@ -65,10 +67,10 @@ bbsink_server_new(bbsink *next, char *pathname)
 	sink->base.bbs_next = next;
 
 	/* Replication permission is not sufficient in this case. */
-	if (!superuser())
+	if (!is_member_of_role(GetUserId(), ROLE_PG_WRITE_SERVER_FILES))
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- errmsg("must be superuser to create server backup")));
+ errmsg("must be superuser or a member of the pg_write_server_files role to create server backup")));
 
 	/*
 	 * It's not a good idea to store your backups in the same directory that
-- 
2.30.2

Re: Server-side base backup: why superuser, not pg_write_server_files?

2022-01-28 Thread Robert Haas

On Fri, Jan 28, 2022 at 12:16 PM Dagfinn Ilmari Mannsåker
 wrote:
> Or now. Patch attached.

LGTM. Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Server-side base backup: why superuser, not pg_write_server_files?

2022-01-28 Thread Dagfinn Ilmari Mannsåker



On Fri, 28 Jan 2022, at 17:33, Robert Haas wrote:
> LGTM. Committed.

Thanks!

- ilmari

Re: refactoring basebackup.c

2022-01-28 Thread Jeevan Ladhe

Hi Robert,

I have attached the latest rebased version of the LZ4 server-side
compression
patch on the recent commits. This patch also introduces the compression
level
and adds a tap test.

Also, while adding the lz4 case in the pg_verifybackup/t/008_untar.pl, I
found
an unused variable {have_zlib}. I have attached a cleanup patch for that as
well.

Please review and let me know your thoughts.

Regards,
Jeevan Ladhe


0001-gzip-tap-test-remove-extra-variable.patch
Description: Binary data


v10-0002-Add-a-LZ4-compression-method-for-server-side-compres.patch
Description: Binary data

Re: [PATCH] Disable bgworkers during servers start in pg_upgrade

2022-01-28 Thread Andres Freund

Hi,

On 2022-01-28 21:56:36 +0800, Julien Rouhaud wrote:
> I think having a new option for vacuumdb is the right move.

Can't we pass the option via the connection string, e.g. something
PGOPTIONS='-c binary_upgrade_mode=true'? That seems to scale better than to
add it gradually to multiple tools.

Greetings,

Andres Freund

Re: archive modules

2022-01-28 Thread Robert Haas

On Thu, Jan 13, 2022 at 2:38 PM Bossart, Nathan  wrote:
> Here is another rebase for cfbot.

I've committed 0001 now. I don't see anything particularly wrong with
the rest of this either, but here are a few comments:

- I wonder whether it might be better to promote the basic archiving
module to contrib (as compared with src/test/modules) and try to
harden it to the extent that such hardening is required. I think a lot
of people would get good use out of that. It might not be a completely
baked solution, but a solution doesn't have to be completely baked to
be a massive improvement over the stupidity endorsed by our current
documentation.

- I wonder whether it's a good idea to silently succeed if the file
exists and has the same contents as the file we're trying to archive.
ISTR that being necessary behavior for robustness, because what if we
archive the file and then die before recording the fact that we
archived it?

- If we load a new archive library, should we give the old one a
callback so it can shut down? And should the archiver considering
exiting since we can't unload? It isn't necessary but it might be
nicer.

- I believe we decided some time back to invoke function pointers
(*like)(this) rather than like(this) for clarity. It was judged that
something->like(this) was fine because that can only be a function
pointer, so no need to write (*(something->like))(this), but
like(this) could make you think that "like" is a plain function rather
than a function pointer.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: refactoring basebackup.c

2022-01-28 Thread Robert Haas

On Fri, Jan 28, 2022 at 12:48 PM Jeevan Ladhe
 wrote:
> I have attached the latest rebased version of the LZ4 server-side compression
> patch on the recent commits. This patch also introduces the compression level
> and adds a tap test.

In view of this morning's commit of
d45099425eb19e420433c9d81d354fe585f4dbd6 I think the threshold for
committing this patch has gone up. We need to make it support
decompression with LZ4 on the client side, as we now have for gzip.

Other comments:

- Even if we were going to support LZ4 only on the server side, surely
it's not right to refuse --compress lz4 and --compress client-lz4 at
the parsing stage. I don't even think the message you added to main()
is reachable.

- In the new test case you set decompress_flags but according to the
documentation I have here, -m is for multiple files (and so should not
be needed here) and -d is for decompression (which is what we want
here). So I'm confused why this is like this.

Other than that this seems like it's in pretty good shape.

> Also, while adding the lz4 case in the pg_verifybackup/t/008_untar.pl, I found
> an unused variable {have_zlib}. I have attached a cleanup patch for that as 
> well.

This part seems clearly correct, so I have committed it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: archive modules

2022-01-28 Thread Nathan Bossart

On Fri, Jan 28, 2022 at 02:06:50PM -0500, Robert Haas wrote:
> I've committed 0001 now. I don't see anything particularly wrong with
> the rest of this either, but here are a few comments:

Thanks!

> - I wonder whether it might be better to promote the basic archiving
> module to contrib (as compared with src/test/modules) and try to
> harden it to the extent that such hardening is required. I think a lot
> of people would get good use out of that. It might not be a completely
> baked solution, but a solution doesn't have to be completely baked to
> be a massive improvement over the stupidity endorsed by our current
> documentation.

This has been suggested a few times in this thread, so I'll go ahead and
move it to contrib.  I am clearly outnumbered! :)

I discussed the two main deficiencies I'm aware of with basic_archive
earlier [0].  The first one is the issue with "incovenient" server crashes
(mentioned below).  The second is that there is no handling for multiple
servers writing to the same location since the temporary file is always
named "archtemp."  I thought about a few ways to pick a unique file name
(or at least one that is _probably_ unique), but that began adding a lot of
complexity for something I intended as a test module.  I can spend some
more time on this if you think it's worth fixing for a contrib module.

> - I wonder whether it's a good idea to silently succeed if the file
> exists and has the same contents as the file we're trying to archive.
> ISTR that being necessary behavior for robustness, because what if we
> archive the file and then die before recording the fact that we
> archived it?

Yes.  The only reason I didn't proceed with this earlier is because the
logic became a sizable chunk of the module.  I will add this in the next
revision.

> - If we load a new archive library, should we give the old one a
> callback so it can shut down? And should the archiver considering
> exiting since we can't unload? It isn't necessary but it might be
> nicer.

Good idea.  I'll look into this.

> - I believe we decided some time back to invoke function pointers
> (*like)(this) rather than like(this) for clarity. It was judged that
> something->like(this) was fine because that can only be a function
> pointer, so no need to write (*(something->like))(this), but
> like(this) could make you think that "like" is a plain function rather
> than a function pointer.

Will fix.

[0] https://postgr.es/m/A30D8D33-8944-4898-BCA8-C77C18258247%40amazon.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com/

Re: archive modules

2022-01-28 Thread Robert Haas

On Fri, Jan 28, 2022 at 3:01 PM Nathan Bossart  wrote:
> I discussed the two main deficiencies I'm aware of with basic_archive
> earlier [0].  The first one is the issue with "incovenient" server crashes
> (mentioned below).

Seems easy enough to rectify, if it's just a matter of silently-succeed-if-same.

> The second is that there is no handling for multiple
> servers writing to the same location since the temporary file is always
> named "archtemp."  I thought about a few ways to pick a unique file name
> (or at least one that is _probably_ unique), but that began adding a lot of
> complexity for something I intended as a test module.  I can spend some
> more time on this if you think it's worth fixing for a contrib module.

Well, my first thought was to wonder whether we even care about that
scenario, but I guess we probably do, at least a little bit.

How about:

1. Name temporary files like
archive_temp.${FINAL_NAME}.${PID}.${SOME_RANDOM_NUMBER}. Create them
with O_EXCL. If it fails, die.

2. Try not to leave files like this behind, perhaps installing an
on_proc_exit callback or similar, but accept that crashes and unlink()
failures will make it inevitable in some cases.

3. Document that crashes or other strange failure cases may leave
archive_temp.* files behind in the archive directory, and recommend
that users remove them before restarting the database after a crash
(or, with caution, removing them while the database is running if the
user is sure that the files are old and unrelated to any archiving
still in progress).

I'm not arguing that this is exactly the right idea. But I am arguing
that it shouldn't take a ton of engineering to come up with something
reasonable here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: support for MERGE

2022-01-28 Thread Andres Freund

Hi,

On 2022-01-28 17:27:37 -0300, Alvaro Herrera wrote:
> MERGE, v10.  I am much more comfortable with this version; I have
> removed a bunch of temporary hacks and cleaned up the interactions with
> table AM and executor, which is something that had been bothering me for
> a while.  The complete set of changes can be seen in github,
> https://github.com/alvherre/postgres/commits/merge-15

Any chance you could split this into something more reviewable? The overall
diff stats are: 102 files changed, 8589 insertions(+), 234 deletions(-) thats
pretty hard to really review. Incremental commits don't realy help with that.


> I am not aware of anything of significance in terms of remaining work
> for this project.

One thing from skimming: There's not enough documentation about the approach
imo - it's a complicated enough feature to deserve more than the one paragraph
in src/backend/executor/README.


I'm still [1],[2] uncomfortable with execMerge.c kinda-but-not-really being an
executor node.


> The one thing I'm a bit bothered about is the fact that we expose a lot of
> executor functions previously static.  I am now wondering if it would be
> better to move the MERGE executor support functions into nodeModifyTable.c,
> which I think would mean we would not have to expose those function
> prototypes.

I'm worried about the complexity of nodeModifyTuple.c as is. Moving more code
in there does not seem like the right call to me - we should do the opposite
if anything.


A few inline comments below. No real review, just stuff noticed while skimming
to see where this is at.


Greetings,

Andres


[1] 
https://www.postgresql.org/message-id/2018040523.gar3j26gsk32g...@alap3.anarazel.de
[2] 
https://www.postgresql.org/message-id/20180403021800.b5nsgiclzanob...@alap3.anarazel.de


> diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
> index 1a9c1ac290..280ac40e63 100644
> --- a/src/backend/commands/trigger.c
> +++ b/src/backend/commands/trigger.c

This stuff seems like one candidate for splitting out.


> + /*
> +  * We maintain separate transition tables for UPDATE/INSERT/DELETE since
> +  * MERGE can run all three actions in a single statement. Note that 
> UPDATE
> +  * needs both old and new transition tables whereas INSERT needs only 
> new,
> +  * and DELETE needs only old.
> +  */
> +
> + /* "old" transition table for UPDATE, if any */
> + Tuplestorestate *old_upd_tuplestore;
> + /* "new" transition table for UPDATE, if any */
> + Tuplestorestate *new_upd_tuplestore;
> + /* "old" transition table for DELETE, if any */
> + Tuplestorestate *old_del_tuplestore;
> + /* "new" transition table INSERT, if any */
> + Tuplestorestate *new_ins_tuplestore;
> +
>   TupleTableSlot *storeslot;  /* for converting to tuplestore's 
> format */
>  };

Do we need to do something slightly clever about the memory limits for the
tuplestore? Each of them having a separate work_mem makes nodeModifyTable.c
one memory hungry node (the worst of all maybe?).




> +
> +/*
> + * Perform MERGE.
> + */
> +TupleTableSlot *
> +ExecMerge(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo,
> +   EState *estate, TupleTableSlot *slot)
> +{
> + ExprContext *econtext = mtstate->ps.ps_ExprContext;
> + ItemPointer tupleid;
> + ItemPointerData tuple_ctid;
> + boolmatched = false;
> + Datum   datum;
> + boolisNull;
> +
> + /*
> +  * Reset per-tuple memory context to free any expression evaluation
> +  * storage allocated in the previous cycle.
> +  */
> + ResetExprContext(econtext);

Hasn't this already happened in ExecModifyTable()?


> + /*
> +  * We run a JOIN between the target relation and the source relation to
> +  * find a set of candidate source rows that have a matching row in the
> +  * target table and a set of candidate source rows that do not have a
> +  * matching row in the target table. If the join returns a tuple with 
> the
> +  * target relation's row-ID set, that implies that the join found a
> +  * matching row for the given source tuple. This case triggers the WHEN
> +  * MATCHED clause of the MERGE. Whereas a NULL in the target relation's
> +  * row-ID column indicates a NOT MATCHED case.
> +  */
> + datum = ExecGetJunkAttribute(slot,
> +  
> resultRelInfo->ri_RowIdAttNo,
> +  &isNull);

Could this be put somewhere common with the equivalent call in
ExecModifyTable()?


> +  * A concurrent update can:
> +  *
> +  * 1. modify the target tuple so that it no longer satisfies the
> +  *additional quals attached to the current WHEN MATCHED action
> +  *
> +  *In this case, we are still dealing with a WHEN MATCHED case.
> +  *In this case, we rechec

Re: support for MERGE

2022-01-28 Thread Zhihong Yu

On Fri, Jan 28, 2022 at 2:19 PM Andres Freund  wrote:

> Hi,
>
> On 2022-01-28 17:27:37 -0300, Alvaro Herrera wrote:
> > MERGE, v10.  I am much more comfortable with this version; I have
> > removed a bunch of temporary hacks and cleaned up the interactions with
> > table AM and executor, which is something that had been bothering me for
> > a while.  The complete set of changes can be seen in github,
> > https://github.com/alvherre/postgres/commits/merge-15
>
> Any chance you could split this into something more reviewable? The overall
> diff stats are: 102 files changed, 8589 insertions(+), 234 deletions(-)
> thats
> pretty hard to really review. Incremental commits don't realy help with
> that.
>
>
> > I am not aware of anything of significance in terms of remaining work
> > for this project.
>
> One thing from skimming: There's not enough documentation about the
> approach
> imo - it's a complicated enough feature to deserve more than the one
> paragraph
> in src/backend/executor/README.
>
>
> I'm still [1],[2] uncomfortable with execMerge.c kinda-but-not-really
> being an
> executor node.
>
>
> > The one thing I'm a bit bothered about is the fact that we expose a lot
> of
> > executor functions previously static.  I am now wondering if it would be
> > better to move the MERGE executor support functions into
> nodeModifyTable.c,
> > which I think would mean we would not have to expose those function
> > prototypes.
>
> I'm worried about the complexity of nodeModifyTuple.c as is. Moving more
> code
>

Hi,
I think you meant nodeModifyTable.c.
And I agree with your comment (on not making nodeModifyTable.c bigger).

Cheers


> in there does not seem like the right call to me - we should do the
> opposite
> if anything.
>
>
> A few inline comments below. No real review, just stuff noticed while
> skimming
> to see where this is at.
>
>
> Greetings,
>
> Andres
>
>
> [1]
> https://www.postgresql.org/message-id/2018040523.gar3j26gsk32g...@alap3.anarazel.de
> [2]
> https://www.postgresql.org/message-id/20180403021800.b5nsgiclzanob...@alap3.anarazel.de
>
>
>

Re: Add 64-bit XIDs into PostgreSQL 15

2022-01-28 Thread Andres Freund

Hi,

On 2022-01-24 16:38:54 +0400, Pavel Borisov wrote:
> +64-bit Transaction ID's (XID)
> +=
> +
> +A limited number (N = 2^32) of XID's required to do vacuum freeze to prevent
> +wraparound every N/2 transactions. This causes performance degradation due
> +to the need to exclusively lock tables while being vacuumed. In each
> +wraparound cycle, SLRU buffers are also being cut.

What exclusive lock?


> +"Double XMAX" page format
> +-
> +
> +At first read of a heap page after pg_upgrade from 32-bit XID PostgreSQL
> +version pd_special area with a size of 16 bytes should be added to a page.
> +Though a page may not have space for this. Then it can be converted to a
> +temporary format called "double XMAX".
>
> +All tuples after pg-upgrade would necessarily have xmin = 
> FrozenTransactionId.

Why would a tuple after pg-upgrade necessarily have xmin =
FrozenTransactionId? A pg_upgrade doesn't scan the tables, so the pg_upgrade
itself doesn't do anything to xmins.

I guess you mean that the xmin cannot be needed anymore, because no older
transaction can be running?


> +In-memory tuple format
> +--
> +
> +In-memory tuple representation consists of two parts:
> +- HeapTupleHeader from disk page (contains all heap tuple contents, not only
> +header)
> +- HeapTuple with additional in-memory fields
> +
> +HeapTuple for each tuple in memory stores t_xid_base/t_multi_base - a copies 
> of
> +page's pd_xid_base/pd_multi_base. With tuple's 32-bit t_xmin and t_xmax from
> +HeapTupleHeader they are used to calculate actual 64-bit XMIN and XMAX:
> +
> +XMIN = t_xmin + t_xid_base.  (3)
> +XMAX = t_xmax + t_xid_base/t_multi_base. (4)

What identifies a HeapTuple as having this additional data?


> +The downside of this is that we can not use tuple's XMIN and XMAX right away.
> +We often need to re-read t_xmin and t_xmax - which could actually be pointers
> +into a page in shared buffers and therefore they could be updated by any 
> other
> +backend.

Ugh, that's not great.


> +Upgrade from 32-bit XID versions
> +
> +
> +pg_upgrade doesn't change pages format itself. It is done lazily after.
> +
> +1. At first heap page read, tuples on a page are repacked to free 16 bytes
> +at the end of a page, possibly freeing space from dead tuples.

That will cause a *massive* torrent of writes after an upgrade. Isn't this
practically making pg_upgrade useless?  Imagine a huge cluster where most of
the pages are all-frozen, upgraded using link mode.


What happens if the first access happens on a replica?


What is the approach for dealing with multixact files? They have xids
embedded?  And currently the SLRUs will break if you just let the offsets SLRU
grow without bounds.



> +void
> +convert_page(Relation rel, Page page, Buffer buf, BlockNumber blkno)
> +{
> + PageHeader  hdr = (PageHeader) page;
> + GenericXLogState *state = NULL;
> + Pagetmp_page = page;
> + uint16  checksum;
> +
> + if (!rel)
> + return;
> +
> + /* Verify checksum */
> + if (hdr->pd_checksum)
> + {
> + checksum = pg_checksum_page((char *) page, blkno);
> + if (checksum != hdr->pd_checksum)
> + ereport(ERROR,
> + (errcode(ERRCODE_INDEX_CORRUPTED),
> +  errmsg("page verification failed, 
> calculated checksum %u but expected %u",
> + checksum, 
> hdr->pd_checksum)));
> + }
> +
> + /* Start xlog record */
> + if (!XactReadOnly && XLogIsNeeded() && RelationNeedsWAL(rel))
> + {
> + state = GenericXLogStart(rel);
> + tmp_page = GenericXLogRegisterBuffer(state, buf, 
> GENERIC_XLOG_FULL_IMAGE);
> + }
> +
> + PageSetPageSizeAndVersion((hdr), PageGetPageSize(hdr),
> +   
> PG_PAGE_LAYOUT_VERSION);
> +
> + if (was_32bit_xid(hdr))
> + {
> + switch (rel->rd_rel->relkind)
> + {
> + case 'r':
> + case 'p':
> + case 't':
> + case 'm':
> + convert_heap(rel, tmp_page, buf, blkno);
> + break;
> + case 'i':
> + /* no need to convert index */
> + case 'S':
> + /* no real need to convert sequences */
> + break;
> + default:
> + elog(ERROR,
> +  "Conversion for relkind '%c' is not 
> implemented",
> +  rel->rd_rel->relkind);
> + }
> + }
> +
> + /*
> +  * Mark buffer dirty unless this is

Re: support for MERGE

2022-01-28 Thread Justin Pryzby

On Fri, Jan 28, 2022 at 05:27:37PM -0300, Alvaro Herrera wrote:
> The one thing I'm a bit bothered about is the fact
> that we expose a lot of executor functions previously static.  I am now
> wondering if it would be better to move the MERGE executor support
> functions into nodeModifyTable.c, which I think would mean we would not
> have to expose those function prototypes.

It's probably a good idea.  

If you wanted to avoid bloating nodeModifyTable.c, maybe you could
#include "execMerge.c"

>From commit message:

> MERGE does not yet support inheritance,

It does support it now, right ?

>From merge.sgml:

"If you specify an update action...":
=> should say "If an update action is specified, ..."

s/an delete/a delete/

".. the WHEN clause is executed"
=> should say "the WHEN clause's action is executed" ?

" If a later WHEN clause of that kind is specified"
=> + COMMA

> --- a/doc/src/sgml/ref/allfiles.sgml
> +++ b/doc/src/sgml/ref/allfiles.sgml
> @@ -159,6 +159,7 @@ Complete list of usable sgml source files in this 
> directory.
>  
>  
>  
> +
>  
>  
>  

Looks like this is intended to be in alpha order.

> +  insert, update, or delete rows of a table based upon source 
> data

based on ?

> --- a/src/backend/executor/README
> +++ b/src/backend/executor/README
> @@ -41,6 +41,19 @@ be used for other table types.)  For DELETE, the plan tree 
> need only deliver
>  junk row-identity column(s), and the ModifyTable node visits each of those
>  rows and marks the row deleted.
>  
> +MERGE runs one generic plan that returns candidate change rows. Each row
> +consists of the output of the data-source table or query, plus CTID and
> +(if the target table is partitioned) TABLEOID junk columns.  If the target

s/partitioned/has child tables/ ?

>   case CMD_INSERT:
>   case CMD_DELETE:
>   case CMD_UPDATE:
> + case CMD_MERGE:

Is it intended to stay in alpha order (?)

> + case WCO_RLS_MERGE_UPDATE_CHECK:
> + case WCO_RLS_MERGE_DELETE_CHECK:
> + if (wco->polname != NULL)
> + ereport(ERROR,
> + 
> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> +  errmsg("target 
> row violates row-level security policy \"%s\" (USING expression) for table 
> \"%s\"",
> + 
> wco->polname, wco->relname)));

The parens around errcode are optional and IMO should be avoided for new code.

> +  * This duplicates much of the logic in ExecInitMerge(), so something
> +  * changes there, look here too.

so *if* ?

>   case T_InsertStmt:
>   case T_DeleteStmt:
>   case T_UpdateStmt:
> + case T_MergeStmt:
>   lev = LOGSTMT_MOD;
>   break;

alphabetize (?)

> + /* selcondition */
> + "c.relkind IN (" CppAsString2(RELKIND_RELATION) ", "
> + CppAsString2(RELKIND_PARTITIONED_TABLE) ") AND "
> + "c.relhasrules = false AND "
> + "(c.relhassubclass = false OR "
> + " c.relkind = " CppAsString2(RELKIND_PARTITIONED_TABLE) ")",

relhassubclass=false is wrong now ?

> +-- prepare
> +RESET SESSION AUTHORIZATION;
> +DROP TABLE target, target2;
> +DROP TABLE source, source2;
> +DROP FUNCTION merge_trigfunc();
> +DROP USER merge_privs;
> +DROP USER merge_no_privs;

Why does it say "prepare" ?
I think it means to say "Clean up"

WRITE_READ_PARSE_PLAN_TREES exposes errors in make check:
+ERROR:  targetColnos does not match subplan target list

Have you looked at code coverage ?  I have an experimental patch to add that to
cirrus, and ran it with this patch; visible here:
https://cirrus-ci.com/task/6362512059793408

-- 
Justin

Re: pg_upgrade should truncate/remove its logs before running

2022-01-28 Thread Andrew Dunstan



On 1/28/22 08:42, Michael Paquier wrote:
> On Wed, Jan 26, 2022 at 11:00:28AM +0900, Michael Paquier wrote:
>> Bleh.  This would point to the old data directory, so this needs to be
>> "$self->{pgsql}/src/bin/pg_upgrade/tmp_check/data/pg_upgrade_output.d/log/*.log"
>> to point to the upgraded cluster.
> Please note that I have sent a patch to merge this change in the
> buildfarm code.  Comments are welcome.



I have committed this. But it will take time to get every buildfarm own
to upgrade. I will try to make a new release ASAP.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: support for MERGE

2022-01-28 Thread Erik Rijkers


Op 28-01-2022 om 21:27 schreef Alvaro Herrera:

MERGE, v10.  I am much more comfortable with this version; I have
removed a bunch of temporary hacks and cleaned up the interactions with
table AM and executor, which is something that had been bothering me for
a while.  The complete set of changes can be seen in github,
https://github.com/alvherre/postgres/commits/merge-15



[v10-0001-MERGE-SQL-Command-following-SQL-2016.patch]


The patch doesnt apply smoothly:

patching file src/backend/tcop/pquery.c
patching file src/backend/tcop/utility.c
patching file src/backend/utils/adt/ruleutils.c
patching file src/bin/psql/tab-complete.c
Hunk #3 FAILED at 1714.
Hunk #4 succeeded at 3489 (offset 6 lines).
Hunk #5 succeeded at 3508 (offset 6 lines).
Hunk #6 succeeded at 3776 (offset 6 lines).
Hunk #7 succeeded at 3855 (offset 6 lines).
1 out of 7 hunks FAILED -- saving rejects to file 
src/bin/psql/tab-complete.c.rej

patching file src/include/commands/trigger.h
patching file src/include/executor/execMerge.h
patching file src/include/executor/instrument.h
patching file src/include/executor/nodeModifyTable.h



tab-complete.c.rej attached


Erik--- src/bin/psql/tab-complete.c
+++ src/bin/psql/tab-complete.c
@@ -1714,6 +1736,12 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER PUBLICATION  ADD */
 	else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD"))
 		COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE");
+	else if (Matches("ALTER", "PUBLICATION", MatchAny, "ADD|SET", "TABLE") ||
+			 (HeadMatches("ALTER", "PUBLICATION", MatchAny, "ADD|SET", "TABLE") &&
+			  ends_with(prev_wd, ',')))
+		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_tables, NULL);
+	else if (HeadMatches("ALTER", "PUBLICATION", MatchAny, "ADD|SET", "TABLE"))
+		COMPLETE_WITH(",");
 	/* ALTER PUBLICATION  DROP */
 	else if (Matches("ALTER", "PUBLICATION", MatchAny, "DROP"))
 		COMPLETE_WITH("ALL TABLES IN SCHEMA", "TABLE");

Re: warn if GUC set to an invalid shared library

2022-01-28 Thread Cary Huang

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

Hello

I tested the patches on master branch on Ubuntu 18.04 and regression turns out 
fine. I did a manual test following the query examples in this email thread and 
I do see the warnings when I attempted: these queries:

postgres=# SET local_preload_libraries=xyz.so;
2022-01-28 15:11:00.592 PST [13622] WARNING:  could not access file "xyz.so"
WARNING:  could not access file "xyz.so"
SET
postgres=# ALTER SYSTEM SET shared_preload_libraries=abc.so;
2022-01-28 15:11:07.729 PST [13622] WARNING:  could not access file 
"$libdir/plugins/abc.so"
WARNING:  could not access file "$libdir/plugins/abc.so"
ALTER SYSTEM

This is fine as this is what these patches are aiming to provide. However, when 
I try to restart the server, it fails to start because abc.so and xyz.so do not 
exist. Setting the parameters "local_preload_libraries" and 
"local_preload_libraries" to something else in postgresql.conf does not seem to 
take effect either.
It still complains shared_preload_libraries abc.so does not exist even though I 
have already set shared_preload_libraries to something else in postgresql.conf. 
This seems a little strange to me 

thank you
Cary

Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2022-01-28 Thread Andres Freund

Hi,

On 2021-12-08 03:04:23 +0100, Tomas Vondra wrote:
> On 12/8/21 02:54, Bharath Rupireddy wrote:
> >> I'm not sure about adding it to control data, though. That doesn't seem
> >> like a very good match for something that's mostly for monitoring.
> > 
> > Having it in the control data file (along with the existing checkpoint
> > information) will be helpful to know what was the last checkpoint
> > information and we can use the existing pg_control_checkpoint function
> > or the tool to emit that info. I plan to add an int16 flag as
> > suggested by Justin in this thread and come up with a patch.
> > 
> 
> OK, although I'm not sure it's all that useful (if we have that in some
> sort of system view).

I don't think we should add stuff like this to the control file. We want to
keep ControlFileData within a disk sector size / 512 bytes (so that we don't
need to care about torn writes etc). Adding information that we don't really
need will byte us at some point, because at that point there will be reliance
on the added data.

Nor have I read a convincing reason for needing the data to be readily
accessible when the server is shut down?

Greetings,

Andres Freund

Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2022-01-28 Thread Andres Freund

Hi,

On 2021-12-07 20:06:22 +0530, Bharath Rupireddy wrote:
> One concern is that we don't want to increase the size of pg_controldata by
> more than the typical block size (of 8K) to avoid any torn-writes.

The limit is 512 bytes (a disk sector), not 8K. There are plenty devices with
4K sectors as well, but I'm not aware of any with 8K.

Greetings,

Andres Freund

Re: archive modules

2022-01-28 Thread Nathan Bossart

On Fri, Jan 28, 2022 at 03:20:41PM -0500, Robert Haas wrote:
> On Fri, Jan 28, 2022 at 3:01 PM Nathan Bossart  
> wrote:
>> I discussed the two main deficiencies I'm aware of with basic_archive
>> earlier [0].  The first one is the issue with "incovenient" server crashes
>> (mentioned below).
> 
> Seems easy enough to rectify, if it's just a matter of 
> silently-succeed-if-same.

Yes.

>> The second is that there is no handling for multiple
>> servers writing to the same location since the temporary file is always
>> named "archtemp."  I thought about a few ways to pick a unique file name
>> (or at least one that is _probably_ unique), but that began adding a lot of
>> complexity for something I intended as a test module.  I can spend some
>> more time on this if you think it's worth fixing for a contrib module.
> 
> Well, my first thought was to wonder whether we even care about that
> scenario, but I guess we probably do, at least a little bit.
> 
> How about:
> 
> 1. Name temporary files like
> archive_temp.${FINAL_NAME}.${PID}.${SOME_RANDOM_NUMBER}. Create them
> with O_EXCL. If it fails, die.
> 
> 2. Try not to leave files like this behind, perhaps installing an
> on_proc_exit callback or similar, but accept that crashes and unlink()
> failures will make it inevitable in some cases.
> 
> 3. Document that crashes or other strange failure cases may leave
> archive_temp.* files behind in the archive directory, and recommend
> that users remove them before restarting the database after a crash
> (or, with caution, removing them while the database is running if the
> user is sure that the files are old and unrelated to any archiving
> still in progress).
> 
> I'm not arguing that this is exactly the right idea. But I am arguing
> that it shouldn't take a ton of engineering to come up with something
> reasonable here.

This is roughly what I had in mind.  I'll give it a try.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Add last commit LSN to pg_last_committed_xact()

2022-01-28 Thread James Coleman

On Thu, Jan 20, 2022 at 8:15 AM James Coleman  wrote:
>
> On Wed, Jan 19, 2022 at 10:12 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2022-01-19 21:23:12 -0500, James Coleman wrote:
> > >  { oid => '3537', descr => 'get identification of SQL object',
> > > diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
> > > index a5f9e9..2a026b0844 100644
> > > --- a/src/include/storage/proc.h
> > > +++ b/src/include/storage/proc.h
> > > @@ -258,6 +258,11 @@ struct PGPROC
> > >   PGPROC *lockGroupLeader;/* lock group leader, if I'm a 
> > > member */
> > >   dlist_head  lockGroupMembers;   /* list of members, if I'm 
> > > a leader */
> > >   dlist_node  lockGroupLink;  /* my member link, if I'm a member 
> > > */
> > > +
> > > + /*
> > > +  * Last transaction metadata.
> > > +  */
> > > + XLogRecPtr  lastCommitLSN;  /* cache of last committed 
> > > LSN */
> > >  };
> >
> > Might be worth forcing this to be on a separate cacheline than stuff more
> > hotly accessed by other backends, like the lock group stuff.
>
> What's the best way to do that? I'm poking around and don't see any
> obvious cases of doing that in a struct definition. I could add a
> char* of size PG_CACHE_LINE_SIZE, but that seems unnecessarily
> wasteful, and the other ALIGN macros seem mostly used in situations
> where we're allocating memory. Is it possible in C to get the size of
> the struct so far to be able to subtract from PG_CACHE_LINE_SIZE?
> Maybe there's some other approach I'm missing...

Looking at this again it seems like there are two ways to do this I see so far:

First would be to have a container struct and two structs inside --
something like one struct for local process access and one for shared
process access. But that seems like it'd likely end up pretty messy in
terms of how much it'd affect other parts of the code, so I'm hesitant
to go down that path.

Alternatively I see pg_attribute_aligned, but that's not defined
(AFAICT) on clang, for example, so I'm not sure that'd be acceptable?

It doesn't seem to me that there's anything like CACHELINEALIGN that
would work in this context (in a struct definition) since that appears
to be designed to work with allocated memory.

Is there an approach I'm missing? Or does one of these seem reasonable?

Thanks,
James Coleman

Re: warn if GUC set to an invalid shared library

2022-01-28 Thread Justin Pryzby

Thanks for loooking

On Fri, Jan 28, 2022 at 11:36:20PM +, Cary Huang wrote:
> This is fine as this is what these patches are aiming to provide. However, 
> when I try to restart the server, it fails to start because abc.so and xyz.so 
> do not exist. Setting the parameters "local_preload_libraries" and 
> "local_preload_libraries" to something else in postgresql.conf does not seem 
> to take effect either.
> It still complains shared_preload_libraries abc.so does not exist even though 
> I have already set shared_preload_libraries to something else in 
> postgresql.conf. This seems a little strange to me 

Could you show exactly what you did and the output ?

The patches don't entirely prevent someone from putting the server config into
a bad state.  It only aims to tell them if they've done that, so they can fix
it, rather than letting someone (else) find the error at some later (probably
inconvenient) time.

ALTER SYSTEM adds config into postgresql.auto.conf.  If you stop the server
after adding bad config there (after getting a warning), the server won't
start.  Once the server is off, you have to remove it manually.

The goal of the patch is to 1) warn someone that they've put a bad config in
place, so they don't leave it there; and, 2) if the server fails to start for
such a reason, provide a CONTEXT line to help them resolve it quickly.

Maybe you know all that and I didn't understand what you're saying.

-- 
Justin

Re: row filtering for logical replication

2022-01-28 Thread Andres Freund

Hi,

Are there any recent performance evaluations of the overhead of row filters? I
think it'd be good to get some numbers comparing:

1) $workload with master
2) $workload with patch, but no row filters
3) $workload with patch, row filter matching everything
4) $workload with patch, row filter matching few rows

For workload I think it'd be worth testing:
a) bulk COPY/INSERT into one table
b) Many transactions doing small modifications to one table
c) Many transactions targetting many different tables
d) Interspersed DDL + small changes to a table


> +/*
> + * Initialize for row filter expression execution.
> + */
> +static ExprState *
> +pgoutput_row_filter_init_expr(Node *rfnode)
> +{
> + ExprState  *exprstate;
> + Expr   *expr;
> +
> + /*
> +  * This is the same code as ExecPrepareExpr() but that is not used 
> because
> +  * we want to cache the expression. There should probably be another
> +  * function in the executor to handle the execution outside a normal 
> Plan
> +  * tree context.
> +  */
> + expr = expression_planner((Expr *) rfnode);
> + exprstate = ExecInitExpr(expr, NULL);
> +
> + return exprstate;
> +}

In what memory context does this run? Are we taking care to deal with leaks?
I'm pretty sure the planner relies on cleanup via memory contexts.


> + memset(entry->exprstate, 0, sizeof(entry->exprstate));
> +
> + schemaId = get_rel_namespace(entry->publish_as_relid);
> + schemaPubids = GetSchemaPublications(schemaId);

Isn't this stuff that we've already queried before? If we re-fetch a lot of
information it's not clear to me that it's actually a good idea to defer
building the row filter.


> + am_partition = get_rel_relispartition(entry->publish_as_relid);

All this stuff likely can cause some memory "leakage" if you run it in a
long-lived memory context.


> + /*
> +  * Find if there are any row filters for this relation. If there are,
> +  * then prepare the necessary ExprState and cache it in
> +  * entry->exprstate. To build an expression state, we need to ensure
> +  * the following:
> +  *
> +  * All publication-table mappings must be checked.
> +  *
> +  * If the relation is a partition and pubviaroot is true, use the row
> +  * filter of the topmost partitioned table instead of the row filter of
> +  * its own partition.
> +  *
> +  * Multiple publications might have multiple row filters for this
> +  * relation. Since row filter usage depends on the DML operation, there
> +  * are multiple lists (one for each operation) to which row filters
> +  * will be appended.
> +  *
> +  * FOR ALL TABLES implies "don't use row filter expression" so it takes
> +  * precedence.
> +  *
> +  * ALL TABLES IN SCHEMA implies "don't use row filter expression" if
> +  * the schema is the same as the table schema.
> +  */
> + foreach(lc, data->publications)
> + {
> + Publication *pub = lfirst(lc);
> + HeapTuple   rftuple = NULL;
> + Datum   rfdatum = 0;
> + boolpub_no_filter = false;
> +
> + if (pub->alltables)
> + {
> + /*
> +  * If the publication is FOR ALL TABLES then it is 
> treated the
> +  * same as if this table has no row filters (even if 
> for other
> +  * publications it does).
> +  */
> + pub_no_filter = true;
> + }
> + else if (list_member_oid(schemaPubids, pub->oid))
> + {
> + /*
> +  * If the publication is FOR ALL TABLES IN SCHEMA and 
> it overlaps
> +  * with the current relation in the same schema then 
> this is also
> +  * treated same as if this table has no row filters 
> (even if for
> +  * other publications it does).
> +  */
> + pub_no_filter = true;

Isn't this basically O(schemas * publications)?




> + if (has_filter)
> + {
> + /* Create or reset the memory context for row filters */
> + if (entry->cache_expr_cxt == NULL)
> + entry->cache_expr_cxt = 
> AllocSetContextCreate(CacheMemoryContext,
> + 
>   "Row filter expressions",
> + 
>   ALLOCSET_DEFAULT_SIZES);
> + else
> + MemoryContextReset(entry->cache_expr_cxt);

I see this started before this patch, but I don't think it's a great idea that
pgoutput does a bunch of stuff in CacheMemoryContext. That makes it
unnecessarily hard to debug leaks.

Seems like all this should liv

Re: Add last commit LSN to pg_last_committed_xact()

2022-01-28 Thread Andres Freund

Hi,

On 2022-01-28 18:43:57 -0500, James Coleman wrote:
> Alternatively I see pg_attribute_aligned, but that's not defined
> (AFAICT) on clang, for example, so I'm not sure that'd be acceptable?

clang should have it (it defines __GNUC__). The problem would be msvc, I
think. Not sure if there's a way to get to a common way of defining it between
gcc-like compilers and msvc (the rest is niche enough that we don't need to
care about the efficiency I think).

> Is there an approach I'm missing? Or does one of these seem reasonable?

I'd probably just slap a char *pad[PG_CACHELINE_SIZE] in there if the above
can't be made work.

Greetings,

Andres Freund

Re: Add last commit LSN to pg_last_committed_xact()

2022-01-28 Thread Andres Freund

On 2022-01-28 16:36:32 -0800, Andres Freund wrote:
> On 2022-01-28 18:43:57 -0500, James Coleman wrote:
> > Alternatively I see pg_attribute_aligned, but that's not defined
> > (AFAICT) on clang, for example, so I'm not sure that'd be acceptable?
> 
> clang should have it (it defines __GNUC__). The problem would be msvc, I
> think. Not sure if there's a way to get to a common way of defining it between
> gcc-like compilers and msvc (the rest is niche enough that we don't need to
> care about the efficiency I think).

Seems like it's doable:

https://godbolt.org/z/3c5573bTW

Re: pg_upgrade should truncate/remove its logs before running

2022-01-28 Thread Michael Paquier

On Fri, Jan 28, 2022 at 06:27:29PM -0500, Andrew Dunstan wrote:
> I have committed this. But it will take time to get every buildfarm own
> to upgrade.

Thanks for that.

> I will try to make a new release ASAP.

And thanks for that, as well.
--
Michael


signature.asc
Description: PGP signature

Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2022-01-28 Thread Michael Paquier

On Fri, Jan 28, 2022 at 06:32:27PM +0900, Kyotaro Horiguchi wrote:
> End-of-recovery checkpoint is requested as CHECKPOINT_WAIT, which
> seems to me to mean the state is always DB_IN_ARCHIVE_RECOVERY while
> the checkpoint is running?

With the patch, yes, we would keep the control file under
DB_IN_{ARCHIVE,CRASH}_RECOVERY during the whole period of the
end-of-recovery checkpoint.  On HEAD, we would have DB_SHUTDOWNING
until the end-of-recovery checkpoint completes, after which we switch
to DB_SHUTDOWNED for a short time before DB_IN_PRODUCTION.

> If correct, if server is killed druing the
> end-of-recovery checkpoint, the state stays DB_IN_ARCHIVE_RECOVERY
> instead of DB_SHUTDOWNING or DB_SHUTDOWNED.  AFAICS there's no
> differece between the first two at next startup.

One difference is the hint given to the user at the follow-up startup.
Crash and archive recovery states can freak people easily as they
mention the risk of corruption.  Using DB_SHUTDOWNING reduces this
window.

> I dont' think DB_SHUTDOWNED case is not worth considering here.

Well, this patch also means there is a small window where we have
DB_IN_ARCHIVE_RECOVERY in the control file with the new timeline and
minRecoveryPoint not set, rather than DB_SHUTDOWNED.
--
Michael

signature.asc
Description: PGP signature

Re: make MaxBackends available in _PG_init

2022-01-28 Thread Michael Paquier

On Thu, Jan 27, 2022 at 10:18:15AM -0800, Nathan Bossart wrote:
> Alright.  I think the comment adjustments still apply, so I split those out
> to a new patch.

Looks fine after a second look, so applied.

As of the issues of this thread, we really have two things to think
about:
1) How do we want to control the access of some parameters in a
context or another?  One idea would be more control through GUCs, say
with a set of context-related flags that prevent the read of some
variables until they are set.  We could encourage the use of
GetConfigOption() for that.  For MaxBackends, we could add a read-only
GUC for this purpose.  That's what Andres hinted at upthread, I
guess.
2) How do we deal with unwanted access of shared parameters?  This one
is not really controllable, is it?  And we are talking about much more
than MaxBackends.  This could perhaps be addressed with more
documentation in the headers for the concerned variables, as a first
step.
--
Michael


signature.asc
Description: PGP signature

Re: Printing backtrace of postgres processes

2022-01-28 Thread vignesh C

On Fri, Jan 28, 2022 at 1:54 PM Bharath Rupireddy
 wrote:
>
> On Thu, Jan 27, 2022 at 10:45 AM vignesh C  wrote:
> >
> > On Wed, Jan 26, 2022 at 11:07 AM Bharath Rupireddy
> >  wrote:
> > >
> > > On Tue, Jan 25, 2022 at 12:00 PM vignesh C  wrote:
> > > > Thanks for the comments, attached v17 patch has the fix for the same.
> > >
> > > Have a minor comment on v17:
> > >
> > > can we modify the elog(LOG, to new style ereport(LOG, ?
> > > + elog(LOG_SERVER_ONLY, "current backtrace:%s", errtrace.data);
> > >
> > > /*--
> > >  * New-style error reporting API: to be used in this way:
> > >  *  ereport(ERROR,
> > >  *  errcode(ERRCODE_UNDEFINED_CURSOR),
> > >  *  errmsg("portal \"%s\" not found", stmt->portalname),
> > >  *  ... other errxxx() fields as needed ...);
> > >  *
> >
> > Attached v18 patch has the changes for the same.
>
> Thanks. The v18 patch LGTM. I'm not sure if the CF bot failure is
> related to the patch -  https://cirrus-ci.com/task/5633364051886080

I felt this test failure is not related to this change. Let's wait and
see the results of the next run. Also I noticed that this test seems
to have failed many times in the buildfarm too recently:
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=60&branch=HEAD&member=&stage=recoveryCheck&filter=Submit

Regards,
Vignesh
From 3599e6940e16eb164737a8428af4b8ba5ef6bf59 Mon Sep 17 00:00:00 2001
From: Vigneshwaran C 
Date: Tue, 25 Jan 2022 08:21:22 +0530
Subject: [PATCH v18] Add function to log the backtrace of the specified
 postgres process.

This commit adds pg_log_backtrace() function that requests to log
the backtrace of the specified backend or auxiliary process except
logger and statistic collector.

Only superusers are allowed to request to log the backtrace
because allowing any users to issue this request at an unbounded rate
would cause lots of log messages and which can lead to denial of service.

On receipt of the request, at the next CHECK_FOR_INTERRUPTS(),
the target backend logs its backtrace at LOG_SERVER_ONLY level,
so that the backtrace will appear in the server log but not
be sent to the client.

Bump catalog version.

Authors: Vignesh C, Bharath Rupireddy
---
 doc/src/sgml/func.sgml| 62 +++
 src/backend/catalog/system_functions.sql  |  2 +
 src/backend/postmaster/autovacuum.c   |  4 ++
 src/backend/postmaster/checkpointer.c |  4 ++
 src/backend/postmaster/interrupt.c|  4 ++
 src/backend/postmaster/pgarch.c   | 10 +++-
 src/backend/postmaster/startup.c  |  4 ++
 src/backend/postmaster/walwriter.c|  4 ++
 src/backend/storage/ipc/procarray.c   | 56 +
 src/backend/storage/ipc/procsignal.c  | 42 +
 src/backend/storage/ipc/signalfuncs.c | 73 ---
 src/backend/tcop/postgres.c   |  4 ++
 src/backend/utils/adt/mcxtfuncs.c | 40 +++--
 src/backend/utils/error/elog.c| 19 --
 src/backend/utils/init/globals.c  |  1 +
 src/include/catalog/pg_proc.dat   |  5 ++
 src/include/miscadmin.h   |  1 +
 src/include/storage/procarray.h   |  2 +
 src/include/storage/procsignal.h  |  3 +-
 src/include/utils/elog.h  |  2 +
 src/test/regress/expected/backtrace.out   | 49 +++
 src/test/regress/expected/backtrace_1.out | 55 +
 src/test/regress/parallel_schedule|  2 +-
 src/test/regress/sql/backtrace.sql| 33 ++
 24 files changed, 416 insertions(+), 65 deletions(-)
 create mode 100644 src/test/regress/expected/backtrace.out
 create mode 100644 src/test/regress/expected/backtrace_1.out
 create mode 100644 src/test/regress/sql/backtrace.sql

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0ee6974f1c..855ccc8902 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25318,6 +25318,30 @@ SELECT collation for ('foo' COLLATE "de_DE");

   
 
+  
+   
+
+ pg_log_backtrace
+
+pg_log_backtrace ( pid integer )
+boolean
+   
+   
+Requests to log the backtrace of the backend with the
+specified process ID.  This function can send the request to
+backends and auxiliary processes except logger and statistics
+collector.  These backtraces will be logged at LOG
+message level. They will appear in the server log based on the log
+configuration set (See  for
+more information), but will not be sent to the client regardless of
+. A backtrace identifies
+which function a process is currently executing and it may be useful
+for developers to diagnose stuck processes and other problems. This
+function is supported only if PostgreSQL was built with the ability to
+capture backtraces, otherwise it will emit a warning.
+   
+  
+

Re: [PATCH] Disable bgworkers during servers start in pg_upgrade

2022-01-28 Thread Julien Rouhaud

Hi,

On Fri, Jan 28, 2022 at 10:20:07AM -0800, Andres Freund wrote:
> 
> On 2022-01-28 21:56:36 +0800, Julien Rouhaud wrote:
> > I think having a new option for vacuumdb is the right move.
> 
> Can't we pass the option via the connection string, e.g. something
> PGOPTIONS='-c binary_upgrade_mode=true'? That seems to scale better than to
> add it gradually to multiple tools.

Ah right that's a better idea.

Re: Bugs in pgoutput.c

2022-01-28 Thread Amit Kapila

On Thu, Jan 6, 2022 at 3:42 AM Tom Lane  wrote:
>
> Commit 6ce16088b caused me to look at pgoutput.c's handling of
> cache invalidations, and I was pretty appalled by what I found.
>
> * rel_sync_cache_relation_cb does the wrong thing when called for
> a cache flush (i.e., relid == 0).  Instead of invalidating all
> RelationSyncCache entries as it should, it will do nothing.
>
> * When rel_sync_cache_relation_cb does invalidate an entry,
> it immediately zaps the entry->map structure, even though that
> might still be in use (as per the adjacent comment that carefully
> explains why this isn't safe).  I'm not sure if this could lead
> to a dangling-pointer core dump, but it sure seems like it could
> lead to failing to translate tuples that are about to be sent.
>
> * Similarly, rel_sync_cache_publication_cb is way too eager to
> reset the pubactions flags, which would likely lead to failing
> to transmit changes that we should transmit.
>

Are you planning to proceed with this patch? AFAICS, this is good to
go. Yesterday, while debugging/analyzing one cfbot failure[1] with one
of my colleagues for row filter patch [2], we have seen the problem
due to the exact reason (second reason) you outlined here. After using
your patch and adapting the row filter patch atop it we see problem
got fixed.

[1] - https://cirrus-ci.com/task/5450648090050560?logs=test_world#L3975
[2] - https://commitfest.postgresql.org/36/2906/

-- 
With Regards,
Amit Kapila.

Re: Why is INSERT-driven autovacuuming based on pg_class.reltuples?

2022-01-28 Thread Peter Geoghegan

On Thu, Jan 27, 2022 at 11:22 PM Laurenz Albe  wrote:
> What would you suggest instead?  pg_stat_all_tables.n_live_tup?

I'm not sure, except that I assume that it'll have to come from the
statistics collector, not from pg_class. I think that this bug stemmed
from the fact that vac_update_relstats() is used by both VACUUM and
ANALYZE to store information in pg_class, while both *also* use
pgstat_report_vacuum()/pgstat_report_analyze() to store closely
related information at the same point.  There is rather a lot of
implicit information here. I recall a few other bugs that also seemed
related to this messiness with statistics and pg_class.


--
Peter Geoghegan

Re: refactoring basebackup.c

2022-01-28 Thread Jeevan Ladhe

On Sat, Jan 29, 2022 at 1:20 AM Robert Haas  wrote:

> On Fri, Jan 28, 2022 at 12:48 PM Jeevan Ladhe
>  wrote:
> > I have attached the latest rebased version of the LZ4 server-side
> compression
> > patch on the recent commits. This patch also introduces the compression
> level
> > and adds a tap test.
>
> In view of this morning's commit of
> d45099425eb19e420433c9d81d354fe585f4dbd6 I think the threshold for
> committing this patch has gone up. We need to make it support
> decompression with LZ4 on the client side, as we now have for gzip.
>

Fair enough. Makes sense.


> - In the new test case you set decompress_flags but according to the
> documentation I have here, -m is for multiple files (and so should not
> be needed here) and -d is for decompression (which is what we want
> here). So I'm confused why this is like this.
>
>
'-d' is the default when we have a .lz4 extension, which is true in our
case,
hence elimininated that. About, '-m' introduction, without any option, or
even
after providing the explicit '-d' option, weirdly lz4 command was throwing
decompressed tar on the console, that's when in my lz4 man version I saw
these 2 lines and tried adding '-m' option, and it worked:

" It is considered bad practice to rely on implicit output in scripts.
 because the script´s environment may change. Always use explicit
 output in scripts. -c ensures that output will be stdout. Conversely,
 providing a destination name, or using -m ensures that the output will
 be either the specified name, or filename.lz4 respectively."

and

"Similarly, lz4 -m -d can decompress multiple *.lz4 files."


> This part seems clearly correct, so I have committed it.


Thanks for pushing it.

Regards,
Jeevan Ladhe

Re: make MaxBackends available in _PG_init

2022-01-28 Thread Nathan Bossart

On Sat, Jan 29, 2022 at 11:19:12AM +0900, Michael Paquier wrote:
> On Thu, Jan 27, 2022 at 10:18:15AM -0800, Nathan Bossart wrote:
>> Alright.  I think the comment adjustments still apply, so I split those out
>> to a new patch.
> 
> Looks fine after a second look, so applied.

Thanks!

> As of the issues of this thread, we really have two things to think
> about:
> 1) How do we want to control the access of some parameters in a
> context or another?  One idea would be more control through GUCs, say
> with a set of context-related flags that prevent the read of some
> variables until they are set.  We could encourage the use of
> GetConfigOption() for that.  For MaxBackends, we could add a read-only
> GUC for this purpose.  That's what Andres hinted at upthread, I
> guess.
> 2) How do we deal with unwanted access of shared parameters?  This one
> is not really controllable, is it?  And we are talking about much more
> than MaxBackends.  This could perhaps be addressed with more
> documentation in the headers for the concerned variables, as a first
> step.

Hm.  Perhaps we should understand the full scope of the problem first.
What else besides MaxBackends and the shared memory GUCs won't be properly
initialized when the shared_preload_libraries' _PG_init() functions are
called?  MaxBackends seems to be the only one that folks have experienced
problems with, which is why I initially zeroed in on it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Schema variables - new implementation for Postgres 15

2022-01-28 Thread Julien Rouhaud

Hi,

On Fri, Jan 28, 2022 at 07:51:08AM +0100, Pavel Stehule wrote:
> st 26. 1. 2022 v 8:23 odesílatel Julien Rouhaud  napsal:
> 
> > +  The ON TRANSACTION END RESET
> > +  clause causes the session variable to be reset to its default value
> > when
> > +  the transaction is committed or rolled back.
> >
> > As far as I can see this clauses doesn't play well with IMMUTABLE
> > VARIABLE, as
> > you can reassign a value once the transaction ends.  Same for DISCARD [
> > ALL |
> > VARIABLES ], or LET var = NULL (or DEFAULT if no default value).  Is that
> > intended?
> >
> 
> I think so it is expected. The life scope of assigned (immutable) value is
> limited to transaction (when ON TRANSACTION END RESET).
> DISCARD is used for reset of session, and after it, you can write the value
> first time.
> 
> I enhanced doc in IMMUTABLE clause

I think it's still somewhat unclear:

-  done, no other change will be allowed in the session lifetime.
+  done, no other change will be allowed in the session variable content's
+  lifetime. The lifetime of content of session variable can be
+  controlled by ON TRANSACTION END RESET clause.
+ 

The "session variable content lifetime" is quite peculiar, as the ON
TRANSACTION END RESET is adding transactional behavior to something that's not
supposed to be transactional, so more documentation about it seems appropriate.

Also DISCARD can be used any time so that's a totally different aspect of the
immutable variable content lifetime that's not described here.

NULL handling also seems inconsistent.  An explicit default NULL value makes it
truly immutable, but manually assigning NULL is a different codepath that has a
different user behavior:

# create immutable variable var_immu int default null;
CREATE VARIABLE

# let var_immu = 1;
ERROR:  22005: session variable "var_immu" is declared IMMUTABLE

# create immutable variable var_immu2 int ;
CREATE VARIABLE

# let var_immu2 = null;
LET

# let var_immu2 = null;
LET

# let var_immu2 = 1;
LET

For var_immu2 I think that the last 2 queries should have errored out.

> > In revoke.sgml:
> > + REVOKE [ GRANT OPTION FOR ]
> > + { { READ | WRITE } [, ...] | ALL [ PRIVILEGES ] }
> > + ON VARIABLE variable_name [, ...]
> > + FROM { [ GROUP ]  > class="parameter">role_name | PUBLIC } [, ...]
> > + [ CASCADE | RESTRICT ]
> >
> > there's no extra documentation for that, and therefore no clarification on
> > variable_name.
> >
> 
> This is same like function_name, domain_name, ...

Ah right.

> > pg_variable.c:
> > Do we really need both session_variable_get_name() and
> > get_session_variable_name()?
> >
> 
> They are different - first returns possibly qualified name, second returns
> only name. Currently it is used just for error messages in
> transformAssignmentIndirection, and I think so it is good for consistency
> with other usage of this routine (transformAssignmentIndirection).

I agree that consistency with other usage is a good thing, but both functions
have very similar and confusing names.  Usually when you need the qualified
name the calling code just takes care of doing so.  Wouldn't it be better to
add say get_session_variable_namespace() and construct the target string in the
calling code?

Also, I didn't dig a lot but I didn't see other usage with optionally qualified
name there?  I'm not sure how it would make sense anyway since LET semantics
are different and the current call for session variable emit incorrect
messages:

# create table tt(id integer);
CREATE TABLE

# create variable vv tt;
CREATE VARIABLE

# let vv.meh = 1;
ERROR:  42703: cannot assign to field "meh" of column "meh" because there is no 
such column in data type tt
LINE 1: let vv.meh = 1;

Re: make MaxBackends available in _PG_init

2022-01-28 Thread Michael Paquier

On Fri, Jan 28, 2022 at 08:33:42PM -0800, Nathan Bossart wrote:
> Hm.  Perhaps we should understand the full scope of the problem first.
> What else besides MaxBackends and the shared memory GUCs won't be properly
> initialized when the shared_preload_libraries' _PG_init() functions are
> called?  MaxBackends seems to be the only one that folks have experienced
> problems with, which is why I initially zeroed in on it.

Yeah, it would be good to know the scope before defining the limits
of what could be done.  Another thing may be the interactions with
session_preload_libraries and local_preload_libraries.
--
Michael


signature.asc
Description: PGP signature

Re: GUC flags

2022-01-28 Thread Michael Paquier

On Thu, Jan 27, 2022 at 10:36:21PM -0600, Justin Pryzby wrote:
> Maybe you misunderstood - I'm not reading the file specified by
> current_setting('config_file').  Rather, I'm reading
> tmp_check/data/postgresql.conf, which is copied from the sample conf.
> Do you see an issue with that ?

Yes, as of:
mv $PGDATA/postgresql.conf $PGDATA/popo.conf
pg_ctl start -D $PGDATA -o '-c config_file=popo.conf'
make installcheck

I have not checked, but I am pretty sure that a couple of
distributions out there would pass down a custom path for the
configuration file, while removing postgresql.conf from the data
directory to avoid any confusion because if one finds out that some
parameters are defined but not loaded.  Your patch fails on that.

> The regression tests are only intended run from a postgres source dir, and if
> someone runs the from somewhere else, and they "fail", I think that's because
> they violated their assumption, not because of a problem with the test.

The tests are able to work out on HEAD, I'd rather not break something
that has worked this way for years.  Two other aspects that we may
want to worry about are include_dir and include if we were to add
tests for that, perhaps.  This last part is not really a strong
requirement IMO, though. 

> I wondered if it should chomp off anything added by pg_regress --temp-regress.
> However that's either going to be a valid guc (or else it would fail some 
> other
> test).  Or an extention's guc (which this isn't testing), which has a dot, and
> which this regex doesn't match, so doesn't cause false positives.

I am not sure about those parts, being reserved about the parts that
involve the format of postgresql.conf or any other configuration
parts, but we could tackle that after, if necessary.

For now, I have down a review of the patch, tweaking the docs, the
code and the test to take care of all the inconsistencies I could
find.  This looks like a good first cut to be able to remove check_guc
(the attached removes it, but I think that we'd better treat that
independently of the actual feature proposed, for clarity).
--
Michael
From a87afcfb5757367aeef795505e9361d5a03facc7 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Sat, 29 Jan 2022 15:32:10 +0900
Subject: [PATCH v2] Expose GUC flags in SQL function, replacing check_guc

Note-to-self: CATVERSION bump!
---
 src/include/catalog/pg_proc.dat   |  6 +++
 src/backend/utils/misc/check_guc  | 78 ---
 src/backend/utils/misc/guc.c  | 39 
 src/test/regress/expected/guc.out | 71 
 src/test/regress/sql/guc.sql  | 41 
 doc/src/sgml/func.sgml| 39 
 6 files changed, 196 insertions(+), 78 deletions(-)
 delete mode 100755 src/backend/utils/misc/check_guc

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0859dc81ca..3f927acb01 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6096,6 +6096,12 @@
   proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
   proargnames => '{name,setting,unit,category,short_desc,extra_desc,context,vartype,source,min_val,max_val,enumvals,boot_val,reset_val,sourcefile,sourceline,pending_restart}',
   prosrc => 'show_all_settings' },
+
+{ oid => '8921', descr => 'return flags for specified GUC',
+  proname => 'pg_settings_get_flags', provolatile => 's',
+  prorettype => '_text', proargtypes => 'text',
+  prosrc => 'pg_settings_get_flags' },
+
 { oid => '3329', descr => 'show config file settings',
   proname => 'pg_show_all_file_settings', prorows => '1000', proretset => 't',
   provolatile => 'v', prorettype => 'record', proargtypes => '',
diff --git a/src/backend/utils/misc/check_guc b/src/backend/utils/misc/check_guc
deleted file mode 100755
index b171ef0e4f..00
--- a/src/backend/utils/misc/check_guc
+++ /dev/null
@@ -1,78 +0,0 @@
-#!/bin/sh
-
-## currently, this script makes a lot of assumptions:
-## in postgresql.conf.sample:
-##   1) the valid config settings may be preceded by a '#', but NOT '# '
-##  (we use this to skip comments)
-##   2) the valid config settings will be followed immediately by  ' ='
-##  (at least one space preceding the '=')
-## in guc.c:
-##   3) the options have PGC_ on the same line as the option
-##   4) the options have '{' on the same line as the option
-
-##  Problems
-## 1) Don't know what to do with TRANSACTION ISOLATION LEVEL
-
-## if an option is valid but shows up in only one file (guc.c but not
-## postgresql.conf.sample), it should be listed here so that it
-## can be ignored
-INTENTIONALLY_NOT_INCLUDED="debug_deadlocks in_hot_standby \
-is_superuser lc_collate lc_ctype lc_messages lc_monetary lc_numeric lc_time \
-pre_auth_delay role seed server_encoding server_version server_version_num \
-session_authorization trace_lock_oidmin trace_lock_table trace_locks trace_lwlocks \
-trace_notify trace_userlocks transaction_isolation transaction

Re: Multiple Query IDs for a rewritten parse tree

2022-01-28 Thread Julien Rouhaud

Hi,

On Fri, Jan 28, 2022 at 05:51:56PM +0100, Dmitry Dolgov wrote:
> > On Fri, Jan 21, 2022 at 11:33:22AM +0500, Andrey V. Lepikhov wrote:
> > On 1/9/22 5:49 AM, Tom Lane wrote:
> > > The idea I'd been vaguely thinking about is to allow attaching a list
> > > of query-hash nodes to a Query, where each node would contain a "tag"
> > > identifying the specific hash calculation method, and also the value
> > > of the query's hash calculated according to that method.  We could
> > > probably get away with saying that all such hash values must be uint64.
> > > The main difference from your function-OID idea, I think, is that
> > > I'm envisioning the tags as being small integers with well-known
> > > values, similarly to the way we manage stakind values in pg_statistic.
> > > In this way, an extension that wants a hash that the core knows how
> > > to calculate doesn't need its own copy of the code, and similarly
> > > one extension could publish a calculation method for use by other
> > > extensions.
> >
> > To move forward, I have made a patch that implements this idea (see
> > attachment). It is a POC, but passes all regression tests.
> [...]
> > 4. We should reserve position of default in-core generator
> 
> From the discussion above I was under the impression that the core
> generator should be distinguished by a predefined kind.

I don't really like this approach.  IIUC, this patch as-is is meant to break
pg_stat_statements extensibility.  If kind == 0 is reserved for in-core queryid
then you can't use pg_stat_statement with a different queryid generator
anymore.  Unless you meant that kind == 0 is reserved for monitoring /
pg_stat_statement purpose and custom extension should register that specific
kind too if that's what they are supposed to implement?

The patch also reserves kind == -1 for pg_stat_statements internal purpose, and
I'm not really sure why that's needed.  Are additional extension that want to
agree with pg_stat_statements on what the monitoring queryid is
supposed to do that too?

I'm also unsure of how are extensions supposed to cooperate in general, as
I feel that queryids should be implemented for some "intent" (like monitoring,
planning optimization...).  That being said I'm not sure if e.g. AQO heuristics
are too specific for its need or if it could be shared with other extension
that might be dealing with similar concerns.

83 matches

Mail list logo