date:20211207

AW: Assorted improvements in pg_dump

2021-12-07 Thread Hans Buschmann

Hello Tom,

from your mail from 25.10.2021:

>0005 implements your suggestion of accounting for TOAST data while
>scheduling parallel dumps.  I realized while looking at that that
>there's a pre-existing bug, which this'd exacerbate: on machines
>with 32-bit off_t, dataLength can overflow.  Admittedly such machines
>are just about extinct in the wild, but we do still trouble to support
>the case.  So 0005 also includes code to check for overflow and clamp
>the result to INT_MAX blocks.

>Maybe we should back-patch 0005.  OTOH, how likely is it that anyone
>is wrangling tables exceeding 16TB on a machine with 32-bit off_t?
>Or that poor parallel dump scheduling would be a real problem in
>such a case?

I noticed that you patched master with all the improvements in pg_dump.

Did you change your mind about backpatching patch 0005 to fix the toast size 
matter?

It would be rather helpfull for handling existent user data in active branches.


On the matter of 32bit versions I think they are used only in much more little 
instances.

BTW the 32 bit build of postgres on Windows does not work any more with more 
modern tool sets (tested with VS2019 and VS2022) albeit not excluded explicitly 
in the docs. But no one complained yet (for a long time now...).

Thanks

Hans Buschmann

Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bharath Rupireddy

On Tue, Dec 7, 2021 at 12:58 AM Bossart, Nathan  wrote:
>
> On 12/6/21, 4:34 AM, "Bharath Rupireddy" 
>  wrote:
> > While the database is performing end-of-recovery checkpoint, the
> > control file gets updated with db state as "shutting down" in
> > CreateCheckPoint (see the code snippet at [1]) and at the end it sets
> > it back to "shut down" for a brief moment and then finally to "in
> > production". If the end-of-recovery checkpoint takes a lot of time or
> > the db goes down during the end-of-recovery checkpoint for whatever
> > reasons, the control file ends up having the wrong db state.
> >
> > Should we add a new db state something like
> > DB_IN_END_OF_RECOVERY_CHECKPOINT/"in end-of-recovery checkpoint" or
> > something else to represent the correct state?
>
> This seems like a reasonable change to me.  From a quick glance, it
> looks like it should be a simple fix that wouldn't add too much
> divergence between the shutdown and end-of-recovery checkpoint code
> paths.

Here's a patch that I've come up with. Please see if this looks okay
and let me know if we want to take it forward so that I can add a CF
entry.

The new status one would see is as follows:
bharath@bharathubuntu3:~/postgres/inst/bin$ ./pg_controldata -D data
pg_control version number:1500
Catalog version number:   202111301
Database system identifier:   7038867865889221935
Database cluster state:   in end-of-recovery checkpoint
pg_control last modified: Tue Dec  7 08:06:18 2021
Latest checkpoint location:   0/14D24A0
Latest checkpoint's REDO location:0/14D24A0
Latest checkpoint's REDO WAL file:00010001

Regards,
Bharath Rupireddy.
From 8cc1dfec87a53d3b423712380c934243e93009ac Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 7 Dec 2021 07:53:53 +
Subject: [PATCH v1] Add DB_IN_END_OF_RECOVERY_CHECKPOINT state for control
 file

While the database is performing end-of-recovery checkpoint, the
control file gets updated with db state as "shutting down" in
CreateCheckPoint and at the end it sets it back to "shut down" for
a brief moment and then finally to "in production". If the
end-of-recovery checkpoint takes a lot of time or the db goes down
during the end-of-recovery checkpoint for whatever reasons, the
control file ends up having the wrong db state.

This patch adds a new db state to control file, called
DB_IN_END_OF_RECOVERY_CHECKPOINT or "in end-of-recovery checkpoint"
to represent the correct state of the database server.
---
 src/backend/access/transam/xlog.c   | 13 +++--
 src/bin/pg_controldata/pg_controldata.c |  2 ++
 src/include/catalog/pg_control.h|  3 ++-
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d894af310a..a31a8dda35 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6747,6 +6747,12 @@ StartupXLOG(void)
 			 " and you might need to choose an earlier recovery target.")));
 			break;
 
+		case DB_IN_END_OF_RECOVERY_CHECKPOINT:
+			ereport(LOG,
+	(errmsg("database system was interrupted while in end-of-recovery checkpoint at %s",
+			str_time(ControlFile->time;
+			break;
+
 		case DB_IN_PRODUCTION:
 			ereport(LOG,
 	(errmsg("database system was interrupted; last known up at %s",
@@ -9140,7 +9146,10 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-		ControlFile->state = DB_SHUTDOWNING;
+		if (flags & CHECKPOINT_END_OF_RECOVERY)
+			ControlFile->state = DB_IN_END_OF_RECOVERY_CHECKPOINT;
+		else
+			ControlFile->state = DB_SHUTDOWNING;
 		UpdateControlFile();
 		LWLockRelease(ControlFileLock);
 	}
@@ -9406,7 +9415,7 @@ CreateCheckPoint(int flags)
 	 * Update the control file.
 	 */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-	if (shutdown)
+	if (flags & CHECKPOINT_IS_SHUTDOWN)
 		ControlFile->state = DB_SHUTDOWNED;
 	ControlFile->checkPoint = ProcLastRecPtr;
 	ControlFile->checkPointCopy = checkPoint;
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f911f98d94..e47b90143e 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -63,6 +63,8 @@ dbState(DBState state)
 			return _("in crash recovery");
 		case DB_IN_ARCHIVE_RECOVERY:
 			return _("in archive recovery");
+		case DB_IN_END_OF_RECOVERY_CHECKPOINT:
+			return _("in end-of-recovery checkpoint");
 		case DB_IN_PRODUCTION:
 			return _("in production");
 	}
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 749bce0cc6..7ea0fdb273 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -22,7 +22,7 @@
 
 
 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION	1300
+#define PG_CONTROL_VERSION	1500
 
 /* Nonce key length, see below */
 #define MOCK_

Re: Do we need pre-allocate WAL files during end-of-recovery checkpoint?

2021-12-07 Thread Bharath Rupireddy

On Tue, Dec 7, 2021 at 1:02 AM Bossart, Nathan  wrote:
>
> On 12/6/21, 4:54 AM, "Bharath Rupireddy" 
>  wrote:
> > The function PreallocXlogFiles doesn't get called during
> > end-of-recovery checkpoint in CreateCheckPoint, see [1]. The server
> > becomes operational after the end-of-recovery checkpoint and may need
> > WAL files. However, I'm not sure how beneficial it is going to be if
> > the WAL is pre-allocated (as PreallocXlogFiles just allocates only 1
> > extra WAL file).
>
> There is another thread for adding more effective WAL pre-allocation
> [0] that you might be interested in.
> [0] 
> https://www.postgresql.org/message-id/flat/20201225200953.jjkrytlrzojbndh5%40alap3.anarazel.de

I haven't had a chance to go through the entire thread but I have a
quick question: why can't the walwriter pre-allocate some of the WAL
segments instead of a new background process? Of course, it might
delay the main functionality of the walwriter i.e. flush and sync the
WAL files, but having checkpointer do the pre-allocation makes it do
another extra task. Here the amount of walwriter work vs checkpointer
work, I'm not sure which one does more work compared to the other.

Another idea could be to let walwrtier or checkpointer pre-allocate
the WAL files whichever seems free as-of-the-moment when the WAL
segment pre-allocation request comes. We can go further to let the
user choose which process i.e. checkpointer or walwrtier do the
pre-allocation with a GUC maybe?

I will also put the same thoughts in the "Pre-allocating WAL files"
thread so that we can continue the discussion there.

[1] 
https://www.postgresql.org/message-id/20201225200953.jjkrytlrzojbndh5%40alap3.anarazel.de

Regards,
Bharath Rupireddy.

Re: Do we need pre-allocate WAL files during end-of-recovery checkpoint?

2021-12-07 Thread Bharath Rupireddy

On Tue, Dec 7, 2021 at 12:09 PM Noah Misch  wrote:
>
> On Mon, Dec 06, 2021 at 06:21:40PM +0530, Bharath Rupireddy wrote:
> > The function PreallocXlogFiles doesn't get called during
> > end-of-recovery checkpoint in CreateCheckPoint, see [1]. The server
> > becomes operational after the end-of-recovery checkpoint and may need
> > WAL files.
>
> PreallocXlogFiles() is never a necessity; it's just an attempted optimization.
> I expect preallocation at end-of-recovery would do more harm than good,
> because the system would accept no writes at all while waiting for it.

Yeah. At times, end-of-recovery checkpoint itself will take a good
amount of time and adding to it the pre-allocation of WAL time doesn't
make sense.

> > However, I'm not sure how beneficial it is going to be if
> > the WAL is pre-allocated (as PreallocXlogFiles just allocates only 1
> > extra WAL file).
>
> Yeah, PreallocXlogFiles() feels like a relict from the same era that gave us
> checkpoint_segments=3.  It was more useful before commit 63653f7 (2002).

I see.

Regards,
Bharath Rupireddy.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-12-07 Thread Dilip Kumar

On Mon, Dec 6, 2021 at 7:53 PM Ashutosh Sharma  wrote:
>
> Thank you, Dilip for the quick response. I am okay with the changes done in 
> the v7 patch.
>
> One last point - If we try to clone a huge database, as expected CREATE 
> DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints which 
> seems to be affecting the performance slightly.

Yeah, that is a valid point because instead of just one WAL for
createdb we will generate WAL for each page in the database, so I
agree that if the max_wal_size is not enough for those WALs then we
might have to pay the cost of multiple checkpoints.  However, if we
compare it with the current mechanism then now it is a forced
checkpoint and there is no way to avoid it whereas with the new
approach user can set enough max_wal_size and they can avoid it.  So
in other words now the checkpoint is driven by the amount of resource
which is true for any other operation e.g. ALTER TABLE SET TABLESPACE
so now it is in more sync with the rest of the system, but without the
patch, it was a special purpose forced checkpoint only for the
createdb.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Do we need pre-allocate WAL files during end-of-recovery checkpoint?

2021-12-07 Thread Bharath Rupireddy

On Tue, Dec 7, 2021 at 12:35 AM SATYANARAYANA NARLAPURAM
 wrote:
>
> If the segment size is 16MB it shouldn't take much time but higher segment 
> values this can be a problem. But again, the current segment has to be filled 
> 75% to precreate new one. I am not sure how much we gain. Do you have some 
> numbers with different segment sizes?

I don't have the numbers. However I agree that the pre-allocation can
make end-of-recovery checkpoint times more. At times, the
end-of-recovery checkpoint itself will take a good amount of time and
adding to it the pre-allocation of WAL time doesn't make sense.

Regards,
Bharath Rupireddy.

Re: Pre-allocating WAL files

2021-12-07 Thread Bharath Rupireddy

On Thu, Nov 11, 2021 at 12:29 AM Bossart, Nathan  wrote:
>
> On 10/8/21, 1:55 PM, "Bossart, Nathan"  wrote:
> > Here is a first attempt at adding the pre-allocation logic to the
> > checkpointer.  I went ahead and just used CheckpointWriteDelay() for
> > pre-allocating during checkpoints.  I've done a few pgbench runs, and
> > it seems to work pretty well.  Initialization is around 15% faster,
> > and I'm seeing about a 5% increase in TPS with a simple-update
> > workload with wal_recycle turned off.  Of course, these improvements
> > go away once segments can be recycled.
>
> Here is a rebased version of this patch set.  I'm getting the sense
> that there isn't a whole lot of interest in this feature, so I'll
> likely withdraw it if it goes too much longer without traction.

As I mentioned in the other thread at [1], let's continue the discussion here.

Why can't the walwriter pre-allocate some of the WAL segments instead
of a new background process? Of course, it might delay the main
functionality of the walwriter i.e. flush and sync the WAL files, but
having checkpointer do the pre-allocation makes it do another extra
task. Here the amount of walwriter work vs checkpointer work, I'm not
sure which one does more work compared to the other.

Another idea could be to let walwrtier or checkpointer pre-allocate
the WAL files whichever seems free as-of-the-moment when the WAL
segment pre-allocation request comes. We can go further to let the
user choose which process i.e. checkpointer or walwrtier do the
pre-allocation with a GUC maybe?

[1] - 
https://www.postgresql.org/message-id/CALj2ACVqYJX9JugooRC1chb2sHqv-C9mYEBE1kxwn%2BTn9vY42A%40mail.gmail.com

Regards,
Bharath Rupireddy.

RE: Allow escape in application_name

2021-12-07 Thread kuroda.hay...@fujitsu.com

Dear Fujii-san,

Thank you for reviewing! I'll post the latest version.

> Currently StringInfo variable is defined in connect_pg_server() and passed to
> parse_pgfdw_appname(). But there is no strong reason why the variable needs to
> be defined connect_pg_server(). Right? If so how about refactoring
> parse_pgfdw_appname() so that it defines StringInfoData variable and returns 
> the
> processed character string? You can see such coding at, for example,
> escape_param_str() in dblink.c.
> If this refactoring is done, probably we can get rid of "#include 
> "lib/stringinfo.h""
> line from connection.c because connect_pg_server() no longer needs to use
> StringInfo.

I refactored that. Thanks!

> + int i;
> 
> + for (i = n - 1; i >= 0; i--)
> 
> I'm tempted to simplify these into "for (int i = n - 1; i >= 0; i--)".
> 
> + /*
> +  * Search application_name and replace it if found.
> +  *
> +  * We search paramters from the end because the later
> +  * one have higher priority.  See also the above comment.
> +  */
> 
> How about updating these comments into the following?
> 
> ---
> Search the parameter arrays to find application_name setting,
> and replace escape sequences in it with status information if found.
> The arrays are searched backwards because the last value
> is used if application_name is repeatedly set.

Prefer yours. Fixed.

> + if (strcmp(buf->data, "") != 0)
> 
> Is this condition the same as "data[0] == '\0'"?

Maybe "data[0] != '\0'", but fixed.

> + * parse postgres_fdw.application_name and set escaped string.
> + * This function is almost same as log_line_prefix(), but
> + * accepted escape sequence is different and this cannot understand
> + * padding integer.
> + *
> + * Note that argument buf must be initialized.
> 
> How about updating these comments into the following?
> I thought that using the term "parse" was a bit confusing because what the
> function actually does is to replace escape seequences in application_name.
> Probably it's also better to rename the function, e.g., to 
> process_pgfdw_appname .
> 
> -
> Replace escape sequences beginning with % character in the given
> application_name with status information, and return it.
> -

Thank you for suggestions. I confused the word "parse."
Fixed and renamed.

> +  Same as , this is a
> +  printf-style string. Accepted escapes are
> +  bit different from ,
> +  but padding can be used like as it.
> 
> This description needs to be updated.

I missed there. Fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Fix a bug in DecodeAbort() and improve input data check on subscriber.

2021-12-07 Thread vignesh C

On Tue, Dec 7, 2021 at 6:07 AM Masahiko Sawada  wrote:
>
> Hi all,
>
> While updating the patch I recently posted[1] to make pg_waldump
> report replication origin ID, LSN, and timestamp, I found a bug that
> replication origin timestamp is not set in ROLLBACK PREPARED case.
> Commit 8bdb1332eb5 (CC'ed Amit) added an argument to
> ReorderBufferFinishPrepared() but in ROLLBACK PREPARED case, the
> caller specified it at the wrong position:
>
> @@ -730,6 +730,7 @@ DecodeCommit(LogicalDecodingContext *ctx,
> XLogRecordBuffer *buf,
> if (two_phase)
> {
> ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr,
> buf->endptr,
> +
> SnapBuildInitialConsistentPoint(ctx->snapshot_builder),
> commit_time, origin_id, origin_lsn,
> parsed->twophase_gid, true);
> }
> @@ -868,6 +869,7 @@ DecodeAbort(LogicalDecodingContext *ctx,
> XLogRecordBuffer *buf,
> {
> ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr,
> buf->endptr,
> abort_time, origin_id, origin_lsn,
> +   InvalidXLogRecPtr,
> parsed->twophase_gid, false);
> }
>
> This affects the value of rollback_data.rollback_time on the
> subscriber, resulting in setting the wrong timestamp to both the
> replication origin timestamp and the error callback argument on the
> subscriber. I've attached the patch to fix it.
>
> Besides, I think we can improve checking input data on subscribers.
> This bug was not detected by compilers but it could have been detected
> if we checked the input data. Looking at logicalrep_read_xxx functions
> in proto.c, there are some inconsistencies; we check the value of
> prepare_data->xid in logicalrep_read_prepare_common() but we don't in
> both logicalrep_read_commit_prepared() and
> logicalrep_read_rollback_prepared(), and we don't check anything in
> stream_start/stream_stop. Also, IIUC that since timestamps are always
> set in prepare/commit prepared/rollback prepared cases we can check
> them too. I've attached a PoC patch that introduces macros for
> checking input data and adds some new checks. Since it could be
> frequently called, I used unlikely() but probably we can also consider
> replacing elog(ERROR) with assertions.

The first patch is required as suggested and fixes the problem.
Few comments on the second patch:
1) Should we validate prepare_time and xid in
logicalrep_read_begin_prepare. Is this not checked intentionally?
2) Similarly should we validate committime in logicalrep_read_stream_commit?

Regards,
Vignesh

Re: Non-superuser subscription owners

2021-12-07 Thread Ronan Dunklau

Le lundi 6 décembre 2021, 16:56:56 CET Mark Dilger a écrit :
> > On Dec 6, 2021, at 2:19 AM, Amit Kapila  wrote:
> >>> If we want to maintain the property that subscriptions can only be
> >>> owned by superuser
> 
> We don't want to maintain such a property, or at least, that's not what I
> want.  I don't think that's what Jeff wants, either.

That's not what I want either: the ability to run and refresh subscriptions as 
a non superuser is a desirable feature. 

The REFRESH part was possible before PG 14, when it was allowed to run REFRESH 
in a function, which could be made to run as security definer. 


> As I perceive the roadmap:
> 
> 1) Fix the current bug wherein subscription changes are applied with
> superuser force after the subscription owner has superuser privileges
> revoked. 2) Allow the transfer of subscriptions to non-superuser owners.
> 3) Allow the creation of subscriptions by non-superusers who are members of
> some as yet to be created predefined role, say "pg_create_subscriptions"

This roadmap seems sensible.

-- 
Ronan Dunklau

RE: Failed transaction statistics to measure the logical replication progress

2021-12-07 Thread osumi.takami...@fujitsu.com

On Monday, December 6, 2021 11:27 PM vignesh C  wrote:
> Thanks for the updated patch, few comments:
Thank you for your review !

> 1) We can keep the documentation similar to mention the count includes both
> table sync worker / main apply worker in case of commit_count/error_count
> and abort_count to keep it consistent.
> +   commit_count bigint
> +  
> +  
> +   Number of transactions successfully applied in this subscription.
> +   COMMIT and COMMIT PREPARED increments this counter.
> +  
> + 
> +
> + 
> +  
> +   error_count bigint
> +  
> +  
> +   Number of transactions that failed to be applied by the table
> +   sync worker or main apply worker in this subscription.
> +   
> + 
> +
> + 
> +  
> +   abort_count bigint
> +  
> +  
> +   Number of transactions aborted in this subscription.
> +   ROLLBACK PREPARED increments this counter.
> +  
> + 
Yeah, you are right. Fixed.
Note that abort_count is not used by table sync worker.

 
> 2) Can this be changed:
> +   /*
> +* If this is a new error reported by table sync worker,
> consolidate this
> +* error count into the entry of apply worker.
> +*/
> +   if (OidIsValid(msg->m_subrelid))
> +   {
> +   /* Gain the apply worker stats */
> +   subwentry = pgstat_get_subworker_entry(dbentry,
> + msg->m_subid,
> +
> InvalidOid, true);
> +   subwentry->error_count++;
> +   }
> +   else
> +   subwentry->error_count++;   /* increment the apply
> worker's counter. */
> To:
> +   /*
> +* If this is a new error reported by table sync worker,
> consolidate this
> +* error count into the entry of apply worker.
> +*/
> +   if (OidIsValid(msg->m_subrelid))
> +   /* Gain the apply worker stats */
> +   subwentry = pgstat_get_subworker_entry(dbentry,
> + msg->m_subid,
> +
> InvalidOid, true);
> +
> +   subwentry->error_count++;   /* increment the apply
> worker's counter. */
Your suggestion looks better.
Also, I fixed some comments of this part
so that we don't need to add a separate comment at the bottom
for the increment of the apply worker.

 
> 3) Since both 026_worker_stats and 027_worker_xact_stats.pl are testing
> pg_stat_subscription_workers, can we move the tests to 026_worker_stats.pl.
> If possible the error_count validation can be combined with the existing 
> tests.
> diff --git a/src/test/subscription/t/027_worker_xact_stats.pl
> b/src/test/subscription/t/027_worker_xact_stats.pl
> new file mode 100644
> index 000..31dbea1
> --- /dev/null
> +++ b/src/test/subscription/t/027_worker_xact_stats.pl
> @@ -0,0 +1,162 @@
> +
> +# Copyright (c) 2021, PostgreSQL Global Development Group
> +
> +# Tests for subscription worker statistics during apply.
> +use strict;
> +use warnings;
> +use PostgreSQL::Test::Cluster;
> +use PostgreSQL::Test::Utils;
> +use Test::More tests => 1;
> +
> +# Create publisher node
Right. I've integrated my tests with 026_worker_stats.pl.
I think error_count validations are combined as you suggested.
Another change I did is to introduce one function
to contribute to better readability of the stats tests.

Here, the 026_worker_stats.pl didn't look aligned by
pgperltidy. This is not a serious issue at all.
Yet, when I ran pgperltidy, the existing codes
that required adjustments came into my patch.
Therefore, I made a separate part for this.

Best Regards,
Takamichi Osumi



v16-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch
Description:  v16-0001-Fix-alignments-of-TAP-tests-for-pg_stat_subscrip.patch


v16-0002-Extend-pg_stat_subscription_workers-to-include-g.patch
Description:  v16-0002-Extend-pg_stat_subscription_workers-to-include-g.patch

回复：Re: Is it worth pushing conditions to sublink/subplan?

2021-12-07 Thread 曾文旌(义从)

Hi Hackers

For my previous proposal, I developed a prototype and passed regression testing.
It works similarly to subquery's qual pushdown. We know that sublink expands
at the beginning of each level of query. At this stage, The query's conditions 
and
equivalence classes are not processed. But after 
generate_base_implied_equalities
the conditions are processed,  which is why qual can push down to subquery but 
sublink not.

My POC implementation chose to delay the sublink expansion in the SELECT clause 
(targetList)
and where clause. Specifically, it is delayed after 
generate_base_implied_equalities. Thus,
the equivalent conditions already established in the Up level query can be 
easily obtained
in the sublink expansion process (make_subplan). 

For example, if the up level query has a.id = 10 and the sublink query has a.id 
= b.id, then
we get b.id = 10 and push it down to the sublink quey. If b is a partitioned 
table and is
partitioned by id, then a large number of unrelated subpartitions are pruned 
out, 
This optimizes a significant amount of Planner and SQL execution time, 
especially
if the partitioned table has a large number of subpartitions and is what I want.

Currently, There were two SQL failures in the regression test, because the 
expansion
order of sublink was changed, which did not affect the execution result of SQL.

Look forward to your suggestions on this proposal.

Thanks

Wenjing




 --原始邮件 --
发件人:shawn wang 
发送时间:Wed Sep 1 10:54:50 2021
收件人:曾文旌(义从) 
抄送:PostgreSQL Hackers , wjzeng 

主题:Re: Is it worth pushing conditions to sublink/subplan?

I tested it the way you said and increased the number of sub-tables.
I created a hash partition table of 1000 sub-tables.
Test according to your first SQL, the optimizer cuts the unnecessary sub-tables 
well.
You can see the plan:

postgres=# explain analyze
postgres-# select temp.p1 from
postgres-# (
postgres(# select p1,p2 from test1.test1hashtable x where x.p1 = '1'
postgres(# union all
postgres(# (values('1','1'))
postgres(# ) temp,
postgres-# test1.test1hashtable y
postgres-# where y.p2 = temp.p2 and y.p1 = '1' and y.p1='1';
 QUERY PLAN 
-
 Nested Loop (cost=0.00..25.55 rows=1 width=32) (actual time=0.004..0.004 
rows=0 loops=1)
 Join Filter: (x.p2 = y.p2)
 -> Seq Scan on test1hashtable826 y (cost=0.00..12.75 rows=1 width=32) (actual 
time=0.002..0.002 rows=0 loops=1)
 Filter: (p1 = '1'::text)
 -> Append (cost=0.00..12.78 rows=2 width=64) (never executed)
 -> Seq Scan on test1hashtable826 x (cost=0.00..12.75 rows=1 width=64) (never 
executed)
 Filter: (p1 = '1'::text)
 -> Result (cost=0.00..0.01 rows=1 width=64) (never executed)
 Planning Time: 0.158 ms
 Execution Time: 0.022 ms
(10 rows)

But when the second one runs, the planning time reaches 13.942ms.
The plan:

postgres=# explain analyze
postgres-# select
postgres-# y.p1,
postgres-# (Select x.p2 from test1.test1hashtable x where y.p1 =x.p1 and 
y.p2=x.p2) as b
postgres-# from test1.test1hashtable y where p1 = '1' and p2 = '1';
 QUERY PLAN 
--
 Seq Scan on test1hashtable826 y (cost=0.00..13318.30 rows=1 width=64) (actual 
time=0.004..0.047 rows=0 loops=1)
 Filter: ((p1 = '1'::text) AND (p2 = '1'::text))
 SubPlan 1
 -> Append (cost=0.00..13305.00 rows=1000 width=32) (never executed)
 -> Seq Scan on test1hashtable1 x_1 (cost=0.00..13.30 rows=1 width=32) (never 
executed)
 Filter: ((y.p1 = p1) AND (y.p2 = p2))
 -> Seq Scan on test1hashtable1000 x_1000 (cost=0.00..13.30 rows=1 width=32) 
(never executed)
 Filter: ((y.p1 = p1) AND (y.p2 = p2))
 Planning Time: 13.942 ms
 Execution Time: 4.899 ms
(2006 rows)

This is a very worthwhile thing to do. In a relatively large business system, a 
large number of partition tables and high concurrency are often used. If the 
planning time is too long, this will greatly affect the business.

regards,
Shawn.
Wenjing  于2021年8月17日周二 上午10:31写道：




2021年8月16日 17:15，Wenjing  写道：
Hi Hackers,

Recently, a issue has been bothering me, This is about conditional push-down in 
SQL.
I use cases from regression testing as an example.
I found that the conditions  (B =1)  can be pushed down into the subquery, 
However, it cannot be pushed down to sublink/subplan.
If a sublink/subplan clause contains a partition table, it can be useful to get 
the conditions for pruning.

So, is it worth pushing conditions to sublink/subplan?
Anybody have any ideas?


regards,
Wenjing


example:
create table p (a int, b int, c int) partition by list (a);
create table p1 partition of p for values in (1);
create table p2 partition of p for values in (2);
create table q (a int, b int, c int) partition by list (a);
create table q1 partition of q for values in (1) partition by list (b);
create table q11 partition of q1 for values in (

RE: Allow escape in application_name

2021-12-07 Thread kuroda.hay...@fujitsu.com

Dear Horiguchi-san

Thank you for giving comments! I attached new patches.

> > +   if (appname == NULL)
> > +   appname = "[unknown]";
> > +   if (username == NULL || *username
> == '\0')
> > +   username = "[unknown]";
> > +   if (dbname == NULL || *dbname ==
> '\0')
> > +   dbname =  "[unknown]";
> >
> > I'm still not sure why these are necessary. Could you clarify that?
> 
> It probably comes from my request, just for safety for uncertain
> future.  They are actually not needed under the current usage of the
> function, so *I* would not object to removing them if you are strongly
> feel them out of place.  In that case, since NULL's leads to SEGV
> outright, we would even not need assertions instead.

Yeah, I followed your suggestion. But we deiced to keep codes clean,
hence I removed the if-statements.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v23_0001_update_descriptions.patch
Description: v23_0001_update_descriptions.patch


v23_0002_allow_escapes.patch
Description: v23_0002_allow_escapes.patch

Re: Alter all tables in schema owner fix

2021-12-07 Thread Amit Kapila

On Tue, Dec 7, 2021 at 8:21 AM vignesh C  wrote:
>
> Thanks for your comments, I have made the changes. Additionally I have
> renamed IsSchemaPublication to is_schema_publication for keeping the
> naming similar around the code. The attached v3 patch has the changes
> for the same.
>

Thanks, the patch looks mostly good to me. I have slightly modified it
to incorporate one of Michael's suggestions, ran pgindent, and
modified the commit message.

I am planning to push the attached tomorrow unless there are further
comments. Michael, do let me know if you have any questions or
objections about this?

-- 
With Regards,
Amit Kapila.

v4-0001-Fix-changing-the-ownership-of-ALL-TABLES-IN-SCHEM.patch
Description: Binary data

Re: Add sub-transaction overflow status in pg_stat_activity

2021-12-07 Thread Dilip Kumar

On Tue, Dec 7, 2021 at 11:11 AM Justin Pryzby  wrote:

Thanks for the review I will work on these comments.

> > +  
> > +   subxact_count xid
> > +  
> > +  
> > +   The current backend's active subtransactions count.
>
> subtransaction (no s)
>
> > +   Set to true if current backend's subtransaction cache is overflowed.
>
> Say "has overflowed"
>
> > + if (local_beentry->subxact_count > 0)
> > + {
> > + values[30] = local_beentry->subxact_count;
> > + values[31] = local_beentry->subxact_overflowed;
> > + }
> > + else
> > + {
> > + nulls[30] = true;
> > + nulls[31] = true;
> > + }
>
> Why is the subxact count set to NULL instead of zero ?

> You added this to pg_stat_activity, which already has a lot of fields.
> We talked a few months ago about not adding more fields that weren't commonly
> used.
> https://www.postgresql.org/message-id/flat/20210426191811.sp3o77doinphyjhu%40alap3.anarazel.de#d96d0a116f0344301eead2676ea65b2e
>
> Since I think this field is usually not interesting to most users of
> pg_stat_activity, maybe this should instead be implemented as a function like
> pg_backend_get_subxact_status(pid).
>
> People who want to could use it like:
> SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) sub;

Yeah, this is a valid point, I will change this accordingly.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Add sub-transaction overflow status in pg_stat_activity

2021-12-07 Thread Dilip Kumar

On Tue, Dec 7, 2021 at 10:29 AM Nikolay Samokhvalov
 wrote:
>
> On Mon, Dec 6, 2021 at 8:16 PM Dilip Kumar  wrote:
>>
>> If the subtransaction cache is overflowed in some of the transactions
>> then it will affect all the concurrent queries as they need to access
>> the SLRU for checking the visibility of each tuple.  But currently
>> there is no way to identify whether in any backend subtransaction is
>> overflowed or what is the current active subtransaction count.
>
>
> I think it's a good idea – had the same need when recently researching 
> various issues with subtransactions [1], needed to patch Postgres in 
> benchmarking environments. To be fair, there is a way to understand that the 
> overflowed state is reached for PG 13+ – on standbys, observe reads in 
> Subtrans in pg_stat_slru. But of course, it's an indirect way.

Yeah right.

> I see that the patch adds two new columns to pg_stat_activity: subxact_count 
> and subxact_overflowed. This should be helpful to have. Additionally, 
> exposing the lastOverflowedXid value would be also good for troubleshooting 
> of subtransaction edge and corner cases – a bug recently fixed in all current 
> versions [2] was really tricky to troubleshoot in production because this 
> value is not visible to DBAs.

Yeah, we can show this too, although we need to take ProcArrayLock in
the shared mode for reading this, but anyway that will be done on
users request so should not be an issue IMHO.

I will post the updated patch soon along with comments given by
Zhihong Yu and Justin.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Non-superuser subscription owners

2021-12-07 Thread Amit Kapila

On Mon, Dec 6, 2021 at 9:26 PM Mark Dilger  wrote:
>
> > On Dec 6, 2021, at 2:19 AM, Amit Kapila  wrote:
> >
> >>> If we want to maintain the property that subscriptions can only be
> >>> owned by superuser
>
> We don't want to maintain such a property, or at least, that's not what I 
> want.  I don't think that's what Jeff wants, either.
>
> To clarify, I'm not entirely sure how to interpret the verb "maintain" in 
> your question, since before the patch the property does not exist, and after 
> the patch, it continues to not exist.  We could *add* such a property, of 
> course, though this patch does not attempt any such thing.
>

Okay, let me try to explain again. Following is the text from docs
[1]: " (a) To create a subscription, the user must be a superuser. (b)
The subscription apply process will run in the local database with the
privileges of a superuser. (c) Privileges are only checked once at the
start of a replication connection. They are not re-checked as each
change record is read from the publisher, nor are they re-checked for
each change when applied.

My understanding is that we want to improve what is written as (c)
which I think is the same as what you mentioned later as "Fix the
current bug wherein subscription changes are applied with superuser
force after the subscription owner has superuser privileges revoked.".
Am I correct till here? If so, I think what I am suggesting should fix
this with the assumption that we still want to follow (b) at least for
the first patch. One possibility is that our understanding of the
first problem is the same but you want to allow apply worker running
even when superuser privileges are revoked provided the user with
which it is running has appropriate privileges on the objects being
accessed by apply worker.

We will talk about other points of the roadmap you mentioned once our
understanding for the first one matches.

[1] - https://www.postgresql.org/docs/devel/logical-replication-security.html

-- 
With Regards,
Amit Kapila.

Re: Alter all tables in schema owner fix

2021-12-07 Thread Greg Nancarrow

On Tue, Dec 7, 2021 at 9:01 PM Amit Kapila  wrote:
>
> Thanks, the patch looks mostly good to me. I have slightly modified it
> to incorporate one of Michael's suggestions, ran pgindent, and
> modified the commit message.
>

LGTM, except in the patch commit message I'd change "Create
Publication" to "CREATE PUBLICATION".


Regards,
Greg Nancarrow
Fujitsu Australia

Re: Skipping logical replication transactions on subscriber side

2021-12-07 Thread Masahiko Sawada

On Mon, Dec 6, 2021 at 2:17 PM Amit Kapila  wrote:
>
> On Fri, Dec 3, 2021 at 12:12 PM Masahiko Sawada  wrote:
> >
> > On Fri, Dec 3, 2021 at 11:53 AM Amit Kapila  wrote:
> > >
> > > >  But this syntax gives you flexibility, so we can also
> > > > start with a simple implementation.
> > > >
> > >
> > > Yeah, I also think so. BTW, what do you think of providing extra
> > > flexibility of giving other options like 'operation', 'rel' along with
> > > xid? I think such options could be useful for large transactions that
> > > operate on multiple tables as it is quite possible that only a
> > > particular operation from the entire transaction is the cause of
> > > failure. Now, on one side, we can argue that skipping the entire
> > > transaction is better from the consistency point of view but I think
> > > it is already possible that we just skip a particular update/delete
> > > (if the corresponding tuple doesn't exist on the subscriber). For the
> > > sake of simplicity, we can just allow providing xid at this stage and
> > > then extend it later as required but I am not very sure of that point.
> >
> > +1
> >
> > Skipping a whole transaction by specifying xid would be a good start.
> >
>
> Okay, that sounds reasonable, so let's do that for now.

I'll submit the patch tomorrow.

While updating the patch, I realized that skipping a transaction that
is prepared on the publisher will be tricky a bit;

First of all, since skip-xid is in pg_subscription catalog, we need to
do a catalog update in a transaction and commit it to disable it. I
think we need to set origin-lsn and timestamp of the transaction being
skipped to the transaction that does the catalog update. That is,
during skipping the (not prepared) transaction, we skip all
data-modification changes coming from the publisher, do a catalog
update, and commit the transaction. If we do the catalog update in the
next transaction after skipping the whole transaction, skip_xid could
be left in case of a server crash between them. Also, we cannot set
origin-lsn and timestamp to an empty transaction.

In prepared transaction cases, I think that when handling a prepare
message, we need to commit the transaction to update the catalog,
instead of preparing it. And at the commit prepared and rollback
prepared time, we skip it since there is not the prepared transaction
on the subscriber. Currently, handling rollback prepared already
behaves so; it first checks whether we have prepared the transaction
or not and skip it if haven’t. So I think we need to do that also for
commit prepared case. With that, this requires protocol changes so
that the subscriber can get prepare-lsn and prepare-time when handling
commit prepared.

So I’m writing a separate patch to add prepare-lsn and timestamp to
commit_prepared message, which will be a building block for skipping
prepared transactions. Actually, I think it’s beneficial even today;
we can skip preparing the transaction if it’s an empty transaction.
Although the comment it’s not a common case, I think that it could
happen quite often in some cases:

* XXX, We can optimize such that at commit prepared time, we first check
* whether we have prepared the transaction or not but that doesn't seem
* worthwhile because such cases shouldn't be common.
*/

For example, if the publisher has multiple subscriptions and there are
many prepared transactions that modify the particular table subscribed
by one publisher, many empty transactions are replicated to other
subscribers.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Data is copied twice when specifying both child and parent table in publication

2021-12-07 Thread vignesh C

On Fri, Dec 3, 2021 at 11:24 AM houzj.f...@fujitsu.com
 wrote:
>
> On Thursday, December 2, 2021 4:54 PM houzj.f...@fujitsu.com 
>  wrote:
> > On Thursday, December 2, 2021 12:50 PM Amit Kapila
> >  wrote:
> > > On Thu, Dec 2, 2021 at 9:41 AM Greg Nancarrow 
> > > wrote:
> > > >
> > > > On Thu, Dec 2, 2021 at 1:48 PM Greg Nancarrow 
> > > wrote:
> > > > >
> > > > > If you updated my original description to say "(instead of just
> > > > > the individual partitions)", it would imply the same I think.
> > > > > But I don't mind if you want to explicitly state both cases to make 
> > > > > it clear.
> > > > >
> > > >
> > > > For example, something like:
> > > >
> > > > For publications of partitioned tables with
> > > > publish_via_partition_root set to true, only the partitioned table
> > > > (and not its partitions) is included in the view, whereas if
> > > > publish_via_partition_root is set to false, only the individual 
> > > > partitions are
> > included in the view.
> > > >
> > >
> > > Yeah, that sounds good to me.
> >
> > It looks good to me as well.
> > Attach the patches for (HEAD~13) which merge the suggested doc change. I
> > prepared the code patch and test patch separately to make it easier for
> > committer to confirm.
>
> It seems we might not need to backpatch the doc change, so
> attach another version which remove the doc changes from backpatch patches.

Thanks for the patches, the patch applies and the test passes in head
and the back branches. one minor suggestion:
1) Shall we change:
+  
+   For publications of partitioned tables with
+   publish_via_partition_root set to
+   true, only the partitioned table (and not its partitions)
+   is included in the view, whereas if
+   publish_via_partition_root is set to
+   false, only the individual partitions are included in the
+   view.
+  
To:
+  
+   For publications of partitioned tables with
+   publish_via_partition_root set to
+   true, only the partitioned table (and not its partitions)
+   is included in the view, whereas if
+   publish_via_partition_root is set to
+   false, only the individual partitions (and not the
+   partitioned table) are included in the
+   view.
+  

2) Any particular reason why the code and tests are backbranched but
not the document changes?

Regards,
Vignesh

Re: pg_get_publication_tables() output duplicate relid

2021-12-07 Thread Amit Langote

On Tue, Dec 7, 2021 at 3:20 PM Amit Kapila  wrote:
> On Tue, Dec 7, 2021 at 10:52 AM Amit Langote  wrote:
> >  So if a partition is
> > explicitly present in a publication with a filter defined on it, I
> > suppose following are the possible scenarios:
> >
> > * publish_via_partition_root=true
> >
> > 1. If the root table is not present in the publication (perhaps
> > unlikely), use the partition's filter because there's nothing else to
> > consider.
> >
> > 2. If the root table is present in the publication (very likely), use
> > the root's filter, ignoring/overriding the partition's own if any.
> >
> > * publish_via_partition_root=false
> >
> > 1. If the root table is not present in the publication, use the
> > partition's filter because there's nothing else to consider.
> >
> > 2. If the root table is present in the publication, use the
> > partition's own filter if present, else root's.
>
> I have not tested/checked each of these scenarios individually but I
> think it is something like, if publish_via_partition_root is false
> then we will always try to use partitions row filter and if it is not
> there then we don't use any filter. Similarly, if
> publish_via_partition_root is true, then we always try to use a
> partitioned table row filter and if it is not there (either because
> the partitioned table is not part of the publication or because there
> is no filter defined for it) then we don't use any filter.

Okay, thanks for the explanation.


--
Amit Langote
EDB: http://www.enterprisedb.com

Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display

2021-12-07 Thread Bharath Rupireddy

On Wed, Nov 17, 2021 at 8:01 AM Michael Paquier  wrote:
>
> On Tue, Nov 16, 2021 at 01:26:49PM -0300, Alvaro Herrera wrote:
> > My opinion is that adding these things willy-nilly is not a solution to
> > any actual problem.  Adding a few additional log lines that are
> > low-volume at DEBUG1 might be useful, but below that (DEBUG2 etc) it's
> > not good for anything other than specific development, IMO.  At least
> > this particular one for streaming replication I think we should not
> > include.
>
> Looking at v2, I think that this leaves the additions of the DEBUG1
> entries in SendBaseBackup() and WalRcvWaitForStartPosition(), then.
> The one in pgarch.c does not provide any additional information as the
> segment to-be-archived should be part of the command.

Thank you all for the inputs. Here's the patch that I've come up with.

Upon thinking further, having at least the messages at LOG level [1]
would be helpful to know what's happening with the system while in
recovery. Although these messages at LOG level seem to be filling up
the server logs, having a good log consumption and rotation mechanism
(I'm sure every major postgres vendor would have one) would be
sufficient to allay that concern.

These LOG messages would help us know how much time a restore command
takes to fetch the WAL file and what is the current WAL file the
server is recovering and where is it recovering from. The customer
often asks questions like: when will my server come up? how much time
does the recovery of my server take?

As a developer or admin, one can monitor these logs and do bunch of things:
1) see how many WAL files left to be recovered by looking at the WAL
files in the archive location or pg_wal directory or from primary
2) provide an approximate estimation of when the server will come up
or how much more the recovery takes by looking at these previous LOG
messages, one can know the number of WAL files that server recovered
over a minute and with the help of left-over WAL files calculated from
(1).

[1]
ereport(LOG,
(errmsg("waiting for WAL segment \"%s\" from archive",
xlogfname)));

ereport(LOG,
(errmsg("restored WAL segment \"%s\" from archive",
xlogfname)));

ereport(LOG,
(errmsg("recovering WAL segment \"%s\" from source \"%s\"",
xlogfname, srcname)));

Regards,
Bharath Rupireddy.

v3-0001-add-log-messages-for-recovery.patch
Description: Binary data

Re: row filtering for logical replication

2021-12-07 Thread Ashutosh Bapat

On Tue, Dec 7, 2021 at 12:18 PM tanghy.f...@fujitsu.com
 wrote:
>
> On Friday, December 3, 2021 10:09 AM Peter Smith  
> wrote:
> >
> > On Thu, Dec 2, 2021 at 2:32 PM tanghy.f...@fujitsu.com
> >  wrote:
> > >
> > > On Thursday, December 2, 2021 5:21 AM Peter Smith
> >  wrote:
> > > >
> > > > PSA the v44* set of patches.
> > > >
> > >
> > > Thanks for the new patch. Few comments:
> > >
> > > 1. This is an example in publication doc, but in fact it's not allowed. 
> > > Should we
> > > change this example?
> > >
> > > +CREATE PUBLICATION active_departments FOR TABLE departments WHERE
> > (active IS TRUE);
> > >
> > > postgres=# CREATE PUBLICATION active_departments FOR TABLE departments
> > WHERE (active IS TRUE);
> > > ERROR:  invalid publication WHERE expression for relation "departments"
> > > HINT:  only simple expressions using columns, constants and immutable 
> > > system
> > functions are allowed
> > >
> >
> > Thanks for finding this. Actually, the documentation looks correct to
> > me. The problem was the validation walker of patch 0002 was being
> > overly restrictive. It needed to also allow a BooleanTest node.
> >
> > Now it works (locally) for me. For example.
> >
> > test_pub=# create table departments(depno int primary key, active boolean);
> > CREATE TABLE
> > test_pub=# create publication pdept for table departments where
> > (active is true) with (publish="insert");
> > CREATE PUBLICATION
> > test_pub=# create publication pdept2 for table departments where
> > (active is false) with (publish="insert");
> > CREATE PUBLICATION
> >
> > This fix will be available in v45*.
> >
>
> Thanks for looking into it.
>
> I have another problem with your patch. The document says:
>
> ... If the subscription has several publications in
> +   which the same table has been published with different filters, those
> +   expressions get OR'ed together so that rows satisfying any of the 
> expressions
> +   will be replicated. Notice this means if one of the publications has no 
> filter
> +   at all then all other filters become redundant.
>
> Then, what if one of the publications is specified as 'FOR ALL TABLES' or 'FOR
> ALL TABLES IN SCHEMA'.
>
> For example:
> create table tbl (a int primary key);"
> create publication p1 for table tbl where (a > 10);
> create publication p2 for all tables;
> create subscription sub connection 'dbname=postgres port=5432' publication 
> p1, p2;

Thanks for the example. I was wondering about this case myself.

>
> I think for "FOR ALL TABLE" publication(p2 in my case), table tbl should be
> treated as no filter, and table tbl should have no filter in subscription 
> sub. Thoughts?
>
> But for now, the filter(a > 10) works both when copying initial data and 
> later changes.
>
> To fix it, I think we can check if the table is published in a 'FOR ALL 
> TABLES'
> publication or published as part of schema in function 
> pgoutput_row_filter_init
> (which was introduced in v44-0003 patch), also we need to make some changes in
> tablesync.c.

In order to check "FOR ALL_TABLES", we might need to fetch publication
metdata. Instead of that can we add a "TRUE" filter on all the tables
which are part of FOR ALL TABLES publication?

-- 
Best Wishes,
Ashutosh Bapat

Re: should we document an example to set multiple libraries in shared_preload_libraries?

2021-12-07 Thread Bharath Rupireddy

On Mon, Dec 6, 2021 at 9:20 PM Bharath Rupireddy
 wrote:
>
> On Fri, Dec 3, 2021 at 11:25 PM Bossart, Nathan  wrote:
> >
> > On 12/3/21, 6:21 AM, "Bharath Rupireddy" 
> >  wrote:
> > > +1 to add here in the "Parameter Names and Values section", but do we
> > > want to backlink every string parameter to this section? I think it
> > > needs more effort. IMO, we can just backlink for
> > > shared_preload_libraries alone. Thoughts?
> >
> > IMO this is most important for GUC_LIST_QUOTE parameters, of which
> > there are only a handful.  I don't think adding a link to every string
> > parameter is necessary.
>
> Agree.
>
> Should we specify something like below in the "Parameter Names and
> Values" section's "String:" para? Do we use generic terminology like
> 'name' and val1, val2, val3 and so on?
>
> ALTER SYSTEM SET name = val1,val2,val3;
> ALTER SYSTEM SET name = 'val1', 'val2', 'val3';
> ALTER SYSTEM SET name = '"val 1"', '"val,2"', 'val3';
>
> Another thing I observed is the difference between how the
> postgresql.conf file and ALTER SYSTEM SET command is parsed for
> GUC_LIST_QUOTE values.

Since there's a difference in the way the params are set in the
postgresql.conf file and ALTER SYSTEM SET command (as pointed out by
Tom in this thread [1]), I'm now confused. If we were to add examples
to the "Parameter Names and Values" section, which examples should we
addd, postgresql.conf files ones or ALTER SYSTEM SET command ones?

Thoughts?

[1] 
https://www.postgresql.org/message-id/flat/3108541.1638806477%40sss.pgh.pa.us

Regards,
Bharath Rupireddy.

Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Bharath Rupireddy

Hi,

Currently one can know the kind of on-going/last checkpoint (shutdown,
end-of-recovery, immediate, force etc.) only via server logs that to
when log_checkpoints GUC is on. At times, the users/service layer
components would want to know the kind of checkpoint (along with other
checkpoint related info) to take some actions and it will be a bit
difficult to search through the server logs. The checkpoint info can
be obtained from the control file (either by pg_control_checkpoint()
or by pg_controldata tool) whereas checkpoint kind isn't available
there.

How about we add an extra string field to the control file alongside
the other checkpoint info it already has? This way, the existing
pg_control_checkpoint() or pg_controldata tool can easily be enhanced
to show the checkpoint kind as well. One concern is that we don't want
to increase the size of pg_controldata by more than the typical block
size (of 8K) to avoid any torn-writes. With this change, we might add
at max the strings specified at [1]. Adding it to the control file has
an advantage of preserving the last checkpoint kind which might be
useful.

Thoughts?

[1] for checkpoint: "checkpoint shutdown end-of-recovery immediate
force wait wal time flush-all"
for restartpoint: "restartpoint shutdown end-of-recovery immediate
force wait wal time flush-all"

Regards,
Bharath Rupireddy.

Re: logical decoding and replication of sequences

2021-12-07 Thread Peter Eisentraut


I have checked the 0001 and 0003 patches.  (I think we have dismissed
the approach in 0002 for now.  And let's talk about 0004 later.)

I have attached a few fixup patches, described further below.

# 0001

The argument "create" for fill_seq_with_data() is always true (and
patch 0002 removes it).  So I'm not sure if it's needed.  If it is, it
should be documented somewhere.

About the comment you added in nextval_internal(): It's a good
explanation, so I would leave it in.  I also agree with the
conclusion of the comment that the current solution is reasonable.  We
probably don't need the same comment again in fill_seq_with_data() and
again in do_setval().  Note that both of the latter functions already
point to nextval_interval() for other comments, so the same can be
relied upon here as well.

The order of the new fields sequence_cb and stream_sequence_cb is a
bit inconsistent compared to truncate_cb and stream_truncate_cb.
Maybe take another look to make the order more uniform.

Some documentation needs to be added to the Logical Decoding chapter.
I have attached a patch that I think catches all the places that need
to be updated, with some details left for you to fill in. ;-) The
ordering of the some of the documentation chunks reflects the order in
which the callbacks appear in the header files, which might not be
optimal; see above.

In reorderbuffer.c, you left a comment about how to access a sequence
tuple.  There is an easier way, using Form_pg_sequence_data, which is
how sequence.c also does it.  See attached patch.

# 0003

The tests added in 0003 don't pass for me.  It appears that you might
have forgotten to update the expected files after you added some tests
or something.  The cfbot shows the same.  See attached patch for a
correction, but do check what your intent was.

As mentioned before, we probably don't need the skip-sequences option
in test_decoding.From 2dd9a0e76b9496016e7abebebd0006ddee72d3c1 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 7 Dec 2021 14:19:42 +0100
Subject: [PATCH 1/4] Whitespace fixes

---
 contrib/test_decoding/expected/sequence.out | 6 +++---
 contrib/test_decoding/sql/sequence.sql  | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/contrib/test_decoding/expected/sequence.out 
b/contrib/test_decoding/expected/sequence.out
index 17c88990b1..ca55bcebc3 100644
--- a/contrib/test_decoding/expected/sequence.out
+++ b/contrib/test_decoding/expected/sequence.out
@@ -177,7 +177,7 @@ SELECT data FROM 
pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 --
 (0 rows)
 
--- rollback on table creation with serial column 
+-- rollback on table creation with serial column
 BEGIN;
 CREATE TABLE test_table (a SERIAL, b INT);
 INSERT INTO test_table (b) VALUES (100);
@@ -202,7 +202,7 @@ SELECT data FROM 
pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'inc
 --
 (0 rows)
 
--- rollback on table with serial column 
+-- rollback on table with serial column
 CREATE TABLE test_table (a SERIAL, b INT);
 BEGIN;
 INSERT INTO test_table (b) VALUES (100);
@@ -222,7 +222,7 @@ INSERT INTO test_table (b) VALUES (700);
 INSERT INTO test_table (b) VALUES (800);
 INSERT INTO test_table (b) VALUES (900);
 ROLLBACK;
--- check table and sequence values after rollback 
+-- check table and sequence values after rollback
 SELECT * from test_table_a_seq;
  last_value | log_cnt | is_called 
 +-+---
diff --git a/contrib/test_decoding/sql/sequence.sql 
b/contrib/test_decoding/sql/sequence.sql
index d8a34738f3..21c4b79222 100644
--- a/contrib/test_decoding/sql/sequence.sql
+++ b/contrib/test_decoding/sql/sequence.sql
@@ -48,7 +48,7 @@ CREATE TABLE test_table (a INT);
 ROLLBACK;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 
'include-xids', '0', 'skip-empty-xacts', '1', 'skip-sequences', '0');
 
--- rollback on table creation with serial column 
+-- rollback on table creation with serial column
 BEGIN;
 CREATE TABLE test_table (a SERIAL, b INT);
 INSERT INTO test_table (b) VALUES (100);
@@ -65,7 +65,7 @@ CREATE TABLE test_table (a SERIAL, b INT);
 ROLLBACK;
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 
'include-xids', '0', 'skip-empty-xacts', '1', 'skip-sequences', '0');
 
--- rollback on table with serial column 
+-- rollback on table with serial column
 CREATE TABLE test_table (a SERIAL, b INT);
 
 BEGIN;
@@ -82,7 +82,7 @@ CREATE TABLE test_table (a SERIAL, b INT);
 INSERT INTO test_table (b) VALUES (900);
 ROLLBACK;
 
--- check table and sequence values after rollback 
+-- check table and sequence values after rollback
 SELECT * from test_table_a_seq;
 SELECT nextval('test_table_a_seq');
 
-- 
2.34.1

From c5e9eb2792bd276e493aed582627e93b4c681374 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 7 Dec 2021 15:29:34 +0100
Subject: [PATCH 2/4] Make tests pass

---
 contrib/test_decoding/expected/sequence.out | 18 --
 1 f

Re: Skipping logical replication transactions on subscriber side

2021-12-07 Thread Peter Eisentraut


On 03.12.21 03:53, Amit Kapila wrote:

I don't know how difficult it would be, but allowing multiple xids might
be desirable.


Are there many cases where there could be multiple xid failures that
the user can skip? Apply worker always keeps looping at the same error
failure so the user wouldn't know of the second xid failure (if any)
till the first failure is resolved.


Yeah, nevermind, doesn't make sense.


Yeah, I also think so. BTW, what do you think of providing extra
flexibility of giving other options like 'operation', 'rel' along with
xid? I think such options could be useful for large transactions that
operate on multiple tables as it is quite possible that only a
particular operation from the entire transaction is the cause of
failure. Now, on one side, we can argue that skipping the entire
transaction is better from the consistency point of view but I think
it is already possible that we just skip a particular update/delete
(if the corresponding tuple doesn't exist on the subscriber). For the
sake of simplicity, we can just allow providing xid at this stage and
then extend it later as required but I am not very sure of that point.


Skipping transactions partially sounds dangerous, especially when 
exposed as an option to users.  Needs more careful thought.

Re: Non-superuser subscription owners

2021-12-07 Thread Mark Dilger




> On Dec 7, 2021, at 2:29 AM, Amit Kapila  wrote:
> 
> Okay, let me try to explain again. Following is the text from docs
> [1]: " (a) To create a subscription, the user must be a superuser. (b)
> The subscription apply process will run in the local database with the
> privileges of a superuser. (c) Privileges are only checked once at the
> start of a replication connection. They are not re-checked as each
> change record is read from the publisher, nor are they re-checked for
> each change when applied.
> 
> My understanding is that we want to improve what is written as (c)
> which I think is the same as what you mentioned later as "Fix the
> current bug wherein subscription changes are applied with superuser
> force after the subscription owner has superuser privileges revoked.".
> Am I correct till here? If so, I think what I am suggesting should fix
> this with the assumption that we still want to follow (b) at least for
> the first patch.

Ok, that's a point of disagreement.  I was trying to fix both (b) and (c) in 
the first patch.

> One possibility is that our understanding of the
> first problem is the same but you want to allow apply worker running
> even when superuser privileges are revoked provided the user with
> which it is running has appropriate privileges on the objects being
> accessed by apply worker.

Correct, that's what I'm trying to make safe.

> We will talk about other points of the roadmap you mentioned once our
> understanding for the first one matches.

I am happy to have an off-list phone call with you, if you like.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Transparent column encryption

2021-12-07 Thread Peter Eisentraut


On 06.12.21 19:28, Robert Haas wrote:

Speaking of parameter descriptions, the trickiest part of this whole
thing appears to be how to get transparently encrypted data into the
database (as opposed to reading it out).  It is required to use
protocol-level prepared statements (i.e., extended query) for this.

Why? If the client knows the CEK, can't the client choose to send
unprepared insert or update statements with pre-encrypted blobs? That
might be a bad idea from a security perspective because the encrypted
blob might then got logged, but we sometimes log parameters, too.


The client can send something like

PQexec(conn, "INSERT INTO tbl VALUES ('ENCBLOB', 'ENCBLOB')");

and it will work.  (See the included test suite where 'ENCBLOB' is 
actually computed by pgcrypto.)  But that is not transparent encryption. 
 The client wants to send "INSERT INTO tbl VALUES ('val1', 'val2')" and 
have libpq take care of encrypting 'val1' and 'val2' before hitting the 
wire.  For that you need to use the prepared statement API so that the 
values are available separately from the statement.  And furthermore the 
client needs to know what columns the insert statements is writing to, 
so that it can get the CEK for that column.  That's what it needs the 
parameter description for.


As alluded to, workarounds exist or might be made available to do part 
of that work yourself, but that shouldn't be the normal way of using it.

pg_dump/restore --no-tableam

2021-12-07 Thread Justin Pryzby

I first suggested this a couple years ago.
Is it desirable to implement in pg_dump and pg_restore ?
It'd be just like --tablespace.

On Tue, Jan 28, 2020 at 07:33:17AM -0600, Justin Pryzby wrote:
> I made these casual comments.  If there's any agreement on their merit, it'd 
> be
> nice to implement at least the first for v13.
> 
> In <20190818193533.gl11...@telsasoft.com>, I wrote: 
> >  . What do you think about pg_restore --no-tableam; similar to 
> >--no-tablespaces, it would allow restoring a table to a different AM:
> >PGOPTIONS='-c default_table_access_method=zedstore' pg_restore 
> > --no-tableam ./pg_dump.dat -d postgres
> >Otherwise, the dump says "SET default_table_access_method=heap", which
> >overrides any value from PGOPTIONS and precludes restoring to new AM.
> 
> That appears to be a trivial variation on no-tablespace:
> 
> /* do nothing in --no-tablespaces mode */
> if (ropt->noTablespace)
> return;
...

Re: Transparent column encryption

2021-12-07 Thread Peter Eisentraut


On 06.12.21 21:44, Jacob Champion wrote:

I think reusing a zero IV will potentially leak more information than
just equality, depending on the cipher in use. You may be interested in
synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem
like they would match this use case exactly. (But I'm not a
cryptographer.)


I'm aware of this and plan to make use of SIV.  The current 
implementation is just an example.



Have you given any thought to AEAD? As a client I'd like to be able to
tie an encrypted value to other column (or external) data. For example,
AEAD could be used to prevent a DBA from copying the (encrypted) value
of my credit card column into their account's row to use it.


I don't know how that is supposed to work.  When the value is encrypted 
for insertion, the client may know things like table name or column 
name, so it can tie it to those.  But it doesn't know what row it will 
go in, so you can't prevent the value from being copied into another 
row.  You would need some permanent logical row ID for this, I think. 
For this scenario, the deterministic encryption mode is perhaps not the 
right one.



This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
on the direction.


What kinds of attacks are you hoping to prevent (and not prevent)?


The point is to prevent admins from getting at plaintext data.  The 
scenario you show is an interesting one but I think it's not meant to be 
addressed by this.  If admins can alter the database to their advantage, 
they could perhaps increase their account balance, create discount 
codes, etc. also.


If this is a problem, then perhaps a better approach would be to store 
parts of the data in a separate database with separate admins.

Re: Is there any documentation on how to correctly create extensions in HA(primary-standby) setup?

2021-12-07 Thread Peter Eisentraut


On 03.12.21 15:28, Bharath Rupireddy wrote:

I'm thinking of adding the above steps into the "Additional Supplied
Modules" section documentation. Any thoughts please?

[1] - https://www.postgresql.org/docs/devel/contrib.html


The chapter about extensions is probably better: 
https://www.postgresql.org/docs/devel/extend-extensions.html

Re: Dubious usage of TYPCATEGORY_STRING

2021-12-07 Thread Peter Eisentraut


On 02.12.21 22:22, Tom Lane wrote:

My first thought about fixing point 1 was to put "char" into some
other typcategory, but that turns out to break some of psql's
catalog queries, with results like:

regression=# \dp
ERROR:  operator is not unique: unknown || "char"
LINE 16:E' (' || polcmd || E'):'
   ^
HINT:  Could not choose a best candidate operator. You might need to add 
explicit type casts.


Could we add explicit casts (like polcmd::text) here?  Or would it break 
too much?

Re: Dubious usage of TYPCATEGORY_STRING

2021-12-07 Thread Tom Lane

Peter Eisentraut  writes:
> On 02.12.21 22:22, Tom Lane wrote:
>> My first thought about fixing point 1 was to put "char" into some
>> other typcategory, but that turns out to break some of psql's
>> catalog queries, with results like:
>> 
>> regression=# \dp
>> ERROR:  operator is not unique: unknown || "char"
>> LINE 16:E' (' || polcmd || E'):'
>> ^
>> HINT:  Could not choose a best candidate operator. You might need to add 
>> explicit type casts.

> Could we add explicit casts (like polcmd::text) here?  Or would it break 
> too much?

I assumed it'd break too much to consider doing that.  But I suppose
that since a typcategory change would be initdb-forcing anyway, maybe
it's not out of the question.  I'll investigate and see exactly how
many places would need an explicit cast.

regards, tom lane

Re: types reliant on encodings [was Re: Dubious usage of TYPCATEGORY_STRING]

2021-12-07 Thread Peter Eisentraut


On 03.12.21 19:42, Chapman Flack wrote:

Is there any way to find out, from the catalogs or in any automatable way,
which types are implemented with a dependence on the database encoding
(or on some encoding)?


What is this needed for?  C code can internally do whatever it wants, 
and the database encoding is effectively a constant, so there is no need 
for server-side code to be very much concerned about whether types do this.


Also, "types" is perhaps the wrong subject here.  Types only contain 
input and output functions and a few more bits.  Additional functions 
operating on the type could look at the server encoding without the type 
and its core functions knowing about it.

Re: Triage for unimplemented geometric operators

2021-12-07 Thread Anton Voloshin

On 07/12/2021 06:18, Tom Lane wrote:

So my inclination for HEAD is to implement poly_distance and nuke
the others.  I'm a bit less sure about the back branches, but maybe
just de-document all of them there.

I agree, seems to be a reasonable compromise. Removing 20+-years old 
unused and slightly misleading code probably should overweight the 
natural inclination to implement all of the functions promised in the 
catalog. Enhancing software by deleting the code is not an everyday 
chance and IMHO should be taken, even when it requires an extra 
catversion bump.

I am kind of split on whether it is worth it to implement poly_distance 
in back branches. Maybe there is a benefit in implementing: it should 
not cause any reasonable incompatibilities and will introduce somewhat 
better compatibility with v15+. We could even get away with not updating 
v10..12' docs on poly_distance because it's not mentioned anyway.

I agree on de-documenting all of the unimplemented functions in v13 and 
v14. Branches before v13 should not require any changes (including 
documentation) because detailed table on which operators support which 
geometry primitives only came in 13, and I could not find in e.g. 12's 
documentation any references to the cases you listed previously:

> dist_lb:   <->(line,box)
> dist_bl:   <->(box,line)
> close_sl:  lseg ## line
> close_lb:  line ## box
> poly_distance: polygon <-> polygon
> path_center:   @@ path

--
Anton Voloshin
Postgres Professional, The Russian Postgres Company
https://postgrespro.ru

Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-07 Thread Robert Haas

On Mon, Dec 6, 2021 at 9:42 PM Peter Geoghegan  wrote:
> On Mon, Dec 6, 2021 at 6:11 PM Robert Haas  wrote:
> > This doesn't seem convincing. Launching autovacuum too soon surely has
> > costs that someone might not want to pay. Clearly in the degenerate
> > case where we always autovacuum every table every time an autovacuum
> > worker is launched, we have gone insane.
>
> Unfortunately we already sometimes behave insanely in exactly the way
> that you describe:
>
> https://postgr.es/m/CAH2-Wz=sJm3tm+FpXbyBhEhX5tbz1trQrhG6eOhYk4-+5uL=w...@mail.gmail.com

In the same ballpark is http://rhaas.blogspot.com/2020/02/useless-vacuuming.html

> It's hard to define a break-even point for launching an autovacuum
> worker. I think it would be more productive to come up with a design
> that at least doesn't go completely off the rails in various specific
> ways.

I think that's a good observation. I think the current autovacuum
algorithm works pretty well when things are normal. But in extreme
scenarios it does not degrade gracefully. The
vacuuming-over-and-over-without-doing-anything phenomenon is an
example of that. Another example is the user who creates 10,000
databases and we're happy to divide the 60s-autovacuum_naptime by
10,000 and try to launch a worker every 0.6 ms. A third example is
vacuuming the tables from pg_class in physical order on disk, so that
a table that is 1 XID past the wraparound limit can result in a long
delay vacuuming a table that is bloating quickly, or conversely a
table that is bloating very slowly but has just crawled over the
threshold for a regular vacuum gets processed before one that is
threatening an imminent wraparound shutdown. I think these are all
pathological cases that a well-informed human can easily recognize and
handle in an intelligent manner, and it doesn't seem crazy to program
those responses into the computer in some way.

> I also think that our problem is not so much that we don't have
> accurate statistics about dead tuples (though we surely don't have
> much accuracy). The main problem seems to be that there are various
> specific, important ways in which the distribution of dead tuples may
> matter (leading to various harmful biases). And so it seems reasonable
> to fudge how we interpret dead tuples with the intention of capturing
> some of that, as a medium term solution. Something that considers the
> concentration of dead tuples in heap pages seems promising.

I am less convinced about this part. It sort of sounds like you're
saying - it doesn't really matter whether the numbers we gather are
accurate, just that they produce the correct results. If the
information we currently gather wouldn't produce the right results
even if it were fully accurate, that to me suggests that we're
gathering the wrong information, and we should gather something else.
For example, we could attack the useless-vacuuming problem by having
each vacuum figure out - and store - the oldest XID that could
potentially be worth using as a cutoff for vacuuming that table, and
refusing to autovacuum that table again until we'd be able to use a
cutoff >= that value. I suppose this would be the oldest of (a) any
XID that exists in the table on a tuple that we judged recently dead,
(b) any XID that is currently-running, and (c) the next XID.

I also accept that knowing the concentration of dead tuples on pages
that contain at least 1 dead tuple could be interesting. I've felt for
a while that it's a mistake to know how many dead tuples there are but
not how many pages contain them, because that affects both the amount
of I/O required to vacuum and also how much need we have to set VM
bits. I'm not sure I would have approached gathering that information
in the way that you're proposing here, but I'm not deeply against it,
either. I do think that we should try to keep it as a principle that
whatever we do gather, we should try our best to make accurate. If
that doesn't work well, then gather different stuff instead.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Dubious usage of TYPCATEGORY_STRING

2021-12-07 Thread Robert Haas

On Thu, Dec 2, 2021 at 4:22 PM Tom Lane  wrote:
> An example of the reasons not to treat these types as being
> general-purpose strings can be seen at [1], where the "char"
> type has acquired some never-intended cast behaviors.  Taking
> that to an extreme, we currently accept
>
> regression=# select '(1,2)'::point::"char";
>  char
> --
>  (
> (1 row)

What's wrong with that?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: AW: Assorted improvements in pg_dump

2021-12-07 Thread Tom Lane

Hans Buschmann  writes:
> I noticed that you patched master with all the improvements in pg_dump.
> Did you change your mind about backpatching patch 0005 to fix the toast size 
> matter?

I looked briefly at that and found that the patch would have to be
largely rewritten, because getTables() looks quite different in the
older branches.  I'm not really sufficiently excited about the point
to do that rewriting and re-testing.  I think that cases where the
old logic gets the scheduling badly wrong are probably rare.

regards, tom lane

Re: ExecTypeSetColNames is fundamentally broken

2021-12-07 Thread Robert Haas

On Mon, Dec 6, 2021 at 4:05 PM Tom Lane  wrote:
> Well, that was what I thought when I wrote bf7ca1587, but it leads
> to logical contradictions.  Consider
>
> create table t (a int, b int);
>
> create function f(t) returns ... ;
>
> select f(t) from t;
>
> select f(t) from t(x,y);
>
> If we adopt the "rename for all purposes" interpretation, then
> the second SELECT must fail, because what f() is being passed is
> no longer of type t.  If you ask me, that'll be a bigger problem
> for users than the change I'm proposing (quite independently of
> how hard it might be to implement).  It certainly will break
> a behavior that goes back much further than bf7ca1587.

For me, the second SELECT does fail:

rhaas=# select f(t) from t(x,y);
ERROR:  column "x" does not exist
LINE 1: select f(t) from t(x,y);
   ^

If it didn't fail, what would the behavior be? I suppose you could
make an argument for trying to match up the columns by position, but
if so this ought to work:

rhaas=# create table u(a int, b int);
CREATE TABLE
rhaas=# select f(u) from u;
ERROR:  function f(u) does not exist
rhaas=# select f(u::t) from u;
ERROR:  cannot cast type u to t

Matching columns by name can't work because the names don't match.
Matching columns by position does not work. So if that example
succeeds, the only real explanation is that we magically know that
we've still got a value of type t despite the user's best attempt to
decree otherwise. I know PostgreSQL sometimes ... does things like
that. I have no idea why anyone would consider it a desirable
behavior, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: MSVC SSL test failure

2021-12-07 Thread Tom Lane

Alexander Lakhin  writes:
> It seems that the test failure rate may depend on the specs/environment.

No surprise there, since the issue is almost surely timing-dependent.

> shutdown(MyProcPort->sock, SD_SEND) apparently fixes the issue, I've got
> 83 successful runs, but then iteration 84 unfortunately failed:
> t/001_ssltests.pl .. 106/110
> #   Failed test 'intermediate client certificate is missing: matches'
> #   at t/001_ssltests.pl line 608.
> #   'psql: error: connection to server at "127.0.0.1",
> port 63187 failed: could not receive data from server: Software caused
> connection abort (0x2745/10053)
> # SSL SYSCALL error: Software caused connection abort (0x2745/10053)
> # could not send startup packet: No error (0x/0)'
> # doesn't match '(?^:SSL error: tlsv1 alert unknown ca)'
> # Looks like you failed 1 test of 110.

Hmm.  I wonder whether using SD_BOTH behaves any differently.

regards, tom lane

Re: pg_dump versus ancient server versions

2021-12-07 Thread Robert Haas

On Mon, Dec 6, 2021 at 4:19 PM Tom Lane  wrote:
> Right.  The question that's on the table is how much is the right
> amount of maintenance.  I think that back-patching user-visible bug
> fixes, for example, is taking things too far.  What we want is to
> be able to replicate the behavior of the branch's last released
> version, using whatever build tools we are currently using.  So
> back-patching something like that is counterproductive, because
> now the behavior is not what was released.
>
> A minimal amount of maintenance would be "only back-patch fixes
> for issues that cause failure-to-build".  The next step up is "fix
> issues that cause failure-to-pass-regression-tests", and then above
> that is "fix developer-facing annoyances, such as compiler warnings
> or unwanted test output, as long as you aren't changing user-facing
> behavior".  I now think that it'd be reasonable to include this
> last group, although I'm pretty sure Peter didn't have that in mind
> in his policy sketch.

Yep, that seems reasonable to me.

I guess the point about user-visible bug fixes is that, as soon as we
start doing that, we don't really want it to be hit-or-miss. We could
make a decision to back-patch all bug fixes or those of a certain
severity or whatever we like back to older branches, and then those
branches would be supported or semi-supported depending on what rule
we adopted, and we could even continue to do releases for them if we
so chose. However, it wouldn't be a great idea to back-patch a
completely arbitrary subset of our fixes into those branches, because
then it sort of gets confusing to understand what the status of that
branch is. I don't know that I'm terribly bothered by the idea that
the behavior of the branch might deviate from the last official
release, because most bug fixes are pretty minor and wouldn't really
affect testing much, but it would be a little annoying to explain to
users that those branches contain an arbitrary subset of newer fixes,
and a little hard for us to understand what is and is not there.

That being said, suppose that a new compiler version comes out and on
that new compiler version, 'make check' crashes on the older branch
due to a missing WhateverGetDatum() call that we rectified in a later,
back-patched commit. I would consider it reasonable to back-patch that
particular bug fix into an unsupported branch to make it testable,
just like we would do for a failure-to-build issue.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Dubious usage of TYPCATEGORY_STRING

2021-12-07 Thread Tom Lane

Robert Haas  writes:
> On Thu, Dec 2, 2021 at 4:22 PM Tom Lane  wrote:
>> An example of the reasons not to treat these types as being
>> general-purpose strings can be seen at [1], where the "char"
>> type has acquired some never-intended cast behaviors.  Taking
>> that to an extreme, we currently accept
>> 
>> regression=# select '(1,2)'::point::"char";
>> char
>> --
>> (
>> (1 row)

> What's wrong with that?

Well, we don't allow things like

regression=# select '(1,2)'::point::float8;
ERROR:  cannot cast type point to double precision
LINE 1: select '(1,2)'::point::float8;
 ^

It's not very clear to me why "char" should get a pass on that.
We allow such cases when the target is text/varchar/etc, but
the assumption is that the textual representation is sufficient
for your purposes.  It's hard to claim that just the first
byte is a useful textual representation.

Worse, PG is actually treating this as an assignment-level cast,
so we accept this:

regression=# create table t(f1 "char");
CREATE TABLE
regression=# insert into t values ('(1,2)'::point);
INSERT 0 1
regression=# table t;
 f1 

 (
(1 row)

I definitely don't think that should have worked without
any complaint.

regards, tom lane

Re: Add sub-transaction overflow status in pg_stat_activity

2021-12-07 Thread Bossart, Nathan

On 12/6/21, 8:19 PM, "Dilip Kumar"  wrote:
> If the subtransaction cache is overflowed in some of the transactions
> then it will affect all the concurrent queries as they need to access
> the SLRU for checking the visibility of each tuple.  But currently
> there is no way to identify whether in any backend subtransaction is
> overflowed or what is the current active subtransaction count.
> Attached patch adds subtransaction count and subtransaction overflow
> status in pg_stat_activity.  I have implemented this because of the
> recent complain about the same[1]

I'd like to give a general +1 to this effort.  Thanks for doing this!
I've actually created a function to provide this information in the
past, so I will help review.

Nathan

Re: ExecTypeSetColNames is fundamentally broken

2021-12-07 Thread Tom Lane

Robert Haas  writes:
> On Mon, Dec 6, 2021 at 4:05 PM Tom Lane  wrote:
>> select f(t) from t(x,y);
>> 
>> If we adopt the "rename for all purposes" interpretation, then
>> the second SELECT must fail, because what f() is being passed is
>> no longer of type t.

> For me, the second SELECT does fail:

> rhaas=# select f(t) from t(x,y);
> ERROR:  column "x" does not exist

Ah, sorry, I fat-fingered the alias syntax.  Here's a tested example:

regression=# create table t (a int, b int);
CREATE TABLE
regression=# insert into t values(11,12);
INSERT 0 1
regression=# create function f(t) returns int as 'select $1.a' language sql;
CREATE FUNCTION
regression=# select f(t) from t as t(x,y);
 f  

 11
(1 row)

If we consider that the alias renames the columns "for all purposes",
how is it okay for f() to select the "a" column?

Another way to phrase the issue is that the column names seen
by f() are currently different from those seen by row_to_json():

regression=# select row_to_json(t) from t as t(x,y);
   row_to_json   
-
 {"x":11,"y":12}
(1 row)

and that seems hard to justify.

regards, tom lane

Re: Pre-allocating WAL files

2021-12-07 Thread Bossart, Nathan

On 12/7/21, 12:29 AM, "Bharath Rupireddy" 
 wrote:
> Why can't the walwriter pre-allocate some of the WAL segments instead
> of a new background process? Of course, it might delay the main
> functionality of the walwriter i.e. flush and sync the WAL files, but
> having checkpointer do the pre-allocation makes it do another extra
> task. Here the amount of walwriter work vs checkpointer work, I'm not
> sure which one does more work compared to the other.

The argument against adding it to the WAL writer is that we want it to
run with low latency to flush asynchronous commits.  If we added WAL
pre-allocation to the WAL writer, there could periodically be large
delays.

> Another idea could be to let walwrtier or checkpointer pre-allocate
> the WAL files whichever seems free as-of-the-moment when the WAL
> segment pre-allocation request comes. We can go further to let the
> user choose which process i.e. checkpointer or walwrtier do the
> pre-allocation with a GUC maybe?

My latest patch set [0] adds WAL pre-allocation to the checkpointer.
In that patch set, WAL pre-allocation is done both outside of
checkpoints as well as during checkpoints (via
CheckPointWriteDelay()).  

Nathan

[0] 
https://www.postgresql.org/message-id/CB15BEBD-98FC-4E72-86AE-513D34014176%40amazon.com

Re: pg_dump versus ancient server versions

2021-12-07 Thread Tom Lane

Robert Haas  writes:
> I guess the point about user-visible bug fixes is that, as soon as we
> start doing that, we don't really want it to be hit-or-miss. We could
> make a decision to back-patch all bug fixes or those of a certain
> severity or whatever we like back to older branches, and then those
> branches would be supported or semi-supported depending on what rule
> we adopted, and we could even continue to do releases for them if we
> so chose. However, it wouldn't be a great idea to back-patch a
> completely arbitrary subset of our fixes into those branches, because
> then it sort of gets confusing to understand what the status of that
> branch is.

Yup, and also confusing to understand whether a given new fix should
be back-patched into the out-of-support-but-keep-buildable branches.
I want to settle on a reasonably well-defined policy for that.

I'm basically suggesting that the policy should be "back-patch the
minimal fix needed so that you can still get a clean build and clean
check-world run, using thus-and-such configure options".  (The point
of the configure options limitation being to exclude moving-target
external dependencies, such as Python.)  I think that Peter's
original suggestion could be read the same way except for the
adjective "clean".  He also said that only core regression needs
to pass not check-world; but if we're trying to test things like
pg_dump compatibility, I think we want the wider scope of what to
keep working.

regards, tom lane

Re: Alter all tables in schema owner fix

2021-12-07 Thread Bossart, Nathan

On 12/7/21, 2:47 AM, "Greg Nancarrow"  wrote:
> On Tue, Dec 7, 2021 at 9:01 PM Amit Kapila  wrote:
>>
>> Thanks, the patch looks mostly good to me. I have slightly modified it
>> to incorporate one of Michael's suggestions, ran pgindent, and
>> modified the commit message.
>>
>
> LGTM, except in the patch commit message I'd change "Create
> Publication" to "CREATE PUBLICATION".

LGTM, too.

Nathan

Re: ExecTypeSetColNames is fundamentally broken

2021-12-07 Thread Robert Haas

On Tue, Dec 7, 2021 at 12:30 PM Tom Lane  wrote:
> If we consider that the alias renames the columns "for all purposes",
> how is it okay for f() to select the "a" column?

I'd say it isn't.

> Another way to phrase the issue is that the column names seen
> by f() are currently different from those seen by row_to_json():
>
> regression=# select row_to_json(t) from t as t(x,y);
>row_to_json
> -
>  {"x":11,"y":12}
> (1 row)
>
> and that seems hard to justify.

Yeah, I agree. The problem I have here is that, with your proposed
fix, it still won't be very consistent. True, row_to_json() and f()
will both see the unaliased column names ... but a straight select *
from t as t(x,y) will show the aliased names. That's unarguably better
than corrupting your data, but it seems "astonishing" in the POLA
sense.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: MSVC SSL test failure

2021-12-07 Thread Alexander Lakhin

Hello Tom,
07.12.2021 19:25, Tom Lane wrote:
> Hmm. I wonder whether using SD_BOTH behaves any differently. 
With shutdown(MyProcPort->sock, SD_BOTH) the test failed for me on
iterations 1, 2, 3, 16 (just as without shutdown() at all).
So shutdown with the SD_SEND flag definitely behaves much better (I've
seen over 200 successful iterations).
 
Best regards,
Alexander

Re: Transparent column encryption

2021-12-07 Thread Jacob Champion

On Tue, 2021-12-07 at 16:39 +0100, Peter Eisentraut wrote:
> On 06.12.21 21:44, Jacob Champion wrote:
> > I think reusing a zero IV will potentially leak more information than
> > just equality, depending on the cipher in use. You may be interested in
> > synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem
> > like they would match this use case exactly. (But I'm not a
> > cryptographer.)
> 
> I'm aware of this and plan to make use of SIV.  The current 
> implementation is just an example.

Sounds good.

> > Have you given any thought to AEAD? As a client I'd like to be able to
> > tie an encrypted value to other column (or external) data. For example,
> > AEAD could be used to prevent a DBA from copying the (encrypted) value
> > of my credit card column into their account's row to use it.
> 
> I don't know how that is supposed to work.  When the value is encrypted 
> for insertion, the client may know things like table name or column 
> name, so it can tie it to those.  But it doesn't know what row it will 
> go in, so you can't prevent the value from being copied into another 
> row.  You would need some permanent logical row ID for this, I think.

Sorry, my description was confusing. There's nothing preventing the DBA
from copying the value inside the database, but AEAD can make it so
that the copied value isn't useful to the DBA.

Sample case. Say I have a webapp backed by Postgres, which stores
encrypted credit card numbers. Users authenticate to the webapp which
then uses the client (which has the keys) to talk to the database.
Additionally, I assume that:

- the DBA can't access the client directly (because if they can, then
they can unencrypt the victim's info using the client's keys), and

- the DBA can't authenticate as the user/victim (because if they can,
they can just log in themselves and have the data). The webapp might
for example use federated authn with a separate provider, using an
email address as an identifier.

Now, if the client encrypts a user's credit card number using their
email address as associated data, then it doesn't matter if the DBA
copies that user's encrypted card over to their own account. The DBA
can't log in as the victim, so the client will fail to authenticate the
value because its associated data won't match.

> > > This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
> > > on the direction.
> > 
> > What kinds of attacks are you hoping to prevent (and not prevent)?
> 
> The point is to prevent admins from getting at plaintext data.  The 
> scenario you show is an interesting one but I think it's not meant to be 
> addressed by this.  If admins can alter the database to their advantage, 
> they could perhaps increase their account balance, create discount 
> codes, etc. also.

Sure, but increasing account balances and discount codes don't lead to
getting at plaintext data, right? Whereas stealing someone else's
encrypted value seems like it would be covered under your threat model,
since it lets you trick a real-world client into decrypting it for you.

Other avenues of attack might depend on how you choose to add HMAC to
the current choice of AES-CBC. My understanding of AE ciphers (with or
without associated data) is that you don't have to design that
yourself, which is nice.

--Jacob

Re: Dubious usage of TYPCATEGORY_STRING

2021-12-07 Thread Robert Haas

On Tue, Dec 7, 2021 at 12:19 PM Tom Lane  wrote:
> > What's wrong with that?
>
> Well, we don't allow things like
>
> regression=# select '(1,2)'::point::float8;
> ERROR:  cannot cast type point to double precision
> LINE 1: select '(1,2)'::point::float8;
>  ^
>
> It's not very clear to me why "char" should get a pass on that.
> We allow such cases when the target is text/varchar/etc, but
> the assumption is that the textual representation is sufficient
> for your purposes.  It's hard to claim that just the first
> byte is a useful textual representation.

Fair enough, I guess. I am pretty skeptical of the merits of refusing
an explicit cast. If I ran the zoo, I would probably choose to allow
all such casts and make them coerce via IO when no other pathway is
available. But I get that's not our policy.

> Worse, PG is actually treating this as an assignment-level cast,
> so we accept this:
>
> regression=# create table t(f1 "char");
> CREATE TABLE
> regression=# insert into t values ('(1,2)'::point);
> INSERT 0 1
> regression=# table t;
>  f1
> 
>  (
> (1 row)
>
> I definitely don't think that should have worked without
> any complaint.

Yes, that one's a bridge too far, even for me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: ExecTypeSetColNames is fundamentally broken

2021-12-07 Thread Tom Lane

Robert Haas  writes:
> On Tue, Dec 7, 2021 at 12:30 PM Tom Lane  wrote:
>> If we consider that the alias renames the columns "for all purposes",
>> how is it okay for f() to select the "a" column?

> I'd say it isn't.

In a green field I'd probably agree with you, but IMO that will
break far too much existing SQL code.

It'd cause problems for us too, not only end-users.  As an example,
ruleutils.c would have to avoid attaching new column aliases to
tables that are referenced as whole-row Vars.  I'm not very sure
that that's even possible without creating insurmountable ambiguity
issues.  There are also fun issues around what happens to a stored
query after a table column rename.  Right now the query acts as
though it uses the old name as a column alias, and that introduces
no semantic problem; but that behavior would no longer be
acceptable.

So the alternatives I see are to revert what bf7ca1587 tried to do
here, or to try to make it work that way across-the-board, which
implies (a) a very much larger amount of work, and (b) breaking
important behaviors that are decades older than that commit.
It's not even entirely clear that we could get to complete
consistency if we went down that path.

regards, tom lane

Re: MSVC SSL test failure

2021-12-07 Thread Tom Lane

Alexander Lakhin  writes:
> 07.12.2021 19:25, Tom Lane wrote:
>> Hmm. I wonder whether using SD_BOTH behaves any differently. 

> With shutdown(MyProcPort->sock, SD_BOTH) the test failed for me on
> iterations 1, 2, 3, 16 (just as without shutdown() at all).
> So shutdown with the SD_SEND flag definitely behaves much better (I've
> seen over 200 successful iterations).

Fun.  Well, I'll put in shutdown with SD_SEND for the moment,
and we'll have to see whether we can improve further than that.
It does sound like we may be running into OpenSSL bugs/oddities,
not only kernel issues, so it may be impossible to do better
on our side.

regards, tom lane

Re: pg_dump versus ancient server versions

2021-12-07 Thread Mark Dilger

> On Dec 7, 2021, at 8:33 AM, Robert Haas  wrote:
> 
> However, it wouldn't be a great idea to back-patch a
> completely arbitrary subset of our fixes into those branches, because
> then it sort of gets confusing to understand what the status of that
> branch is. I don't know that I'm terribly bothered by the idea that
> the behavior of the branch might deviate from the last official
> release, because most bug fixes are pretty minor and wouldn't really
> affect testing much, but it would be a little annoying to explain to
> users that those branches contain an arbitrary subset of newer fixes,
> and a little hard for us to understand what is and is not there.

Wouldn't you be able to see what changed by comparing the last released tag for 
version X.Y against the RELX_Y_STABLE branch?  Something like `git diff 
REL8_4_22 origin/REL8_4_STABLE > buildability.patch`?

Having such a patch should make reproducing old corruption bugs easier, as you 
could apply the buildability.patch to the last branch that contained the bug.  
If anybody did that work, would we want it committed somewhere?  
REL8_4_19_BUILDABLE or such?  For patches that apply trivially, that might not 
be worth keeping, but if the merge is difficult, maybe sharing with the 
community would make sense.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Post-CVE Wishlist

2021-12-07 Thread Jacob Champion

On Tue, 2021-11-23 at 18:27 +, Jacob Champion wrote:
> Now that the MITM CVEs are published [1], I wanted to share my wishlist
> of things that would have made those attacks difficult/impossible to
> pull off.

Now that we're post-commitfest, here's my summary of the responses so
far:

> = Client-Side Auth Selection =

There is interest in letting libpq reject certain auth methods coming
back from the server, perhaps using a simple connection option, and
there are some prior conversations on the list to look into.

> = Implicit TLS =

Reactions to implicit TLS were mixed, from "we should not do this" to
"it might be nice to have the option, from a technical standpoint".
Both a separate-port model and a shared-port model were tentatively
proposed. The general consensus seems to be that the StartTLS-style
flow is currently sufficient from a security standpoint.

I didn't see any responses that were outright in favor, so I think my
remaining question is: are there any committers who think a prototype
would be worth the time for a motivated implementer?

Thanks for the discussion!

--Jacob

Re: pg_dump versus ancient server versions

2021-12-07 Thread Tom Lane

Mark Dilger  writes:
> Wouldn't you be able to see what changed by comparing the last released tag 
> for version X.Y against the RELX_Y_STABLE branch?  Something like `git diff 
> REL8_4_22 origin/REL8_4_STABLE > buildability.patch`?

> Having such a patch should make reproducing old corruption bugs easier, as 
> you could apply the buildability.patch to the last branch that contained the 
> bug.  If anybody did that work, would we want it committed somewhere?  
> REL8_4_19_BUILDABLE or such?  For patches that apply trivially, that might 
> not be worth keeping, but if the merge is difficult, maybe sharing with the 
> community would make sense.

I'm not entirely following ... are you suggesting that each released minor
version needs to be kept buildable separately?  That seems like a huge
amount of extra committer effort with not much added value.  If someone
comes to me and wants to investigate a bug in a branch that's already
out-of-support, and they then say they're not running the last minor
release, I'm going to tell them to come back after updating.

It is (I suspect) true that diffing the last release against branch
tip would often yield a patch that could be used to make an older
minor release buildable again.  But when that patch doesn't work
trivially, I for one am not interested in making it work.  And
especially not interested in doing so "on spec", with no certainty
that anyone would ever need it.

regards, tom lane

Re: pg_dump versus ancient server versions

2021-12-07 Thread Mark Dilger

> On Dec 7, 2021, at 10:52 AM, Tom Lane  wrote:
> 
> I'm not entirely following ... are you suggesting that each released minor
> version needs to be kept buildable separately?

No.  I'm just wondering if we want to share the product of such efforts if 
anybody (me, for instance) volunteers to do it for some subset of minor 
releases.  For my heap corruption checking work, I might want to be able to 
build a small number of old minor releases that I know had corruption bugs.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-07 Thread Peter Geoghegan

On Tue, Dec 7, 2021 at 7:58 AM Robert Haas  wrote:
> I think that's a good observation. I think the current autovacuum
> algorithm works pretty well when things are normal. But in extreme
> scenarios it does not degrade gracefully.

+1 to all of the specific examples you go on to describe.

> > I also think that our problem is not so much that we don't have
> > accurate statistics about dead tuples (though we surely don't have
> > much accuracy). The main problem seems to be that there are various
> > specific, important ways in which the distribution of dead tuples may
> > matter (leading to various harmful biases). And so it seems reasonable
> > to fudge how we interpret dead tuples with the intention of capturing
> > some of that, as a medium term solution. Something that considers the
> > concentration of dead tuples in heap pages seems promising.
>
> I am less convinced about this part. It sort of sounds like you're
> saying - it doesn't really matter whether the numbers we gather are
> accurate, just that they produce the correct results.

That's not quite it. More like: I think that it would be reasonable to
define dead tuples abstractly, in a way that's more useful but
nevertheless cannot diverge too much from the current definition. We
should try to capture "directionality", with clear error bounds. This
won't represent the literal truth, but it will be true in spirit (or
much closer).

For example, why should we count dead heap-only tuples from earlier in
a HOT chain, even when we see no evidence that opportunistic HOT
pruning can't keep up on that page? Since we actually care about the
direction of things, not just the present state of things, we'd be
justified in completely ignoring those dead tuples. Similarly, it
might well make sense to give more weight to concentrations of LP_DEAD
items on a page -- that is a signal that things are not going well *at
the level of the page*. Not so much when you have a few LP_DEAD stubs,
but certainly when you have dozens of them on one page, or even
hundreds. And so ISTM that the conditions of the page should influence
how we interpret/count that page's dead tuples, in both directions
(interpret the page as having more dead tuples, or fewer).

We all know that there isn't a sensible answer to the question "if the
abstract cost units used in the optimizer say that one sequential page
access is 4x cheaper than one random page access, then what's the
difference between accessing 10 random pages sequentially in close
succession, versus accessing the same 10 pages randomly?". But at the
same time, we can say for sure that random is more expensive to *some*
degree, but certainly never by multiple orders of magnitude. The model
is imperfect, to be sure, but it is better than many alternative
models that are also technically possible. We need *some* cost model
for a cost-based optimizer, and it is better to be approximately
correct than exactly wrong. Again, the truly important thing is to not
be totally wrong about any one thing.

Another way of putting it is that I am willing to accept a more biased
definition of dead tuples, if that lowers the variance:

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

We have some certainty about what is possible in a world in which we
use the visibility map directly, and so our final estimate should
never be wildly unreasonable -- the abstract idea of dead tuples can
still be anchored to the physical reality.

As with the optimizer and its cost units, we don't have that many
degrees of freedom when autovacuum.c launches a worker, any I don't
see that changing -- we must ultimately decide to launch or not launch
an autovacuum worker (for any given table) each time the autovacuum
launcher wakes up. So we're practically forced to come up with a model
that has some abstract idea of one kind of bloat/dead tuple. I would
say that we already have one, in fact -- it's just not a very good one
because it doesn't take account of obviously-relevant factors like
HOT. It could quite easily be less wrong.

> If the
> information we currently gather wouldn't produce the right results
> even if it were fully accurate, that to me suggests that we're
> gathering the wrong information, and we should gather something else.

I think that counting dead tuples is useful, but not quite sufficient
on its own. At the same time, we still need something that works like
a threshold -- because that's just how the autovacuum.c scheduling
works. It's a go/no-go decision, regardless of how the decision is
made.

> For example, we could attack the useless-vacuuming problem by having
> each vacuum figure out - and store - the oldest XID that could
> potentially be worth using as a cutoff for vacuuming that table, and
> refusing to autovacuum that table again until we'd be able to use a
> cutoff >= that value. I suppose this would be the oldest of (a) any
> XID that exists in the table on a tuple that we judged recently dead,
> (b) any XID that is c

Re: pg_dump versus ancient server versions

2021-12-07 Thread Andrew Dunstan



On 12/7/21 13:59, Mark Dilger wrote:
>> On Dec 7, 2021, at 10:52 AM, Tom Lane  wrote:
>>
>> I'm not entirely following ... are you suggesting that each released minor
>> version needs to be kept buildable separately?
> No.  I'm just wondering if we want to share the product of such efforts if 
> anybody (me, for instance) volunteers to do it for some subset of minor 
> releases.  For my heap corruption checking work, I might want to be able to 
> build a small number of old minor releases that I know had corruption bugs.
>

I doubt there's going to be a whole lot of changes. You should just be
able to cherry-pick them in most cases I suspect.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: MSVC SSL test failure

2021-12-07 Thread Andrew Dunstan



On 12/7/21 13:22, Tom Lane wrote:
> Alexander Lakhin  writes:
>> 07.12.2021 19:25, Tom Lane wrote:
>>> Hmm. I wonder whether using SD_BOTH behaves any differently. 
>> With shutdown(MyProcPort->sock, SD_BOTH) the test failed for me on
>> iterations 1, 2, 3, 16 (just as without shutdown() at all).
>> So shutdown with the SD_SEND flag definitely behaves much better (I've
>> seen over 200 successful iterations).
> Fun.  Well, I'll put in shutdown with SD_SEND for the moment,
> and we'll have to see whether we can improve further than that.
> It does sound like we may be running into OpenSSL bugs/oddities,
> not only kernel issues, so it may be impossible to do better
> on our side.


Yeah. My suspicion is that SD_BOTH is what closesocket() does if
shutdown() hasn't been previously called.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Dubious usage of TYPCATEGORY_STRING

2021-12-07 Thread Tom Lane

I wrote:
> Peter Eisentraut  writes:
>> Could we add explicit casts (like polcmd::text) here?  Or would it break 
>> too much?

> I assumed it'd break too much to consider doing that.  But I suppose
> that since a typcategory change would be initdb-forcing anyway, maybe
> it's not out of the question.  I'll investigate and see exactly how
> many places would need an explicit cast.

Um, I definitely gave up too easily there.  The one usage in \dp
seems to be the *only* thing that breaks in describe.c, and pg_dump
doesn't need any changes so far as check-world reveals.  So let's
just move "char" to another category, as attached.

regards, tom lane

diff --git a/contrib/citext/expected/citext.out b/contrib/citext/expected/citext.out
index ec99aaed5d..3bac0534fb 100644
--- a/contrib/citext/expected/citext.out
+++ b/contrib/citext/expected/citext.out
@@ -1089,7 +1089,12 @@ INSERT INTO caster (char)  VALUES ('f'::citext);
 INSERT INTO caster (citext)VALUES ('f'::char);
 INSERT INTO caster (chr)   VALUES ('f'::text);
 INSERT INTO caster (text)  VALUES ('f'::"char");
-INSERT INTO caster (chr)   VALUES ('f'::citext);
+INSERT INTO caster (chr)   VALUES ('f'::citext);  -- requires cast
+ERROR:  column "chr" is of type "char" but expression is of type citext
+LINE 1: INSERT INTO caster (chr)   VALUES ('f'::citext);
+   ^
+HINT:  You will need to rewrite or cast the expression.
+INSERT INTO caster (chr)   VALUES ('f'::citext::text);
 INSERT INTO caster (citext)VALUES ('f'::"char");
 INSERT INTO caster (name)  VALUES ('foo'::text);
 INSERT INTO caster (text)  VALUES ('foo'::name);
diff --git a/contrib/citext/expected/citext_1.out b/contrib/citext/expected/citext_1.out
index 75fd08b7cc..57fc863f7a 100644
--- a/contrib/citext/expected/citext_1.out
+++ b/contrib/citext/expected/citext_1.out
@@ -1089,7 +1089,12 @@ INSERT INTO caster (char)  VALUES ('f'::citext);
 INSERT INTO caster (citext)VALUES ('f'::char);
 INSERT INTO caster (chr)   VALUES ('f'::text);
 INSERT INTO caster (text)  VALUES ('f'::"char");
-INSERT INTO caster (chr)   VALUES ('f'::citext);
+INSERT INTO caster (chr)   VALUES ('f'::citext);  -- requires cast
+ERROR:  column "chr" is of type "char" but expression is of type citext
+LINE 1: INSERT INTO caster (chr)   VALUES ('f'::citext);
+   ^
+HINT:  You will need to rewrite or cast the expression.
+INSERT INTO caster (chr)   VALUES ('f'::citext::text);
 INSERT INTO caster (citext)VALUES ('f'::"char");
 INSERT INTO caster (name)  VALUES ('foo'::text);
 INSERT INTO caster (text)  VALUES ('foo'::name);
diff --git a/contrib/citext/sql/citext.sql b/contrib/citext/sql/citext.sql
index 10232f5a9f..55fb1d11a6 100644
--- a/contrib/citext/sql/citext.sql
+++ b/contrib/citext/sql/citext.sql
@@ -361,7 +361,8 @@ INSERT INTO caster (citext)VALUES ('f'::char);
 
 INSERT INTO caster (chr)   VALUES ('f'::text);
 INSERT INTO caster (text)  VALUES ('f'::"char");
-INSERT INTO caster (chr)   VALUES ('f'::citext);
+INSERT INTO caster (chr)   VALUES ('f'::citext);  -- requires cast
+INSERT INTO caster (chr)   VALUES ('f'::citext::text);
 INSERT INTO caster (citext)VALUES ('f'::"char");
 
 INSERT INTO caster (name)  VALUES ('foo'::text);
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1d11be73f..216aa4510d 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9305,6 +9305,10 @@ SCRAM-SHA-256$:&l
   X
   unknown type
  
+ 
+  Z
+  Internal-use types
+ 
 

   
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index ea721d963a..72d8547628 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -1142,7 +1142,7 @@ permissionsList(const char *pattern)
 		  ",\n  pg_catalog.array_to_string(ARRAY(\n"
 		  "SELECT polname\n"
 		  "|| CASE WHEN polcmd != '*' THEN\n"
-		  "   E' (' || polcmd || E'):'\n"
+		  "   E' (' || polcmd::pg_catalog.text || E'):'\n"
 		  "   ELSE E':'\n"
 		  "   END\n"
 		  "|| CASE WHEN polqual IS NOT NULL THEN\n"
@@ -1176,7 +1176,7 @@ permissionsList(const char *pattern)
 		  "   E' (RESTRICTIVE)'\n"
 		  "   ELSE '' END\n"
 		  "|| CASE WHEN polcmd != '*' THEN\n"
-		  "   E' (' || polcmd || E'):'\n"
+		  "   E' (' || polcmd::pg_catalog.text || E'):'\n"
 		  "   ELSE E':'\n"
 		  "   END\n"
 		  "|| CASE WHEN polqual IS NOT NULL THEN\n"
diff --git a/src/include/catalog/pg_type.dat b/src/include/catalog/pg_type.dat
index 41074c994b..f3d94f3cf5 100644
--- a/src/include/catalog/pg_type.dat
+++ b/src/inc

Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-07 Thread Robert Haas

On Tue, Dec 7, 2021 at 2:13 PM Peter Geoghegan  wrote:
> For example, why should we count dead heap-only tuples from earlier in
> a HOT chain, even when we see no evidence that opportunistic HOT
> pruning can't keep up on that page? Since we actually care about the
> direction of things, not just the present state of things, we'd be
> justified in completely ignoring those dead tuples. Similarly, it
> might well make sense to give more weight to concentrations of LP_DEAD
> items on a page -- that is a signal that things are not going well *at
> the level of the page*. Not so much when you have a few LP_DEAD stubs,
> but certainly when you have dozens of them on one page, or even
> hundreds. And so ISTM that the conditions of the page should influence
> how we interpret/count that page's dead tuples, in both directions
> (interpret the page as having more dead tuples, or fewer).

Well... I mean, I think we're almost saying the same thing, then, but
I think you're saying it more confusingly. I have no objection to
counting the number of dead HOT chains rather than the number of dead
tules, because that's what affects the index contents, but there's no
need to characterize that as "not the literal truth." There's nothing
fuzzy or untrue about it if we simply say that's what we're doing.

> Right. And as I keep saying, the truly important thing is to not
> *completely* ignore any relevant dimension of cost. I just don't want
> to ever be wildly wrong -- not even once. We can tolerate being
> somewhat less accurate all the time (not that we necessarily have to
> make a trade-off), but we cannot tolerate pathological behavior. Of
> course I include new/theoretical pathological behaviors here (not just
> the ones we know about today).

Sure, but we don't *need* to be less accurate, and I don't think we
even *benefit* from being less accurate. If we do something like count
dead HOT chains instead of dead tuples, let's not call that a
less-accurate count of dead tuples. Let's call it an accurate count of
dead HOT chains.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: enable certain TAP tests for MSVC builds

2021-12-07 Thread Andrew Dunstan



On 12/5/21 18:32, Noah Misch wrote:
> On Sun, Dec 05, 2021 at 06:00:08PM -0500, Andrew Dunstan wrote:
>> On 12/5/21 14:47, Noah Misch wrote:
>>> On Sun, Dec 05, 2021 at 11:57:31AM -0500, Andrew Dunstan wrote:
 Certain TAP tests rely on settings that the Make files provide for them.
 However vcregress.pl doesn't provide those settings. This patch remedies
 that, and I propose to apply it shortly (when we have a fix for the SSL
 tests that I will write about separately) and backpatch it appropriately.
 --- a/src/tools/msvc/vcregress.pl
 +++ b/src/tools/msvc/vcregress.pl
 @@ -59,6 +59,12 @@ copy("$Config/autoinc/autoinc.dll",   
 "src/test/regress");
  copy("$Config/regress/regress.dll",   "src/test/regress");
  copy("$Config/dummy_seclabel/dummy_seclabel.dll", "src/test/regress");
  
 +# Settings used by TAP tests
 +$ENV{with_ssl} = $config->{openssl} ? 'openssl' : 'no';
 +$ENV{with_ldap} = $config->{ldap} ? 'yes' : 'no';
 +$ENV{with_icu} = $config->{icu} ? 'yes' : 'no';
 +$ENV{with_gssapi} = $config->{gss} ? 'yes' : 'no';
>>> That's appropriate.  There are more variables to cover:
>>>
>>> $ git grep -n ^export ':(glob)**/Makefile'
>>> src/bin/pg_basebackup/Makefile:22:export LZ4
>>> src/bin/pg_basebackup/Makefile:23:export TAR
>>> src/bin/pg_basebackup/Makefile:27:export GZIP_PROGRAM=$(GZIP)
>>> src/bin/psql/Makefile:20:export with_readline
>>> src/test/kerberos/Makefile:16:export with_gssapi with_krb_srvnam
>>> src/test/ldap/Makefile:16:export with_ldap
>>> src/test/modules/ssl_passphrase_callback/Makefile:3:export with_ssl
>>> src/test/recovery/Makefile:20:export REGRESS_SHLIB
>>> src/test/ssl/Makefile:18:export with_ssl
>>> src/test/subscription/Makefile:18:export with_icu
>> LZ4/TAR/GZIP_PROGAM: It's not clear what these should be set to. The TAP
>> tests skip tests that use them if they are not set.
> Could add config.pl entries for those.  Preventing those skips on Windows may
> or may not be worth making config.pl readers think about them.
>
>> with_readline: we don't build with readline on Windows, period. I guess
>> we could just set it to "no".
>> with_krb_srvnam: the default is "postgres", we could just set it to that
>> I guess.
> Works for me.



All done.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-07 Thread Peter Geoghegan

On Tue, Dec 7, 2021 at 12:27 PM Robert Haas  wrote:
> Well... I mean, I think we're almost saying the same thing, then, but
> I think you're saying it more confusingly. I have no objection to
> counting the number of dead HOT chains rather than the number of dead
> tules, because that's what affects the index contents, but there's no
> need to characterize that as "not the literal truth."

Works for me!

> Sure, but we don't *need* to be less accurate, and I don't think we
> even *benefit* from being less accurate. If we do something like count
> dead HOT chains instead of dead tuples, let's not call that a
> less-accurate count of dead tuples. Let's call it an accurate count of
> dead HOT chains.

Fair enough, but even then we still ultimately have to generate a
final number that represents how close we are to a configurable "do an
autovacuum" threshold (such as an autovacuum_vacuum_scale_factor-based
threshold) -- the autovacuum.c side of this (the consumer side)
fundamentally needs the model to reduce everything to a one
dimensional number (even though the reality is that there isn't just
one dimension). This single number (abstract bloat units, abstract
dead tuples, whatever) is a function of things like the count of dead
HOT chains, perhaps the concentration of dead tuples on heap pages,
whatever -- but it's not the same thing as any one of those things we
count.

I think that this final number needs to be denominated in abstract
units -- we need to call those abstract units *something*. I don't
care what that name ends up being, as long as it reflects reality.

-- 
Peter Geoghegan

Re: Pre-allocating WAL files

2021-12-07 Thread Bossart, Nathan

On 12/7/21, 9:35 AM, "Bossart, Nathan"  wrote:
> On 12/7/21, 12:29 AM, "Bharath Rupireddy" 
>  wrote:
>> Why can't the walwriter pre-allocate some of the WAL segments instead
>> of a new background process? Of course, it might delay the main
>> functionality of the walwriter i.e. flush and sync the WAL files, but
>> having checkpointer do the pre-allocation makes it do another extra
>> task. Here the amount of walwriter work vs checkpointer work, I'm not
>> sure which one does more work compared to the other.
>
> The argument against adding it to the WAL writer is that we want it to
> run with low latency to flush asynchronous commits.  If we added WAL
> pre-allocation to the WAL writer, there could periodically be large
> delays.

To your point on trying to avoid giving the checkpointer extra tasks
(basically what we are talking about on the other thread [0]), WAL
pre-allocation might not be of much concern because it will generally
be a small, fixed (and configurable) amount of work, and it can be
performed concurrently with the checkpoint.  Plus, WAL pre-allocation
should ordinarily be phased out as WAL segments become eligible for
recycling.  IMO it's not comparable to tasks like
CheckPointSnapBuild() that can delay checkpointing for a long time.

Nathan

[0] 
https://www.postgresql.org/message-id/flat/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com

Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bossart, Nathan

On 12/7/21, 12:10 AM, "Bharath Rupireddy" 
 wrote:
> Here's a patch that I've come up with. Please see if this looks okay
> and let me know if we want to take it forward so that I can add a CF
> entry.

Overall, the patch seems reasonable to me.

+   case DB_IN_END_OF_RECOVERY_CHECKPOINT:
+   ereport(LOG,
+   (errmsg("database system was 
interrupted while in end-of-recovery checkpoint at %s",
+   
str_time(ControlFile->time;
+   break;

I noticed that some (but not all) of the surrounding messages say
"last known up at" the control file time.  I'm curious why you chose
not to use that phrasing in this case.

Nathan

Re: Transparent column encryption

2021-12-07 Thread Tomas Vondra





On 12/7/21 19:02, Jacob Champion wrote:

On Tue, 2021-12-07 at 16:39 +0100, Peter Eisentraut wrote:

On 06.12.21 21:44, Jacob Champion wrote:

I think reusing a zero IV will potentially leak more information than
just equality, depending on the cipher in use. You may be interested in
synthetic IVs and nonce-misuse resistance (e.g. [1]), since they seem
like they would match this use case exactly. (But I'm not a
cryptographer.)


I'm aware of this and plan to make use of SIV.  The current
implementation is just an example.


Sounds good.


Have you given any thought to AEAD? As a client I'd like to be able to
tie an encrypted value to other column (or external) data. For example,
AEAD could be used to prevent a DBA from copying the (encrypted) value
of my credit card column into their account's row to use it.


I don't know how that is supposed to work.  When the value is encrypted
for insertion, the client may know things like table name or column
name, so it can tie it to those.  But it doesn't know what row it will
go in, so you can't prevent the value from being copied into another
row.  You would need some permanent logical row ID for this, I think.


Sorry, my description was confusing. There's nothing preventing the DBA
from copying the value inside the database, but AEAD can make it so
that the copied value isn't useful to the DBA.

Sample case. Say I have a webapp backed by Postgres, which stores
encrypted credit card numbers. Users authenticate to the webapp which
then uses the client (which has the keys) to talk to the database.
Additionally, I assume that:

- the DBA can't access the client directly (because if they can, then
they can unencrypt the victim's info using the client's keys), and

- the DBA can't authenticate as the user/victim (because if they can,
they can just log in themselves and have the data). The webapp might
for example use federated authn with a separate provider, using an
email address as an identifier.

Now, if the client encrypts a user's credit card number using their
email address as associated data, then it doesn't matter if the DBA
copies that user's encrypted card over to their own account. The DBA
can't log in as the victim, so the client will fail to authenticate the
value because its associated data won't match.


This is not targeting PostgreSQL 15.  But I'd appreciate some feedback
on the direction.


What kinds of attacks are you hoping to prevent (and not prevent)?


The point is to prevent admins from getting at plaintext data.  The
scenario you show is an interesting one but I think it's not meant to be
addressed by this.  If admins can alter the database to their advantage,
they could perhaps increase their account balance, create discount
codes, etc. also.


Sure, but increasing account balances and discount codes don't lead to
getting at plaintext data, right? Whereas stealing someone else's
encrypted value seems like it would be covered under your threat model,
since it lets you trick a real-world client into decrypting it for you.

Other avenues of attack might depend on how you choose to add HMAC to
the current choice of AES-CBC. My understanding of AE ciphers (with or
without associated data) is that you don't have to design that
yourself, which is nice.



IMO it's impossible to solve this attack within TCE, because it requires 
ensuring consistency at the row level, but TCE obviously works at column 
level only.


I believe TCE can do AEAD at the column level, which protects against 
attacks that flipping bits, and similar attacks. It's just a matter of 
how the client encrypts the data.


Extending it to protect the whole row seems tricky, because the client 
may not even know the other columns, and it's not clear to me how it'd 
deal with things like updates of the other columns, hint bits, dropped 
columns, etc.


It's probably possible to get something like this (row-level AEAD) by 
encrypting enriched data, i.e. not just the card number, but {user ID, 
card number} or something like that, and verify that in the webapp. The 
problem of course is that the "user ID" is just another column in the 
table, and there's nothing preventing the DBA from modifying that too.


So I think it's pointless to try extending this to row-level AEAD.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-07 Thread Robert Haas

On Tue, Dec 7, 2021 at 3:44 PM Peter Geoghegan  wrote:
> Fair enough, but even then we still ultimately have to generate a
> final number that represents how close we are to a configurable "do an
> autovacuum" threshold (such as an autovacuum_vacuum_scale_factor-based
> threshold) -- the autovacuum.c side of this (the consumer side)
> fundamentally needs the model to reduce everything to a one
> dimensional number (even though the reality is that there isn't just
> one dimension). This single number (abstract bloat units, abstract
> dead tuples, whatever) is a function of things like the count of dead
> HOT chains, perhaps the concentration of dead tuples on heap pages,
> whatever -- but it's not the same thing as any one of those things we
> count.
>
> I think that this final number needs to be denominated in abstract
> units -- we need to call those abstract units *something*. I don't
> care what that name ends up being, as long as it reflects reality.

If we're only trying to decide whether or not to vacuum a table, we
don't need units: the output is a Boolean. If we're trying to decide
on an order in which to vacuum tables, then we need units. But such
units can't be anything related to dead tuples, because vacuum can be
needed based on XID age, or MXID age, or dead tuples. The units would
have to be something like abstract vacuum-urgency units (if higher is
more urgent) or abstract remaining-headroom-beform-catastrophe units
(if lower is more urgent).

Ignoring wraparound considerations for a moment, I think that we might
want to consider having multiple thresholds and vacuuming the table if
any one of them are met. For example, suppose a table qualifies for
vacuum when %-of-not-all-visible-pages > some-threshold, or
alternatively when %-of-index-tuples-thought-to-be-dead >
some-other-threshold.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Tomas Vondra

On 12/7/21 15:36, Bharath Rupireddy wrote:
> Hi,
> 
> Currently one can know the kind of on-going/last checkpoint (shutdown,
> end-of-recovery, immediate, force etc.) only via server logs that to
> when log_checkpoints GUC is on. At times, the users/service layer
> components would want to know the kind of checkpoint (along with other
> checkpoint related info) to take some actions and it will be a bit
> difficult to search through the server logs. The checkpoint info can
> be obtained from the control file (either by pg_control_checkpoint()
> or by pg_controldata tool) whereas checkpoint kind isn't available
> there.
> 
> How about we add an extra string field to the control file alongside
> the other checkpoint info it already has? This way, the existing
> pg_control_checkpoint() or pg_controldata tool can easily be enhanced
> to show the checkpoint kind as well. One concern is that we don't want
> to increase the size of pg_controldata by more than the typical block
> size (of 8K) to avoid any torn-writes. With this change, we might add
> at max the strings specified at [1]. Adding it to the control file has
> an advantage of preserving the last checkpoint kind which might be
> useful.
> 
> Thoughts?
> 

I agree it might be useful to provide information about the nature of
the checkpoint, and perhaps even PID of the backend that triggered it
(although that may be tricky, if the backend terminates).

I'm not sure about adding it to control data, though. That doesn't seem
like a very good match for something that's mostly for monitoring.

We already have some checkpoint info in pg_stat_bgwriter, but that's
just aggregated data, not very convenient for info about the current
checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing
various info about the current checkpoint?

> [1] for checkpoint: "checkpoint shutdown end-of-recovery immediate
> force wait wal time flush-all"
> for restartpoint: "restartpoint shutdown end-of-recovery immediate
> force wait wal time flush-all"
> 

I'd bet squashing all of this into a single string (not really a flag)
will just mean people will have to parse it, etc. Keeping individual
boolean flags (or even separate string fields) would be better, I guess.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Justin Pryzby

On Tue, Dec 07, 2021 at 11:18:37PM +0100, Tomas Vondra wrote:
> On 12/7/21 15:36, Bharath Rupireddy wrote:
> > Currently one can know the kind of on-going/last checkpoint (shutdown,
> > end-of-recovery, immediate, force etc.) only via server logs that to
> > when log_checkpoints GUC is on.

> > The checkpoint info can be obtained from the control file (either by
> > pg_control_checkpoint() or by pg_controldata tool) whereas checkpoint kind
> > isn't available there.

> > How about we add an extra string field to the control file alongside
> > the other checkpoint info it already has? This way, the existing
> > pg_control_checkpoint() or pg_controldata tool can easily be enhanced
> > to show the checkpoint kind as well. One concern is that we don't want
> > to increase the size of pg_controldata by more than the typical block
> > size (of 8K) to avoid any torn-writes. With this change, we might add
> > at max the strings specified at [1]. Adding it to the control file has
> > an advantage of preserving the last checkpoint kind which might be
> > useful.

> > [1] for checkpoint: "checkpoint shutdown end-of-recovery immediate
> > force wait wal time flush-all"
> > for restartpoint: "restartpoint shutdown end-of-recovery immediate
> > force wait wal time flush-all"
> 
> I'd bet squashing all of this into a single string (not really a flag)
> will just mean people will have to parse it, etc. Keeping individual
> boolean flags (or even separate string fields) would be better, I guess.

Since the size of controldata is a concern, I suggest to add an int16 to
populate with flags, which pg_controldata can parse for display.

Note that this other patch intends to add the timestamp and LSN of most recent
recovery to controldata.
https://commitfest.postgresql.org/36/3183/

> We already have some checkpoint info in pg_stat_bgwriter, but that's
> just aggregated data, not very convenient for info about the current
> checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing
> various info about the current checkpoint?

It could go into the pg_stat_checkpointer view, which would be the culmination
of another patch (cc Melanie).
https://commitfest.postgresql.org/36/3272/

-- 
Justin

Re: RecoveryInProgress() has critical side effects

2021-12-07 Thread Robert Haas

On Sat, Dec 4, 2021 at 7:44 PM Michael Paquier  wrote:
> My main worry here is that this changes slightly the definition of
> doPageWrites across stable branches at the end of recovery as there
> could be records generated there.  Note that GetFullPageWriteInfo()
> gets called in XLogInsert(), while Insert->fullPageWrites gets updated
> before CleanupAfterArchiveRecovery().  And it may influence
> the value of doPageWrites in the startup process.

But ... so what? All the code that uses it retries if the value that
was tentatively used turns out to be wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Why doesn't pgstat_report_analyze() focus on not-all-visible-page dead tuple counts, specifically?

2021-12-07 Thread Peter Geoghegan

On Tue, Dec 7, 2021 at 1:59 PM Robert Haas  wrote:
> If we're only trying to decide whether or not to vacuum a table, we
> don't need units: the output is a Boolean.

I was imagining a world in which we preserve the
autovacuum_vacuum_scale_factor design, but interpret it creatively
(but never too creatively) -- an incremental approach seems best to
me. We can even sanity check our abstract bloat unit calculation, in
case the page-level sampling aggregates into a totally wild number of
dead tuples (based in part on the current number of not-all-visible
heap pages) -- so the abstract units are always anchored to the old
idea of dead tuples. Maybe this isn't the best approach, but at least
it addresses compatibility.

*Any* approach based on sampling relatively few random blocks (to look
for signs of bloat) is inherently prone to hugely underestimating the
extent of bloat (which is what we see in TPC-C). I am primarily
concerned about compensating for the inherent limitations that go with
that. To me it seems inappropriate to make statistical inferences
about dead tuples based on a random snapshot of random blocks (usually
only a tiny minority). It is not only possible for the picture to
change utterly -- it is routine, expected, and the whole entire point.

The entire intellectual justification for statistical sampling (that
mostly works for optimizer stats) just doesn't carry over to
autovacuum stats, for many reasons. At the same time, I don't have any
fundamentally better starting point. That's how I arrived at the idea
of probabilistic modeling based on several recent snapshots from
ANALYZE. The statistics are often rubbish, whether or not we like it,
and regardless of how we decide to count things on each page. And so
it's entirely reasonable to not limit the algorithm to concerns about
the state of things -- the actual exposure of the system to harm (from
overlooking harmful bloat) is also relevant.

> If we're trying to decide
> on an order in which to vacuum tables, then we need units. But such
> units can't be anything related to dead tuples, because vacuum can be
> needed based on XID age, or MXID age, or dead tuples. The units would
> have to be something like abstract vacuum-urgency units (if higher is
> more urgent) or abstract remaining-headroom-beform-catastrophe units
> (if lower is more urgent).

I like that idea. But I wonder if they should be totally unrelated. If
we're close to the "emergency" XID threshold, and also close to the
"bloat units" threshold, then it seems reasonable to put our finger on
the scales, and do an autovacuum before either threshold is crossed.
I'm not sure how that should work, but I find the idea of interpreting
the "bloat units" creatively/probabilistically appealing.

We're not actually making things up by erring in the direction of
launching an autovacuum worker, because we don't actually know the
number of dead tuples (or whatever) anyway -- we're just recognizing
the very real role of chance and noise. That is, if the "bloat units"
threshold might well not have been crossed due to random chance
(noise, the phase of the moon), why should we defer to random chance?
If we have better information to go on, like the thing with the XID
threshold, why not prefer that? Similarly, if we see that the system
as a whole is not very busy right now, why not consider that, just a
little, if the only downside is that we'll ignore a demonstrably
noise-level signal (from the stats)?

That's the high level intuition behind making "bloat units" a
probability density function, and not just a simple expected value.
Teaching the autovacuum.c scheduler to distinguish signal from noise
could be very valuable, if it enables opportunistic batching of work,
or off-hours work. We don't have to respect noise. The devil is in the
details, of course.

-- 
Peter Geoghegan

Re: Transparent column encryption

2021-12-07 Thread Jacob Champion

On Tue, 2021-12-07 at 22:21 +0100, Tomas Vondra wrote:
> IMO it's impossible to solve this attack within TCE, because it requires 
> ensuring consistency at the row level, but TCE obviously works at column 
> level only.

I was under the impression that clients already had to be modified to
figure out how to encrypt the data? If part of that process ends up
including enforcement of encryption for a specific column set, then the
addition of AEAD data could hypothetically be part of that hand-
waviness.

Unless "transparent" means that the client completely defers to the
server on whether to encrypt or not, and silently goes along with it if
the server tells it not to encrypt? That would only protect against a
_completely_ passive DBA, like someone reading unencrypted backups,
etc. And that still has a lot of value, certainly. But it seems like
this prototype is very close to a system where the client can reliably
secure data even if the server isn't trustworthy, if that's a use case
you're interested in.

> I believe TCE can do AEAD at the column level, which protects against 
> attacks that flipping bits, and similar attacks. It's just a matter of 
> how the client encrypts the data.

Right, I think authenticated encryption ciphers (without AD) will be
important to support in practice. I think users are going to want
*some* protection against active attacks.

> Extending it to protect the whole row seems tricky, because the client 
> may not even know the other columns, and it's not clear to me how it'd 
> deal with things like updates of the other columns, hint bits, dropped 
> columns, etc.

Covering the entire row automatically probably isn't super helpful in
practice. As you mention later:

> It's probably possible to get something like this (row-level AEAD) by 
> encrypting enriched data, i.e. not just the card number, but {user ID, 
> card number} or something like that, and verify that in the webapp. The 
> problem of course is that the "user ID" is just another column in the 
> table, and there's nothing preventing the DBA from modifying that too.

Right. That's why the client has to be able to choose AD according to
the application. In my previous example, the victim's email address can
be copied by the DBA, but they wouldn't be able to authenticate as that
user and couldn't convince the client to use the plaintext on their
behalf.

--Jacob

Re: Add sub-transaction overflow status in pg_stat_activity

2021-12-07 Thread Imseih (AWS), Sami

I also want to +1 this this effort. Exposing subtransaction usage is very 
useful.

It would also be extremely beneficial to add both subtransaction usage and 
overflow counters to pg_stat_database. 

Monitoring tools that capture deltas on pg_stat_database will be able to 
generate historical analysis and usage trends of subtransactions.

On 12/7/21, 5:34 PM, "Bossart, Nathan"  wrote:

On 12/6/21, 8:19 PM, "Dilip Kumar"  wrote:
> If the subtransaction cache is overflowed in some of the transactions
> then it will affect all the concurrent queries as they need to access
> the SLRU for checking the visibility of each tuple.  But currently
> there is no way to identify whether in any backend subtransaction is
> overflowed or what is the current active subtransaction count.
> Attached patch adds subtransaction count and subtransaction overflow
> status in pg_stat_activity.  I have implemented this because of the
> recent complain about the same[1]

I'd like to give a general +1 to this effort.  Thanks for doing this!
I've actually created a function to provide this information in the
past, so I will help review.

Nathan

Re: logical decoding and replication of sequences

2021-12-07 Thread Tomas Vondra

On 12/7/21 15:38, Peter Eisentraut wrote:
> I have checked the 0001 and 0003 patches.  (I think we have dismissed
> the approach in 0002 for now.  And let's talk about 0004 later.)
> 

Right, I think that's correct.

> I have attached a few fixup patches, described further below.
> 
> # 0001
> 
> The argument "create" for fill_seq_with_data() is always true (and
> patch 0002 removes it).  So I'm not sure if it's needed.  If it is, it
> should be documented somewhere.
> 

Good point. I think it could be removed, but IIRC it'll be needed when
adding sequence decoding to the built-in replication, and that patch is
meant to be just an implementation of the API, without changing WAL.

OTOH I don't see it in the last version of that patch (in ResetSequence2
call) so maybe it's not really needed. I've kept it for now, but I'll
investigate.

> About the comment you added in nextval_internal(): It's a good
> explanation, so I would leave it in.  I also agree with the
> conclusion of the comment that the current solution is reasonable.  We
> probably don't need the same comment again in fill_seq_with_data() and
> again in do_setval().  Note that both of the latter functions already
> point to nextval_interval() for other comments, so the same can be
> relied upon here as well.
> 

True. I moved it a bit in nextval_internal() and removed the other
copies. The existing references should be enough.

> The order of the new fields sequence_cb and stream_sequence_cb is a
> bit inconsistent compared to truncate_cb and stream_truncate_cb.
> Maybe take another look to make the order more uniform.
> 

You mean in OutputPluginCallbacks? I'd actually argue it's the truncate
callbacks that are inconsistent - in the regular section truncate_cb is
right before commit_cb, while in the streaming section it's at the end.

Or what order do you think would be better? I'm fine with changing it.

> Some documentation needs to be added to the Logical Decoding chapter.
> I have attached a patch that I think catches all the places that need
> to be updated, with some details left for you to fill in. ;-) The
> ordering of the some of the documentation chunks reflects the order in
> which the callbacks appear in the header files, which might not be
> optimal; see above.
> 

Thanks. I added a bit about the callbacks being optional and what the
parameters mean (only for sequence_cb, as the stream_ callbacks
generally don't copy that bit).

> In reorderbuffer.c, you left a comment about how to access a sequence
> tuple.  There is an easier way, using Form_pg_sequence_data, which is
> how sequence.c also does it.  See attached patch.
> 

Yeah, that looks much nicer.

> # 0003
> 
> The tests added in 0003 don't pass for me.  It appears that you might
> have forgotten to update the expected files after you added some tests
> or something.  The cfbot shows the same.  See attached patch for a
> correction, but do check what your intent was.
> 

Yeah. I suspect I removed the expected results due to the experimental
rework. I haven't noticed that because some of the tests for the
built-in replication are expected to fail.

Attached is an updated version of the first two parts (infrastructure
and test_decoding), with all your fixes merged.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 999514b9f802db9b888543532355e9c36646da30 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 24 Sep 2021 00:41:33 +0200
Subject: [PATCH 1/2] Logical decoding of sequences

This adds the infrastructure for logical decoding of sequences. Support
for built-in logical replication and test_decoding is added separately.

When decoding sequences, we differentiate between a sequences created in
a (running) transaction, and sequences created in other (already
committed) transactions. Changes for sequences created in the same
top-level transaction are treated as "transactional" i.e. just like any
other change from that transaction (and discarded in case of a
rollback). Changes for sequences created earlier are treated as not
transactional - are processed immediately, as if performed outside any
transaction (and thus not rolled back).

This mixed behavior is necessary - sequences are non-transactional (e.g.
ROLLBACK does not undo the sequence increments). But for new sequences,
we need to handle them in a transactional way, because if we ever get
some DDL support, the sequence won't exist until the transaction gets
applied. So we need to ensure the increments don't happen until the
sequence gets created.

To differentiate which sequences are "old" and which were created in a
still-running transaction, we track sequences created in running
transactions in a hash table. Each sequence is identified by it's
relfilenode, and at transaction commit we discard all sequences created
in that particular transaction. For each sequence we track the XID of
the (sub)transaction that created it, and we cleanup the sequences for
each

Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bharath Rupireddy

On Wed, Dec 8, 2021 at 2:50 AM Bossart, Nathan  wrote:
>
> On 12/7/21, 12:10 AM, "Bharath Rupireddy" 
>  wrote:
> > Here's a patch that I've come up with. Please see if this looks okay
> > and let me know if we want to take it forward so that I can add a CF
> > entry.
>
> Overall, the patch seems reasonable to me.

Thanks for reviewing this.

> +   case DB_IN_END_OF_RECOVERY_CHECKPOINT:
> +   ereport(LOG,
> +   (errmsg("database system was 
> interrupted while in end-of-recovery checkpoint at %s",
> +   
> str_time(ControlFile->time;
> +   break;
>
> I noticed that some (but not all) of the surrounding messages say
> "last known up at" the control file time.  I'm curious why you chose
> not to use that phrasing in this case.

If state is DB_IN_END_OF_RECOVERY_CHECKPOINT that means the db was
interrupted while in end-of-recovery checkpoint, so I used the
phrasing similar to DB_IN_CRASH_RECOVERY and DB_IN_ARCHIVE_RECOVERY
cases. I would like to keep it as-is (in the v1 patch) unless anyone
has other thoughts here?
(errmsg("database system was interrupted while in recovery at %s",
(errmsg("database system was interrupted while in recovery at log time %s",

I added a CF entry here - https://commitfest.postgresql.org/36/3442/

Regards,
Bharath Rupireddy.

Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?

2021-12-07 Thread Bharath Rupireddy

On Wed, Dec 8, 2021 at 4:17 AM Bossart, Nathan  wrote:
>
> On 11/18/21, 8:27 PM, "Bharath Rupireddy" 
>  wrote:
> > Here's the v4 patch with the above changes, the output looks like [1].
> > Please review it further.
>
> I agree with Tom.  I would just s/server/backend/ (as per the
> attached) and call it a day.

Thanks. v5 patch looks good to me.

Regards,
Bharath Rupireddy.

Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Bharath Rupireddy

On Wed, Dec 8, 2021 at 3:48 AM Tomas Vondra
 wrote:
>
> On 12/7/21 15:36, Bharath Rupireddy wrote:
> > Hi,
> >
> > Currently one can know the kind of on-going/last checkpoint (shutdown,
> > end-of-recovery, immediate, force etc.) only via server logs that to
> > when log_checkpoints GUC is on. At times, the users/service layer
> > components would want to know the kind of checkpoint (along with other
> > checkpoint related info) to take some actions and it will be a bit
> > difficult to search through the server logs. The checkpoint info can
> > be obtained from the control file (either by pg_control_checkpoint()
> > or by pg_controldata tool) whereas checkpoint kind isn't available
> > there.
> >
> > How about we add an extra string field to the control file alongside
> > the other checkpoint info it already has? This way, the existing
> > pg_control_checkpoint() or pg_controldata tool can easily be enhanced
> > to show the checkpoint kind as well. One concern is that we don't want
> > to increase the size of pg_controldata by more than the typical block
> > size (of 8K) to avoid any torn-writes. With this change, we might add
> > at max the strings specified at [1]. Adding it to the control file has
> > an advantage of preserving the last checkpoint kind which might be
> > useful.
> >
> > Thoughts?
> >
>
> I agree it might be useful to provide information about the nature of
> the checkpoint, and perhaps even PID of the backend that triggered it
> (although that may be tricky, if the backend terminates).

Thanks. I agree to have pg_stat_progress_checkpoint and yes PID of the
triggered backend can possibly go there (we can mention in the
documentation that the backend that triggered the checkpoint can get
terminated).

> I'm not sure about adding it to control data, though. That doesn't seem
> like a very good match for something that's mostly for monitoring.

Having it in the control data file (along with the existing checkpoint
information) will be helpful to know what was the last checkpoint
information and we can use the existing pg_control_checkpoint function
or the tool to emit that info. I plan to add an int16 flag as
suggested by Justin in this thread and come up with a patch.

> We already have some checkpoint info in pg_stat_bgwriter, but that's
> just aggregated data, not very convenient for info about the current
> checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing
> various info about the current checkpoint?

+1 to have pg_stat_progress_checkpoint view to know what's going on
with the current checkpoint.

Regards,
Bharath Rupireddy.

Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Bharath Rupireddy

On Wed, Dec 8, 2021 at 4:07 AM Justin Pryzby  wrote:
> > I'd bet squashing all of this into a single string (not really a flag)
> > will just mean people will have to parse it, etc. Keeping individual
> > boolean flags (or even separate string fields) would be better, I guess.
>
> Since the size of controldata is a concern, I suggest to add an int16 to
> populate with flags, which pg_controldata can parse for display.

+1. I will come up with a patch soon.

> Note that this other patch intends to add the timestamp and LSN of most recent
> recovery to controldata.
> https://commitfest.postgresql.org/36/3183/

Thanks. I will go through it separately.

> > We already have some checkpoint info in pg_stat_bgwriter, but that's
> > just aggregated data, not very convenient for info about the current
> > checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing
> > various info about the current checkpoint?
>
> It could go into the pg_stat_checkpointer view, which would be the culmination
> of another patch (cc Melanie).
> https://commitfest.postgresql.org/36/3272/

+1 to have pg_stat_progress_checkpoint view. We have
CheckpointStatsData already. What we need is to make this structure
shared and add a little more info to represent the progress, so that
the other backends can access it. I think we can discuss this in a
separate thread to give it a fresh try rather than proposing this as a
part of another thread. I will spend some time on
pg_stat_progress_checkpoint proposal and try to come up with a
separate thread to discuss.

Regards,
Bharath Rupireddy.

Re: Transparent column encryption

2021-12-07 Thread Tomas Vondra

On 12/8/21 00:26, Jacob Champion wrote:
> On Tue, 2021-12-07 at 22:21 +0100, Tomas Vondra wrote:
>> IMO it's impossible to solve this attack within TCE, because it requires 
>> ensuring consistency at the row level, but TCE obviously works at column 
>> level only.
> 
> I was under the impression that clients already had to be modified to
> figure out how to encrypt the data? If part of that process ends up
> including enforcement of encryption for a specific column set, then the
> addition of AEAD data could hypothetically be part of that hand-
> waviness.
> 

I think "transparency" here means the client just uses the regular
prepared-statement API without having to explicitly encrypt/decrypt any
data. The problem is we can't easily tie this to other columns in the
table, because the client may not even know what values are in those
columns.

Imagine you do this

  UPDATE t SET encrypted_column = $1 WHERE another_column = $2;

but you want to ensure the encrypted value belongs to a particular row
(which may or may not be identified by the another_column value). How
would the client do that? Should it fetch the value or what?

Similarly, what if the client just does

  SELECT encrypted_column FROM t;

How would it verify the values belong to the row, without having all the
data for the row (or just the required columns)?

> Unless "transparent" means that the client completely defers to the
> server on whether to encrypt or not, and silently goes along with it if
> the server tells it not to encrypt?
I think that's probably a valid concern - a "bad DBA" could alter the
table definition to not contain the "ENCRYPTED" bits, and then peek at
the plaintext values.

But it's not clear to me how exactly would the AEAD prevent this?
Wouldn't that be also specified on the server, somehow? In which case
the DBA could just tweak that too, no?

In other words, this issue seems mostly orthogonal to the AEAD, and the
right solution would be to allow the client to define which columns have
to be encrypted (in which case altering the server definition would not
be enough).

> That would only protect against a
> _completely_ passive DBA, like someone reading unencrypted backups,
> etc. And that still has a lot of value, certainly. But it seems like
> this prototype is very close to a system where the client can reliably
> secure data even if the server isn't trustworthy, if that's a use case
> you're interested in.
> 

Right. IMHO the "passive attacker" is a perfectly fine model for use
cases that would be fine with e.g. pgcrypto if there was no risk of
leaking plaintext values to logs, system catalogs, etc.

If we can improve it to provide (at least some) protection against
active attackers, that'd be a nice bonus.

>> I believe TCE can do AEAD at the column level, which protects against 
>> attacks that flipping bits, and similar attacks. It's just a matter of 
>> how the client encrypts the data.
> 
> Right, I think authenticated encryption ciphers (without AD) will be
> important to support in practice. I think users are going to want
> *some* protection against active attacks.
> 
>> Extending it to protect the whole row seems tricky, because the client 
>> may not even know the other columns, and it's not clear to me how it'd 
>> deal with things like updates of the other columns, hint bits, dropped 
>> columns, etc.
> 
> Covering the entire row automatically probably isn't super helpful in
> practice. As you mention later:
> 
>> It's probably possible to get something like this (row-level AEAD) by 
>> encrypting enriched data, i.e. not just the card number, but {user ID, 
>> card number} or something like that, and verify that in the webapp. The 
>> problem of course is that the "user ID" is just another column in the 
>> table, and there's nothing preventing the DBA from modifying that too.
> 
> Right. That's why the client has to be able to choose AD according to
> the application. In my previous example, the victim's email address can
> be copied by the DBA, but they wouldn't be able to authenticate as that
> user and couldn't convince the client to use the plaintext on their
> behalf.
> 

Well, yeah. But I'm not sure how to make that work easily, because the
client may not have the data :-(

I was thinking about using a composite data type combining the data with
the extra bits - that'd not be all that transparent as it'd require the
client to build this manually and then also cross-check it after loading
the data. So the user would be responsible for having all the data.

But doing that automatically/transparently seems hard, because how would
you deal e.g. with SELECT queries reading data through a view or CTE?

How would you declare this, either at the client or server?

Do any other databases have this capability? How do they do it?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Tomas Vondra




On 12/8/21 02:54, Bharath Rupireddy wrote:
> On Wed, Dec 8, 2021 at 3:48 AM Tomas Vondra
>  wrote:
>>
>> On 12/7/21 15:36, Bharath Rupireddy wrote:
>>> Hi,
>>>
>>> Currently one can know the kind of on-going/last checkpoint (shutdown,
>>> end-of-recovery, immediate, force etc.) only via server logs that to
>>> when log_checkpoints GUC is on. At times, the users/service layer
>>> components would want to know the kind of checkpoint (along with other
>>> checkpoint related info) to take some actions and it will be a bit
>>> difficult to search through the server logs. The checkpoint info can
>>> be obtained from the control file (either by pg_control_checkpoint()
>>> or by pg_controldata tool) whereas checkpoint kind isn't available
>>> there.
>>>
>>> How about we add an extra string field to the control file alongside
>>> the other checkpoint info it already has? This way, the existing
>>> pg_control_checkpoint() or pg_controldata tool can easily be enhanced
>>> to show the checkpoint kind as well. One concern is that we don't want
>>> to increase the size of pg_controldata by more than the typical block
>>> size (of 8K) to avoid any torn-writes. With this change, we might add
>>> at max the strings specified at [1]. Adding it to the control file has
>>> an advantage of preserving the last checkpoint kind which might be
>>> useful.
>>>
>>> Thoughts?
>>>
>>
>> I agree it might be useful to provide information about the nature of
>> the checkpoint, and perhaps even PID of the backend that triggered it
>> (although that may be tricky, if the backend terminates).
> 
> Thanks. I agree to have pg_stat_progress_checkpoint and yes PID of the
> triggered backend can possibly go there (we can mention in the
> documentation that the backend that triggered the checkpoint can get
> terminated).
> 

My concern is someone might run something that requires a checkpoint, so
we start it and put the PID into the catalog. And then the person aborts
the command and starts doing something else. But that does not abort the
checkpoint, but the backend now runs something that doesn't requite
checkpoint, which is rather confusing.

>> I'm not sure about adding it to control data, though. That doesn't seem
>> like a very good match for something that's mostly for monitoring.
> 
> Having it in the control data file (along with the existing checkpoint
> information) will be helpful to know what was the last checkpoint
> information and we can use the existing pg_control_checkpoint function
> or the tool to emit that info. I plan to add an int16 flag as
> suggested by Justin in this thread and come up with a patch.
> 

OK, although I'm not sure it's all that useful (if we have that in some
sort of system view).

>> We already have some checkpoint info in pg_stat_bgwriter, but that's
>> just aggregated data, not very convenient for info about the current
>> checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing
>> various info about the current checkpoint?
> 
> +1 to have pg_stat_progress_checkpoint view to know what's going on
> with the current checkpoint.
> 

Do you plan to add it to this patch, or should it be a separate patch?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Bharath Rupireddy

On Wed, Dec 8, 2021 at 7:34 AM Tomas Vondra
 wrote:
> >> I agree it might be useful to provide information about the nature of
> >> the checkpoint, and perhaps even PID of the backend that triggered it
> >> (although that may be tricky, if the backend terminates).
> >
> > Thanks. I agree to have pg_stat_progress_checkpoint and yes PID of the
> > triggered backend can possibly go there (we can mention in the
> > documentation that the backend that triggered the checkpoint can get
> > terminated).
> >
>
> My concern is someone might run something that requires a checkpoint, so
> we start it and put the PID into the catalog. And then the person aborts
> the command and starts doing something else. But that does not abort the
> checkpoint, but the backend now runs something that doesn't requite
> checkpoint, which is rather confusing.
>
> >> I'm not sure about adding it to control data, though. That doesn't seem
> >> like a very good match for something that's mostly for monitoring.
> >
> > Having it in the control data file (along with the existing checkpoint
> > information) will be helpful to know what was the last checkpoint
> > information and we can use the existing pg_control_checkpoint function
> > or the tool to emit that info. I plan to add an int16 flag as
> > suggested by Justin in this thread and come up with a patch.
> >
>
> OK, although I'm not sure it's all that useful (if we have that in some
> sort of system view).

If the server is down, the control file will help. Since we already
have the other checkpoint info in the control file, it's much more
useful and sensible to have this extra piece of missing information
(checkpoint kind) there.

> >> We already have some checkpoint info in pg_stat_bgwriter, but that's
> >> just aggregated data, not very convenient for info about the current
> >> checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing
> >> various info about the current checkpoint?
> >
> > +1 to have pg_stat_progress_checkpoint view to know what's going on
> > with the current checkpoint.
> >
>
> Do you plan to add it to this patch, or should it be a separate patch?

No, I will put some more thoughts around it and start a new thread.

Regards,
Bharath Rupireddy.

Re: Is there a way (except from server logs) to know the kind of on-going/last checkpoint?

2021-12-07 Thread Tomas Vondra




On 12/8/21 02:54, Bharath Rupireddy wrote:
> On Wed, Dec 8, 2021 at 4:07 AM Justin Pryzby  wrote:
>>> I'd bet squashing all of this into a single string (not really a flag)
>>> will just mean people will have to parse it, etc. Keeping individual
>>> boolean flags (or even separate string fields) would be better, I guess.
>>
>> Since the size of controldata is a concern, I suggest to add an int16 to
>> populate with flags, which pg_controldata can parse for display.
> 
> +1. I will come up with a patch soon.
> 
>> Note that this other patch intends to add the timestamp and LSN of most 
>> recent
>> recovery to controldata.
>> https://commitfest.postgresql.org/36/3183/
> 
> Thanks. I will go through it separately.
> 
>>> We already have some checkpoint info in pg_stat_bgwriter, but that's
>>> just aggregated data, not very convenient for info about the current
>>> checkpoint. So maybe we should add pg_stat_progress_checkpoint, showing
>>> various info about the current checkpoint?
>>
>> It could go into the pg_stat_checkpointer view, which would be the 
>> culmination
>> of another patch (cc Melanie).
>> https://commitfest.postgresql.org/36/3272/
> 

I don't think the pg_stat_checkpointer would be a good match - that's
going to be an aggregated view of all past checkpoints, not a good
source info about the currently running one.

> +1 to have pg_stat_progress_checkpoint view. We have
> CheckpointStatsData already. What we need is to make this structure
> shared and add a little more info to represent the progress, so that
> the other backends can access it. I think we can discuss this in a
> separate thread to give it a fresh try rather than proposing this as a
> part of another thread. I will spend some time on
> pg_stat_progress_checkpoint proposal and try to come up with a
> separate thread to discuss.
> 

+1 to discuss it as part of this patch

I'm not sure whether the view should look at CheckpointStatsData, or do
the same thing as the other pg_stat_progress_* views - send the data to
stat collector, and read it from there.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: parallel vacuum comments

2021-12-07 Thread houzj.f...@fujitsu.com

On Tuesday, December 7, 2021 1:42 PM Masahiko Sawada  
wrote:
> I've attached an updated patch. I've removed 0003 patch that added
> regression tests as per discussion. Regarding the terminology like "bulkdel"
> and "cleanup" you pointed out, I've done that in 0002 patch while moving the
> code to vacuumparallel.c. In that file, we can consistently use the terms
> "bulkdel" and "cleanup" instead of "vacuum"
> and "cleanup".
Hi,

Thanks for updating the patch.
I noticed few minor things.

0001
1)

 * Skip processing indexes that are unsafe for workers (these 
are
-* processed in do_serial_processing_for_unsafe_indexes() by 
leader)
+* processed in parallel_vacuum_process_unsafe_indexes() by 
leader)

It might be clearer to mention that the index to be skipped are unsafe OR not
worthwhile.

2)
+   /* Set index vacuum status and mark as parallel safe or not */
+   for (int i = 0; i < pvc->nindexes; i++)
+   {
...
+   pindstats->parallel_workers_can_process =
+   parallel_vacuum_index_is_parallel_safe(vacrel,
+   
   vacrel->indrels[i],
+   
   vacuum);

For the comments above the loop, maybe better to mention we are marking whether
worker can process the index(not only safe/unsafe).

0002
3)

+   /*
+* Skip indexes that are unsuitable target for parallel index 
vacuum
+*/
+   if (parallel_vacuum_should_skip_index(indrel))
+   continue;
+

It seems we can use will_parallel_vacuum[] here instead of invoking the function
again.

Best regards,
Hou zj

Re: 回复：Re: Is it worth pushing conditions to sublink/subplan?

2021-12-07 Thread Tomas Vondra

Hi,

On 12/7/21 10:44, 曾文旌(义从) wrote:
> Hi Hackers
> 
> For my previous proposal, I developed a prototype and passed
> regression testing. It works similarly to subquery's qual pushdown.
> We know that sublink expands at the beginning of each level of
> query. At this stage, The query's conditions and equivalence classes
> are not processed. But after generate_base_implied_equalities the
> conditions are processed,  which is why qual can push down to 
> subquery but sublink not.
> 
> My POC implementation chose to delay the sublink expansion in the
> SELECT clause (targetList) and where clause. Specifically, it is
> delayed after generate_base_implied_equalities. Thus, the equivalent
> conditions already established in the Up level query can be easily
> obtained in the sublink expansion process (make_subplan).
> 
> For example, if the up level query has a.id = 10 and the sublink
> query has a.id = b.id, then we get b.id = 10 and push it down to the
> sublink quey. If b is a partitioned table and is partitioned by id,
> then a large number of unrelated subpartitions are pruned out, This
> optimizes a significant amount of Planner and SQL execution time, 
> especially if the partitioned table has a large number of
> subpartitions and is what I want.
> 
> Currently, There were two SQL failures in the regression test,
> because the expansion order of sublink was changed, which did not
> affect the execution result of SQL.
> 
> Look forward to your suggestions on this proposal.
> 

I took a quick look, and while I don't see / can't think of any problems
with delaying it until after generating implied equalities, there seems
to be a number of gaps.


1) Are there any regression tests exercising this modified behavior?
Maybe there are, but if the only changes are due to change in order of
targetlist entries, that doesn't seem like a clear proof.

It'd be good to add a couple tests exercising both the positive and
negative case (i.e. when we can and can't pushdown a qual).


2) apparently, contrib/postgres_fdw does crash like this:

  #3  0x0077b412 in adjust_appendrel_attrs_mutator
  (node=0x13f7ea0, context=0x7fffc3351b30) at appendinfo.c:470
  470  Assert(!IsA(node, SubLink));
  (gdb) p node
  $1 = (Node *) 0x13f7ea0
  (gdb) p *node
  $2 = {type = T_SubLink}

  Backtrace attached.

3) various parts of the patch really need at least some comments, like:

  - try_push_outer_qual_to_sublink_query really needs some docs

  - new stuff at the end of initsplan.c


4) generate_base_implied_equalities

   shouldn't this

if (ec->ec_processed)
;

   really be?

if (ec->ec_processed)
continue;

5) I'm not sure why we need the new ec_processed flag.

6) So we now have lazy_process_sublink callback? Does that mean we
expand sublinks in two places - sometimes lazily, sometimes not?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyCore was generated by `postgres: user contrib_regression [local] EXPLAIN
 '.
Program terminated with signal SIGABRT, Aborted.
#0  0x79a07cfd79e5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install 
glibc-2.32-10.fc33.x86_64 libgcc-10.3.1-1.fc33.x86_64
(gdb) bt
#0  0x79a07cfd79e5 in raise () from /lib64/libc.so.6
#1  0x79a07cfc08a4 in abort () from /lib64/libc.so.6
#2  0x0094f22a in ExceptionalCondition 
(conditionName=conditionName@entry=0xab44ca "!IsA(node, SubLink)", 
errorType=errorType@entry=0x9a2017 "FailedAssertion", 
fileName=fileName@entry=0xab55cf "appendinfo.c", 
lineNumber=lineNumber@entry=470) at assert.c:69
#3  0x0077b412 in adjust_appendrel_attrs_mutator (node=0x13f7ea0, 
context=0x7fffc3351b30) at appendinfo.c:470
#4  0x0071f8f9 in expression_tree_mutator (node=0x13f7f90, 
mutator=0x77ae20 , context=0x7fffc3351b30) at 
nodeFuncs.c:3240
#5  0x0077b025 in adjust_appendrel_attrs_mutator (node=0x13f7f90, 
context=0x7fffc3351b30) at appendinfo.c:390
#6  0x00720066 in expression_tree_mutator (node=0x13eca78, 
mutator=0x77ae20 , context=0x7fffc3351b30) at 
nodeFuncs.c:3109
#7  0x0077b512 in adjust_appendrel_attrs (root=root@entry=0x13be5b0, 
node=, nappinfos=nappinfos@entry=1, 
appinfos=appinfos@entry=0x7fffc3351bd0) at appendinfo.c:210
#8  0x0073b88c in set_append_rel_size (rte=, rti=4, 
rel=0xf19538, root=0x13be5b0) at allpaths.c:1056
#9  set_rel_size (root=0x13be5b0, rel=0xf19538, rti=4, rte=) at 
allpaths.c:386
#10 0x0073e250 in set_base_rel_sizes (root=) at 
allpaths.c:326
#11 make_one_rel (root=root@entry=0x13be5b0, joinlist=joinlist@entry=0x13edf10) 
at allpaths.c:188
#12 0x00763fde in query_planner (root=root@entry=0x13be5b0, 
qp_callback=qp_callback@entry=0x764eb0 , 
qp_extra=qp_extra@entry=0x7fffc3351d90, lps_callback=) at 
planmain.c:286
#13 0x007694fa in grouping_planner (root=, 
tuple_fraction=) at planner.c:145

Re: parse_subscription_options - suggested improvements

2021-12-07 Thread Michael Paquier

On Tue, Dec 07, 2021 at 08:12:59AM +1100, Peter Smith wrote:
> On Tue, Dec 7, 2021 at 6:07 AM Bossart, Nathan  wrote:
>> Attached a v14 with the initializations added back.
> 
> LGTM.

All the code paths previously covered still are, so applied this one.
Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?

2021-12-07 Thread Bossart, Nathan

On 12/7/21, 5:21 PM, "Bharath Rupireddy" 
 wrote:
> On Wed, Dec 8, 2021 at 4:17 AM Bossart, Nathan  wrote:
>> I agree with Tom.  I would just s/server/backend/ (as per the
>> attached) and call it a day.
>
> Thanks. v5 patch looks good to me.

I've marked the commitfest entry as ready-for-committer.

Nathan

Re: parse_subscription_options - suggested improvements

2021-12-07 Thread Peter Smith

On Wed, Dec 8, 2021 at 2:51 PM Michael Paquier  wrote:
>
> On Tue, Dec 07, 2021 at 08:12:59AM +1100, Peter Smith wrote:
> > On Tue, Dec 7, 2021 at 6:07 AM Bossart, Nathan  wrote:
> >> Attached a v14 with the initializations added back.
> >
> > LGTM.
>
> All the code paths previously covered still are, so applied this one.
> Thanks!

Thanks for pushing!

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bossart, Nathan

On 12/7/21, 5:21 PM, "Bharath Rupireddy" 
 wrote:
> On Wed, Dec 8, 2021 at 2:50 AM Bossart, Nathan  wrote:
>> I noticed that some (but not all) of the surrounding messages say
>> "last known up at" the control file time.  I'm curious why you chose
>> not to use that phrasing in this case.
>
> If state is DB_IN_END_OF_RECOVERY_CHECKPOINT that means the db was
> interrupted while in end-of-recovery checkpoint, so I used the
> phrasing similar to DB_IN_CRASH_RECOVERY and DB_IN_ARCHIVE_RECOVERY
> cases. I would like to keep it as-is (in the v1 patch) unless anyone
> has other thoughts here?
> (errmsg("database system was interrupted while in recovery at %s",
> (errmsg("database system was interrupted while in recovery at log time %s",

I think that's alright.  The only other small suggestion I have would
be to say "during end-of-recovery checkpoint" instead of "while in
end-of-recovery checkpoint."

Another option we might want to consider is to just skip updating the
state entirely for end-of-recovery checkpoints.  The state would
instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION.  I
don't know if it's crucial to have a dedicated control file state for
end-of-recovery checkpoints.

Nathan

Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display

2021-12-07 Thread Michael Paquier

On Tue, Dec 07, 2021 at 06:28:24PM +0530, Bharath Rupireddy wrote:
> Upon thinking further, having at least the messages at LOG level [1]
> would be helpful to know what's happening with the system while in
> recovery. Although these messages at LOG level seem to be filling up
> the server logs, having a good log consumption and rotation mechanism
> (I'm sure every major postgres vendor would have one) would be
> sufficient to allay that concern.

+  ereport(LOG,
+  (errmsg("recovering WAL segment \"%s\" from source \"%s\"",
+   xlogfname, srcname)));
Isn't this approach an issue for translations?  It seems to me that
terms like "stream" or "archive" had better be translated properly,
meaning that this needs a fully-constructed sentence.
--
Michael


signature.asc
Description: PGP signature

Re: parallel vacuum comments

2021-12-07 Thread Masahiko Sawada

On Wed, Dec 8, 2021 at 12:22 PM houzj.f...@fujitsu.com
 wrote:
>
> On Tuesday, December 7, 2021 1:42 PM Masahiko Sawada  
> wrote:
> > I've attached an updated patch. I've removed 0003 patch that added
> > regression tests as per discussion. Regarding the terminology like "bulkdel"
> > and "cleanup" you pointed out, I've done that in 0002 patch while moving the
> > code to vacuumparallel.c. In that file, we can consistently use the terms
> > "bulkdel" and "cleanup" instead of "vacuum"
> > and "cleanup".
> Hi,
>
> Thanks for updating the patch.
> I noticed few minor things.

Thank you for the comments!

>
> 0001
> 1)
>
>  * Skip processing indexes that are unsafe for workers (these 
> are
> -* processed in do_serial_processing_for_unsafe_indexes() by 
> leader)
> +* processed in parallel_vacuum_process_unsafe_indexes() by 
> leader)
>
> It might be clearer to mention that the index to be skipped are unsafe OR not
> worthwhile.

Agreed. Will add the comments.

>
> 2)
> +   /* Set index vacuum status and mark as parallel safe or not */
> +   for (int i = 0; i < pvc->nindexes; i++)
> +   {
> ...
> +   pindstats->parallel_workers_can_process =
> +   parallel_vacuum_index_is_parallel_safe(vacrel,
> + 
>  vacrel->indrels[i],
> + 
>  vacuum);
>
> For the comments above the loop, maybe better to mention we are marking 
> whether
> worker can process the index(not only safe/unsafe).

Right. WIll fix.

>
> 0002
> 3)
>
> +   /*
> +* Skip indexes that are unsuitable target for parallel index 
> vacuum
> +*/
> +   if (parallel_vacuum_should_skip_index(indrel))
> +   continue;
> +
>
> It seems we can use will_parallel_vacuum[] here instead of invoking the 
> function
> again.

Oops, I missed updating it in 0002 patch. Will fix.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bharath Rupireddy

On Wed, Dec 8, 2021 at 9:49 AM Bossart, Nathan  wrote:
>
> On 12/7/21, 5:21 PM, "Bharath Rupireddy" 
>  wrote:
> > On Wed, Dec 8, 2021 at 2:50 AM Bossart, Nathan  wrote:
> >> I noticed that some (but not all) of the surrounding messages say
> >> "last known up at" the control file time.  I'm curious why you chose
> >> not to use that phrasing in this case.
> >
> > If state is DB_IN_END_OF_RECOVERY_CHECKPOINT that means the db was
> > interrupted while in end-of-recovery checkpoint, so I used the
> > phrasing similar to DB_IN_CRASH_RECOVERY and DB_IN_ARCHIVE_RECOVERY
> > cases. I would like to keep it as-is (in the v1 patch) unless anyone
> > has other thoughts here?
> > (errmsg("database system was interrupted while in recovery at %s",
> > (errmsg("database system was interrupted while in recovery at log time %s",
>
> I think that's alright.  The only other small suggestion I have would
> be to say "during end-of-recovery checkpoint" instead of "while in
> end-of-recovery checkpoint."

"while in" is being used by DB_IN_CRASH_RECOVERY and
DB_IN_ARCHIVE_RECOVERY messages. I don't think it's a good idea to
deviate from that and use "during".

> Another option we might want to consider is to just skip updating the
> state entirely for end-of-recovery checkpoints.  The state would
> instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION.  I
> don't know if it's crucial to have a dedicated control file state for
> end-of-recovery checkpoints.

Please note that end-of-recovery can take a while in production
systems (we have observed such things working with our customers) and
anything can happen during that period of time. The end-of-recovery
checkpoint is not something that gets finished momentarily. Therefore,
having a separate db state in the control file is useful.

Regards,
Bharath Rupireddy.

Re: [PATCH]Comment improvement in publication.sql

2021-12-07 Thread vignesh C

On Sun, Aug 8, 2021 at 4:26 PM tanghy.f...@fujitsu.com
 wrote:
>
> On Sunday, August 8, 2021 6:34 PM, vignesh C  wrote
> >Thanks for the updated patch, your changes look good to me. You might
> >want to include the commit message in the patch, that will be useful.
>
> Thanks for your kindly suggestion.
> Updated patch attached. Added the patch commit message with no new fix.

The patch no longer applies, could you post a rebased patch.

Regards,
Vignesh

Re: Skipping logical replication transactions on subscriber side

2021-12-07 Thread Amit Kapila

On Tue, Dec 7, 2021 at 5:06 PM Masahiko Sawada  wrote:
>
> On Mon, Dec 6, 2021 at 2:17 PM Amit Kapila  wrote:
>
> I'll submit the patch tomorrow.
>
> While updating the patch, I realized that skipping a transaction that
> is prepared on the publisher will be tricky a bit;
>
> First of all, since skip-xid is in pg_subscription catalog, we need to
> do a catalog update in a transaction and commit it to disable it. I
> think we need to set origin-lsn and timestamp of the transaction being
> skipped to the transaction that does the catalog update. That is,
> during skipping the (not prepared) transaction, we skip all
> data-modification changes coming from the publisher, do a catalog
> update, and commit the transaction. If we do the catalog update in the
> next transaction after skipping the whole transaction, skip_xid could
> be left in case of a server crash between them.
>

But if we haven't updated origin_lsn/timestamp before the crash, won't
it request the same transaction again from the publisher? If so, it
will be again able to skip it because skip_xid is still not updated.

> Also, we cannot set
> origin-lsn and timestamp to an empty transaction.
>

But won't we update the catalog for skip_xid in that case?

Do we see any advantage of updating the skip_xid in the same
transaction vs. doing it in a separate transaction? If not then
probably we can choose either of those ways and add some comments to
indicate the possibility of doing it another way.

> In prepared transaction cases, I think that when handling a prepare
> message, we need to commit the transaction to update the catalog,
> instead of preparing it. And at the commit prepared and rollback
> prepared time, we skip it since there is not the prepared transaction
> on the subscriber.
>

Can't we think of just allowing prepare in this case and updating the
skip_xid only at commit time? I see that in this case, we would be
doing prepare for a transaction that has no changes but as such cases
won't be common, isn't that acceptable?

> Currently, handling rollback prepared already
> behaves so; it first checks whether we have prepared the transaction
> or not and skip it if haven’t. So I think we need to do that also for
> commit prepared case. With that, this requires protocol changes so
> that the subscriber can get prepare-lsn and prepare-time when handling
> commit prepared.
>
> So I’m writing a separate patch to add prepare-lsn and timestamp to
> commit_prepared message, which will be a building block for skipping
> prepared transactions. Actually, I think it’s beneficial even today;
> we can skip preparing the transaction if it’s an empty transaction.
> Although the comment it’s not a common case, I think that it could
> happen quite often in some cases:
>
> * XXX, We can optimize such that at commit prepared time, we first check
> * whether we have prepared the transaction or not but that doesn't seem
> * worthwhile because such cases shouldn't be common.
> */
>
> For example, if the publisher has multiple subscriptions and there are
> many prepared transactions that modify the particular table subscribed
> by one publisher, many empty transactions are replicated to other
> subscribers.
>

I think this is not clear to me. Why would one have multiple
subscriptions for the same publication? I thought it is possible when
say some publisher doesn't publish any data of prepared transaction
say because the corresponding action is not published or something
like that. I don't deny that someday we want to optimize this case but
it might be better if we don't need to do it along with this patch.

-- 
With Regards,
Amit Kapila.

Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display

2021-12-07 Thread Bharath Rupireddy

On Wed, Dec 8, 2021 at 10:05 AM Michael Paquier  wrote:
>
> On Tue, Dec 07, 2021 at 06:28:24PM +0530, Bharath Rupireddy wrote:
> > Upon thinking further, having at least the messages at LOG level [1]
> > would be helpful to know what's happening with the system while in
> > recovery. Although these messages at LOG level seem to be filling up
> > the server logs, having a good log consumption and rotation mechanism
> > (I'm sure every major postgres vendor would have one) would be
> > sufficient to allay that concern.
>
> +  ereport(LOG,
> +  (errmsg("recovering WAL segment \"%s\" from source \"%s\"",
> +   xlogfname, srcname)));
> Isn't this approach an issue for translations?  It seems to me that
> terms like "stream" or "archive" had better be translated properly,
> meaning that this needs a fully-constructed sentence.

Thanks for taking a look at the patch. How about the attached v4?

I added a CF entry - https://commitfest.postgresql.org/36/3443/

Regards,
Bharath Rupireddy.


v4-0001-add-log-messages-for-recovery.patch
Description: Binary data

Re: row filtering for logical replication

2021-12-07 Thread Amit Kapila

On Tue, Dec 7, 2021 at 6:31 PM Ashutosh Bapat
 wrote:
>
> On Tue, Dec 7, 2021 at 12:18 PM tanghy.f...@fujitsu.com
>  wrote:
> >
> > I have another problem with your patch. The document says:
> >
> > ... If the subscription has several publications in
> > +   which the same table has been published with different filters, those
> > +   expressions get OR'ed together so that rows satisfying any of the 
> > expressions
> > +   will be replicated. Notice this means if one of the publications has no 
> > filter
> > +   at all then all other filters become redundant.
> >
> > Then, what if one of the publications is specified as 'FOR ALL TABLES' or 
> > 'FOR
> > ALL TABLES IN SCHEMA'.
> >
> > For example:
> > create table tbl (a int primary key);"
> > create publication p1 for table tbl where (a > 10);
> > create publication p2 for all tables;
> > create subscription sub connection 'dbname=postgres port=5432' publication 
> > p1, p2;
>
> Thanks for the example. I was wondering about this case myself.
>

I think we should handle this case.

> >
> > I think for "FOR ALL TABLE" publication(p2 in my case), table tbl should be
> > treated as no filter, and table tbl should have no filter in subscription 
> > sub. Thoughts?
> >
> > But for now, the filter(a > 10) works both when copying initial data and 
> > later changes.
> >
> > To fix it, I think we can check if the table is published in a 'FOR ALL 
> > TABLES'
> > publication or published as part of schema in function 
> > pgoutput_row_filter_init
> > (which was introduced in v44-0003 patch), also we need to make some changes 
> > in
> > tablesync.c.
>
> In order to check "FOR ALL_TABLES", we might need to fetch publication
> metadata.
>

Do we really need to perform a separate fetch for this? In
get_rel_sync_entry(), we already have this information, can't we
someway stash that in the corresponding RelationSyncEntry so that same
can be used later for row filtering.

> Instead of that can we add a "TRUE" filter on all the tables
> which are part of FOR ALL TABLES publication?
>

How? We won't have an entry for such tables in pg_publication_rel
where we store row_filter information.

-- 
With Regards,
Amit Kapila.

Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?

2021-12-07 Thread Bossart, Nathan

On 12/7/21, 8:42 PM, "Bharath Rupireddy" 
 wrote:
> On Wed, Dec 8, 2021 at 9:49 AM Bossart, Nathan  wrote:
>> I think that's alright.  The only other small suggestion I have would
>> be to say "during end-of-recovery checkpoint" instead of "while in
>> end-of-recovery checkpoint."
>
> "while in" is being used by DB_IN_CRASH_RECOVERY and
> DB_IN_ARCHIVE_RECOVERY messages. I don't think it's a good idea to
> deviate from that and use "during".

Fair enough.  I don't have a strong opinion about this.

>> Another option we might want to consider is to just skip updating the
>> state entirely for end-of-recovery checkpoints.  The state would
>> instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION.  I
>> don't know if it's crucial to have a dedicated control file state for
>> end-of-recovery checkpoints.
>
> Please note that end-of-recovery can take a while in production
> systems (we have observed such things working with our customers) and
> anything can happen during that period of time. The end-of-recovery
> checkpoint is not something that gets finished momentarily. Therefore,
> having a separate db state in the control file is useful.

Is there some useful distinction between the states for users?  ISTM
that users will be waiting either way, and I don't know that an extra
control file state will help all that much.  The main reason I bring
up this option is that the list of states is pretty short and appears
to be intended to indicate the high-level status of the server.  Most
of the states are over 20 years old, and the newest one is over 10
years old, so I don't think new states can be added willy-nilly.

Of course, I could be off-base and others might agree that this new
state would be nice to have.

Nathan

RE: [PATCH]Comment improvement in publication.sql

2021-12-07 Thread tanghy.f...@fujitsu.com

On Wednesday, December 8, 2021 1:49 PM, vignesh C  wrote:

> The patch no longer applies, could you post a rebased patch.

Thanks for your kindly reminder. Attached a rebased patch.
Some changes in v4 patch has been fixed by 5a28324, so I deleted those changes.

Regards,
Tang 


v5-0001-Fix-comments-in-publication.sql.patch
Description: v5-0001-Fix-comments-in-publication.sql.patch

Re: Data is copied twice when specifying both child and parent table in publication

2021-12-07 Thread Amit Kapila

On Tue, Dec 7, 2021 at 5:53 PM vignesh C  wrote:
>
> On Fri, Dec 3, 2021 at 11:24 AM houzj.f...@fujitsu.com
>  wrote:
> >
>
> 2) Any particular reason why the code and tests are backbranched but
> not the document changes?
>

I am not sure whether we need the doc change or not as this is not a
new feature and even if we need it as an improvement to docs, shall we
consider backpatching it? I felt that code changes are required to fix
a known issue so the case of backpatching it is clear.

-- 
With Regards,
Amit Kapila.

1 2 >

1 - 100 of 113 matches

Mail list logo