date:20220726

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-26 Thread Michael Paquier

On Mon, Jul 25, 2022 at 02:12:05PM +0300, Heikki Linnakangas wrote:
> The way this is written, it would change whenever we add/remove fields in
> DecodedBkpBlock, for example. That's fragile; if you added a field in a
> back-branch, you could accidentally make the new minor version unable to
> read maximum-sized WAL records generated with an older version. I'd like the
> maximum to be more explicit.

That's a good point.

> How large exactly is the maximum size that this gives? I'd prefer to set the
> limit conservatively to 1020 MB, for example, with a compile-time static
> assertion that AllocSizeIsValid(DecodeXLogRecordRequiredSpace(1020 MB)).

Something like that would work, I guess.

>  Among the two problems to solve at hand, the parts where the APIs are
>  changed and made more robust with unsigned types and where block data
>  is not overflowed with its 16-byte limit are committable, so I'd like
>  to do that first (still need to check its performance with some micro
>  benchmark on XLogRegisterBufData()).
> 
> +1. I'm not excited about adding the "unlikely()" hints, though. We have a
> pg_attribute_cold hint in ereport(), that should be enough.

Okay, that makes sense.  FWIW, I have been wondering about the
addition of the extra condition in XLogRegisterBufData() and I did not
see a difference on HEAD in terms of execution time or profile, with a
micro-benchmark doing a couple of million calls in a row as of the
following, roughly:
// Can be anything, really..
rel = relation_open(RelationRelationId, AccessShareLock);
buffer = ReadBuffer(rel, 0);
for (i = 0 ; i < WAL_MAX_CALLS ; i++)
{
XLogBeginInsert();
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
XLogRegisterBufData(0, buf, 10);
XLogResetInsertion();
}
ReleaseBuffer(buffer);
relation_close(rel, AccessShareLock);
--
Michael


signature.asc
Description: PGP signature

Re: Refactoring postgres_fdw/connection.c

2022-07-26 Thread Kyotaro Horiguchi

At Tue, 26 Jul 2022 00:54:47 +0900, Fujii Masao  
wrote in 
> Hi,
> 
> When reviewing the postgres_fdw parallel-abort patch [1], I found that
> there are several duplicate codes in postgres_fdw/connection.c.
> Which seems to make it harder to review the patch changing
> connection.c.
> So I'd like to remove such duplicate codes and refactor the functions
> in connection.c. I attached the following three patches.
> 
> There are two functions, pgfdw_get_result() and
> pgfdw_get_cleanup_result(),
> to get a query result. They have almost the same code, call
> PQisBusy(),
> WaitLatchOrSocket(), PQconsumeInput() and PQgetResult() in the loop,
> but only pgfdw_get_cleanup_result() allows its callers to specify the
> timeout.
> 0001 patch transforms pgfdw_get_cleanup_result() to the common
> function
> to get a query result and makes pgfdw_get_result() use it instead of
> its own (duplicate) code. The patch also renames
> pgfdw_get_cleanup_result()
> to pgfdw_get_result_timed().

Agree to that refactoring.  And it looks fine to me.

> pgfdw_xact_callback() and pgfdw_subxact_callback() have similar codes
> to
> issue COMMIT or RELEASE SAVEPOINT commands. 0002 patch adds the common
> function,
> pgfdw_exec_pre_commit(), for that purpose, and changes those functions
> so that they use the common one.

I'm not sure the two are similar with each other.  The new function
pgfdw_exec_pre_commit() looks like a merger of two isolated code paths
intended to share a seven-line codelet.  I feel the code gets a bit
harder to understand after the change.  I mildly oppose to this part.

> pgfdw_finish_pre_commit_cleanup() and
> pgfdw_finish_pre_subcommit_cleanup()
> have similar codes to wait for the results of COMMIT or RELEASE
> SAVEPOINT commands.
> 0003 patch adds the common function, pgfdw_finish_pre_commit(), for
> that purpose,
> and replaces those functions with the common one.
> That is, pgfdw_finish_pre_commit_cleanup() and
> pgfdw_finish_pre_subcommit_cleanup()
> are no longer necessary and 0003 patch removes them.

It gives the same feeling with 0002.  Considering that
pending_deallocate becomes non-NIL only when toplevel, 38 lines out of
66 lines of the function are the toplevel-dedicated stuff.

> [1] https://commitfest.postgresql.org/38/3392/

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: remove more archiving overhead

2022-07-26 Thread Fujii Masao





On 2022/07/09 2:18, Nathan Bossart wrote:

On Fri, Jul 08, 2022 at 01:02:51PM -0400, David Steele wrote:

I think I wrote this before I'd had enough coffee. "fully persisted to
storage" can mean many things depending on the storage (Posix, CIFS, S3,
etc.) so I think this is fine. The basic_archive module is there for people
who would like implementation details for Posix.


Yes. But some users might want to use basic_archive without any changes.
For them, how about documenting how basic_archive works in basic_archive docs?
I'm just thinking that it's worth exposing the information commented on
the top of basic_archive.c.

Anyway, since the patches look good to me, I pushed them. Thanks a lot!
If necessary, we can keep improving the docs later.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-07-26 Thread Richard Guo

On Tue, Jul 26, 2022 at 7:38 AM David Rowley  wrote:

> On Fri, 22 Jul 2022 at 21:33, Richard Guo  wrote:
> > I can see this problem with
> > the query below:
> >
> > select max(b order by b), max(a order by a) from t group by a;
> >
> > When processing the first aggregate, we compose the 'currpathkeys' as
> > {a, b} and mark this aggregate in 'aggindexes'. When it comes to the
> > second aggregate, we compose its pathkeys as {a} and decide that it is
> > not stronger than 'currpathkeys'. So the second aggregate is not
> > recorded in 'aggindexes'. As a result, we fail to mark aggpresorted for
> > the second aggregate.
>
> Yeah, you're right. I have a missing check to see if currpathkeys are
> better than the pathkeys for the current aggregate. In your example
> case we'd have still processed the 2nd aggregate the old way instead
> of realising we could take the new pre-sorted path for faster
> processing.
>
> I've adjusted that in the attached to make it properly include the
> case where currpathkeys are better.


Thanks. Verified problem is solved in v8 patch.

Also I'm wondering if it's possible to take into consideration the
ordering indicated by existing indexes when determining the pathkeys. So
that for the query below we can avoid the Incremental Sort node if we
consider that there is an index on t(a, c):

# explain (costs off) select max(b order by b), max(c order by c) from t
group by a;
 QUERY PLAN
-
 GroupAggregate
   Group Key: a
   ->  Incremental Sort
 Sort Key: a, b
 Presorted Key: a
 ->  Index Scan using t_a_c_idx on t
(6 rows)

Thanks
Richard

Re: Introduce wait_for_subscription_sync for TAP tests

2022-07-26 Thread Masahiko Sawada

On Tue, Jul 26, 2022 at 2:01 PM Amit Kapila  wrote:
>
> On Tue, Jul 26, 2022 at 7:07 AM Masahiko Sawada  wrote:
> >
> > Hi,
> >
> > In tap tests for logical replication, we have the following code in many 
> > places:
> >
> > $node_publisher->wait_for_catchup('tap_sub');
> > my $synced_query =
> >   "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT
> > IN ('r', 's');";
> > $node_subscriber->poll_query_until('postgres', $synced_query)
> >   or die "Timed out while waiting for subscriber to synchronize data";
> >
> > Also, we sometime forgot to check either one, like we fixed in commit
> > 1f50918a6fb02207d151e7cb4aae4c36de9d827c.
> >
> > I think we can have a new function to wait for all subscriptions to
> > synchronize data. The attached patch introduce a new function
> > wait_for_subscription_sync(). With this function, we can replace the
> > above code with this one function as follows:
> >
> > $node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub');
> >
>
> +1. This reduces quite some code in various tests and will make it
> easier to write future tests.
>
> Few comments/questions:
> 
> 1.
> -$node_publisher->wait_for_catchup('mysub1');
> -
> -# Also wait for initial table sync to finish.
> -my $synced_query =
> -  "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT
> IN ('r', 's');";
> -$node_subscriber->poll_query_until('postgres', $synced_query)
> -  or die "Timed out while waiting for subscriber to synchronize data";
> -
>  # Also wait for initial table sync to finish.
> -$node_subscriber->poll_query_until('postgres', $synced_query)
> -  or die "Timed out while waiting for subscriber to synchronize data";
> +$node_subscriber->wait_for_subscription_sync($node_publisher, 'mysub1');
>
> It seems to me without your patch there is an extra poll in the above
> test. If so, we can probably remove that in a separate patch?

Agreed.

>
> 2.
> +# wait for the replication to catchup if required.
> +if (defined($publisher))
> +{
> + croak 'subscription name must be specified' unless defined($subname);
> + $publisher->wait_for_catchup($subname, 'replay');
> +}
> +
> +# then, wait for all table states to be ready.
> +print "Waiting for all subscriptions in \"$name\" to synchronize data\n";
> +my $query = qq[SELECT count(1) = 0
> + FROM pg_subscription_rel
> + WHERE srsubstate NOT IN ('r', 's');];
> +$self->poll_query_until($dbname, $query)
> + or croak "timed out waiting for subscriber to synchronize data";
>
> In the tests, I noticed that a few places did wait_for_catchup after
> the subscription check, and at other places, we did that check before
> as you have it here. Ideally, I think wait_for_catchup should be after
> confirming the initial sync is over as without initial sync, the
> publisher node won't be completely in sync with the subscriber.

What do you mean by the last sentence? I thought the order doesn't
matter here. Even if we do wait_for_catchup first then the
subscription check, we can make sure that the apply worker caught up
and table synchronization has been done, no?

>
> 3. In the code quoted in the previous point, why did you pass the
> second parameter as 'replay' when we have not used that in the tests
> otherwise?

It makes sure to use the (current) default value of $mode of
wait_for_catchup(). But probably it's not necessary, I've removed it.

I've attached an updated patch as well as a patch to remove duplicated
waits in 007_ddl.pl.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/
From 35c7b74b66a179b14fc97a012309f3d3ee9a7e26 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Tue, 26 Jul 2022 16:37:26 +0900
Subject: [PATCH v2 1/2] Remove duplicated wait for subscription
 synchronization in 007_ddl.pl

---
 src/test/subscription/t/007_ddl.pl | 4 
 1 file changed, 4 deletions(-)

diff --git a/src/test/subscription/t/007_ddl.pl b/src/test/subscription/t/007_ddl.pl
index cdd6b119ff..01df54229c 100644
--- a/src/test/subscription/t/007_ddl.pl
+++ b/src/test/subscription/t/007_ddl.pl
@@ -57,10 +57,6 @@ my $synced_query =
 $node_subscriber->poll_query_until('postgres', $synced_query)
   or die "Timed out while waiting for subscriber to synchronize data";
 
-# Also wait for initial table sync to finish.
-$node_subscriber->poll_query_until('postgres', $synced_query)
-  or die "Timed out while waiting for subscriber to synchronize data";
-
 # Specifying non-existent publication along with add publication.
 ($ret, $stdout, $stderr) = $node_subscriber->psql('postgres',
 	"ALTER SUBSCRIPTION mysub1 ADD PUBLICATION non_existent_pub1, non_existent_pub2"
-- 
2.24.3 (Apple Git-128)

From 229f8fe990f56a6558311de5eb75e35275b50609 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Tue, 26 Jul 2022 09:43:17 +0900
Subject: [PATCH v2 2/2] Introduce wait_for_subscription_sync for TAP tests

---
 src/test/perl/PostgreSQL/Test/Cluster.pm  | 43 +
 src/test/su

SI-read predicate locks on materialized views

2022-07-26 Thread Yugo NAGATA

Hi,

I propose to acquire SI-read predicate locks on materialized views
as the attached patch.

Currently, materialized views do not participate in predicate locking,
but I think this causes a serialization anomaly when `REFRESH
MATERIALIZED VIEW CONCURRENTLY` is used.

For example, supporse that there is a table "orders" which contains
order information and a materialized view "order_summary" which contains
summary of the order information. 

 CREATE TABLE orders (date date, item text, num int);

 CREATE MATERIALIZED VIEW order_summary AS
   SELECT date, item, sum(num) FROM orders GROUP BY date, item;

"order_summary" is refreshed once per day in the following transaction.

 T1:
 REFRESH MATERIALIZED VIEW CONCURRENTLY order_summary;

"orders" has a date column, and when a new item is inserted, the date
value is determined as the next day of the last date recorded in
"order_summary" as in the following transaction.
 
 T2:
 SELECT max(date) + 1 INTO today FROM order_summary;
 INSERT INTO orders(date, item, num) VALUES (today, 'apple', 1);

If such two transactions run concurrently, a write skew anomaly occurs,
and the result of order_summary refreshed in T1 will not contain the
record inserted in T2. 

On the other hand, if the materialized view participates in predicate
locking and the transaction isolation level is SELIALIZABLE, this
anomaly can be avoided; one of the transaction will be aborted and
suggested to be retried.

The problem doesn't occur when we use REFRESH MATERIALIZED VIEW
(not CONCURRENTLY) because it acquires the strongest lock and
any concurrent transactions are prevent from reading the materialized view.
I think this is the reason why materialized views didn't have to
participate in predicate locking. However,  this is no longer the case
because now we support REFRESH ... CONCURRENTLY which refreshes the
materialized view using DELETE and INSERT and also allow to read it
from concurrent transactions. I think we can regard them as same as
DELETE, INSERT, and SELECT on regular tables and acquire predicate
locks on materialized views as well.

What do you think about it?

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5136da6ea3..6d414bfcc9 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -489,14 +489,13 @@ static void ReleasePredicateLocksLocal(void);
 
 /*
  * Does this relation participate in predicate locking? Temporary and system
- * relations are exempt, as are materialized views.
+ * relations are exempt.
  */
 static inline bool
 PredicateLockingNeededForRelation(Relation relation)
 {
 	return !(relation->rd_id < FirstUnpinnedObjectId ||
-			 RelationUsesLocalBuffers(relation) ||
-			 relation->rd_rel->relkind == RELKIND_MATVIEW);
+			 RelationUsesLocalBuffers(relation));
 }
 
 /*

Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-26 Thread Masahiko Sawada

On Tue, Jul 26, 2022 at 2:18 PM Amit Kapila  wrote:
>
> On Tue, Jul 26, 2022 at 7:00 AM Masahiko Sawada  wrote:
> >
> > On Mon, Jul 25, 2022 at 7:57 PM shiy.f...@fujitsu.com
> >  wrote:
> > >
> > > Hi,
> > >
> > > I did some performance test for the master branch patch (based on v6 
> > > patch) to
> > > see if the bsearch() added by this patch will cause any overhead.
> >
> > Thank you for doing performance tests!
> >
> > >
> > > I tested them three times and took the average.
> > >
> > > The results are as follows, and attach the bar chart.
> > >
> > > case 1
> > > -
> > > No catalog modifying transaction.
> > > Decode 800k pgbench transactions. (8 clients, 100k transactions per 
> > > client)
> > >
> > > master  7.5417
> > > patched 7.4107
> > >
> > > case 2
> > > -
> > > There's one catalog modifying transaction.
> > > Decode 100k/500k/1M transactions.
> > >
> > > 100k500k1M
> > > master  0.0576  0.1491  0.4346
> > > patched 0.0586  0.1500  0.4344
> > >
> > > case 3
> > > -
> > > There are 64 catalog modifying transactions.
> > > Decode 100k/500k/1M transactions.
> > >
> > > 100k500k1M
> > > master  0.0600  0.1666  0.4876
> > > patched 0.0620  0.1653  0.4795
> > >
> > > (Because the result of case 3 shows that there is a overhead of about 3% 
> > > in the
> > > case decoding 100k transactions with 64 catalog modifying transactions, I
> > > tested the next run of 100k xacts with or without catalog modifying
> > > transactions, to see if it affects subsequent decoding.)
> > >
> > > case 4.1
> > > -
> > > After the test steps in case 3 (64 catalog modifying transactions, decode 
> > > 100k
> > > transactions), run 100k xacts and then decode.
> > >
> > > master  0.3699
> > > patched 0.3701
> > >
> > > case 4.2
> > > -
> > > After the test steps in case 3 (64 catalog modifying transactions, decode 
> > > 100k
> > > transactions), run 64 DDLs(without checkpoint) and 100k xacts, then 
> > > decode.
> > >
> > > master  0.3687
> > > patched 0.3696
> > >
> > > Summary of the tests:
> > > After applying this patch, there is a overhead of about 3% in the case 
> > > decoding
> > > 100k transactions with 64 catalog modifying transactions. This is an 
> > > extreme
> > > case, so maybe it's okay.
> >
> > Yes. If we're worried about the overhead and doing bsearch() is the
> > cause, probably we can try simplehash instead of the array.
> >
>
> I am not sure if we need to go that far for this extremely corner
> case. Let's first try your below idea.
>
> > An improvement idea is that we pass the parsed->xinfo down to
> > SnapBuildXidHasCatalogChanges(), and then return from that function
> > before doing bearch() if the parsed->xinfo doesn't have
> > XACT_XINFO_HAS_INVALS. That would save calling bsearch() for
> > non-catalog-modifying transactions. Is it worth trying?
> >
>
> I think this is worth trying and this might reduce some of the
> overhead as well in the case presented by Shi-San.

Okay, I've attached an updated patch that does the above idea. Could
you please do the performance tests again to see if the idea can help
reduce the overhead, Shi yu?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


master-v9-0001-Add-catalog-modifying-transactions-to-logical-dec.patch
Description: Binary data

RE: Handle infinite recursion in logical replication setup

2022-07-26 Thread shiy.f...@fujitsu.com

On Sun, Jul 24, 2022 1:28 AM vignesh C  wrote:
> 
> Added a  note for the same and referred it to the conflicts section.
> 
> Thanks for the comments, the attached v38 patch has the changes for the
> same.
> 

Thanks for updating the patch. A comment on the test in 0001 patch.

+# Alter subscription ... refresh publication should fail when a new table is
+# subscribing data from a different publication should fail
+($result, $stdout, $stderr) = $node_A->psql(
+   'postgres', "
+ALTER SUBSCRIPTION tap_sub_A2 REFRESH PUBLICATION");
+like(
+   $stderr,
+   qr/ERROR: ( [A-Z0-9]+:)? could not replicate table "public.tab_new"/,
+   "Create subscription with origin and copy_data having replicated table 
in publisher"
+);

The comment says "should fail" twice, the latter one can be removed.

Besides, "Create subscription with origin and copy_data" should be changed to
"Alter subscription with origin and copy_data" I think.

Regards,
Shi yu

Re: Make name optional in CREATE STATISTICS

2022-07-26 Thread Alvaro Herrera

On 2022-Jul-26, Michael Paquier wrote:

> On Mon, Jul 25, 2022 at 12:55:54PM +0200, Alvaro Herrera wrote:
> > Agreed.  I think you already have the query for that elsewhere in the
> > test, so it's just a matter of copying it from there.
> 
> I actually already wrote most of it in 2cbc3c1, and I just needed to
> extend things a bit to detect the OID changes :)
> 
> So, applied, with all the extra tests.

Thank you!

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this."   (Fotis)
   (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)

Re: make -C libpq check fails obscurely if tap tests are disabled

2022-07-26 Thread Alvaro Herrera

Ah, I found what the actual problem is: we have sprinkled a few
dependencies on "...-recurse" throught the tree, but the patch I posted
yesterday changes the manufactured target as "-recursive", as it was
prior to 1bd201214965; so essentially these manually added dependencies
all became silent no-ops.

With this version I keep the target name as -recurse, and at least the
ecpg<->libpq problem is no more AFAICT.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
>From 90a874bdf0e6f56d9c4ac1370d1765285db57e6b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 25 Jul 2022 12:26:49 +0200
Subject: [PATCH v2] Try to undo 1bd201214965

---
 src/Makefile.global.in | 15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 14fdd4ef7b..26ecfb2087 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -882,19 +882,14 @@ endif
 # using $(eval).  It will set up a target so that it recurses into a
 # given subdirectory.  For the tree-wide all/install/check/installcheck cases,
 # ensure we do our one-time tasks before recursing (see targets above).
-# Note that to avoid a nasty bug in make 3.80,
-# this function was written to not use any complicated constructs (like
-# multiple targets on a line) and also not contain any lines that expand
-# to more than about 200 bytes.  This is why we make it apply to just one
-# subdirectory at a time, rather than to a list of subdirectories.
 # $1: target name, e.g., all
 # $2: subdir name
 # $3: target to run in subdir, usually same as $1
 define _create_recursive_target
-.PHONY: $(1)-$(2)-recurse
-$(1): $(1)-$(2)-recurse
-$(1)-$(2)-recurse: $(if $(filter all install check installcheck, $(3)), submake-generated-headers) $(if $(filter check, $(3)), temp-install)
-	$$(MAKE) -C $(2) $(3)
+.PHONY: $(patsubst %,$(1)-%-recurse,$(2))
+$(1): $(patsubst %,$(1)-%-recurse,$(2))
+$(2:%=$(1)-%-recurse):
+	$$(MAKE) -C $$(patsubst $(1)-%-recurse,%,$$@) $(3)
 endef
 # Note that the use of $$ on the last line above is important; we want
 # $(MAKE) to be evaluated when the rule is run, not when the $(eval) is run
@@ -905,7 +900,7 @@ endef
 # $1: targets to make recursive (defaults to list of standard targets)
 # $2: list of subdirs (defaults to SUBDIRS variable)
 # $3: target to run in subdir (defaults to current element of $1)
-recurse = $(foreach target,$(if $1,$1,$(standard_targets)),$(foreach subdir,$(if $2,$2,$(SUBDIRS)),$(eval $(call _create_recursive_target,$(target),$(subdir),$(if $3,$3,$(target))
+recurse = $(foreach target,$(if $1,$1,$(standard_targets)),$(eval $(call _create_recursive_target,$(target),$(if $2,$2,$(SUBDIRS)),$(if $3,$3,$(target)
 
 # If a makefile's list of SUBDIRS varies depending on configuration, then
 # any subdirectories excluded from SUBDIRS should instead be added to
-- 
2.30.2

Re: Pluggable toaster

2022-07-26 Thread Aleksander Alekseev

Hi Nikita,

Thanks for an update!

> 0002_toaster_interface_v10 contains TOAST API with Dummy toaster as an 
> example (but I would,
> as recommended, remove Dummy toaster and provide it as an extension), and 
> default Toaster
> was left as-is (reference implementation).
>
> 0003_toaster_default_v9 implements reference TOAST as Default Toaster via 
> TOAST API,
> so Heap AM calls Toast only via API, and does not have direct calls to Toast 
> functionality.
>
> 0004_toaster_snapshot_v8 continues refactoring and has some important changes 
> (added
> into README.toastapi new part related TOAST API extensibility - the virtual 
> functions table).

This numbering is confusing. Please use a command like:

```
git format-patch origin/master -v 42
```

This will produce a patchset with a consistent naming like:

```
v42-0001-foo-bar.patch
v42-0002-baz-qux.patch
... etc ...
```

Also cfbot [1] will know in which order to apply them.

> GIT branch with this patch resides here:
> https://github.com/postgrespro/postgres/tree/toasterapi_clean

Unfortunately the three patches in question from this branch don't
pass `make check`. Please update
src/test/regress/expected/publication.out and make sure the patchset
passes the rest of the tests at least on one platform before
submitting.

Personally I have a little set of scripts for this [2]. The following
commands should pass:

```
# quick check
./quick-build.sh && ./single-install.sh && make installcheck

# full check
./full-build.sh && ./single-install.sh && make installcheck-world
```

Finally, please update the commit messages. Each commit message should
include a brief description (one line) , a detailed description (the
body), and also the list of the authors, the reviewers and a link to
the discussion. Please use [3] as a template.

[1]: http://cfbot.cputube.org/
[2]: https://github.com/afiskon/pgscripts/
[3]: 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=784cedda0604ee4ac731fd0b00cd8b27e78c02d3

-- 
Best regards,
Aleksander Alekseev

RE: Handle infinite recursion in logical replication setup

2022-07-26 Thread wangw.f...@fujitsu.com

On Sun, Jul 24, 2022 1:28 AM vignesh C  wrote:
> Added a  note for the same and referred it to the conflicts section.
> 
> Thanks for the comments, the attached v38 patch has the changes for the same.

Thanks for your patches.

Two slight comments on the below message in the 0001 patch:

The error message in the function check_pub_table_subscribed().
+   ereport(ERROR,
+   
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+   errmsg("could not replicate table \"%s.%s\"",
+  nspname, relname),
+   errdetail("CREATE/ALTER SUBSCRIPTION with 
origin = none and copy_data = on is not allowed when the publisher has 
subscribed same table."),
+   errhint("Use CREATE/ALTER SUBSCRIPTION with 
copy_data = off/force."));

1.
I think it might be better to use true/false here than on/off.
Just for consistency with another error message
(in function parse_subscription_options) and the description of this parameter
in the PG document.

If you agree with this, please also kindly consider the attached
"slight_modification.diff" file.
(This is a slight modification for the second patch, just replace "off" with
"false" in PG document.)

2.
How about replacing "origin = none" and "copy_data = on" in the message with
"%s"? I think this might be better for translation. Just like the following
error message in the function parse_subscription_options:
```
if (opts->copy_data &&
IsSet(opts->specified_opts, SUBOPT_COPY_DATA))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
 errmsg("%s and %s are mutually 
exclusive options",
"connect = false", 
"copy_data = true/force")));
```

Regards,
Wang wei

slight_modification.diff
Description: slight_modification.diff

Re: Perform streaming logical transactions by background workers and parallel apply

2022-07-26 Thread Dilip Kumar

On Fri, Jul 22, 2022 at 8:27 AM wangw.f...@fujitsu.com
 wrote:
>
> On Tues, Jul 19, 2022 at 10:29 AM I wrote:
> > Attach the news patches.
>
> Not able to apply patches cleanly because the change in HEAD (366283961a).
> Therefore, I rebased the patch based on the changes in HEAD.
>
> Attach the new patches.

+/* Check the foreign keys. */
+fkeys = RelationGetFKeyList(entry->localrel);
+if (fkeys)
+entry->parallel_apply = PARALLEL_APPLY_UNSAFE;

So if there is a foreign key on any of the tables which are parts of a
subscription then we do not allow changes for that subscription to be
applied in parallel?  I think this is a big limitation because having
foreign key on the table is very normal right?  I agree that if we
allow them then there could be failure due to out of order apply
right? but IMHO we should not put the restriction instead let it fail
if there is ever such conflict.  Because if there is a conflict the
transaction will be sent again.  Do we see that there could be wrong
or inconsistent results if we allow such things to be executed in
parallel.  If not then IMHO just to avoid some corner case failure we
are restricting very normal cases.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-26 Thread shiy.f...@fujitsu.com

On Tue, Jul 26, 2022 3:52 PM Masahiko Sawada  wrote:
> 
> On Tue, Jul 26, 2022 at 2:18 PM Amit Kapila 
> wrote:
> >
> > On Tue, Jul 26, 2022 at 7:00 AM Masahiko Sawada
>  wrote:
> > >
> > > On Mon, Jul 25, 2022 at 7:57 PM shiy.f...@fujitsu.com
> > >  wrote:
> > > >
> > > >
> > > > case 3
> > > > -
> > > > There are 64 catalog modifying transactions.
> > > > Decode 100k/500k/1M transactions.
> > > >
> > > > 100k500k1M
> > > > master  0.0600  0.1666  0.4876
> > > > patched 0.0620  0.1653  0.4795
> > > >
> > > >
> > > > Summary of the tests:
> > > > After applying this patch, there is a overhead of about 3% in the case
> decoding
> > > > 100k transactions with 64 catalog modifying transactions. This is an
> extreme
> > > > case, so maybe it's okay.
> > >
> > > Yes. If we're worried about the overhead and doing bsearch() is the
> > > cause, probably we can try simplehash instead of the array.
> > >
> >
> > I am not sure if we need to go that far for this extremely corner
> > case. Let's first try your below idea.
> >
> > > An improvement idea is that we pass the parsed->xinfo down to
> > > SnapBuildXidHasCatalogChanges(), and then return from that function
> > > before doing bearch() if the parsed->xinfo doesn't have
> > > XACT_XINFO_HAS_INVALS. That would save calling bsearch() for
> > > non-catalog-modifying transactions. Is it worth trying?
> > >
> >
> > I think this is worth trying and this might reduce some of the
> > overhead as well in the case presented by Shi-San.
> 
> Okay, I've attached an updated patch that does the above idea. Could
> you please do the performance tests again to see if the idea can help
> reduce the overhead, Shi yu?
> 

Thanks for your improvement. I have tested the case which shows overhead before
(decoding 100k transactions with 64 catalog modifying transactions) for the v9
patch, the result is as follows.

master  0.0607
patched 0.0613

There's almost no difference compared with master (less than 1%), which looks
good to me.

Regards,
Shi yu

Re: Refactoring postgres_fdw/connection.c

2022-07-26 Thread Fujii Masao





On 2022/07/26 16:25, Kyotaro Horiguchi wrote:

Agree to that refactoring.  And it looks fine to me.


Thanks for reviewing the patches!



I'm not sure the two are similar with each other.  The new function
pgfdw_exec_pre_commit() looks like a merger of two isolated code paths
intended to share a seven-line codelet.  I feel the code gets a bit
harder to understand after the change.  I mildly oppose to this part.


If so, we can pgfdw_exec_pre_commit() into two, one is the common
function that sends or executes the command (i.e., calls
do_sql_command_begin() or do_sql_command()), and another is
the function only for toplevel. The latter function calls
the common function and then executes DEALLOCATE ALL things.

But this is not the way that other functions like pgfdw_abort_cleanup()
is implemented. Those functions have both codes for toplevel and
!toplevel (i.e., subxact), and run the processings depending
on the argument "toplevel". So I'm thinking that
pgfdw_exec_pre_commit() implemented in the same way is better.



It gives the same feeling with 0002.  Considering that
pending_deallocate becomes non-NIL only when toplevel, 38 lines out of
66 lines of the function are the toplevel-dedicated stuff.


Same as above.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Perform streaming logical transactions by background workers and parallel apply

2022-07-26 Thread Dilip Kumar

On Tue, Jul 26, 2022 at 2:30 PM Dilip Kumar  wrote:
>
> On Fri, Jul 22, 2022 at 8:27 AM wangw.f...@fujitsu.com
>  wrote:
> >
> > On Tues, Jul 19, 2022 at 10:29 AM I wrote:
> > > Attach the news patches.
> >
> > Not able to apply patches cleanly because the change in HEAD (366283961a).
> > Therefore, I rebased the patch based on the changes in HEAD.
> >
> > Attach the new patches.
>
> +/* Check the foreign keys. */
> +fkeys = RelationGetFKeyList(entry->localrel);
> +if (fkeys)
> +entry->parallel_apply = PARALLEL_APPLY_UNSAFE;
>
> So if there is a foreign key on any of the tables which are parts of a
> subscription then we do not allow changes for that subscription to be
> applied in parallel?  I think this is a big limitation because having
> foreign key on the table is very normal right?  I agree that if we
> allow them then there could be failure due to out of order apply
> right? but IMHO we should not put the restriction instead let it fail
> if there is ever such conflict.  Because if there is a conflict the
> transaction will be sent again.  Do we see that there could be wrong
> or inconsistent results if we allow such things to be executed in
> parallel.  If not then IMHO just to avoid some corner case failure we
> are restricting very normal cases.

some more comments..
1.
+/*
+ * If we have found a free worker or if we are already
applying this
+ * transaction in an apply background worker, then we
pass the data to
+ * that worker.
+ */
+if (first_segment)
+apply_bgworker_send_data(stream_apply_worker, s->len, s->data);

Comment says that if we have found a free worker or we are already
applying in the worker then pass the changes to the worker
but actually as per the code here we are only passing in case of first_segment?

I think what you are trying to say is that if it is first segment then send the

2.
+/*
+ * This is the main apply worker. Check if there is any free apply
+ * background worker we can use to process this transaction.
+ */
+if (first_segment)
+stream_apply_worker = apply_bgworker_start(stream_xid);
+else
+stream_apply_worker = apply_bgworker_find(stream_xid);

So currently, whenever we get a new streamed transaction we try to
start a new background worker for that.  Why do we need to start/close
the background apply worker every time we get a new streamed
transaction.  I mean we can keep the worker in the pool for time being
and if there is a new transaction looking for a worker then we can
find from that.  Starting a worker is costly operation and since we
are using parallelism for this mean we are expecting that there would
be frequent streamed transaction needing parallel apply worker so why
not to let it wait for a certain amount of time so that if load is low
it will anyway stop and if the load is high it will be reused for next
streamed transaction.

3.
Why are we restricting parallel apply workers only for the streamed
transactions, because streaming depends upon the size of the logical
decoding work mem so making steaming and parallel apply tightly
coupled seems too restrictive to me.  Do we see some obvious problems
in applying other transactions in parallel?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Max compact as an FSM strategy

2022-07-26 Thread Simon Riggs

A long time ago, Tom Lane came up with the idea that when tables get
bloated, tables might be allowed to shrink down again in size
naturally by altering the way FSM allocates blocks. That's a very good
idea, but we didn't implement it back then...

This patch allows the Heap to specify what FreeSpaceStrategy it would
like to see.

(extract from attached patch...)
+typedef enum FreeSpaceStrategy
+{
+   FREESPACE_STRATEGY_MAX_CONCURRENCY = 0,
+   /*
+* Each time we ask for a new block with freespace this will set
+* the advancenext flag which increments the next block by one.
+* The effect of this is to ensure that all backends are given
+* a separate block, minimizing block contention and thereby
+* maximising concurrency. This is the default strategy used by
+* PostgreSQL since at least PostgreSQL 8.4.
+*/
+   FREESPACE_STRATEGY_MAX_COMPACT
+   /*
+* All backends are given the earliest block in the table with
+* sufficient freespace for the insert. This could cause block
+* contention for concurrent inserts, but ensures maximum data
+* compaction, which will then allow vacuum truncation
to release
+* as much space as possible.  This strategy may be appropriate
+* for short periods if a table becomes bloated.
+*/
+} FreeSpaceStrategy;

All we need is a simple heuristic to allow us to choose between
various strategies.

Your input is welcome! Please read the short patch.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


freespace_strategy.v2.patch
Description: Binary data

回复：Re: PANIC: wrong buffer passed to visibilitymap_clear

2022-07-26 Thread 王伟(学弈)

On Fri, Jul 22, 2022 at 14:49 Peter Geoghegan wrote:
> The line numbers from your stack trace don't match up with> REL_14_STABLE. Is 
> this actually a fork of Postgres 14? (Oh, looks like
> it's an old beta release.)

Yeah, I was testing on 14beta2 branch once. So I considered your
advices and test on REL_14_STABLE branch. But I'm not lucky
enough to repeat this core file during the past few days.

> It would also be helpful if you told us about the specific table
> involved. Though the important thing (the essential thing) is to test
> today's REL_14_STABLE. There have been *lots* of bug fixes since
> Postgres 14 beta2 was current.

Besides, the logic of heap_update and RelationGetBufferForTuple functions
are mostly same between 14beta2 and REL_14_STABLE versions. 
So, I take a deep research on heap_update and RelationGetBufferForTuple
functions and find one code path for missing recheck of all-visible flag after
locking buffer with BUFFER_LOCK_EXCLUSIVE.

The following is the code path:
(1) heap_update calls RelationGetBufferForTuple to get new suitable buffer;
(2) during 'loop' loop of RelationGetBufferForTuple, it looks up one suitable 
new
buffer via FSM. But we assume that it was failed to find one available buffer 
here
and old buffer ('otherBuffer') was not set the all-visible flag.
(3) Next, it decides to bulk-extend the relation via RelationAddExtraBlocks and

get one new locked empty buffer via ReadBufferBI within ExclusiveLock.

(4) Then, it's succeed to ConditionalLockBuffer old buffer ('otherBuffer') and

returns the new buffer number without rechecking the all-visibility flag of old 
buffer.
(5) Finally, heap_update do the real update work and clear the all-visibility 
flag of
 both old buffer and new buffer. It finds that the old buffer was set 
all-visibility
flag but vmbuffer is still 0. At last, visibilitymap_clear reports error 'wrong 
buffer
passed to visibilitymap_clear'.

I propose one patch which rechecks the all-visibility flag of old buffer after
ConditionalLockBuffer old buffer is successful. I start to test this patch for
couple of days and it seems to work well for me till now.

--
Regards,
rogers.ww

fix_missed_visibility_map_pin_for_old_buffer_after_extend.patch
Description: Binary data

Re: Perform streaming logical transactions by background workers and parallel apply

2022-07-26 Thread Peter Smith

Here are some review comment for patch v19-0001:

==

1.1 Missing docs for protocol version

Since you bumped the logical replication protocol version for this
patch, now there is some missing documentation to describe this new
protocol version. e.g. See here [1]

==

1.2 doc/src/sgml/config.sgml

+   
+Maximum number of apply background workers per subscription. This
+parameter controls the amount of parallelism of the streaming of
+in-progress transactions when subscription parameter
+streaming = parallel.
+   

SUGGESTION
Maximum number of apply background workers per subscription. This
parameter controls the amount of parallelism for streaming of
in-progress transactions with subscription parameter
streaming = parallel.

==

1.3 src/sgml/protocol.sgml

@@ -6809,6 +6809,25 @@ psql "dbname=postgres replication=database" -c
"IDENTIFY_SYSTEM;"

   

+  
+   Int64 (XLogRecPtr)
+   
+
+ The LSN of the abort.
+
+   
+  
+
+  
+   Int64 (TimestampTz)
+   
+
+ Abort timestamp of the transaction. The value is in number
+ of microseconds since PostgreSQL epoch (2000-01-01).
+
+   
+  

There are missing notes on these new fields. They both should says
something like "This field is available since protocol version 4."
(See similar examples on the same docs page)

==

1.4 src/backend/replication/logical/applybgworker.c - apply_bgworker_start

Previously [1] I wrote:
> The Assert that the entries in the free-list are FINISHED seems like
> unnecessary checking. IIUC, code is already doing the Assert that
> entries are FINISHED before allowing them into the free-list in the
> first place.

IMO this Assert just causes unnecessary doubts, but if you really want
to keep it then I think it belongs logically *above* the
list_delete_last.

~~~

1.5 src/backend/replication/logical/applybgworker.c - apply_bgworker_start

+ server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
+ wstate->shared->server_version =
+ server_version >= 16 ? LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM :
+ server_version >= 15 ? LOGICALREP_PROTO_TWOPHASE_VERSION_NUM :
+ server_version >= 14 ? LOGICALREP_PROTO_STREAM_VERSION_NUM :
+ LOGICALREP_PROTO_VERSION_NUM;

It makes no sense to assign a protocol version to a server_version.
Perhaps it is just a simple matter of a poorly named struct member.
e.g Maybe everything is OK if it said something like
wstate->shared->proto_version.

~~~

1.6 src/backend/replication/logical/applybgworker.c - LogicalApplyBgwLoop

+/* Apply Background Worker main loop */
+static void
+LogicalApplyBgwLoop(shm_mq_handle *mqh, volatile ApplyBgworkerShared *shared)

'shared' seems a very vague param name. Maybe can be 'bgw_shared' or
'parallel_shared' or something better?

~~~

1.7 src/backend/replication/logical/applybgworker.c - ApplyBgworkerMain

+/*
+ * Apply Background Worker entry point
+ */
+void
+ApplyBgworkerMain(Datum main_arg)
+{
+ volatile ApplyBgworkerShared *shared;

'shared' seems a very vague var name. Maybe can be 'bgw_shared' or
'parallel_shared' or something better?

~~~

1.8 src/backend/replication/logical/applybgworker.c - apply_bgworker_setup_dsm

+static void
+apply_bgworker_setup_dsm(ApplyBgworkerState *wstate)
+{
+ shm_toc_estimator e;
+ Size segsize;
+ dsm_segment *seg;
+ shm_toc*toc;
+ ApplyBgworkerShared *shared;
+ shm_mq*mq;

'shared' seems a very vague var name. Maybe can be 'bgw_shared' or
'parallel_shared' or something better?

~~~

1.9 src/backend/replication/logical/applybgworker.c - apply_bgworker_setup_dsm

+ server_version = walrcv_server_version(LogRepWorkerWalRcvConn);
+ shared->server_version =
+ server_version >= 16 ? LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM :
+ server_version >= 15 ? LOGICALREP_PROTO_TWOPHASE_VERSION_NUM :
+ server_version >= 14 ? LOGICALREP_PROTO_STREAM_VERSION_NUM :
+ LOGICALREP_PROTO_VERSION_NUM;

Same as earlier review comment #1.5

==

1.10 src/backend/replication/logical/worker.c

@@ -22,8 +22,28 @@
  * STREAMED TRANSACTIONS
  * -
  * Streamed transactions (large transactions exceeding a memory limit on the
- * upstream) are not applied immediately, but instead, the data is written
- * to temporary files and then applied at once when the final commit arrives.
+ * upstream) are applied using one of two approaches.
+ *
+ * 1) Separate background workers

"two approaches." --> "two approaches:"

~~~

1.11 src/backend/replication/logical/worker.c - apply_handle_stream_abort

+ /* Check whether the publisher sends abort_lsn and abort_time. */
+ if (am_apply_bgworker())
+ read_abort_lsn = MyParallelShared->server_version >=
+ LOGICALREP_PROTO_STREAM_PARALLEL_VERSION_NUM;

IMO makes no sense to compare a server version with a protocol
version. Same as review comment #1.5

==

1.12 src/include/replication/worker_internal.h

+typedef stru

Re: SI-read predicate locks on materialized views

2022-07-26 Thread Richard Guo

On Tue, Jul 26, 2022 at 3:44 PM Yugo NAGATA  wrote:

> If such two transactions run concurrently, a write skew anomaly occurs,
> and the result of order_summary refreshed in T1 will not contain the
> record inserted in T2.


Indeed we have write skew anomaly here between the two transactions.


> On the other hand, if the materialized view participates in predicate
> locking and the transaction isolation level is SELIALIZABLE, this
> anomaly can be avoided; one of the transaction will be aborted and
> suggested to be retried.


The idea works for me.

Thanks
Richard

Re: Max compact as an FSM strategy

2022-07-26 Thread Dilip Kumar

On Tue, Jul 26, 2022 at 3:04 PM Simon Riggs
 wrote:
>
> A long time ago, Tom Lane came up with the idea that when tables get
> bloated, tables might be allowed to shrink down again in size
> naturally by altering the way FSM allocates blocks. That's a very good
> idea, but we didn't implement it back then...
>
> This patch allows the Heap to specify what FreeSpaceStrategy it would
> like to see.
>
> (extract from attached patch...)
> +typedef enum FreeSpaceStrategy
> +{
> +   FREESPACE_STRATEGY_MAX_CONCURRENCY = 0,
> +   /*
> +* Each time we ask for a new block with freespace this will 
> set
> +* the advancenext flag which increments the next block by 
> one.
> +* The effect of this is to ensure that all backends are given
> +* a separate block, minimizing block contention and thereby
> +* maximising concurrency. This is the default strategy used 
> by
> +* PostgreSQL since at least PostgreSQL 8.4.
> +*/
> +   FREESPACE_STRATEGY_MAX_COMPACT
> +   /*
> +* All backends are given the earliest block in the table with
> +* sufficient freespace for the insert. This could cause block
> +* contention for concurrent inserts, but ensures maximum data
> +* compaction, which will then allow vacuum truncation
> to release
> +* as much space as possible.  This strategy may be 
> appropriate
> +* for short periods if a table becomes bloated.
> +*/
> +} FreeSpaceStrategy;

I think this is a really interesting idea.  So IIUC this patch enables
an option to select between the strategy but don't yet decide on that.

> All we need is a simple heuristic to allow us to choose between
> various strategies.

I think it would be really interesting to see what would be the exact
deciding point between these strategies.  Because when we switch from
CONCURRENCY to COMPACT it would immediately affect the insert/update
performance but it would control the bloat.  So I am not sure whether
the selection should be completely based on the heuristic or there
should be some GUC parameter where the user can decide at what point
we should switch to the COMPACT strategy or it should not at all
switch?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-07-26 Thread Bharath Rupireddy

On Fri, May 13, 2022 at 6:41 PM Robert Haas  wrote:
>
> On Fri, Apr 29, 2022 at 5:11 AM Bharath Rupireddy
>  wrote:
> > Here's the rebased v9 patch.
>
> This seems like it has enormous overlap with the existing
> functionality that we have from log_startup_progress_interval.
>
> I think that facility is also better-designed than this one. It prints
> out a message based on elapsed time, whereas this patch prints out a
> message based progress through the WAL. That means that if WAL replay
> isn't actually advancing for some reason, you just won't get any log
> messages and you don't know whether it's advancing slowly or not at
> all or the server is just hung. With that facility you can distinguish
> those cases.
>
> Also, if for some reason we do think that amount of WAL replayed is
> the right metric, rather than time, why would we only allow high=1
> segment and low=128 segments, rather than say any number of MB or GB
> that the user would like to configure?
>
> I suggest that if log_startup_progress_interval doesn't meet your
> needs here, we should try to understand why not and maybe enhance it,
> instead of adding a separate facility.

After thinking for a while, I agree with Robert and others that we
could leverage the existing log_startup_progress_interval mechanism
for reporting which WAL file currently is being replayed. I added
current TLI (helps to construct the WAL file name from current LSN)
and current WAL file source (helps to know where the WAL files was
fetched from) to the existing "redo in progress, elapsed time:..."
message. This very well serves the purpose of identifying the issues
such as the restore command taking a lot of time (>
log_startup_progress_interval for instance), WAL replay rate on
standby or primary for long recoveries and so on.

However, ereport_startup_progress isn't enabled on standby to not let
it bloat the server logs. I believe the "redo in progress, elapsed
time:..." message can provide some important info/metric for standby
too and there's no way for the users to enable it on standbys today.
For instance, users can know how well the standby fares in replaying,
they can figure this out, by looking at two or more such messages. If
enabled, with default value of 10 sec for
log_startup_progress_interval, the standby can emit 8640 messages per
day which is too much. I'm not sure if we are okay to change the
default value of log_startup_progress_interval to say 1min or 5min so
that 1440 messages are emitted. In production environments, typically
users may or may not be interested if recovery takes just 10sec, but
they really are interested if it takes in the order of minutes.
Basically, I would like to enable "redo in progress, elapsed time:..."
message for standbys too.

Thoughts?

PSA v10 patch with enhanced "redo in progress, elapsed time:..."
message. Note that it's not a final patch though.

Regards,
Bharath Rupireddy.

v10-0001-Add-WAL-recovery-info-to-startup-progress-log-me.patch
Description: Binary data

Re: Fast COPY FROM based on batch insert

2022-07-26 Thread Etsuro Fujita

On Fri, Jul 22, 2022 at 5:42 PM Andrey Lepikhov
 wrote:
> So, maybe switch
> status of this patch to 'Ready for committer'?

Yeah, I think the patch is getting better, but I noticed some issues,
so I'm working on them.  I think I can post a new version in the next
few days.

Best regards,
Etsuro Fujita

Re: Refactoring postgres_fdw/connection.c

2022-07-26 Thread Etsuro Fujita

Fujii-san,

On Tue, Jul 26, 2022 at 12:55 AM Fujii Masao
 wrote:
> When reviewing the postgres_fdw parallel-abort patch [1], I found that
> there are several duplicate codes in postgres_fdw/connection.c.
> Which seems to make it harder to review the patch changing connection.c.
> So I'd like to remove such duplicate codes and refactor the functions
> in connection.c. I attached the following three patches.
>
> There are two functions, pgfdw_get_result() and pgfdw_get_cleanup_result(),
> to get a query result. They have almost the same code, call PQisBusy(),
> WaitLatchOrSocket(), PQconsumeInput() and PQgetResult() in the loop,
> but only pgfdw_get_cleanup_result() allows its callers to specify the timeout.
> 0001 patch transforms pgfdw_get_cleanup_result() to the common function
> to get a query result and makes pgfdw_get_result() use it instead of
> its own (duplicate) code. The patch also renames pgfdw_get_cleanup_result()
> to pgfdw_get_result_timed().
>
> pgfdw_xact_callback() and pgfdw_subxact_callback() have similar codes to
> issue COMMIT or RELEASE SAVEPOINT commands. 0002 patch adds the common 
> function,
> pgfdw_exec_pre_commit(), for that purpose, and changes those functions
> so that they use the common one.
>
> pgfdw_finish_pre_commit_cleanup() and pgfdw_finish_pre_subcommit_cleanup()
> have similar codes to wait for the results of COMMIT or RELEASE SAVEPOINT 
> commands.
> 0003 patch adds the common function, pgfdw_finish_pre_commit(), for that 
> purpose,
> and replaces those functions with the common one.
> That is, pgfdw_finish_pre_commit_cleanup() and 
> pgfdw_finish_pre_subcommit_cleanup()
> are no longer necessary and 0003 patch removes them.
>
> [1] https://commitfest.postgresql.org/38/3392/

Thanks for working on this!  I'd like to review this after the end of
the current CF.  Could you add this to the next CF?

Best regards,
Etsuro Fujita

Re: Refactoring postgres_fdw/connection.c

2022-07-26 Thread Fujii Masao





On 2022/07/26 19:26, Etsuro Fujita wrote:

Thanks for working on this!  I'd like to review this after the end of
the current CF.


Thanks!



 Could you add this to the next CF?


Yes.
https://commitfest.postgresql.org/39/3782/

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Max compact as an FSM strategy

2022-07-26 Thread Simon Riggs

On Tue, 26 Jul 2022 at 11:02, Dilip Kumar  wrote:
>
> On Tue, Jul 26, 2022 at 3:04 PM Simon Riggs
>  wrote:
> >
> > A long time ago, Tom Lane came up with the idea that when tables get
> > bloated, tables might be allowed to shrink down again in size
> > naturally by altering the way FSM allocates blocks. That's a very good
> > idea, but we didn't implement it back then...
> >
> > This patch allows the Heap to specify what FreeSpaceStrategy it would
> > like to see.

> I think this is a really interesting idea.  So IIUC this patch enables
> an option to select between the strategy but don't yet decide on that.

Correct

> > All we need is a simple heuristic to allow us to choose between
> > various strategies.
>
> I think it would be really interesting to see what would be the exact
> deciding point between these strategies.  Because when we switch from
> CONCURRENCY to COMPACT it would immediately affect the insert/update
> performance but it would control the bloat.  So I am not sure whether
> the selection should be completely based on the heuristic or there
> should be some GUC parameter where the user can decide at what point
> we should switch to the COMPACT strategy or it should not at all
> switch?

How and when is the right question. I am happy to hear thoughts and
opinions from others before coming up with a specific scheme to do
that.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: Race between KeepFileRestoredFromArchive() and restartpoint

2022-07-26 Thread David Steele


Hi Noah,

On 6/19/21 16:39, Noah Misch wrote:

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries.  In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.


Here's the implementation.  Patches 1-4 suffice to stop the user-visible
ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation.  Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit.  Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint.  Before commit 63653f7 (2002), preallocation created more files.


This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?

[1] 
https://www.postgresql.org/message-id/flat/CAKw-smBhLOGtRJTC5c%3DqKTPz8gz5%2BWPoVAXrHB6mY-1U4_N7-w%40mail.gmail.com

Re: make -C libpq check fails obscurely if tap tests are disabled

2022-07-26 Thread Andres Freund

Hi,

On 2022-07-22 19:50:02 -0400, Tom Lane wrote:
> I wrote:
> > So it'll work in 3.81 (released 2006) and later, but not 3.80.
> 
> Confirmed that things are fine with 3.81.

Thanks for looking into this Alvaro, Andrew, Justin, Tom - I was on
vacation...

Greetings,

Andres Freund

Re: Cygwin cleanup

2022-07-26 Thread Tom Lane

Thomas Munro  writes:
> On Tue, Jul 26, 2022 at 4:34 PM Tom Lane  wrote:
>> If that's an accurate statement, shouldn't we just drop Cygwin support?

> This thread rejected the idea last time around:
> https://www.postgresql.org/message-id/flat/136712b0-0619-5619-4634-0f0286acaef7%402ndQuadrant.com

I think maybe we should re-open the discussion.  I've certainly
reached the stage of fed-up-ness.  That platform seems seriously
broken, upstream is making no progress on fixing it, and there
doesn't seem to be any real-world use-case.  The only positive
argument for it is that Readline doesn't work in the other
Windows builds --- but we've apparently not rechecked that
statement in eighteen years, so maybe things are better now.

If we could just continue to blithely ignore lorikeet's failures,
I wouldn't mind so much; but doing any significant amount of new
code development work for the platform seems like throwing away
developer time.

regards, tom lane

Re: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?

2022-07-26 Thread David Steele


On 7/25/22 22:49, Kyotaro Horiguchi wrote:

At Mon, 25 Jul 2022 14:21:38 +0530, Bharath Rupireddy 
 wrote in

Hm. I think we must take this opportunity to clean it up. You are
right, we don't need to parse the label file contents (just like we
used to do previously after reading it from the file) in
do_pg_backup_stop(), instead we can just pass a structure. Also,
do_pg_backup_stop() isn't modifying any labelfile contents, but using
startxlogfilename, startpoint and backupfrom from the labelfile
contents. I think this information can easily be passed as a single
structure. In fact, I might think a bit more here and wrap label_file,
tblspc_map_file to a single structure something like below and pass it
across the functions.

typedef struct BackupState
{
StringInfo label_file;
StringInfo tblspc_map_file;
char startxlogfilename[MAXFNAMELEN];
XLogRecPtr startpoint;
char backupfrom[20];
} BackupState;

This way, the code is more readable, structured and we can remove 2
sscanf() calls, 2 "invalid data in file" errors, 1 strchr() call, 1
strstr() call. Only thing is that it creates code diff from the
previous PG versions which is fine IMO. If okay, I'm happy to prepare
a patch.

Thoughts?


It is more or less what was in my mind, but it seems that we don't
need StringInfo there, or should avoid it to signal the strings are
not editable.


I would prefer to have all the components of backup_label stored 
separately and then generate backup_label from them in pg_backup_stop().


For PG16 I am planning to add some fields to backup_label that are not 
known when pg_backup_start() is called, e.g. min recovery time.


Regards,
-David

Re: Allow parallel plan for referential integrity checks?

2022-07-26 Thread Frédéric Yhuel





On 4/14/22 14:25, Frédéric Yhuel wrote:



On 3/19/22 01:57, Imseih (AWS), Sami wrote:
I looked at your patch and it's a good idea to make foreign key 
validation

use parallel query on large relations.

It would be valuable to add logging to ensure that the ActiveSnapshot 
and TransactionSnapshot
is the same for the leader and the workers. This logging could be 
tested in the TAP test.


Also, inside RI_Initial_Check you may want to set max_parallel_workers to
max_parallel_maintenance_workers.

Currently the work_mem is set to maintenance_work_mem. This will also 
require

a doc change to call out.

/*
  * Temporarily increase work_mem so that the check query can be 
executed
  * more efficiently.  It seems okay to do this because the query 
is simple
  * enough to not use a multiple of work_mem, and one typically 
would not
  * have many large foreign-key validations happening 
concurrently.  So
  * this seems to meet the criteria for being considered a 
"maintenance"
  * operation, and accordingly we use maintenance_work_mem.  
However, we




Hello Sami,

Thank you for your review!

I will try to do as you say, but it will take time, since my daily job 
as database consultant takes most of my time and energy.




Hi,

As suggested by Jacob, here is a quick message to say that I didn't find 
enough time to work further on this patch, but I didn't completely 
forget it either. I moved it to the next commitfest. Hopefully I will 
find enough time and motivation in the coming months :-)


Best regards,
Frédéric

Re: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?

2022-07-26 Thread Bharath Rupireddy

On Tue, Jul 26, 2022 at 5:22 PM David Steele  wrote:
>
> On 7/25/22 22:49, Kyotaro Horiguchi wrote:
> > At Mon, 25 Jul 2022 14:21:38 +0530, Bharath Rupireddy 
> >  wrote in
> >> Hm. I think we must take this opportunity to clean it up. You are
> >> right, we don't need to parse the label file contents (just like we
> >> used to do previously after reading it from the file) in
> >> do_pg_backup_stop(), instead we can just pass a structure. Also,
> >> do_pg_backup_stop() isn't modifying any labelfile contents, but using
> >> startxlogfilename, startpoint and backupfrom from the labelfile
> >> contents. I think this information can easily be passed as a single
> >> structure. In fact, I might think a bit more here and wrap label_file,
> >> tblspc_map_file to a single structure something like below and pass it
> >> across the functions.
> >>
> >> typedef struct BackupState
> >> {
> >> StringInfo label_file;
> >> StringInfo tblspc_map_file;
> >> char startxlogfilename[MAXFNAMELEN];
> >> XLogRecPtr startpoint;
> >> char backupfrom[20];
> >> } BackupState;
> >>
> >> This way, the code is more readable, structured and we can remove 2
> >> sscanf() calls, 2 "invalid data in file" errors, 1 strchr() call, 1
> >> strstr() call. Only thing is that it creates code diff from the
> >> previous PG versions which is fine IMO. If okay, I'm happy to prepare
> >> a patch.
> >>
> >> Thoughts?
> >
> > It is more or less what was in my mind, but it seems that we don't
> > need StringInfo there, or should avoid it to signal the strings are
> > not editable.
>
> I would prefer to have all the components of backup_label stored
> separately and then generate backup_label from them in pg_backup_stop().

+1, because pg_backup_stop is the one that's returning backup_label
contents, so it does make sense for it to prepare it once and for all
and return.

> For PG16 I am planning to add some fields to backup_label that are not
> known when pg_backup_start() is called, e.g. min recovery time.

Can you please point to your patch that does above?

Yes, right now, backup_label or tablespace_map contents are being
filled in by pg_backup_start and are never changed again. But if your
above proposal is for fixing some issue, then it would make sense for
us to carry all the info in a structure to pg_backup_stop and then let
it prepare the backup_label and tablespace_map contents.

If the approach is okay for the hackers, I would like to spend time on it.

Regards,
Bharath Rupireddy.

Re: Proposal to provide the facility to set binary format output for specific OID's per session

2022-07-26 Thread Dave Cramer

Hi Sehrope,




On Mon, 25 Jul 2022 at 17:53, Dave Cramer  wrote:

> Hi Sehrope,
>
>
> On Mon, 25 Jul 2022 at 17:22, Sehrope Sarkuni  wrote:
>
>> Idea here makes sense and I've seen this brought up repeatedly on the
>> JDBC lists.
>>
>> Does the driver need to be aware that this SET command was executed? I'm
>> wondering what happens if an end user executes this with an OID the driver
>> does not actually know how to handle.
>>
> I suppose there would be a failure to read the attribute correctly.
>
>>
>> > + Oid *tmpOids = palloc(length+1);
>> > ...
>> > + tmpOids = repalloc(tmpOids, length+1);
>>
>> These should be: sizeof(Oid) * (length + 1)
>>
>
> Yes they should, thanks!
>
>>
>> Also, I think you need to specify an explicit context via
>> MemoryContextAlloc or the allocated memory will be in the default context
>> and released at the end of the command.
>>
>
> Also good catch
>
> Thanks,
>

Attached patch to correct these deficiencies.

Thanks again,


>
> Dave
>
>>


0002-allocate-the-correct-amount-of-memory-in-permanent-s.patch
Description: Binary data

Re: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?

2022-07-26 Thread David Steele





On 7/26/22 07:59, Bharath Rupireddy wrote:

On Tue, Jul 26, 2022 at 5:22 PM David Steele  wrote:


On 7/25/22 22:49, Kyotaro Horiguchi wrote:

At Mon, 25 Jul 2022 14:21:38 +0530, Bharath Rupireddy 
 wrote in

Hm. I think we must take this opportunity to clean it up. You are
right, we don't need to parse the label file contents (just like we
used to do previously after reading it from the file) in
do_pg_backup_stop(), instead we can just pass a structure. Also,
do_pg_backup_stop() isn't modifying any labelfile contents, but using
startxlogfilename, startpoint and backupfrom from the labelfile
contents. I think this information can easily be passed as a single
structure. In fact, I might think a bit more here and wrap label_file,
tblspc_map_file to a single structure something like below and pass it
across the functions.

typedef struct BackupState
{
StringInfo label_file;
StringInfo tblspc_map_file;
char startxlogfilename[MAXFNAMELEN];
XLogRecPtr startpoint;
char backupfrom[20];
} BackupState;

This way, the code is more readable, structured and we can remove 2
sscanf() calls, 2 "invalid data in file" errors, 1 strchr() call, 1
strstr() call. Only thing is that it creates code diff from the
previous PG versions which is fine IMO. If okay, I'm happy to prepare
a patch.

Thoughts?


It is more or less what was in my mind, but it seems that we don't
need StringInfo there, or should avoid it to signal the strings are
not editable.


I would prefer to have all the components of backup_label stored
separately and then generate backup_label from them in pg_backup_stop().


+1, because pg_backup_stop is the one that's returning backup_label
contents, so it does make sense for it to prepare it once and for all
and return.


For PG16 I am planning to add some fields to backup_label that are not
known when pg_backup_start() is called, e.g. min recovery time.


Can you please point to your patch that does above?


Currently it is a plan, not a patch. So there is nothing to show yet.


Yes, right now, backup_label or tablespace_map contents are being
filled in by pg_backup_start and are never changed again. But if your
above proposal is for fixing some issue, then it would make sense for
us to carry all the info in a structure to pg_backup_stop and then let
it prepare the backup_label and tablespace_map contents.


I think this makes sense even if I don't get these changes into PG16.


If the approach is okay for the hackers, I would like to spend time on it.


+1 from me.

Regards,
-David

Re: making relfilenodes 56 bits

2022-07-26 Thread Ashutosh Sharma

/*
 * If relfilenumber is unspecified by the caller then create storage
-* with oid same as relid.
+* with relfilenumber same as relid if it is a system table
otherwise
+* allocate a new relfilenumber.  For more details read comments
atop
+* FirstNormalRelFileNumber declaration.
 */
if (!RelFileNumberIsValid(relfilenumber))
-   relfilenumber = relid;
+   {
+   relfilenumber = relid < FirstNormalObjectId ?
+   relid : GetNewRelFileNumber();

Above code says that in the case of system table we want relfilenode to be
the same as object id. This technically means that the relfilenode or oid
for the system tables would not be exceeding 16383. However in the below
lines of code added in the patch, it says there is some chance for the
storage path of the user tables from the old cluster conflicting with the
storage path of the system tables in the new cluster. Assuming that the
OIDs for the user tables on the old cluster would start with 16384 (the
first object ID), I see no reason why there would be a conflict.

+/* --
+ * RelFileNumber zero is InvalidRelFileNumber.
+ *
+ * For the system tables (OID < FirstNormalObjectId) the initial storage
+ * will be created with the relfilenumber same as their oid.  And, later
for
+ * any storage the relfilenumber allocated by GetNewRelFileNumber() will
start
+ * at 10.  Thus, when upgrading from an older cluster, the relation
storage
+ * path for the user table from the old cluster will not conflict with the
+ * relation storage path for the system table from the new cluster.
Anyway,
+ * the new cluster must not have any user tables while upgrading, so we
needn't
+ * worry about them.
+ * --
+ */
+#define FirstNormalRelFileNumber   ((RelFileNumber) 10)

==

When WAL logging the next object id we have the chosen the xlog threshold
value as 8192 whereas for relfilenode it is 512. Any reason for choosing
this low arbitrary value in case of relfilenumber?

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 1:32 PM Dilip Kumar  wrote:

> On Thu, Jul 21, 2022 at 9:53 AM Thomas Munro 
> wrote:
> >
> > On Wed, Jul 20, 2022 at 11:27 PM Dilip Kumar 
> wrote:
> > > [v10 patch set]
> >
> > Hi Dilip, I'm experimenting with these patches and will hopefully have
> > more to say soon, but I just wanted to point out that this builds with
> > warnings and failed on 3/4 of the CI OSes on cfbot's last run.  Maybe
> > there is the good kind of uninitialised data on Linux, and the bad
> > kind of uninitialised data on those other pesky systems?
>
> Here is the patch to fix the issue, basically, while asserting for the
> file existence it was not setting the relfilenumber in the
> relfilelocator before generating the path so it was just checking for
> the existence of the random path so it was asserting randomly.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>

Re: log_line_prefix: make it possible to add the search_path

2022-07-26 Thread Pierre

On Tuesday, July 26, 2022 3:08:01 AM CEST Lukas Fittl wrote:
> On Mon, Jul 25, 2022 at 12:38 AM Pierre Ducroquet 
> 
> wrote:
> > usecase by not showing the schema, one of them being log_line_prefix.
> > It is possible to work around this using the application_name, but a
> > mistake
> > on the application side would be fatal, while the search_path would still
> > indicate the real tables used in a query.
> 
> I'm assuming this is mostly referring to STATEMENT log lines and other
> situations where the original query is output (e.g. auto_explain).
> 
> +1 on the benefit of solving this (I've had this use case before), but I
> think we can keep this more specific than a general log_line_prefix option.
> The search_path isn't relevant to any log line that doesn't reference a
> query, since e.g. autovacuum log output fully qualifies its relation names,
> and many other common log lines have nothing to do with tables or queries.
> 
> What if we instead had something like this, as an extra CONTEXT (or DETAIL)
> log line:
> 
> LOG: duration: 4079.697 ms execute :
> SELECT * FROM x WHERE y = $1 LIMIT $2
> DETAIL: parameters: $1 = 'long string', $2 = '1'
> CONTEXT: settings: search_path = 'my_tenant_schema, "$user", public'
> 
> That way you could determine that the slow query was affecting the "x"
> table in "my_tenant_schema".
> 
> This log output would be controlled by a new GUC, e.g.
> "log_statement_search_path" with three settings: (1) never, (2)
> non_default, (3) always.
> 
> The default would be "never" (same as today). "non_default" would output
> the search path when a SET has modified it in the current session (and so
> we couldn't infer it from the config or the role/database overrides).
> "always" would always output the search path for statement-related log
> lines.
> 
> Thanks,
> Lukas

Hi

This is a good idea. I've hacked a first implementation of it (lacking 
documentation, and several logs are still missing) attached to this email.
The biggest issue I had was with knowing where the setting come from since no 
guc.h function expose that information. I worked around this a bit, but I'm 
sure it would be preferable to do it otherwise.

Thanks for your feedbacks

Regards

 Pierre
>From 11bbdec931481ffc02ce147db2c7c8b08e6503e4 Mon Sep 17 00:00:00 2001
From: Pierre Ducroquet 
Date: Tue, 26 Jul 2022 14:39:59 +0200
Subject: [PATCH] log the search_path of statements (configurable)

Introduce a new GUC, log_statement_search_path, to make it possible
to log the search_path of queries.
---
 src/backend/catalog/namespace.c |  4 
 src/backend/tcop/postgres.c | 34 -
 src/backend/utils/misc/guc.c| 11 +++
 src/include/catalog/namespace.h |  2 ++
 src/include/tcop/tcopprot.h |  8 
 src/include/utils/guc.h |  1 +
 6 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index fafb9349cc..ec6c260f15 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -197,6 +197,7 @@ static SubTransactionId myTempNamespaceSubID = InvalidSubTransactionId;
  * of the GUC variable 'search_path'.
  */
 char	   *namespace_search_path = NULL;
+GucSource	namespace_search_path_source = 0;
 
 
 /* Local functions */
@@ -4348,6 +4349,9 @@ check_search_path(char **newval, void **extra, GucSource source)
 	pfree(rawname);
 	list_free(namelist);
 
+	/* Store the source */
+	namespace_search_path_source = source;
+
 	return true;
 }
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index d0bbd30d2b..4fa6260f63 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -39,6 +39,7 @@
 #include "access/parallel.h"
 #include "access/printtup.h"
 #include "access/xact.h"
+#include "catalog/namespace.h"
 #include "catalog/pg_type.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
@@ -96,6 +97,15 @@ bool		Log_disconnections = false;
 
 int			log_statement = LOGSTMT_NONE;
 
+const struct config_enum_entry log_statement_search_path_options[] = {
+{"never", LOG_STATEMENT_SEARCH_PATH_NEVER, false},
+{"non_default", LOG_STATEMENT_SEARCH_PATH_NON_DEFAULT, false},
+{"always", LOG_STATEMENT_SEARCH_PATH_ALWAYS, false},
+{NULL, 0, false}
+};
+
+int			log_statement_search_path = LOG_STATEMENT_SEARCH_PATH_NEVER;
+
 /* GUC variable for maximum stack depth (measured in kilobytes) */
 int			max_stack_depth = 100;
 
@@ -1328,11 +1338,25 @@ exec_simple_query(const char *query_string)
 	 errhidestmt(true)));
 			break;
 		case 2:
-			ereport(LOG,
-	(errmsg("duration: %s ms  statement: %s",
-			msec_str, query_string),
-	 errhidestmt(true),
-	 errdetail_execute(parsetree_list)));
+			if ((log_statement_search_path > LOG_STATEMENT_SEARCH_PATH_NEVER)
+&& (
+	namespace_search_path_source > PGC_S_INTERACTIVE
+	||
+	log_statement_search_path > LOG_STATEMENT_SEARCH_PATH_NON_DEFAUL

RE: Collect ObjectAddress for ATTACH DETACH PARTITION to use in event trigger

2022-07-26 Thread houzj.f...@fujitsu.com

On Monday, July 25, 2022 6:26 PM Michael Paquier  wrote:
> 
> On Mon, Jul 25, 2022 at 09:25:07AM +, houzj.f...@fujitsu.com wrote:
> > BTW, while reviewing it, I found there are some more subcommands that
> > the
> > get_altertable_subcmdtypes() doesn't handle(e.g., ADD/DROP/SET
> > IDENTITY and re ADD STAT). Shall we fix them all while on it ?
> >
> > Attach a minor patch to fix those which is based on the v2 patch set.
> 
> @@ -300,6 +300,18 @@ get_altertable_subcmdinfo(PG_FUNCTION_ARGS)
> [ ... ]
> default:
> strtype = "unrecognized";
> break;
> 
> Removing the "default" clause would help here as we would get compiler
> warnings if there is anything missing.  One way to do things is to set 
> strtype to
> NULL before going through the switch and have a failsafe as some commands
> are internal so they may not be worth adding to the output.

Thanks for the suggestion. I have removed the default and found some missed
subcommands in 0003 patch. Attach the new version patch here
(The 0001 and 0002 is unchanged).

Best regards,
Hou zj




v3-0003-Add-support-for-some-missed-commands-in-test_ddl_.patch
Description:  v3-0003-Add-support-for-some-missed-commands-in-test_ddl_.patch


v3-0001-Collect-ObjectAddress-for-ATTACH-DETACH-PARTITION.patch
Description:  v3-0001-Collect-ObjectAddress-for-ATTACH-DETACH-PARTITION.patch


v3-0002-Extend-test_ddl_deparse-for-ALTER-TABLE-.-ATTACH-.patch
Description:  v3-0002-Extend-test_ddl_deparse-for-ALTER-TABLE-.-ATTACH-.patch

Re: Restructure ALTER TABLE notes to clarify table rewrites and verification scans

2022-07-26 Thread Justin Pryzby

I think this patch is missing "SET [UN]LOGGED", defaults of identity columns
and domains, and access method.

And tablespace, even though that rewrites the *files*, but not tuples (maybe
these docs should say that).

Re: making relfilenodes 56 bits

2022-07-26 Thread Dilip Kumar

On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma  wrote:

Hi,
Note: please avoid top posting.

> /*
>  * If relfilenumber is unspecified by the caller then create storage
> -* with oid same as relid.
> +* with relfilenumber same as relid if it is a system table otherwise
> +* allocate a new relfilenumber.  For more details read comments atop
> +* FirstNormalRelFileNumber declaration.
>  */
> if (!RelFileNumberIsValid(relfilenumber))
> -   relfilenumber = relid;
> +   {
> +   relfilenumber = relid < FirstNormalObjectId ?
> +   relid : GetNewRelFileNumber();
>
> Above code says that in the case of system table we want relfilenode to be 
> the same as object id. This technically means that the relfilenode or oid for 
> the system tables would not be exceeding 16383. However in the below lines of 
> code added in the patch, it says there is some chance for the storage path of 
> the user tables from the old cluster conflicting with the storage path of the 
> system tables in the new cluster. Assuming that the OIDs for the user tables 
> on the old cluster would start with 16384 (the first object ID), I see no 
> reason why there would be a conflict.

Basically, the above comment says that the initial system table
storage will be created with the same relfilenumber as Oid so you are
right that will not exceed 16383.  And below code is explaining the
reason that in order to avoid the conflict with the user table from
the older cluster we do it this way.  Otherwise, in the new design, we
have no intention to keep the relfilenode same as Oid.  But during an
upgrade from the older cluster which is not following this new design
might have user table relfilenode which can conflict with the system
table in the new cluster so we have to ensure that with the new design
also when creating the initial cluster we keep the system table
relfilenode in low range and directly using Oid is the best idea for
this purpose instead of defining the completely new range and
maintaining a separate counter for that.

> +/* --
> + * RelFileNumber zero is InvalidRelFileNumber.
> + *
> + * For the system tables (OID < FirstNormalObjectId) the initial storage
> + * will be created with the relfilenumber same as their oid.  And, later for
> + * any storage the relfilenumber allocated by GetNewRelFileNumber() will 
> start
> + * at 10.  Thus, when upgrading from an older cluster, the relation 
> storage
> + * path for the user table from the old cluster will not conflict with the
> + * relation storage path for the system table from the new cluster.  Anyway,
> + * the new cluster must not have any user tables while upgrading, so we 
> needn't
> + * worry about them.
> + * --
> + */
> +#define FirstNormalRelFileNumber   ((RelFileNumber) 10)
>
> ==
>
> When WAL logging the next object id we have the chosen the xlog threshold 
> value as 8192 whereas for relfilenode it is 512. Any reason for choosing this 
> low arbitrary value in case of relfilenumber?

For Oid when we cross the max value we will wraparound, whereas for
relfilenumber we can not expect the wraparound for cluster lifetime.
So it is better not to log forward a really large number of
relfilenumber as we do for Oid.  OTOH if we make it really low like 64
then we can is RelFIleNumberGenLock in wait event in very high
concurrency where from 32 backends we are continuously
creating/dropping tables.  So we thought of choosing this number 512
so that it is not very low that can create the lock contention and it
is not very high so that we need to worry about wasting those many
relfilenumbers on the crash.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: doc: Clarify Savepoint Behavior

2022-07-26 Thread David G. Johnston

On Thu, Jul 14, 2022 at 12:44 PM Bruce Momjian  wrote:

> On Sat, Jul  9, 2022 at 12:59:23PM -0400, Bruce Momjian wrote:
> > On Sun, Jun 26, 2022 at 09:14:56AM -0700, David G. Johnston wrote:
> > > So leave the "release" behavior implied from the rollback behavior?
> > >
> > > On the whole I'm slightly in favor of your proposed wording (mostly
> due to the
> > > better fitting of the ROLLBACK command, though at the omission of
> RELEASE...)
> > > but are you seeing anything beyond personal style as to why you feel
> one is
> > > better than the other?  Is there some existing wording in the docs
> that I
> > > should be conforming to here?
> >
> > I have developed the attached patch based on the discussion here.  I
> > tried to simplify the language and example to clarify the intent.
> >
> > I was confused why the first part of the patch added a mention of
> > releasing savepoints to the ROLLBACK TO manual page --- I have removed
> > that and improved the text in RELEASE SAVEPOINT.
>
> Patch applied to all supported versions.
>
>
Bruce,

Thanks for committing this and the other patches.  Should I go into the
commitfest and mark the entries for these as committed or does protocol
dictate I remind you and you do that?

David J.

Re: make -C libpq check fails obscurely if tap tests are disabled

2022-07-26 Thread Alvaro Herrera

On 2022-Jul-26, Alvaro Herrera wrote:

> With this version I keep the target name as -recurse, and at least the
> ecpg<->libpq problem is no more AFAICT.

... but I think we're missing some more dependencies, because if I
remove everything (beyond make clean), then a "make -j8 world-bin"
fails, per Cirrus.
  https://cirrus-ci.com/build/6305635338813440
With that, I'm going to set this aside for the time being.  If somebody
wants to play with these Makefile rules, be my guest.  It sounds like
there's some compile time gains to be had, but it may require some
fiddling and it's not clear to me if the move to Meson is going to make
this moot.

Running it locally, I get this:

[...]
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement 
-Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 
-Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g 
-O2 -DFRONTEND -I. -I/pgsql/source/master/src/common -I../../src/include 
-I/pgsql/source/master/src/include  -D_GNU_SOURCE  -DVAL_CC="\"gcc\"" 
-DVAL_CPPFLAGS="\"-D_GNU_SOURCE\"" -DVAL_CFLAGS="\"-Wall -Wmissing-prototypes 
-Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels 
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type 
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard 
-Wno-format-truncation -Wno-stringop-truncation -g -O2\"" 
-DVAL_CFLAGS_SL="\"-fPIC\"" -DVAL_LDFLAGS="\"-L/usr/lib/llvm-11/lib 
-Wl,--as-needed -Wl,-rpath,'/pgsql/install/master/lib',--enable-new-dtags\"" 
-DVAL_LDFLAGS_EX="\"\"" -DVAL_LDFLAGS_SL="\"\"" -DVAL_LIBS="\"-lpgcommon 
-lpgport -llz4 -lssl -lcrypto -lz -lreadline -lpthread -lrt -ldl -lm \""  -c -o 
hashfn.o /pgsql/source/master/src/common/hashfn.c -MMD -MP -MF .deps/hashfn.Po
[...]
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement 
-Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 
-Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -g 
-O2 -DFRONTEND -I. -I/pgsql/source/master/src/common -I../../src/include 
-I/pgsql/source/master/src/include  -D_GNU_SOURCE  -DVAL_CC="\"gcc\"" 
-DVAL_CPPFLAGS="\"-D_GNU_SOURCE\"" -DVAL_CFLAGS="\"-Wall -Wmissing-prototypes 
-Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels 
-Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type 
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard 
-Wno-format-truncation -Wno-stringop-truncation -g -O2\"" 
-DVAL_CFLAGS_SL="\"-fPIC\"" -DVAL_LDFLAGS="\"-L/usr/lib/llvm-11/lib 
-Wl,--as-needed -Wl,-rpath,'/pgsql/install/master/lib',--enable-new-dtags\"" 
-DVAL_LDFLAGS_EX="\"\"" -DVAL_LDFLAGS_SL="\"\"" -DVAL_LIBS="\"-lpgcommon 
-lpgport -llz4 -lssl -lcrypto -lz -lreadline -lpthread -lrt -ldl -lm \""  -c -o 
relpath.o /pgsql/source/master/src/common/relpath.c -MMD -MP -MF 
.deps/relpath.Po
[...]
In file included from /pgsql/source/master/src/include/postgres.h:47,
 from /pgsql/source/master/src/common/hashfn.c:24:
/pgsql/source/master/src/include/utils/elog.h:75:10: fatal error: 
utils/errcodes.h: No existe el fichero o el directorio
   75 | #include "utils/errcodes.h"
  |  ^~
compilation terminated.
[...]
make[2]: *** [../../src/Makefile.global:945: hashfn.o] Error 1
make[2]: *** Se espera a que terminen otras tareas
/pgsql/source/master/src/common/relpath.c:21:10: fatal error: 
catalog/pg_tablespace_d.h: No existe el fichero o el directorio
   21 | #include "catalog/pg_tablespace_d.h"
  |  ^~~
compilation terminated.
make[2]: *** [../../src/Makefile.global:945: relpath.o] Error 1
make[2]: se sale del directorio 
'/home/alvherre/Code/pgsql-build/master/src/common'
make[1]: *** [Makefile:42: all-common-recurse] Error 2
make[1]: se sale del directorio '/home/alvherre/Code/pgsql-build/master/src'
make: *** [GNUmakefile:21: world-bin-src-recurse] Error 2


-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I must say, I am absolutely impressed with what pgsql's implementation of
VALUES allows me to do. It's kind of ridiculous how much "work" goes away in
my code.  Too bad I can't do this at work (Oracle 8/9)."   (Tom Allison)
   http://archives.postgresql.org/pgsql-general/2007-06/msg00016.php

Re: Unprivileged user can induce crash by using an SUSET param in PGOPTIONS

2022-07-26 Thread Dagfinn Ilmari Mannsåker

Tom Lane  writes:

> =?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?=  writes:
>> Tom Lane  writes:
>>> I wonder if it'd be a good idea to convert
>>> auto_explain's TAP test to load auto_explain via session_preload_libraries
>>> instead of shared_preload_libraries, and then pass in the settings for
>>> each test via PGOPTIONS instead of constantly rewriting postgresql.conf.
>
>> That whole config-file rewriting did feel a bit icky when I added more
>> tests recently, but I completely forgot about PGOPTIONS and -c.
>> Something like the attached is indeed much nicer.
>
> Thanks!  I added a test to verify the permissions-checking issue
> and pushed it.

Thanks!  Just one minor nitpick: setting an %ENV entry to `undef`
doesn't unset the environment variable, it sets it to the empty string.
To unset a variable it needs to be deleted from %ENV, i.e. `delete
$ENV{PGUSER};`.  Alternatively, wrap the relevant tests in a block and
use `local`, like in the `query_log` function.

>   regards, tom lane

- ilmari

Re: Unprivileged user can induce crash by using an SUSET param in PGOPTIONS

2022-07-26 Thread Tom Lane

=?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?=  writes:
> Thanks!  Just one minor nitpick: setting an %ENV entry to `undef`
> doesn't unset the environment variable, it sets it to the empty string.
> To unset a variable it needs to be deleted from %ENV, i.e. `delete
> $ENV{PGUSER};`.

Ah.  Still, libpq doesn't distinguish, so the test works anyway.
Not sure if it's worth changing.

regards, tom lane

[Refactor]Avoid to handle FORCE_NOT_NULL/FORCE_NULL options when COPY TO

2022-07-26 Thread Zhang Mingli

Hi,FORCE_NOT_NULL and FORCE_NULL are only used when COPY FROM.And copyto.c and copyfrom.c are split in this commit https://github.com/postgres/postgres//commit/c532d15dddff14b01fe9ef1d465013cb8ef186df .There is no need to handle these options when COPY TO.

vn-0001-Avoid-to-handle-FORCE_NOT_NULL-FORCE_NULL-options.patch
Description: Binary data

Regards,Zhang Mingli

Re: remove more archiving overhead

2022-07-26 Thread Nathan Bossart

On Tue, Jul 26, 2022 at 04:26:23PM +0900, Fujii Masao wrote:
> Anyway, since the patches look good to me, I pushed them. Thanks a lot!
> If necessary, we can keep improving the docs later.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Fix annotations nextFullXid

2022-07-26 Thread Zhang Mingli

Thanks!

Regards,
Zhang Mingli



> On Jul 26, 2022, at 10:08, Fujii Masao  wrote:
> 
> 
> 
> On 2022/07/23 14:01, Zhang Mingli wrote:
>> Hi,
>> VariableCacheData.nextFullXid is renamed to nextXid in commit 
>> https://github.com/postgres/postgres//commit/fea10a64340e529805609126740a540c8f9daab4
>>  
>> 
>> Fix the annotations for less confusion.
> 
> Thanks for the patch! LGTM. I will commit it.
> 
> Regards,
> 
> -- 
> Fujii Masao
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION

Re: failures in t/031_recovery_conflict.pl on CI

2022-07-26 Thread Alvaro Herrera

On 2022-May-08, Andres Freund wrote:

> On 2022-05-08 13:59:09 -0400, Tom Lane wrote:

> > No one is going to thank us for shipping a known-unstable test case.
> 
> IDK, hiding failures indicating bugs isn't really better, at least if it
> doesn't look like a bug in the test. But you seem to have a stronger opinion
> on this than me, so I'll skip the entire test for now :/

Hey, I just noticed that these tests are still disabled.  The next
minors are coming soon; should we wait until *those* are done and then
re-enable; or re-enable them now to see how they fare and then
re-disable before the next minors if there's still problems we don't
find fixes for?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"How strange it is to find the words "Perl" and "saner" in such close
proximity, with no apparent sense of irony. I doubt that Larry himself
could have managed it." (ncm, http://lwn.net/Articles/174769/)

Re: pg_auth_members.grantor is bunk

2022-07-26 Thread Robert Haas

On Wed, Jul 20, 2022 at 3:11 PM Robert Haas  wrote:
> The previous version of the patch makes both DROP OWNED and REASSIGN
> OWNED cascade to grantors, but I now think that, for consistency, I'd
> better look into changing it so that only DROP OWNED cascades. I think
> perhaps I should be using SHARED_DEPENDENCY_ACL instead of
> SHARED_DEPENDENCY_OWNER.

All right, here's a new patch set, now with a second patch added to the series.

0001, as before, is a minimal fix for $SUBJECT, but it now uses
SHARED_DEPENDENCY_ACL instead of SHARED_DEPENDENCY_OWNER, because that
gives behavior which is more like what we do for other object types.
However, it confines itself to making sure that
pg_auth_members.grantor is a valid user, and that's it.

0002 then revises the behavior substantially further to make role
grants work like other grants. The grantor of record is required to be
a user with ADMIN OPTION on the grant, or the bootstrap superuser,
just as for other object types the grantor of record must have GRANT
OPTION or be the object owner (but roles don't have owners). Dependent
grants are tracked and must be revoked before the grants upon which
they depend, but REVOKE .. CASCADE now works. Dependent grants must be
acyclic: you can't have alice getting ADMIN OPTION from bob and bob
getting it from alice; somebody's got to get it from the bootstrap
superuser. This is all just by analogy with what we do for grants on
object types, and making role grants do something similar instead of
the completely random treatment we have at present.

I believe that these patches are mostly complete, but I think that
dumpRoleMembership() probably needs some more work. I don't know what
exactly, but there's nothing to cause it to dump the role grants in an
order that will create dependent grants after the things that they
depend on, which seems essential.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

v2-0001-Ensure-that-pg_auth_members.grantor-is-always-val.patch
Description: Binary data

v2-0002-Make-role-grant-system-more-consistent-with-other.patch
Description: Binary data

Re: failures in t/031_recovery_conflict.pl on CI

2022-07-26 Thread Tom Lane

Alvaro Herrera  writes:
> Hey, I just noticed that these tests are still disabled.  The next
> minors are coming soon; should we wait until *those* are done and then
> re-enable; or re-enable them now to see how they fare and then
> re-disable before the next minors if there's still problems we don't
> find fixes for?

Maybe I missed it, but I don't think anything's been done to fix the
test's problems in the back branches.

regards, tom lane

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

2022-07-26 Thread Matthias van de Meent

On Tue, 26 Jul 2022 at 09:20, Michael Paquier  wrote:
>
> On Mon, Jul 25, 2022 at 02:12:05PM +0300, Heikki Linnakangas wrote:
> >  Among the two problems to solve at hand, the parts where the APIs are
> >  changed and made more robust with unsigned types and where block data
> >  is not overflowed with its 16-byte limit are committable, so I'd like
> >  to do that first (still need to check its performance with some micro
> >  benchmark on XLogRegisterBufData()).
> >
> > +1. I'm not excited about adding the "unlikely()" hints, though. We have a
> > pg_attribute_cold hint in ereport(), that should be enough.
>
> Okay, that makes sense.  FWIW, I have been wondering about the
> addition of the extra condition in XLogRegisterBufData() and I did not
> see a difference on HEAD in terms of execution time or profile, with a
> micro-benchmark doing a couple of million calls in a row as of the
> following, roughly:
> [...]

Thanks for testing.

> > How large exactly is the maximum size that this gives? I'd prefer to set the
> > limit conservatively to 1020 MB, for example, with a compile-time static
> > assertion that AllocSizeIsValid(DecodeXLogRecordRequiredSpace(1020 MB)).
>
> Something like that would work, I guess.

I've gone over the patch and reviews again, and updated those places
that received comments:

- updated the MaxXLogRecordSize and XLogRecordLengthIsValid(len)
macros (now in xlogrecord.h), with a max length of the somewhat
arbitrary 1020MiB.
 This leaves room for approx. 4MiB of per-record allocation overhead
before you'd hit MaxAllocSize, and also detaches the dependency on
memutils.h.

- Retained the check in XLogRegisterData, so that we check against
integer overflows in the registerdata code instead of only an assert
in XLogRecordAssemble where it might be too late.
- Kept the inline static elog-ing function (as per Andres' suggestion
on 2022-03-14; this decreases binary sizes)
- Dropped any changes in xlogreader.h/c

Kind regards,

Matthias van de Meent


v8-0001-Add-protections-in-xlog-record-APIs-against-large.patch
Description: Binary data

Re: making relfilenodes 56 bits

2022-07-26 Thread Ashutosh Sharma

Thanks Dilip. Here are few comments that  could find upon quickly reviewing
the v11 patch:

 /*
+ * Similar to the XLogPutNextOid but instead of writing NEXTOID log record
it
+ * writes a NEXT_RELFILENUMBER log record.  If '*prevrecptr' is a valid
+ * XLogRecPtrthen flush the wal upto this record pointer otherwise flush
upto

XLogRecPtrthen -> XLogRecPtr then

==

+   switch (relpersistence)
+   {
+   case RELPERSISTENCE_TEMP:
+   backend = BackendIdForTempRelations();
+   break;
+   case RELPERSISTENCE_UNLOGGED:
+   case RELPERSISTENCE_PERMANENT:
+   backend = InvalidBackendId;
+   break;
+   default:
+   elog(ERROR, "invalid relpersistence: %c", relpersistence);
+   return InvalidRelFileNumber;/* placate compiler */
+   }


I think the above check should be added at the beginning of the function
for the reason that if we come to the default switch case we won't be
acquiring the lwlock and do other stuff to get a new relfilenumber.

==

-   newrelfilenumber = GetNewRelFileNumber(newTableSpace, NULL,
+* Generate a new relfilenumber.  We cannot reuse the old relfilenumber
+* because of the possibility that that relation will be moved back to
the

that that relation -> that relation.

==

+ * option_parse_relfilenumber
+ *
+ * Parse relfilenumber value for an option.  If the parsing is successful,
+ * returns; if parsing fails, returns false.
+ */

If parsing is successful, returns true;

--
With Regards,
Ashutosh Sharma.

On Tue, Jul 26, 2022 at 7:33 PM Dilip Kumar  wrote:

> On Tue, Jul 26, 2022 at 6:06 PM Ashutosh Sharma 
> wrote:
>
> Hi,
> Note: please avoid top posting.
>
> > /*
> >  * If relfilenumber is unspecified by the caller then create
> storage
> > -* with oid same as relid.
> > +* with relfilenumber same as relid if it is a system table
> otherwise
> > +* allocate a new relfilenumber.  For more details read comments
> atop
> > +* FirstNormalRelFileNumber declaration.
> >  */
> > if (!RelFileNumberIsValid(relfilenumber))
> > -   relfilenumber = relid;
> > +   {
> > +   relfilenumber = relid < FirstNormalObjectId ?
> > +   relid : GetNewRelFileNumber();
> >
> > Above code says that in the case of system table we want relfilenode to
> be the same as object id. This technically means that the relfilenode or
> oid for the system tables would not be exceeding 16383. However in the
> below lines of code added in the patch, it says there is some chance for
> the storage path of the user tables from the old cluster conflicting with
> the storage path of the system tables in the new cluster. Assuming that the
> OIDs for the user tables on the old cluster would start with 16384 (the
> first object ID), I see no reason why there would be a conflict.
>
>
> Basically, the above comment says that the initial system table
> storage will be created with the same relfilenumber as Oid so you are
> right that will not exceed 16383.  And below code is explaining the
> reason that in order to avoid the conflict with the user table from
> the older cluster we do it this way.  Otherwise, in the new design, we
> have no intention to keep the relfilenode same as Oid.  But during an
> upgrade from the older cluster which is not following this new design
> might have user table relfilenode which can conflict with the system
> table in the new cluster so we have to ensure that with the new design
> also when creating the initial cluster we keep the system table
> relfilenode in low range and directly using Oid is the best idea for
> this purpose instead of defining the completely new range and
> maintaining a separate counter for that.
>
> > +/* --
> > + * RelFileNumber zero is InvalidRelFileNumber.
> > + *
> > + * For the system tables (OID < FirstNormalObjectId) the initial storage
> > + * will be created with the relfilenumber same as their oid.  And,
> later for
> > + * any storage the relfilenumber allocated by GetNewRelFileNumber()
> will start
> > + * at 10.  Thus, when upgrading from an older cluster, the relation
> storage
> > + * path for the user table from the old cluster will not conflict with
> the
> > + * relation storage path for the system table from the new cluster.
> Anyway,
> > + * the new cluster must not have any user tables while upgrading, so we
> needn't
> > + * worry about them.
> > + * --
> > + */
> > +#define FirstNormalRelFileNumber   ((RelFileNumber) 10)
> >
> > ==
> >
> > When WAL logging the next object id we have the chosen the xlog
> threshold value as 8192 whereas for relfilenode it is 512. Any reason for
> choosing this low arbitrary value in case of relfilenumber?
>
> For Oid when we cross the max value we will wraparound, whereas for
> relfilenumber we can not expect the wraparound for cluster lifetime.
> So it is better not to log forward a

Re: Cygwin cleanup

2022-07-26 Thread Robert Haas

On Tue, Jul 26, 2022 at 7:40 AM Tom Lane  wrote:
> I think maybe we should re-open the discussion.  I've certainly
> reached the stage of fed-up-ness.  That platform seems seriously
> broken, upstream is making no progress on fixing it, and there
> doesn't seem to be any real-world use-case.  The only positive
> argument for it is that Readline doesn't work in the other
> Windows builds --- but we've apparently not rechecked that
> statement in eighteen years, so maybe things are better now.
>
> If we could just continue to blithely ignore lorikeet's failures,
> I wouldn't mind so much; but doing any significant amount of new
> code development work for the platform seems like throwing away
> developer time.

I agree with that. All things being equal, I like the idea of
supporting a bunch of different platforms, and Cygwin doesn't really
look that dead. It has recent releases. But if blocking signals
doesn't actually work on that platform, making PostgreSQL work
reliably there seems really difficult.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Password reset link / 'less' does not exit in psql version 13.4

2022-07-26 Thread Justin Pryzby

On Mon, Jul 25, 2022 at 11:02:26AM +0200, Michael J. Baars wrote:
> Hello,
> 
> I have two very simple questions:
> 
> 1) I have an account at postgresql.org, but a link to a 'forgot password' 
> seems to be missing on the login page. I have my password stored only on an 
> old Fedora 32 computer. To change the password
> when logged in, you need to supply the old password. In short, I have no way 
> to migrate this postgresql.org account to my new Fedora 35 and Fedora 36 
> computers. What can be done about this?

It say this:
| If you have a postgresql.org community account with a password, please use 
the form below to sign in. If you have one but have lost your password, you can 
use the password reset form. 

(BTW, the development list isn't the right place; pgsql-www is better).

> 2) I have three psql clients running, a version 12.6, a version 13.4 and a 
> version 14.3. Until now a 'select * from table;' showed the output in 'less' 
> or something alike and exited from 'less' when
> the output was complete. Both version 12.6 and version 13.4 work that way. 
> Version 14.3 does not exit from 'less' when the output is complete. Did 
> anyone notice this already?

Is it actually running less or some other pager ?

Do you have all 3 versions of psql installed and the same (different) behavior
happening today ?  How was postgres installed ? Compiled locally or from which
packages ?  Please show pg_config for each.

Could you check how the pager is being run ?
Check the commandline in ps -fC less or similar, and the environment in 
"cat /proc/PID/environ" or "ps wwe -C less"

-- 
Justin

out of date comment in commit_ts.c

2022-07-26 Thread Nathan Bossart

Hi hackers,

I noticed that commit_ts.c has the following comment:

 * XLOG interactions: this module generates an XLOG record whenever a new
 * CommitTs page is initialized to zeroes.  Also, one XLOG record is
 * generated for setting of values when the caller requests it; this allows
 * us to support values coming from places other than transaction commit.
 * Other writes of CommitTS come from recording of transaction commit in
 * xact.c, which generates its own XLOG records for these events and will
 * re-perform the status update on redo; so we need make no additional XLOG
 * entry here.

IIUC the ability for callers to request WAL record generation is no longer
possible as of 08aa89b [0].  Should the second sentence be removed?

[0] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=08aa89b

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: predefined role(s) for VACUUM and ANALYZE

2022-07-26 Thread Robert Haas

On Mon, Jul 25, 2022 at 9:47 PM Kyotaro Horiguchi
 wrote:
> One arguable point would be whether we will need to put restriction
> the target relations that Bob can vacuum/analyze.

Yeah. pg_checkpoint makes sense because you can either CHECKPOINT or
you can't. But for a command with a target, you really ought to have a
permission on the object, not just a general permission. On the other
hand, we do have things like pg_read_all_tables, so we could have
pg_vacuum_all_tables too. Still, it seems somewhat appealing to give
people fine-grained control over this, rather than just "on" or "off".

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: failures in t/031_recovery_conflict.pl on CI

2022-07-26 Thread Andres Freund

Hi,

On 2022-07-26 12:47:38 -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > Hey, I just noticed that these tests are still disabled.  The next
> > minors are coming soon; should we wait until *those* are done and then
> > re-enable; or re-enable them now to see how they fare and then
> > re-disable before the next minors if there's still problems we don't
> > find fixes for?
> 
> Maybe I missed it, but I don't think anything's been done to fix the
> test's problems in the back branches.

Yea, I don't think either. What's worse, there's several actual problems
in the recovery conflict code in all branches - which causes occasional
failures of the test in HEAD as well.

There's one big set of fixes in
https://www.postgresql.org/message-id/CA%2BhUKGKrLKx7Ky1T_FHk-Y729K0oie-gOXKCbxCXyjbPDJAOOw%40mail.gmail.com

and I suspect it'll be hard to have test be fully reliable without
addressing
https://postgr.es/m/20220715172938.t7uivghdx5vj36cn%40awork3.anarazel.de

Greetings,

Andres Freund

Re: predefined role(s) for VACUUM and ANALYZE

2022-07-26 Thread David G. Johnston

On Tue, Jul 26, 2022 at 10:37 AM Robert Haas  wrote:

> On Mon, Jul 25, 2022 at 9:47 PM Kyotaro Horiguchi
>  wrote:
> > One arguable point would be whether we will need to put restriction
> > the target relations that Bob can vacuum/analyze.
>


> But for a command with a target, you really ought to have a
> permission on the object, not just a general permission. On the other
> hand, we do have things like pg_read_all_tables, so we could have
> pg_vacuum_all_tables too.


I'm still more likely to create a specific security definer function owned
by the relevant table owner to give out ANALYZE (and maybe VACUUM)
permission to ETL-performing roles.

Still, it seems somewhat appealing to give
> people fine-grained control over this, rather than just "on" or "off".
>
>
Appealing enough to consume a couple of permission bits?

https://www.postgresql.org/message-id/CAKFQuwZ6dhjTFV7Bwmehe1N3%3Dk484y4mM22zuYjVEU2dq9V1aQ%40mail.gmail.com

David J.

Re: Transparent column encryption

2022-07-26 Thread Robert Haas

On Thu, Jul 21, 2022 at 2:30 PM Jacob Champion  wrote:
> On Mon, Jul 18, 2022 at 9:07 AM Robert Haas  wrote:
> > Even there, what can be accomplished with a feature that only encrypts
> > individual column values is by nature somewhat limited. If you have a
> > text column that, for one row, stores the value 'a', and for some
> > other row, stores the entire text of Don Quixote in the original
> > Spanish, it is going to be really difficult to keep an adversary who
> > can read from the disk from distinguishing those rows. If you want to
> > fix that, you're going to need to do block-level encryption or
> > something of that sort.
>
> A minimum padding option would fix the leak here, right? If every
> entry is the same length then there's no information to be gained, at
> least in an offline analysis.

Sure, but padding every text column that you have, even the ones
containing only 'a', out to the length of Don Quixote in the original
Spanish, is unlikely to be an appealing option.

> I think some work around that is probably going to be needed for
> serious use of this encryption, in part because of the use of text
> format as the canonical input. If the encrypted values of 1, 10, 100,
> and 1000 hypothetically leaked their exact lengths, then an encrypted
> int wouldn't be very useful. So I'd want to quantify (and possibly
> configure) exactly how much data you can encrypt in a single message
> before the length starts being leaked, and then make sure that my
> encrypted values stay inside that bound.

I think most ciphers these days are block ciphers, so you're going to
get output that is a multiple of the block size anyway - e.g. I think
for AES it's 128 bits = 16 bytes. So small differences in length will
be concealed naturally, which may be good enough for some use cases.

I'm not really convinced that it's worth putting a lot of effort into
bolstering the security of this kind of tech above what it naturally
gives. I think it's likely to be a wild goose chase. If you have major
worries about someone reading your disk in its entirety, use full-disk
encryption. Selective encryption is only suitable when you want to add
a modest level of protection for individual value and are willing to
accept that some information leakage is likely if an adversary can in
fact read the full disk. Padding values to try to further obscure
things may be situationally useful, but if you find yourself worrying
too much about that sort of thing, you likely should have picked
stronger medicine initially.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: predefined role(s) for VACUUM and ANALYZE

2022-07-26 Thread Robert Haas

On Tue, Jul 26, 2022 at 1:50 PM David G. Johnston
 wrote:
>> Still, it seems somewhat appealing to give
>> people fine-grained control over this, rather than just "on" or "off".
> Appealing enough to consume a couple of permission bits?
> https://www.postgresql.org/message-id/CAKFQuwZ6dhjTFV7Bwmehe1N3%3Dk484y4mM22zuYjVEU2dq9V1aQ%40mail.gmail.com

I think we're down to 0 remaining now, so it'd be hard to justify
consuming 2 of 0 remaining bits. However, I maintain that the solution
to this is either (1) change the aclitem representation to get another
32 bits or (2) invent a different system for less-commonly used
permission bits. Checking permissions for SELECT or UPDATE has to be
really fast, because most queries will need to do that sort of thing.
If we represented VACUUM or ANALYZE in some other way in the catalogs
that was more scalable but less efficient, it wouldn't be a big deal
(although there's the issue of code duplication to consider).

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Unstable tests for recovery conflict handling

2022-07-26 Thread Tom Lane

I wrote:
>> It's been kind of hidden by other buildfarm noise, but
>> 031_recovery_conflict.pl is not as stable as it should be [1][2][3][4].

> After digging around in the code, I think this is almost certainly
> some manifestation of the previously-complained-of problem [1] that
> RecoveryConflictInterrupt is not safe to call in a signal handler,
> leading the conflicting backend to sometimes decide that it's not
> the problem.

I happened to notice that while skink continues to fail off-and-on
in 031_recovery_conflict.pl, the symptoms have changed!  What
we're getting now typically looks like [1]:

[10:45:11.475](0.023s) ok 14 - startup deadlock: lock acquisition is waiting
Waiting for replication conn standby's replay_lsn to pass 0/33FB8B0 on primary
done
timed out waiting for match: (?^:User transaction caused buffer deadlock with 
recovery.) at t/031_recovery_conflict.pl line 367.

where absolutely nothing happens in the standby log, until we time out:

2022-07-24 10:45:11.452 UTC [1468367][client backend][2/4:0] LOG:  statement: 
SELECT * FROM test_recovery_conflict_table2;
2022-07-24 10:45:11.472 UTC [1468547][client backend][3/2:0] LOG:  statement: 
SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
2022-07-24 10:48:15.860 UTC [1468362][walreceiver][:0] FATAL:  could not 
receive data from WAL stream: server closed the connection unexpectedly

So this is not a case of RecoveryConflictInterrupt doing the wrong thing:
the startup process hasn't detected the buffer conflict in the first
place.  Don't know what to make of that, but I vaguely suspect a test
timing problem.  gull has shown this once as well, although at a different
step in the script [2].

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-07-24%2007%3A00%3A29
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gull&dt=2022-07-23%2009%3A34%3A54

Re: Handle infinite recursion in logical replication setup

2022-07-26 Thread vignesh C

On Sun, Jul 24, 2022 at 10:21 PM Jonathan S. Katz  wrote:
>
> On 7/22/22 12:47 AM, Amit Kapila wrote:
> > On Fri, Jul 22, 2022 at 1:39 AM Jonathan S. Katz  
> > wrote:
>
> >> 1. I'm concerned by calling this "Bidirectional replication" in the docs
> >> that we are overstating the current capabilities. I think this is
> >> accentuated int he opening paragraph:
> >>
> >> ==snip==
> >>Bidirectional replication is useful for creating a multi-master database
> >>environment for replicating read/write operations performed by any of 
> >> the
> >>member nodes.
> >> ==snip==
> >>
> >> For one, we're not replicating reads, we're replicating writes. Amongst
> >> the writes, at this point we're only replicating DML. A reader could
> >> think that deploying can work for a full bidirectional solution.
> >>
> >> (Even if we're aspirationally calling this section "Bidirectional
> >> replication", that does make it sound like we're limited to two nodes,
> >> when we can support more than two).
> >>
> >
> > Right, I think the system can support N-Way replication.
>
> I did some more testing of the feature, i.e. doing 3-node and 4-node
> tests. While logical replication today can handle replicating between
> multiple nodes (N-way), the "origin = none" does require setting up
> subscribers between each of the nodes.
>
> For example, if I have 4 nodes A, B, C, D and I want to replicate the
> same table between all of them, I need to set up subscribers between all
> of them (A<=>B, A<=>C, A<=>D, B<=>C, B<=>D, C<=>D). However, each node
> can replicate between each other in a way that's convenient (vs. having
> to do something funky with partitions) so this is still a big step forward.
>
> This is a long way of saying that I do think it's fair to say we support
> "N-way" replication so long as you are set up in a mesh (vs. a ring,
> e.g. A=>B=>C=>D=>A).
>
> > Among the above "Replicating changes between primaries" sounds good to
> > me or simply "Replication between primaries". As this is a sub-section
> > on the Logical Replication page, I feel it is okay to not use Logical
> > in the title.
>
> Agreed, I think that's fine.
>
> >> At a minimum, I think we should reference the documentation we have in
> >> the logical replication section on conflicts. We may also want to advise
> >> that a user is responsible for designing their schemas in a way to
> >> minimize the risk of conflicts.
> >>
> >
> > This sounds reasonable to me.
> >
> > One more point about docs, it appears to be added as the last
> > sub-section on the Logical Replication page. Is there a reason for
> > doing so? I feel this should be third sub-section after describing
> > Publication and Subscription.
>
> When I first reviewed, I had not built the docs. Did so on this pass.
>
> I agree with the positioning argument, i.e. it should go after
> "Subscription" in the table of contents -- but it makes me question a
> couple of things:
>
> 1. The general ordering of the docs
> 2. How we describe that section (more on that in a sec)
> 3. If "row filters" should be part of "subscription" instead of its own
> section.
>
> If you look at the current order, "Quick setup" is the last section; one
> would think the "quick" portion goes first :) Given a lot of this is for
> the current docs, I may start a separate discussion on -docs for this part.
>
> For the time being, I agree it should be moved to the section after
> "Subscription".
>
> I think what this section describes is "Configuring Replication Between
> Nodes" as it covers a few different scenarios.
>
> I do think we need to iterate on these docs -- the examples with the
> commands are generally OK and easy to follow, but a few things I noticed:
>
> 1. The general description of the section needs work. We may want to
> refine the description of the use cases, and in the warning, link to
> instructions on how to take backups.

Modified

> 2. We put the "case not supported" in the middle, not at the end.

Modified

> 3. The "generic steps for adding a new node..." section uses a
> convention for steps that is not found in the docs. We also don't
> provide an example for this section, and this is the most complicated
> scenario to set up.

Modified

Thanks a lot for the suggestions, I have made the changes  for the
same in the v39 patch attached.

Regards,
Vignesh
From 29c2206e2440bee0ee9818b6a2af66dad5a5d5d4 Mon Sep 17 00:00:00 2001
From: Vigneshwaran C 
Date: Tue, 26 Jul 2022 22:59:15 +0530
Subject: [PATCH v39 1/2] Check and throw an error if publication tables were
 also subscribing from other publishers and support force value for copy_data
 parameter.

This patch does a couple of things:
1) Checks and throws an error if 'copy_data = on' and 'origin =
none' but the publication tables were also replicated from other publishers.
2) Adds 'force' value for copy_data parameter.

---
The steps below help to demonstrate how the new ex

Re: Handle infinite recursion in logical replication setup

2022-07-26 Thread vignesh C

On Tue, Jul 26, 2022 at 7:16 AM Peter Smith  wrote:
>
> Here are some review comments for the patch v38-0002:
>
> ==
>
>  - terminology
>
> There seemed to be an inconsistent alternation of the terms
> "primaries" and "nodes"... For example "Setting replication between
> two primaries" versus "Adding a new node..." (instead of "Adding a new
> primary..."?). I have included suggested minor rewording changes in
> the review comments below, but please check in case I miss something.
> Because I suggested changes to some titles maybe you will also want to
> change the section ids too.
>
> ~~~
>
> 1. Commit message
>
> The documentation was recently modified to remove the term
> "bidirectional replication" and replace it all with "replication
> between primaries", so this commit message (and also the patch name
> itself) should be similarly modified.

Modified

> ~~~
>
> 2.
> +   
> +Replication between primaries is useful for creating a multi-master
> +database environment for replicating write operations performed by any of
> +the member nodes. The steps to create replication between primaries in
> +various scenarios are given below. Note: User is responsible for 
> designing
> +their schemas in a way to minimize the risk of conflicts. See
> + for the details of 
> logical
> +replication conflicts. The logical replication restrictions applies to
> +the replication between primaries also. See
> + for the details of
> +logical replication restrictions.
> +   
>
> 2a.
> "User" -> "The user"

Modified

> 2b.
> "The logical replication restrictions applies to..." --> "The logical
> replication restrictions apply to..."

Modified

> 2c.
> These are important notes. Instead of just being part of the text
> blurb, perhaps these should be rendered as SGML  (or put them
> both in a single  if you want)

Modified

> ~~~
>
> 3. Setting replication between two primaries
>
> +   Setting replication between two primaries
> +   
> +The following steps demonstrate how to setup replication between two
> +primaries when there is no table data present on both nodes
> +node1 and node2:
> +   
>
> SUGGESTED
> The following steps demonstrate how to set up replication between two
> primaries (node1 and node2) when there is no table data present on
> both nodes:.

Modified

> ~~~
>
> 4.
> +   
> +Now the replication setup between two primaries node1
> +and node2 is complete. Any incremental changes from
> +node1 will be replicated to node2,
> +and any incremental changes from node2 will be
> +replicated to node1.
> +   
>
> "between two primaries" -> "between primaries"

Modified

> ~~~
>
> 5. Adding a new node when there is no table data on any of the nodes
>
> SUGGESTION (title)
> Adding a new primary when there is no table data on any of the primaries

Modified

> ~~~
>
> 6.
> +   
> +The following steps demonstrate adding a new node 
> node3
> +to the existing node1 and node2 
> when
> +there is no t1 data on any of the nodes. This requires
>
> SUGGESTION
> The following steps demonstrate adding a new primary (node3) to the
> existing primaries (node1 and node2) when there is no t1 data on any
> of the nodes.

Modified

> ~~~
>
> 7. Adding a new node when table data is present on the existing nodes
>
> SUGGESTION (title)
> Adding a new primary when table data is present on the existing primaries

Modified

> ~~~
>
> 8.
> +
> + The following steps demonstrate adding a new node 
> node3
> + which has no t1 data to the existing
> + node1 and node2 where
> + t1 data is present. This needs similar steps; the 
> only
>
> SUGGESTION
> The following steps demonstrate adding a new primary (node3) that has
> no t1 data to the existing primaries (node1 and node2) where t1 data
> is present.

Modified

> ~~~
>
> 9. Adding a new node when table data is present on the new node
>
> SUGGESTION (title)
> Adding a new primary that has existing table data

Modified

> ~~~
>
> 10.
> +   
> +
> + Adding a new node when table data is present on the new node is not
> + supported.
> +
> +   
>
> SUGGESTION
> Adding a new primary that has existing table data is not supported.

Modified

> ~~~
>
> 11. Generic steps for adding a new node to an existing set of primaries
>
> SUGGESTION (title)
> Generic steps for adding a new primary to an existing set of primaries

Modified

Thanks for the comments, the v39 patch shared at [1] has the changes
for the same.
[1] - 
https://www.postgresql.org/message-id/CALDaNm2POATc_jwQ-8MBJgGCVZGdUNhnTv8zkBuGzLaY03dM%3DA%40mail.gmail.com

Regards,
Vignesh

Re: Handle infinite recursion in logical replication setup

2022-07-26 Thread vignesh C

On Tue, Jul 26, 2022 at 1:35 PM shiy.f...@fujitsu.com
 wrote:
>
> On Sun, Jul 24, 2022 1:28 AM vignesh C  wrote:
> >
> > Added a  note for the same and referred it to the conflicts section.
> >
> > Thanks for the comments, the attached v38 patch has the changes for the
> > same.
> >
>
> Thanks for updating the patch. A comment on the test in 0001 patch.
>
> +# Alter subscription ... refresh publication should fail when a new table is
> +# subscribing data from a different publication should fail
> +($result, $stdout, $stderr) = $node_A->psql(
> +   'postgres', "
> +ALTER SUBSCRIPTION tap_sub_A2 REFRESH PUBLICATION");
> +like(
> +   $stderr,
> +   qr/ERROR: ( [A-Z0-9]+:)? could not replicate table "public.tab_new"/,
> +   "Create subscription with origin and copy_data having replicated 
> table in publisher"
> +);
>
> The comment says "should fail" twice, the latter one can be removed.

Modified

> Besides, "Create subscription with origin and copy_data" should be changed to
> "Alter subscription with origin and copy_data" I think.

Modified to "Refresh publication"

Thanks for the comments, the v39 patch shared at [1] has the changes
for the same.
[1] - 
https://www.postgresql.org/message-id/CALDaNm2POATc_jwQ-8MBJgGCVZGdUNhnTv8zkBuGzLaY03dM%3DA%40mail.gmail.com

Regards,
Vignesh

Re: Handle infinite recursion in logical replication setup

2022-07-26 Thread vignesh C

On Tue, Jul 26, 2022 at 2:12 PM wangw.f...@fujitsu.com
 wrote:
>
> On Sun, Jul 24, 2022 1:28 AM vignesh C  wrote:
> > Added a  note for the same and referred it to the conflicts section.
> >
> > Thanks for the comments, the attached v38 patch has the changes for the 
> > same.
>
> Thanks for your patches.
>
> Two slight comments on the below message in the 0001 patch:
>
> The error message in the function check_pub_table_subscribed().
> +   ereport(ERROR,
> +   
> errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +   errmsg("could not replicate table \"%s.%s\"",
> +  nspname, relname),
> +   errdetail("CREATE/ALTER SUBSCRIPTION with 
> origin = none and copy_data = on is not allowed when the publisher has 
> subscribed same table."),
> +   errhint("Use CREATE/ALTER SUBSCRIPTION with 
> copy_data = off/force."));
>
> 1.
> I think it might be better to use true/false here than on/off.
> Just for consistency with another error message
> (in function parse_subscription_options) and the description of this parameter
> in the PG document.

Modified

> If you agree with this, please also kindly consider the attached
> "slight_modification.diff" file.
> (This is a slight modification for the second patch, just replace "off" with
> "false" in PG document.)

I have updated the documentation similarly

> 2.
> How about replacing "origin = none" and "copy_data = on" in the message with
> "%s"? I think this might be better for translation. Just like the following
> error message in the function parse_subscription_options:
> ```
> if (opts->copy_data &&
> IsSet(opts->specified_opts, SUBOPT_COPY_DATA))
> ereport(ERROR,
> (errcode(ERRCODE_SYNTAX_ERROR),
>  errmsg("%s and %s are mutually 
> exclusive options",
> "connect = false", 
> "copy_data = true/force")));
> ```

Modified

Thanks for the comments, the v39 patch shared at [1] has the changes
for the same.
[1] - 
https://www.postgresql.org/message-id/CALDaNm2POATc_jwQ-8MBJgGCVZGdUNhnTv8zkBuGzLaY03dM%3DA%40mail.gmail.com

Regards,
Vignesh

Re: Slow standby snapshot

2022-07-26 Thread Andrey Borodin




> On 20 Jul 2022, at 02:12, Michail Nikolaev  wrote:
> 
>> I've looked into v5.
> Thanks!
> 
> Patch is updated accordingly your remarks.

The patch seems Ready for Committer from my POV.

Thanks!

Best regards, Andrey Borodin.

Re: Handle infinite recursion in logical replication setup

2022-07-26 Thread vignesh C

On Mon, Jul 25, 2022 at 12:58 PM Peter Smith  wrote:
>
> Firstly, I have some (case-sensitivity) questions about the previous
> patch which was already pushed [1].
>
> Q1. create_subscription docs
>
> I did not understand why the docs refer to slot_name = NONE, yet the
> newly added option says origin = none/any. I think that saying origin
> = NONE/ANY would be more consistent with the existing usage of NONE in
> this documentation.
>
> ~~~
>
> Q2. parse_subscription_options
>
> Similarly, in the code (parse_subscription_options), I did not
> understand why the checks for special name values are implemented
> differently:
>
> The new 'origin' code is using pg_strcmpcase to check special values
> (none/any), and the old 'slot_name' code uses case-sensitive strcmp to
> check the special value (none).
>
> FWIW, here I thought the new origin code is the correct one.
>
> ==
>
> Now, here are some review comments for the patch v38-0001:
>
> 1. src/backend/commands/subscriptioncmds.c - check_pub_table_subscribed
>
> @@ -1781,6 +1858,121 @@ AlterSubscriptionOwner_oid(Oid subid, Oid newOwnerId)
>   table_close(rel, RowExclusiveLock);
>  }
>
> +/*
> + * Check and throw an error if the publisher has subscribed to the same table
> + * from some other publisher. This check is required only if copydata is ON 
> and
> + * the origin is local for CREATE SUBSCRIPTION and
> + * ALTER SUBSCRIPTION ... REFRESH statements to avoid replicating remote data
> + * from the publisher.
> + *
> + * This check need not be performed on the tables that are already added as
> + * incremental sync for such tables will happen through WAL and the origin of
> + * the data can be identified from the WAL records.
> + *
> + * subrel_local_oids contains the list of relation oids that are already
> + * present on the subscriber.
> + */
> +static void
> +check_pub_table_subscribed(WalReceiverConn *wrconn, List *publications,
> +CopyData copydata, char *origin,
> +Oid *subrel_local_oids, int subrel_count)
>
> 1a.
> "copydata is ON" --> "copy_data = on" (because the comment is talking
> about the CREATE/ALTER statements, so it seemed a bit confusing to
> refer to the copydata function param instead of the copy_data
> subscription parameter)

Modified

> 1b.
> "the origin is local" ?? But, "local" was the old special name value.
> Now it is "none", so I think this part needs minor rewording.

Modified

>
> ~~~
>
> 2.
>
> + if (copydata != COPY_DATA_ON || !origin ||
> + (pg_strcasecmp(origin, "none") != 0))
> + return;
>
> Should this be using the constant LOGICALREP_ORIGIN_NONE?

Modified

> ~~~
>
> 3.
>
> + /*
> + * Throw an error if the publisher has subscribed to the same table
> + * from some other publisher. We cannot differentiate between the
> + * origin and non-origin data that is present in the HEAP during the
> + * initial sync. Identification of non-origin data can be done only
> + * from the WAL by using the origin id.
> + *
> + * XXX: For simplicity, we don't check whether the table has any data
> + * or not. If the table doesn't have any data then we don't need to
> + * distinguish between local and non-local data so we can avoid
> + * throwing an error in that case.
> + */
>
> 3a.
> When the special origin value changed from "local" to "none" this
> comment's first part seems to have got a bit lost in translation.

Modified

> SUGGESTION:
> Throw an error if the publisher has subscribed to the same table from
> some other publisher. We cannot know the origin of data during the
> initial sync. Data origins can be found only from the WAL by looking
> at the origin id.

Modified

> 3b.
> I think referring to "local and non-local" data in the XXX part of
> this comment also needs some minor rewording now that "local" is not a
> special origin name anymore.

Modified

Thanks for the comments, the v39 patch shared at [1] has the changes
for the same.
[1] - 
https://www.postgresql.org/message-id/CALDaNm2POATc_jwQ-8MBJgGCVZGdUNhnTv8zkBuGzLaY03dM%3DA%40mail.gmail.com

Regards,
Vignesh

Re: Unstable tests for recovery conflict handling

2022-07-26 Thread Andres Freund

Hi,

On 2022-07-26 13:57:53 -0400, Tom Lane wrote:
> I happened to notice that while skink continues to fail off-and-on
> in 031_recovery_conflict.pl, the symptoms have changed!  What
> we're getting now typically looks like [1]:
> 
> [10:45:11.475](0.023s) ok 14 - startup deadlock: lock acquisition is waiting
> Waiting for replication conn standby's replay_lsn to pass 0/33FB8B0 on primary
> done
> timed out waiting for match: (?^:User transaction caused buffer deadlock with 
> recovery.) at t/031_recovery_conflict.pl line 367.
> 
> where absolutely nothing happens in the standby log, until we time out:
> 
> 2022-07-24 10:45:11.452 UTC [1468367][client backend][2/4:0] LOG:  statement: 
> SELECT * FROM test_recovery_conflict_table2;
> 2022-07-24 10:45:11.472 UTC [1468547][client backend][3/2:0] LOG:  statement: 
> SELECT 'waiting' FROM pg_locks WHERE locktype = 'relation' AND NOT granted;
> 2022-07-24 10:48:15.860 UTC [1468362][walreceiver][:0] FATAL:  could not 
> receive data from WAL stream: server closed the connection unexpectedly
> 
> So this is not a case of RecoveryConflictInterrupt doing the wrong thing:
> the startup process hasn't detected the buffer conflict in the first
> place.

I wonder if this, at least partially, could be be due to the elog thing
I was complaining about nearby. I.e. we decide to FATAL as part of a
recovery conflict interrupt, and then during that ERROR out as part of
another recovery conflict interrupt (because nothing holds interrupts as
part of FATAL).

Greetings,

Andres Freund

Re: optimize lookups in snapshot [sub]xip arrays

2022-07-26 Thread Andres Freund

On 2022-07-25 12:04:19 -0700, Nathan Bossart wrote:
> On Sat, Jul 16, 2022 at 08:59:57PM -0700, Nathan Bossart wrote:
> > On Fri, Jul 15, 2022 at 01:08:57PM -0700, Andres Freund wrote:
> >> I wonder if we additionally / alternatively could use a faster method of
> >> searching the array linearly, e.g. using SIMD.
> > 
> > I looked into using SIMD.  The patch is attached, but it is only intended
> > for benchmarking purposes and isn't anywhere close to being worth serious
> > review.  There may be a simpler/better way to implement the linear search,
> > but this seemed to work well.  Overall, SIMD provided a decent improvement.
> > I had to increase the number of writers quite a bit in order to demonstrate
> > where the hash tables began winning.  Here are the numbers:
> > 
> > writers head simd hash
> > 256 663  632  694
> > 512 530  618  637
> > 768 489  544  573
> > 1024364  508  562
> > 2048185  306  485
> > 4096146  197  441
> > 
> > While it is unsurprising that the hash tables perform the best, there are a
> > couple of advantages to SIMD that might make that approach worth
> > considering.  For one, there's really no overhead (i.e., you don't need to
> > sort the array or build a hash table), so we can avoid picking an arbitrary
> > threshold and just have one code path.  Also, a SIMD implementation for a
> > linear search through an array of integers could likely be easily reused
> > elsewhere.
> 
> From the discussion thus far, it seems there is interest in optimizing
> [sub]xip lookups, so I'd like to spend some time moving it forward.  I
> think the biggest open question is which approach to take.  Both the SIMD
> and hash table approaches seem viable, but I think I prefer the SIMD
> approach at the moment (see the last paragraph of quoted text for the
> reasons).  What do folks think?

Agreed on all points.

Re: Transparent column encryption

2022-07-26 Thread Jacob Champion

On Tue, Jul 26, 2022 at 10:52 AM Robert Haas  wrote:
> On Thu, Jul 21, 2022 at 2:30 PM Jacob Champion  
> wrote:
> > A minimum padding option would fix the leak here, right? If every
> > entry is the same length then there's no information to be gained, at
> > least in an offline analysis.
>
> Sure, but padding every text column that you have, even the ones
> containing only 'a', out to the length of Don Quixote in the original
> Spanish, is unlikely to be an appealing option.

If you are honestly trying to conceal Don Quixote, I suspect you are
already in the business of making unappealing decisions. I don't think
that's necessarily an argument against hiding the length for
real-world use cases.

> > I think some work around that is probably going to be needed for
> > serious use of this encryption, in part because of the use of text
> > format as the canonical input. If the encrypted values of 1, 10, 100,
> > and 1000 hypothetically leaked their exact lengths, then an encrypted
> > int wouldn't be very useful. So I'd want to quantify (and possibly
> > configure) exactly how much data you can encrypt in a single message
> > before the length starts being leaked, and then make sure that my
> > encrypted values stay inside that bound.
>
> I think most ciphers these days are block ciphers, so you're going to
> get output that is a multiple of the block size anyway - e.g. I think
> for AES it's 128 bits = 16 bytes. So small differences in length will
> be concealed naturally, which may be good enough for some use cases.

Right. My point is, if you have a column that has exactly one
important value that is 17 bytes long when converted to text, you're
going to want to know that block size exactly, because the encryption
will be effectively useless for that value. That size needs to be
documented, and it'd be helpful to know that it's longer than, say,
the longest text representation of our fixed-length column types.

> I'm not really convinced that it's worth putting a lot of effort into
> bolstering the security of this kind of tech above what it naturally
> gives. I think it's likely to be a wild goose chase.

If the goal is to provide real encryption, and not just a toy, I think
you're going to need to put a *lot* of effort into analysis. Even if
the result of the analysis is "we don't plan to address this in v1".

Crypto is inherently a cycle of
make-it-and-break-it-and-fix-it-and-break-it-again. If that's
considered a "wild goose chase" and not seriously pursued at some
level, then this implementation will probably not last long in the
face of real abuse. (That doesn't mean you have to take my advice; I'm
just a dude with opinions -- but you will need to have real
cryptographers look at this, and you're going to need to think about
how the system will evolve when it's broken.)

> If you have major
> worries about someone reading your disk in its entirety, use full-disk
> encryption.

This patchset was designed to protect against the evil DBA case, I
think. Full disk encryption doesn't help.

> Selective encryption is only suitable when you want to add
> a modest level of protection for individual value and are willing to
> accept that some information leakage is likely if an adversary can in
> fact read the full disk.

...but there's a known countermeasure to this particular leakage,
right? Which would make it more suitable for that case.

> Padding values to try to further obscure
> things may be situationally useful, but if you find yourself worrying
> too much about that sort of thing, you likely should have picked
> stronger medicine initially.

In my experience, this entire field is the application of
situationally useful protection. That's one of the reasons it's hard,
and designing this sort of patch is going to be hard too. Putting that
on the user isn't quite fair when you're the ones designing the
system; you determine what they have to worry about when you choose
the crypto.

--Jacob

Re: Unstable tests for recovery conflict handling

2022-07-26 Thread Tom Lane

Andres Freund  writes:
> On 2022-07-26 13:57:53 -0400, Tom Lane wrote:
>> So this is not a case of RecoveryConflictInterrupt doing the wrong thing:
>> the startup process hasn't detected the buffer conflict in the first
>> place.

> I wonder if this, at least partially, could be be due to the elog thing
> I was complaining about nearby. I.e. we decide to FATAL as part of a
> recovery conflict interrupt, and then during that ERROR out as part of
> another recovery conflict interrupt (because nothing holds interrupts as
> part of FATAL).

There are all sorts of things one could imagine going wrong in the
backend receiving the recovery conflict interrupt, but AFAICS in these
failures, the startup process hasn't sent a recovery conflict interrupt.
It certainly hasn't logged anything suggesting it noticed a conflict.

regards, tom lane

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2022-07-26 Thread Simon Riggs

On Fri, 15 Jul 2022 at 12:29, Aleksander Alekseev
 wrote:
> Personally I didn't expect that
> merging patches in this thread would take that long. They are in
> "Ready for Committer" state for a long time now and there are no known
> issues with them other than unit tests for SLRU [1] should be merged
> first.

These patches look ready to me, including the SLRU tests.

Even though they do very little, these patches touch many aspects of
the code, so it would make sense to apply these as the last step in
the CF.

To encourage committers to take that next step, let's have a
democratic vote on moving this forwards:
+1 from me.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: making relfilenodes 56 bits

2022-07-26 Thread Robert Haas

On Tue, Jul 26, 2022 at 2:07 AM Dilip Kumar  wrote:
> I have thought about it while doing so but I am not sure whether it is
> a good idea or not, because before my change these all were macros
> with 2 naming conventions so I just changed to inline function so why
> to change the name.

Well, the reason to change the name would be for consistency. It feels
weird to have some NAMES_LIKETHIS() and other NamesLikeThis().

Now, an argument against that is that it will make back-patching more
annoying, if any code using these functions/macros is touched. But
since the calling sequence is changing anyway (you now have to pass a
pointer rather than the object itself) that argument doesn't really
carry any weight. So I would favor ClearBufferTag(), InitBufferTag(),
etc.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: let's disallow ALTER ROLE bootstrap_superuser NOSUPERUSER

2022-07-26 Thread Robert Haas

On Thu, Jul 21, 2022 at 1:27 PM Nathan Bossart  wrote:
> Given the current assumptions the code makes about the bootstrap superuser,
> I think it makes sense to disallow removing its superuser attribute (at
> least via ALTER ROLE NOSUPERUSER).  It seems like there is much work to do
> before a no-superuser configuration could be formally supported.  If/when
> such support materializes, it might be possible to remove the restriction
> that Robert is proposing.

Reaction to this patch seems tentatively positive so far, so I have
committed it. Maybe someone will still show up to complain ... but I
think it's a good change, so I hope not.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: let's disallow ALTER ROLE bootstrap_superuser NOSUPERUSER

2022-07-26 Thread Tom Lane

Robert Haas  writes:
> Reaction to this patch seems tentatively positive so far, so I have
> committed it. Maybe someone will still show up to complain ... but I
> think it's a good change, so I hope not.

I had not actually read the patch, but now that I have, it's got
a basic typing error:

+   boolshould_be_super = BoolGetDatum(boolVal(dissuper->arg));
+
+   if (!should_be_super && roleid == BOOTSTRAP_SUPERUSERID)
+   ereport(ERROR,

The result of BoolGetDatum is not bool, it's Datum.  This is
probably harmless, but it's still a typing violation.
You want something like

boolshould_be_super = boolVal(dissuper->arg);
...
new_record[Anum_pg_authid_rolsuper - 1] = BoolGetDatum(should_be_super);

regards, tom lane

Re: let's disallow ALTER ROLE bootstrap_superuser NOSUPERUSER

2022-07-26 Thread Robert Haas

On Tue, Jul 26, 2022 at 2:59 PM Tom Lane  wrote:
> I had not actually read the patch, but now that I have, it's got
> a basic typing error:
>
> +   boolshould_be_super = BoolGetDatum(boolVal(dissuper->arg));
> +
> +   if (!should_be_super && roleid == BOOTSTRAP_SUPERUSERID)
> +   ereport(ERROR,
>
> The result of BoolGetDatum is not bool, it's Datum.  This is
> probably harmless, but it's still a typing violation.
> You want something like
>
> boolshould_be_super = boolVal(dissuper->arg);
> ...
> new_record[Anum_pg_authid_rolsuper - 1] = 
> BoolGetDatum(should_be_super);

Oops. Will fix.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: making relfilenodes 56 bits

2022-07-26 Thread Robert Haas

On Tue, Jul 12, 2022 at 4:35 PM Robert Haas  wrote:
> > Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME,
> > since it includes fsync etc?
>
> Sure, I had it that way for a while and changed it at the last minute.
> I can change it back.

Committed that way, also with the fix for the typo Dilip found.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

[Commitfest 2022-07] Patch Triage: Waiting on Author

2022-07-26 Thread Jacob Champion

Hello all,

I'm making my way through some stalled patches in Waiting on Author. If
nothing changes by the end of this CF, I'd recommend marking these
as Returned with Feedback.

Patch authors CC'd.

- jsonpath syntax extensions
  https://commitfest.postgresql.org/38/2482/

  As a few people pointed out, this has not seen much review/interest in
  the roughly two years it's been posted, and it's needed a rebase since
  last CF. Daniel suggested that the featureset be split up for easier
  review during the 2021-11 triage. I recommend RwF with that
  suggestion.

- Consider parallel for LATERAL subqueries having LIMIT/OFFSET
  https://commitfest.postgresql.org/38/2851/

  Does this have a path forward? It's been Waiting on Author since the
  beginning of last CF, with open concerns from Tom about safety.

- Parallel INSERT SELECT take 2
  https://commitfest.postgresql.org/38/3143/

  There was a lot of discussion early on in this patchset's life, and
  then it got caught in a rebase loop without review in August 2021. The
  thread has mostly gone dark since then and the patch does not apply.

- Add callback table access method to reset filenode when dropping relation
  https://commitfest.postgresql.org/38/3073/

  Heikki had some feedback back in Februrary but it hasn't been answered
  yet. It needs a rebase too.

- Avoid orphaned dependencies
  https://commitfest.postgresql.org/38/3106/

  Tom notes that this cannot be committed as-is; the thread has been
  silent since last CF. Last Author comment in January and needs a
  rebase.

- Allow multiple recursive self-references
  https://commitfest.postgresql.org/38/3046/

  There appears to be agreement that this is useful, but it looks like
  the patch needs some changes before it's committable. Last post from
  the Author was in January.

- Push down time-related SQLValue functions to foreign server
  https://commitfest.postgresql.org/38/3289/

  There's interest and engagement, but it's not committable as-is and
  needs a rebase. Last Author post in January.

- Parallelize correlated subqueries that execute within each worker
  https://commitfest.postgresql.org/38/3246/

  Patch needs to be fixed for FreeBSD; last Author post in January.

- libpq compression
  https://commitfest.postgresql.org/38/3499/

  Needs a rebase and response to feedback; mostly quiet since January.

Thanks,
--Jacob

Re: Implementing Incremental View Maintenance

2022-07-26 Thread huyajun

Hi, Nagata-san

Thank you for your answer, I agree with your opinion, and found some new 
problems to discuss with you
> 
>> 3. Consider truncate base tables, IVM will not refresh, maybe raise an error 
>> will be better
> 
> I fixed to support TRUNCATE on base tables in our repository.
> https://github.com/sraoss/pgsql-ivm/commit/a1365ed69f34e1adbd160f2ce8fd1e80e032392f
> 
> When a base table is truncated, the view content will be empty if the
> view definition query does not contain an aggregate without a GROUP clause.
> Therefore, such views can be truncated. 
> 
> Aggregate views without a GROUP clause always have one row. Therefore,
> if a base table is truncated, the view will not be empty and will contain
> a row with NULL value (or 0 for count()). So, in this case, we refresh the
> view instead of truncating it.
> 
> The next version of the patch-set will include this change. 
> 
I read your patch and think this processing is greet, but there is a risk of 
deadlock. 
Although I have not thought of a suitable processing method for the time being, 
it is also acceptable for truncate scenarios.The deadlock scene is as follows:

Mv define is: select * from base_a,base_b;
S1: truncate base_a; — only AccessExclusiveLock base_a and not run into after 
trigger
S2: insert into base_b; — The update has been completed and the incremental 
refresh is started in the after trigger，RowExclusive on base_b and 
ExclusiveLock on mv
S1: continue truncate mv, wait for AccessExclusiveLock on mv, wait for S2
S2: continue refresh mv, wait for AccessShardLock on base_a, wait for S1
So deadlock occurred

I also found some new issues that I would like to discuss with you
1. Concurrent DML causes imv data error, case like below
Setup:
Create table t( a int);
Insert into t select 1 from generate_series(1,3);
create incremental materialized view s  as select count(*) from t;

S1: begin;delete from t where ctid in (select ctid from t limit 1);
S2: begin;delete from t where ctid in (select ctid from t limit 1 offset 1);
S1: commit;
S2: commit;

After this, The count data of s becomes 2 but correct data is 1.
I found out that the problem is probably because to our use of ctid update
Consider user behavior unrelated to imv：

Create table t( a int);
Insert into t select 1;
s1: BEGIN
s1: update t set a = 2 where ctid in (select ctid from t); -- UPDATE 1
s2: BEGIN
s2: update t set a = 3 where ctid in (select ctid from t); -- wait row lock
s1: COMMIT
s2: -- UPDATE 0 -- ctid change so can't UPDATE one rows
So we lost the s2 update

2. Sometimes it will crash when the columns of the created materialized view do 
not match
Create table t( a int);
create incremental materialized view s(z)  as select sum(1) as a, sum(1) as b 
from t;

The problem should be that colNames in rewriteQueryForIMMV does not consider 
this situation

3. Sometimes no error when the columns of the created materialized view do not 
match
 Create table t( a int);
 create incremental materialized view s(y,z)  as select  count(1) as b from t;

But the hidden column of IMV is overwritten to z which will cause refresh 
failed.

The problem should be that checkRuleResultList we should only skip imv hidden 
columns check

4. A unique index should not be created in the case of a Cartesian product

create table base_a (i int primary key, j varchar);
create table base_b (i int primary key, k varchar);
INSERT INTO base_a VALUES
(1,10),
(2,20),
(3,30),
(4,40),
(5,50);
INSERT INTO base_b VALUES
(1,101),
(2,102),
(3,103),
(4,104);
CREATE incremental MATERIALIZED VIEW s as
select base_a.i,base_a.j from base_a,base_b; — create error because of unique 
index

5. Besides, I would like to ask you if you have considered implementing an IMV 
with delayed refresh?
The advantage of delayed refresh is that it will not have much impact on write 
performance
I probably have some ideas about it now, do you think it works?
1. After the base table is updated, the delayed IMV's after trigger is used to 
record the delta
 information in another table similar to the incremental log of the base table
2. When incremental refresh, use the data in the log instead of the data in the 
trasient table
of the after trigger
3. We need to merge the incremental information in advance to ensure that the 
base_table
after transaction filtering UNION ALL old_delta is the state before the base 
table is updated
Case like below:
Create table t( a int);
—begin to record log
Insert into t select 1; — newlog: 1 oldlog: empty
Delete from t; —newlog:1, oldlog：1
— begin to incremental refresh
Select * from t where xmin < xid or (xmin = xid and cmin < cid); — empty
So this union all oldlog is not equal to  before the base table is updated
We need  merge the incremental log in advance to make newlog: empty, oldlog: 
empty

If implemented, incremental refresh must still be serialized, but the DML of 
the base table 
can not be blocked, that is to say, the base table can still record logs during 
incremental refresh

Re: pg15b2: large objects lost on upgrade

2022-07-26 Thread Robert Haas

On Mon, Jul 18, 2022 at 2:57 PM Robert Haas  wrote:
> Well, it took a while to figure out how to make that work, but I
> believe I've got it now. Attached please find a couple of patches that
> should get the job done. They might need a bit of polish, but I think
> the basic concepts are sound.

So, would people like these patches (1) committed to master only, (2)
committed to master and back-patched into v15, or (3) not committed at
all? Michael argued upthread that it was too risky to be tinkering
with things at this stage in the release cycle and, certainly, the
more time goes by, the more true that gets. But I'm not convinced that
these patches involve an inordinate degree of risk, and using beta as
a time to fix things that turned out to be buggy is part of the point
of the whole thing. On the other hand, the underlying issue isn't that
serious either, and nobody seems to have reviewed the patches in
detail, either. I don't mind committing them on my own recognizance if
that's what people would prefer; I can take responsibility for fixing
anything that is further broken, as I suppose would be expected even
if someone else did review. But, I don't want to do something that
other people feel is the wrong thing to have done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Postgres do not allow to create many tables with more than 63-symbols prefix

2022-07-26 Thread Tom Lane

Andrey Lepikhov  writes:
> On 6/27/22 06:38, Masahiko Sawada wrote:
 Regarding the patch, I think we can merge makeUniqueTypeName() to
 makeArrayTypeName() as there is no caller of makeUniqueTypeName() who
 pass tryOriginal = true.

>>> I partially agree with you. But I have one reason to leave
>>> makeUniqueTypeName() separated:
>>> It may be used in other codes with auto generated types. For example, I
>>> think, the DefineRelation routine should choose composite type instead
>>> of using the same name as the table.

No, this is an absolutely horrid idea.  The rule that "_foo" means "array
of foo" is quite well known to a lot of client code, and as long as they
don't use any type names approaching NAMEDATALEN, it's solid.  So we must
not build backend code that uses "_foo"-like type names for any other
purpose than arrays.

I suspect in fact that the reason we ended up with this orphaned logic
is that somebody pointed out this problem somewhere along the development
of multiranges, whereupon makeMultirangeTypeName was changed to not
use the shared code --- but the breakup of makeArrayTypeName wasn't
undone altogether.  It should have been, because it just tempts other
people to make the same wrong choice.

Pushed with re-merging of the code into makeArrayTypeName and some
work on the comments.

regards, tom lane

Re: Unstable tests for recovery conflict handling

2022-07-26 Thread Andres Freund

Hi,

On 2022-07-26 14:30:30 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2022-07-26 13:57:53 -0400, Tom Lane wrote:
> >> So this is not a case of RecoveryConflictInterrupt doing the wrong thing:
> >> the startup process hasn't detected the buffer conflict in the first
> >> place.
> 
> > I wonder if this, at least partially, could be be due to the elog thing
> > I was complaining about nearby. I.e. we decide to FATAL as part of a
> > recovery conflict interrupt, and then during that ERROR out as part of
> > another recovery conflict interrupt (because nothing holds interrupts as
> > part of FATAL).
> 
> There are all sorts of things one could imagine going wrong in the
> backend receiving the recovery conflict interrupt, but AFAICS in these
> failures, the startup process hasn't sent a recovery conflict interrupt.
> It certainly hasn't logged anything suggesting it noticed a conflict.

I don't think we reliably emit a log message before the recovery
conflict is resolved.

I've wondered a couple times now about making tap test timeouts somehow
trigger a core dump of all processes. Certainly would make it easier to
debug some of these kinds of issues.

Greetings,

Andres Freund

Re: Transparent column encryption

2022-07-26 Thread Robert Haas

On Tue, Jul 26, 2022 at 2:27 PM Jacob Champion  wrote:
> Right. My point is, if you have a column that has exactly one
> important value that is 17 bytes long when converted to text, you're
> going to want to know that block size exactly, because the encryption
> will be effectively useless for that value. That size needs to be
> documented, and it'd be helpful to know that it's longer than, say,
> the longest text representation of our fixed-length column types.

I certainly have no objection to being clear about such details in the
documentation.

> If the goal is to provide real encryption, and not just a toy, I think
> you're going to need to put a *lot* of effort into analysis. Even if
> the result of the analysis is "we don't plan to address this in v1".
>
> Crypto is inherently a cycle of
> make-it-and-break-it-and-fix-it-and-break-it-again. If that's
> considered a "wild goose chase" and not seriously pursued at some
> level, then this implementation will probably not last long in the
> face of real abuse. (That doesn't mean you have to take my advice; I'm
> just a dude with opinions -- but you will need to have real
> cryptographers look at this, and you're going to need to think about
> how the system will evolve when it's broken.)

Well, I'm just a dude with opinions, too. I fear the phenomenon where
doing anything about a problem makes you responsible for the whole
problem. If we disclaim the ability to hide the length of values,
that's clear enough. But if we start padding to try to hide the length
of values, then people might expect it to work in all cases, and I
don't see how it ever can. Moreover, I think that the padding might
need to be done in a "cryptographically intelligent" way rather than
just, say, adding trailing blanks. Now that being said, if Peter wants
to implement something around padding that he has reason to believe
will not create cryptographic weaknesses, I have no issue with that. I
just don't view it as an essential part of the feature, because hiding
such things doesn't seem like it can ever be the main point of a
feature like this.

> > Padding values to try to further obscure
> > things may be situationally useful, but if you find yourself worrying
> > too much about that sort of thing, you likely should have picked
> > stronger medicine initially.
>
> In my experience, this entire field is the application of
> situationally useful protection. That's one of the reasons it's hard,
> and designing this sort of patch is going to be hard too.

Agreed.

> Putting that
> on the user isn't quite fair when you're the ones designing the
> system; you determine what they have to worry about when you choose
> the crypto.

I guess my view on this is that, if you're trying to hide something
like a credit card number, most likely every value in the system is
the same length, and then this is a non-issue. On the other hand, if
the secret column is a person's name, then there is an issue, but
you're not going to pad every value out the maximum length of a
varlena, so you have to make an estimate of how long a name someone
might reasonably have to decide how much padding to include. You also
have to decide whether the storage cost of padding every value is
worth it to you given the potential information leakage. Only the
human user can make those decisions, so some amount of "putting that
on the user" feels inevitable. Now, if we don't have a padding system
built into the feature, then that does put even more on the user; it's
hard to argue with that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Question about ExplainOneQuery_hook

2022-07-26 Thread Zhihong Yu

Hi,
I was looking at ExplainOneQuery() where ExplainOneQuery_hook is called.

Currently the call to the hook is in if block and normal processing is in
else block.

What if the hook doesn't want to duplicate the whole code printing
execution plan ?

Please advise.

Thanks

Re: Expand palloc/pg_malloc API

2022-07-26 Thread Tom Lane

Peter Eisentraut  writes:
> On 17.05.22 20:43, Tom Lane wrote:
>> So I think Peter's got a good idea here (I might quibble with the details
>> of some of these macros).  But it's not really going to move the
>> safety goalposts very far unless we make a concerted effort to make
>> these be the style everywhere.  Are we willing to do that?  What
>> will it do to back-patching difficulty?  Dare I suggest back-patching
>> these changes?

> I think it could go like the castNode() introduction: first we adopt it 
> sporadically for new code, then we change over some larger pieces of 
> code, then we backpatch the API, then someone sends in a big patch to 
> change the rest.

OK, that seems like a reasonable plan.

I've now studied this a little more closely, and I think the
functionality is fine, but I have naming quibbles.

1. Do we really want distinct names for the frontend and backend
versions of the macros?  Particularly since you're layering the
frontend ones over pg_malloc, which has exit-on-OOM behavior?
I think we've found that notational discrepancies between frontend
and backend code are generally more a hindrance than a help,
so I'm inclined to drop the pg_malloc_xxx macros and just use
"palloc"-based names across the board.

2. I don't like the "palloc_ptrtype" name at all.  I see that you
borrowed that name from talloc, but I doubt that's a precedent that
very many people are familiar with.  To me it sounds like it might
allocate something that's the size of a pointer, not the size of the
pointed-to object.  I have to confess though that I don't have an
obviously better name to suggest.  "palloc_pointed_to" would be
clear perhaps, but it's kind of long.

3. Likewise, "palloc_obj" is perhaps less clear than it could be.
I find palloc_array just fine though.  Maybe palloc_object or
palloc_struct?  (If "_obj" can be traced to talloc, I'm not
seeing where.)

One thought that comes to mind is that palloc_ptrtype is almost
surely going to be used in the style

myvariable = palloc_ptrtype(myvariable);

and if it's not that it's very likely wrong.  So maybe we should cut
out the middleman and write something like

#define palloc_instantiate(ptr) ((ptr) = (typeof(ptr)) palloc(sizeof(*(ptr
...
palloc_instantiate(myvariable);

I'm not wedded to "instantiate" here, there's probably better names.

regards, tom lane

Re: Skip partition tuple routing with constant partition key

2022-07-26 Thread David Rowley

Thank for looking at this.

On Sat, 23 Jul 2022 at 01:23, Amit Langote  wrote:
> +   /*
> +* The Datum has changed.  Zero the number of times we've
> +* found last_found_datum_index in a row.
> +*/
> +   partdesc->last_found_count = 0;
>
> +   /* Zero the "winning streak" on the cache hit count */
> +   partdesc->last_found_count = 0;
>
> Might it be better for the two comments to say the same thing?  Also,
> I wonder which one do you intend as the resetting of last_found_count:
> setting it to 0 or 1?  I can see that the stanza at the end of the
> function sets to 1 to start a new cycle.

I think I've addressed all of your comments.  The above one in
particular caused me to make some larger changes.

The reason I was zeroing the last_found_count in LIST partitioned
tables when the Datum was not equal to the previous found Datum was
due to the fact that the code at the end of the function was only
checking the partition indexes matched rather than the bound_offset vs
last_found_datum_index.  The reason I wanted to zero this was that if
you had a partition FOR VALUES IN(1,2), and you received rows with
values alternating between 1 and 2 then we'd match to the same
partition each time, however the equality test with the current
'values' and the Datum at last_found_datum_index would have been false
each time.  If we didn't zero the last_found_count we'd have kept
using the cache path even though the Datum and last Datum wouldn't
have been equal each time. That would have resulted in always doing
the cache check and failing, then doing the binary search anyway.

I've now changed the code so that instead of checking the last found
partition is the same as the last one, I'm now checking if
bound_offset is the same as last_found_datum_index.  This will be
false in the "values alternating between 1 and 2" case from above.
This caused me to have to change how the caching works for LIST
partitions with a NULL partition which is receiving NULL values.  I've
coded things now to just skip the cache for that case. Finding the
correct LIST partition for a NULL value is cheap and no need to cache
that.  I've also moved all the code which updates the cache fields to
the bottom of get_partition_for_tuple(). I'm only expecting to do that
when bound_offset is set by the lookup code in the switch statement.
Any paths, e.g. HASH partitioning lookup and LIST or RANGE with NULL
values shouldn't reach the code which updates the partition fields.
I've added an Assert(bound_offset >= 0) to ensure that stays true.

There's probably a bit more to optimise here too, but not much. I
don't think the partdesc->last_found_part_index = -1; is needed when
we're in the code block that does return boundinfo->default_index;
However, that only might very slightly speedup the case when we're
inserting continuously into the DEFAULT partition. That code path is
also used when we fail to find any matching partition. That's not one
we need to worry about making go faster.

I also ran the benchmarks again and saw that most of the use of
likely() and unlikely() no longer did what I found them to do earlier.
So the weirdness we saw there most likely was just down to random code
layout changes. In this patch, I just dropped the use of either of
those two macros.

David
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index e03ea27299..6a323436d5 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1332,10 +1332,48 @@ FormPartitionKeyDatum(PartitionDispatch pd,
elog(ERROR, "wrong number of partition key expressions");
 }

+/*
+ * The number of times the same partition must be found in a row before we
+ * switch from a search for the given values to just checking if the values
+ * belong to the last found partition.  This must be above 0.
+ */
+#define PARTITION_CACHED_FIND_THRESHOLD16
+
 /*
  * get_partition_for_tuple
  * Finds partition of relation which accepts the partition key 
specified
- * in values and isnull
+ * in values and isnull.
+ *
+ * Calling this function can be quite expensive when LIST and RANGE
+ * partitioned tables have many partitions.  This is due to the binary search
+ * that's done to find the correct partition.  Many of the use cases for LIST
+ * and RANGE partitioned tables make it likely that the same partition is
+ * found in subsequent ExecFindPartition() calls.  This is especially true for
+ * cases such as RANGE partitioned tables on a TIMESTAMP column where the
+ * partition key is the current time.  When asked to find a partition for a
+ * RANGE or LIST partitioned table, we record the partition index and datum
+ * offset we've found for the given 'values' in the PartitionDesc (which is
+ * stored in relcache), and if we keep finding the sam

Re: fairywren hung in pg_basebackup tests

2022-07-26 Thread Thomas Munro

On Tue, Jul 26, 2022 at 4:03 AM Andrew Dunstan  wrote:
> On 2022-07-25 Mo 11:24, Thomas Munro wrote:
> > On Tue, Jul 26, 2022 at 3:08 AM Tom Lane  wrote:
> >> Hmm ... an alternative theory is that the test is fine, and what
> >> it's telling us is that get_dirent_type() is still wrong on msys.
> >> Would that end in this symptom?
> > Hmm, possibly yes (if it sees a non-symlink, it'll skip it).   If
> > someone can run the test on an msys system, perhaps they could put a
> > debugging elog() into the code modified by 9d3444dc to log d_name and
> > the d_type that is returned?  I'm struggling to understand why msys
> > would change the answer though.
>
> I have no idea either. The link exists and it is a junction. I'll see
> about logging details.

>From the clues so far, it seems like pgwin32_is_junction(fullpath) was
returning true, but the following code in get_dirent_type(), which was
supposed to be equivalent, is not reached on MSYS (though it
apparently does on MSVC?):

+   if ((fd.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) != 0 &&
+   (fd.dwReserved0 == IO_REPARSE_TAG_MOUNT_POINT))
d->ret.d_type = DT_LNK;

pgwin32_is_junction() uses GetFileAttributes() and tests (attr &
FILE_ATTRIBUTE_REPARSE_POINT) == FILE_ATTRIBUTE_REPARSE_POINT, which
is like the first condition but lacks the dwReserved0 part.  What is
that part doing, and why would it be doing something different in MSVC
and MSYS builds?  That code came from 87e6ed7c, recently I was just
trying to fix it by reordering the checks; oh, there was some
discussion about that field[1].

One idea is that something about dwReserved0 or
IO_REPARSE_TAG_MOUNT_POINT is different in the open source replacement
system headers supplied by the MinGW project used by MSYS builds
(right?), compared to the "real" Windows SDK's headers used by MSVC
builds.

Or perhaps there is some other dumb mistake, or perhaps the reparse
point really is different, or ... I dunno, I'd probably shove a load
of log messages in there and see what's going on.

[1] 
https://www.postgresql.org/message-id/flat/CABUevEzURN%3DwC95JHvTKFJtEy0eY9rWO42yU%3D59-q8xSwm-Dug%40mail.gmail.com#ac54acd782fc849c0fe6c2c05db101dc

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-07-26 Thread David Rowley

On Tue, 26 Jul 2022 at 12:01, Zhihong Yu  wrote:
> sort order the the planner chooses is simply : there is duplicate `the`

I think the first "the" should be "that"

> +   /* mark this aggregate is covered by 'currpathkeys' */
>
> is covered by -> as covered by

I think it was shortened from "mark that this aggregate", but I
dropped "that" to get the comment to fit on a single line. Swapping
"is" for "as" makes it better. Thanks.

David

Re: explain_regress, explain(MACHINE), and default to explain(BUFFERS) (was: BUFFERS enabled by default in EXPLAIN (ANALYZE))

2022-07-26 Thread David G. Johnston

On Mon, Jan 24, 2022 at 9:54 AM Justin Pryzby  wrote:

> I'm renaming this thread for better visibility, since buffers is a small,
> optional part of the patches I sent.
>
> I made a CF entry here.
> https://commitfest.postgresql.org/36/3409/
>
> On Wed, Dec 01, 2021 at 06:58:20PM -0600, Justin Pryzby wrote:
> > On Mon, Nov 15, 2021 at 01:09:54PM -0600, Justin Pryzby wrote:
> > > Some time ago, I had a few relevant patches:
> > > 1) add explain(REGRESS) which is shorthand for (BUFFERS OFF, TIMING
> OFF, COSTS OFF, SUMMARY OFF)
> > > 2) add explain(MACHINE) which elides machine-specific output from
> explain;
> > >for example, Heap Fetches, sort spaceUsed, hash nbuckets, and
> tidbitmap stuff.
> > >
> > >
> https://www.postgresql.org/message-id/flat/20200306213310.gm...@telsasoft.com
> >
> > The attached patch series now looks like this (some minor patches are not
> > included in this list):
> >
> >  1. add GUC explain_regress, which disables TIMING, SUMMARY, COSTS;
>

> >  2. enable explain(BUFFERS) by default (but disabled by explain_regress);
>

+1

> >  3. Add explain(MACHINE) - which is disabled by explain_regress.
> > This elides various machine-specific output like Memory and Disk use.
> > Maybe it should be called something else like "QUIET" or
> "VERBOSE_MINUS_1"
> > or ??
>

INCIDENTALS ?

Aside from being 4 syllables and, for me at least, a pain to remember how
to spell, it seems to fit.

RUNTIME is probably easier, and we just need to point out the difficulty in
naming things since actual rows is still shown, and timings have their own
independent option.

>
> > The regression tests now automatically run with explain_regress=on,
> which is
> > shorthand for TIMING OFF, SUMMARY OFF, COSTS OFF, and then BUFFERS OFF.
> >
> > There's a further option called explain(MACHINE) which can be disabled
> to hide
> > portions of the output that are unstable, like Memory/Disk/Batches/
> > Heap Fetches/Heap Blocks.  This allows "explain analyze" to be used more
> easily
> > in regression tests, and simplifies some existing tests that currently
> use
> > plpgsql functions to filter the output.  But it doesn't handle all the
> > variations from parallel workers.
> >
> > (3) is optional, but simplifies some regression tests.  The patch series
> could
> > be rephrased with (3) first.
>

I like the idea of encapsulating the logic about what to show into the
generation code instead of a post-processing routine.

I don't see any particular ordering of the three changes being most clear.

>
> > Unfortunately, "COSTS OFF" breaks postgres_fdw remote_estimate.  If
> specifying
> > "COSTS ON" in postgres_fdw.c is considered to be a poor fix

Given that we have version tests scattered throughout the postgres_fdw.c
code I'd say protect it with version >= 9.0 and call it good.

But I'm assuming that we are somehow also sending over the GUC to the
remote server (which will be v16+) but I am unsure why that would be the
case.  I'm done code reading for now so I'm leaving this an open
request/question.

, then I suppose
> > this patch series could do as suggested and enable buffers by default
> only when
> > ANALYZE is specified.  Then postgres_fdw is not affected, and the
> > explain_regress GUC is optional: instead, we'd need to specify BUFFERS
> OFF in
> > ~100 regression tests which use EXPLAIN ANALYZE.

I'm not following the transition from the prior sentences about COSTS to
this one regarding BUFFERS.

  (3) still seems useful on its
> > own.
>

It is an orthogonal feature with a different section of our user base being
catered to, so I'd agree by default.

Patch Review Comments:

The following change in the machine commit seems out-of-place; are we
fixing a bug here?

explain.c:
   /* Show buffer usage in planning */
- if (bufusage)
+ if (bufusage && es->buffers)

Typo (trailing comment // without comment):

ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es); //

I did a pretty thorough look-through of the locations of the various
changes and everything seems consistent with the established code in this
area (took me a bit to get to the level of comprehension though).  Whether
there may be something missing I'm less able to judge.

>From reading the other main thread I'm not finding this particular choice
of GUC usage to be a problem anyone has raised and it does seem the
cleanest solution, both for us and for anyone out there with a similar
testing framework.

The machine output option covers an obvious need since we've already
written ad-hoc parsing code to deal with the problem.

And, as someone who sees first-hand the kinds of conversations that occur
surrounding beginner's questions, and getting more into performance
questions specifically, I'm sympathetic with the desire to have BUFFERS
something that is output by default.  Given existing buy-in for that idea
I'd hope that at minimum the "update 100 .sql and expected output files,
then change the default" commit happens even if the rest

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-07-26 Thread David Rowley

On Tue, 26 Jul 2022 at 19:39, Richard Guo  wrote:
> Also I'm wondering if it's possible to take into consideration the
> ordering indicated by existing indexes when determining the pathkeys. So
> that for the query below we can avoid the Incremental Sort node if we
> consider that there is an index on t(a, c):
>
> # explain (costs off) select max(b order by b), max(c order by c) from t 
> group by a;
>  QUERY PLAN
> -
>  GroupAggregate
>Group Key: a
>->  Incremental Sort
>  Sort Key: a, b
>  Presorted Key: a
>  ->  Index Scan using t_a_c_idx on t
> (6 rows)

That would be nice but I'm not going to add anything to this patch
which does anything like that. I think the patch, as it is, is a good
meaningful step forward to improve the performance of ordered
aggregates.

There are other things in the planner that could gain from what you
talk about. For example, choosing the evaluation order of WindowFuncs.
Perhaps it would be better to try to tackle those two problems
together rather than try to sneak something half-baked along with this
patch.

David

Re: Question about ExplainOneQuery_hook

2022-07-26 Thread David G. Johnston

On Tue, Jul 26, 2022 at 1:54 PM Zhihong Yu  wrote:

> Hi,
> I was looking at ExplainOneQuery() where ExplainOneQuery_hook is called.
>
> Currently the call to the hook is in if block and normal processing is in
> else block.
>
> What if the hook doesn't want to duplicate the whole code printing
> execution plan ?
>
> Please advise.
>
>
What kind of advice are you looking for, especially knowing we don't know
anything except you find the existing hook unusable.

https://github.com/postgres/postgres/commit/604ffd280b955100e5fc24649ee4d42a6f3ebf35

My advice is pretend the hook doesn't even exist since it was created 15
years ago for a specific purpose that isn't what you are doing.

I'm hoping that you already have some idea of how to interact with the open
source PostgreSQL project when it doesn't have a feature that you want.
Otherwise that generic discussion probably is best done on -general with a
better subject line.

David J.

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2022-07-26 Thread Thomas Munro

On Sat, Jul 2, 2022 at 11:18 AM Andres Freund  wrote:
> On 2022-07-01 13:14:23 -0700, Andres Freund wrote:
> > - the pg_usleep(5000) in ResolveRecoveryConflictWithVirtualXIDs() can
> >   completely swamp the target(s) on a busy system. This surely is 
> > exascerbated
> >   by the usleep I added RecoveryConflictInterrupt() but a 5ms signalling 
> > pace
> >   does seem like a bad idea.
>
> This one is also implicated in
> https://postgr.es/m/20220701154105.jjfutmngoedgiad3%40alvherre.pgsql
> and related issues.
>
> Besides being very short, it also seems like a bad idea to wait when we might
> not need to? Seems we should only wait if we subsequently couldn't get the
> lock?
>
> Misleadingly WaitExceedsMaxStandbyDelay() also contains a usleep, which at
> least I wouldn't expect given its name.
>
>
> A minimal fix would be to increase the wait time, similar how it is done with
> standbyWait_us?
>
> Medium term it seems we ought to set the startup process's latch when handling
> a conflict, and use a latch wait. But avoiding races probably isn't quite
> trivial.

Yeah, I had the same thought; it's easy to criticise the current
collateral damage maximising design, but a whole project to come up
with a good race-free precise design.  We should do that, though.

> I guess the reason we first get an ERROR and then a FATAL is that the second
> iteration hits the if (RecoveryConflictPending && DoingCommandRead) bit,
> because we end up there after handling the first error? And that's a FATAL.
>
> I suspect that Thomas' fix will address at least part of this, as the check
> whether we're still waiting for a lock will be made just before the error is
> thrown.

That seems right.

> >   reporting a FATAL error in process of reporting a FATAL error. Yeha.
> >
> >   I assume this could lead to sending out the same message quite a few
> >   times.
>
> This seems like it needs to be fixed in elog.c. ISTM that at the very least we
> ought to HOLD_INTERRUPTS() before the EmitErrorReport() for FATAL.

That seems to make sense.

About my patch... even though it solves a couple of problems now
identified, I found an architectural problem that I don't have a
solution for yet, which stopped me in my tracks a few weeks back.  I
need to find a way forward that is back-patchable.

Recap:  The basic concept here is to kick all "real work" out of
signal handlers, because that work is unsafe in that context.  So
instead of deciding whether we need to cancel the current query at the
next CFI by setting QueryCancelPending, we defer the whole decision to
the next CFI.  Sometimes the decision is that we don't need to do
anything, and the CFI returns and execution continues normally.

The problem is that there are a couple of parts of our tree that don't
use a standard CFI, but are interrupted by looking for
QueryCancelPending directly.  syncrep.c is one, but I don't believe
you could be blocked there while recovery is in progress, and
regcomp.c is another.  (There was a third case relating to that
posix_fallocate() problem report you mentioned above, but 4518c798
removed that).  The regular expression machinery is capable of
consuming a lot of CPU, and does CANCEL_REQUESTED(nfa->v->re)
frequently to avoid getting stuck.  With the patch as it stands, that
would never be true.

Re: Skip partition tuple routing with constant partition key

2022-07-26 Thread Zhihong Yu

On Tue, Jul 26, 2022 at 3:28 PM David Rowley  wrote:

> Thank for looking at this.
>
> On Sat, 23 Jul 2022 at 01:23, Amit Langote 
> wrote:
> > +   /*
> > +* The Datum has changed.  Zero the number of times
> we've
> > +* found last_found_datum_index in a row.
> > +*/
> > +   partdesc->last_found_count = 0;
> >
> > +   /* Zero the "winning streak" on the cache hit count
> */
> > +   partdesc->last_found_count = 0;
> >
> > Might it be better for the two comments to say the same thing?  Also,
> > I wonder which one do you intend as the resetting of last_found_count:
> > setting it to 0 or 1?  I can see that the stanza at the end of the
> > function sets to 1 to start a new cycle.
>
> I think I've addressed all of your comments.  The above one in
> particular caused me to make some larger changes.
>
> The reason I was zeroing the last_found_count in LIST partitioned
> tables when the Datum was not equal to the previous found Datum was
> due to the fact that the code at the end of the function was only
> checking the partition indexes matched rather than the bound_offset vs
> last_found_datum_index.  The reason I wanted to zero this was that if
> you had a partition FOR VALUES IN(1,2), and you received rows with
> values alternating between 1 and 2 then we'd match to the same
> partition each time, however the equality test with the current
> 'values' and the Datum at last_found_datum_index would have been false
> each time.  If we didn't zero the last_found_count we'd have kept
> using the cache path even though the Datum and last Datum wouldn't
> have been equal each time. That would have resulted in always doing
> the cache check and failing, then doing the binary search anyway.
>
> I've now changed the code so that instead of checking the last found
> partition is the same as the last one, I'm now checking if
> bound_offset is the same as last_found_datum_index.  This will be
> false in the "values alternating between 1 and 2" case from above.
> This caused me to have to change how the caching works for LIST
> partitions with a NULL partition which is receiving NULL values.  I've
> coded things now to just skip the cache for that case. Finding the
> correct LIST partition for a NULL value is cheap and no need to cache
> that.  I've also moved all the code which updates the cache fields to
> the bottom of get_partition_for_tuple(). I'm only expecting to do that
> when bound_offset is set by the lookup code in the switch statement.
> Any paths, e.g. HASH partitioning lookup and LIST or RANGE with NULL
> values shouldn't reach the code which updates the partition fields.
> I've added an Assert(bound_offset >= 0) to ensure that stays true.
>
> There's probably a bit more to optimise here too, but not much. I
> don't think the partdesc->last_found_part_index = -1; is needed when
> we're in the code block that does return boundinfo->default_index;
> However, that only might very slightly speedup the case when we're
> inserting continuously into the DEFAULT partition. That code path is
> also used when we fail to find any matching partition. That's not one
> we need to worry about making go faster.
>
> I also ran the benchmarks again and saw that most of the use of
> likely() and unlikely() no longer did what I found them to do earlier.
> So the weirdness we saw there most likely was just down to random code
> layout changes. In this patch, I just dropped the use of either of
> those two macros.
>
> David
>
Hi,

+   return boundinfo->indexes[last_datum_offset + 1];
+
+   else if (cmpval < 0 && last_datum_offset + 1 <
boundinfo->ndatums)

nit: the `else` keyword is not needed.

Cheers

Re: Transparent column encryption

2022-07-26 Thread Jacob Champion

On 7/26/22 13:25, Robert Haas wrote:
> I certainly have no objection to being clear about such details in the
> documentation.

Cool.

> I fear the phenomenon where
> doing anything about a problem makes you responsible for the whole
> problem. If we disclaim the ability to hide the length of values,
> that's clear enough.

I don't think disclaiming responsibility absolves you of it here, in
part because choices are being made (text format) that make length
hiding even more important than it otherwise would be. A user who
already knows that encryption doesn't hide length might still reasonably
expect a fixed-length column type like bigint to be protected in all
cases. It won't be (at least not with your 16-byte example).

And sure, you can document that caveat too, but said user might then
reasonably wonder how they're supposed to actually make it safe.

> But if we start padding to try to hide the length
> of values, then people might expect it to work in all cases, and I
> don't see how it ever can.

Well, that's where I agree with you on the value of solid documentation.
But there are other things we can do as well. In general we should

- choose a default that will protect most people out of the box,
- document the heck out of the default's limitations,
- provide guardrails that warn the user when they're outgrowing those
  limitations, and
- give people a way to tune it to their own use cases.

As an example, a naive guardrail in this instance could be to simply
have the client refuse to encrypt data past the padding maximum, if
you've gone so far as to set one up. It'd suck to hit that maximum in
production and have to rewrite the column, but you did want your
encryption to hide your data, right?

Maybe that's way too complex to think about for a v1, but it'll be
easier to maintain this into the future if there's at least a plan to
create a v2. If you declare it out of scope, instead of considering it a
potential TODO, then I think it'll be a lot harder for people to improve it.

> Moreover, I think that the padding might
> need to be done in a "cryptographically intelligent" way rather than
> just, say, adding trailing blanks.

Possibly. I think that's where AEAD comes in -- if you've authenticated
your ciphertext sufficiently, padding oracles should be prohibitively
difficult(?). (But see below; I think we also have other things to worry
about in terms of authentication and oracles.)

> Now that being said, if Peter wants
> to implement something around padding that he has reason to believe
> will not create cryptographic weaknesses, I have no issue with that. I
> just don't view it as an essential part of the feature, because hiding
> such things doesn't seem like it can ever be the main point of a
> feature like this.

I think that side channel consideration has to be an essential part of
any cryptography feature. Recent history has shown "obscure" side
channels gaining power to the point of completely breaking crypto schemes.

And it's not like TLS where we have to protect an infinite stream of
arbitrary bytes; this is going to be used on small values that probably
get repeated often and have (comparatively) very little entropy.
Cryptanalysis based on length seems to me like part and parcel of the
problem space.

> I guess my view on this is that, if you're trying to hide something
> like a credit card number, most likely every value in the system is
> the same length, and then this is a non-issue.

Agreed.

> On the other hand, if
> the secret column is a person's name, then there is an issue, but
> you're not going to pad every value out the maximum length of a
> varlena, so you have to make an estimate of how long a name someone
> might reasonably have to decide how much padding to include. You also
> have to decide whether the storage cost of padding every value is
> worth it to you given the potential information leakage. Only the
> human user can make those decisions, so some amount of "putting that
> on the user" feels inevitable.

Agreed.

> Now, if we don't have a padding system
> built into the feature, then that does put even more on the user; it's
> hard to argue with that.
Right. If they can even fix it at all. Having a well-documented padding
feature would not only help mitigate that, it would conveniently hang a
big sign on the caveats that exist.

--

Speaking of oracles and side channels. Users may want to use associated
data to further lock an encrypted value to its column type, too.
Otherwise it seems like an active DBA could feed an encrypted text blob
to a client in place of, say, an int column, to see whether or not that
text blob is a number. Seems like AD is going to be important to prevent
active attacks in general.

--Jacob

Re: [Commitfest 2022-07] Patch Triage: Waiting on Author

2022-07-26 Thread Justin Pryzby

On Tue, Jul 26, 2022 at 12:26:59PM -0700, Jacob Champion wrote:
> Hello all,
> 
> I'm making my way through some stalled patches in Waiting on Author. If
> nothing changes by the end of this CF, I'd recommend marking these
> as Returned with Feedback.

+1

I suggest that, if you send an email when marking as RWF, to mention that the
existing patch record can be re-opened and moved to next CF.

I'm aware that people may think that this isn't always a good idea, but it's
nice to mention that it's possible.  It's somewhat comparable to starting a new
thread (preferably including a link to the earlier one).

-- 
Justin

Re: Is RecoveryConflictInterrupt() entirely safe in a signal handler?

2022-07-26 Thread Tom Lane

Thomas Munro  writes:
> ... The regular expression machinery is capable of
> consuming a lot of CPU, and does CANCEL_REQUESTED(nfa->v->re)
> frequently to avoid getting stuck.  With the patch as it stands, that
> would never be true.

Surely that can't be too hard to fix.  We might have to refactor
the code around QueryCancelPending a little bit so that callers
can ask "do we need a query cancel now?" without actually triggering
a longjmp ... but why would that be problematic?

regards, tom lane

Re: [Commitfest 2022-07] Patch Triage: Waiting on Author

2022-07-26 Thread Jacob Champion

On 7/26/22 16:20, Justin Pryzby wrote:
> I suggest that, if you send an email when marking as RWF, to mention that the
> existing patch record can be re-opened and moved to next CF.
> 
> I'm aware that people may think that this isn't always a good idea, but it's
> nice to mention that it's possible.  It's somewhat comparable to starting a 
> new
> thread (preferably including a link to the earlier one).

Thanks, will do!

--Jacob

Re: [Commitfest 2022-07] Patch Triage: Waiting on Author

2022-07-26 Thread James Coleman

On Tue, Jul 26, 2022 at 3:27 PM Jacob Champion  wrote:
...
> - Consider parallel for LATERAL subqueries having LIMIT/OFFSET
>   https://commitfest.postgresql.org/38/2851/
>
>   Does this have a path forward? It's been Waiting on Author since the
>   beginning of last CF, with open concerns from Tom about safety.
...
> - Parallelize correlated subqueries that execute within each worker
>   https://commitfest.postgresql.org/38/3246/
>
>   Patch needs to be fixed for FreeBSD; last Author post in January.

These are both mine, and I'd hoped to work on them this CF, but I've
been sufficiently busy that that hasn't happened.

I'd like to just move these to the next CF.

Thanks,
James Coleman

Re: Expand palloc/pg_malloc API

2022-07-26 Thread David G. Johnston

On Tue, Jul 26, 2022 at 2:32 PM Tom Lane  wrote:

>
> 2. I don't like the "palloc_ptrtype" name at all.  I see that you
> borrowed that name from talloc, but I doubt that's a precedent that
> very many people are familiar with.

> To me it sounds like it might
> allocate something that's the size of a pointer, not the size of the
> pointed-to object.  I have to confess though that I don't have an
> obviously better name to suggest.  "palloc_pointed_to" would be
> clear perhaps, but it's kind of long.
>

I agree that ptrtype reads "the type of a pointer".

This may not be a C-idiom but the pointed-to thing is a "reference" (hence
pass by value vs pass by reference).  So:

palloc_ref(myvariablepointer)

will allocate using the type of the referenced object.  Just like _array
and _obj, which name the thing being used as a size template as opposed to
instantiate which seems more like another word for "allocate/palloc".

David J.
P.S.

Admittedly I'm still getting my head around reading pointer-using code (I
get the general concept but haven't had to code them)

- lockrelid = palloc(sizeof(*lockrelid));
+ lockrelid = palloc_ptrtype(lockrelid);

// This definitely seems like an odd idiom until I remembered about
short-lived memory contexts and the lost pointers are soon destroyed there.

So lockrelid (no star) is a pointer that has an underlying reference that
the macro (and the orignal code) resolves via the *

I cannot reason out whether the following would be equivalent to the above:

lockrelid = palloc_obj(*lockrelid);

I assume not because:  typeof(lockrelid) != (*lockrelid *)

Re: pg15b2: large objects lost on upgrade

2022-07-26 Thread Bruce Momjian

On Tue, Jul 26, 2022 at 03:45:11PM -0400, Robert Haas wrote:
> On Mon, Jul 18, 2022 at 2:57 PM Robert Haas  wrote:
> > Well, it took a while to figure out how to make that work, but I
> > believe I've got it now. Attached please find a couple of patches that
> > should get the job done. They might need a bit of polish, but I think
> > the basic concepts are sound.
> 
> So, would people like these patches (1) committed to master only, (2)
> committed to master and back-patched into v15, or (3) not committed at
> all? Michael argued upthread that it was too risky to be tinkering
> with things at this stage in the release cycle and, certainly, the
> more time goes by, the more true that gets. But I'm not convinced that
> these patches involve an inordinate degree of risk, and using beta as
> a time to fix things that turned out to be buggy is part of the point
> of the whole thing. On the other hand, the underlying issue isn't that
> serious either, and nobody seems to have reviewed the patches in
> detail, either. I don't mind committing them on my own recognizance if
> that's what people would prefer; I can take responsibility for fixing
> anything that is further broken, as I suppose would be expected even
> if someone else did review. But, I don't want to do something that
> other people feel is the wrong thing to have done.

This behavior is new in PG 15, and I would be worried to have one new
behavior in PG 15 and another one in PG 16.  Therefore, I would like to
see it in PG 15 and master.  I also think not doing anything and leaving
these zero-length files around would also be risky.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: Refactoring postgres_fdw/connection.c

2022-07-26 Thread Kyotaro Horiguchi

At Tue, 26 Jul 2022 18:33:04 +0900, Fujii Masao  
wrote in 
> > I'm not sure the two are similar with each other.  The new function
> > pgfdw_exec_pre_commit() looks like a merger of two isolated code paths
> > intended to share a seven-line codelet.  I feel the code gets a bit
> > harder to understand after the change.  I mildly oppose to this part.
> 
> If so, we can pgfdw_exec_pre_commit() into two, one is the common
> function that sends or executes the command (i.e., calls
> do_sql_command_begin() or do_sql_command()), and another is
> the function only for toplevel. The latter function calls
> the common function and then executes DEALLOCATE ALL things.
> 
> But this is not the way that other functions like
> pgfdw_abort_cleanup()
> is implemented. Those functions have both codes for toplevel and
> !toplevel (i.e., subxact), and run the processings depending
> on the argument "toplevel". So I'm thinking that
> pgfdw_exec_pre_commit() implemented in the same way is better.

I didn't see it from that viewpoint but I don't think that
unconditionally justifies other refactoring.  If we merge
pgfdw_finish_pre_(sub)?commit_cleanup()s this way, in turn
pgfdw_subxact_callback() and pgfdw_xact_callback() are going to be
almost identical except event IDs to handle. But I don't think we
would want to merge them.

A concern on 0002 is that it is hiding the subxact-specific steps from
the subxact callback.  It would look reasonable if it were called from
two or more places for each topleve and !toplevel, but actually it has
only one caller for each.  So I think that pgfdw_exec_pre_commit
should not do that and should be renamed to pgfdw_commit_remote() or
something.  On the other hand pgfdw_finish_pre_commit() hides
toplevel-specific steps from the caller so the same argument holds.

Another point that makes me concern about the patch is the new
function takes an SQL statement, along with the toplevel flag. I guess
the reason is that the command for subxact (RELEASE SAVEPOINT %d)
requires the current transaction level.  However, the values
isobtainable very cheap within the cleanup functions. So I propose to
get rid of the parameter "sql" from the two functions.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

1 2 >

1 - 100 of 118 matches

Mail list logo