date:20210413

On Mon, Apr 12, 2021 at 10:36:10PM -0700, Noah Misch wrote:
> Christoph Berg's first message on this thread reported doing that.  If
> supporting server_user!=pg_regress_user is unwarranted and Christoph Berg
> should stop, then already-committed code suffices.

Not sure that we have ever claimed that.  It is unfortunate that this
has broken a case that used to work, perhaps accidentally.  At the  
same time, Christoph has a workaround for the Debian suite, so it does
not seem like there is anything to do now, anyway.

> The race that commit 6c788d9 fixed is not inherent to Makefile rules.  For
> example, you could have instead caused test.sh to issue 'make
> installcheck-parallel TABLESPACEDIR="$outputdir"/testtablespace' and have the
> makefiles consult $(TABLESPACEDIR) rather than hard-code ./testtablespace.
> (That said, I like how 6c788d9 consolidated Windows and non-Windows paths.)

FWIW, I don't really want to split again this code path across
platforms.  Better to have one to rule them all.

> I don't know.  I never considered server_user!=pg_regress_user before this
> thread, and I don't plan to use it myself.  Your latest patch originated to
> make that case work, and my last message was reporting that the patch works
> for a rather-narrow interpretation of server_user!=pg_regress_user, failing on
> variations thereof.  That might be fine.

Okay.  So..  As I am not sure if there is anything that needs to be
acted on here for the moment, I would just close the case.  If there
are more voices in favor of the SQL function using mkdir(), I would
not object to use that, as that looks to work for all the cases where
pg_regress is used.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] force_parallel_mode and GUC categories

On Mon, Apr 12, 2021 at 01:40:52AM -0400, Tom Lane wrote:
> Michael Paquier  writes:
>> However, I'd like to think that we can do better than what's proposed
>> in the patch.  There are a couple of things to consider here:
>> - Should the parameter be renamed to reflect that it should only be
>> used for testing purposes?
> 
> -1 to that part, because it would break a bunch of buildfarm animals'
> configurations.  I doubt that any gain in clarity would be worth it.

Okay.

>> - Should we make more general the description of the developer options
>> in the docs?
> 
> Perhaps ... what did you have in mind?

The first sentence of the page now says that:
"The following parameters are intended for work on the PostgreSQL
source code, and in some cases to assist with recovery of severely
damaged databases."

That does not stick with force_parallel_mode IMO.  Maybe:
"The following parameters are intended for development work related to
PostgreSQL.  Some of them work on the PostgreSQL source code, some of
them can be used to control the run-time behavior of the server, and
in some cases they can be used to assist with the recovery of severely
damaged databases."
--
Michael


signature.asc
Description: PGP signature

Re: Simplify backend terminate and wait logic in postgres_fdw test

On Mon, Apr 12, 2021 at 11:29:28AM +0530, Bharath Rupireddy wrote:
> I changed to 5min. If at all there's any server that would take more
> than 5min to remove a process from the system processes list, then it
> would see a warning on timeout.

Looks fine to me.  Let's wait a bit first to see if Fujii-san has any
objections to this cleanup as that's his code originally, from
32a9c0bd.
--
Michael


signature.asc
Description: PGP signature

Re: vacuum freeze - possible improvements

2021-04-13 Thread Virender Singla

Thanks Masahiko for the response.

"What is
the use case where users want to freeze fewer transactions, meaning
invoking anti-wraparound frequently?"

My overall focus here is anti wraparound vacuum on huge tables in emergency
situations (where we reached very close to  2B transactions or already in
outage window). In this situation we want to recover ASAP instead of having
many hours of outage.The Purpose of increasing "vacuum_freeze_min_age" to
high value is that anti wraparound vacuum will have to do less work because
we are asking less transactions/tuples to freeze (Of Course subsequent
vacuum has to do the remaining work).

"So the vacuum freeze will still have to
process tuples that are inserted/modified during consuming 1 billion
transactions. It seems to me that it’s not fewer transactions."

Yes another thing here is anti wraparound vacuum also cleans dead tuples
but i am not sure what we can do to avoid that.
There can be vacuum to only freeze the tulpes?

Thanks for sharing PG14 improvements, those are nice to have. But still the
anti wraparound vacuum will have to scan all the pages (from visibility
map) even if we are freezing fewer transactions because currently there is
no way to know what block/tuple contains which transaction id. If there is
a way then it would be easier to directly freeze those tuples quickly and
advance the relfrozenxid for the table.

On Tue, Apr 13, 2021 at 7:52 AM Masahiko Sawada 
wrote:

> On Mon, Apr 12, 2021 at 5:38 PM Virender Singla 
> wrote:
> >
> > Hi Postgres Community,
> >
> > Regarding anti wraparound vacuums (to freeze tuples), I see it has to
> scan all the pages which are not frozen-all (looking at visibility map).
> That means even if we want to freeze less transactions only (For ex - by
> increasing parameter vacuum_freeze_min_age to 1B), still it will scan all
> the pages in the visibility map and a time taking process.
>
>  If vacuum_freeze_min_age is 1 billion, autovacuum_freeze_max_age is 2
> billion (vacuum_freeze_min_age is limited to the half of
> autovacuum_freeze_max_age). So vacuum freeze will still have to
> process tuples that are inserted/modified during consuming 1 billion
> transactions. It seems to me that it’s not fewer transactions. What is
> the use case where users want to freeze fewer transactions, meaning
> invoking anti-wraparound frequently?
>
> >
> > Can there be any improvement on this process so VACUUM knows the
> tuple/pages of those transactions which need to freeze up.
> >
> > Benefit of such an improvement is that if we are reaching transaction id
> close to 2B (and downtime), that time we can quickly recover the database
> with vacuuming freeze only a few millions rows with quick lookup rather
> than going all the pages from visibility map.
>
> Apart from this idea, in terms of speeding up vacuum,
> vacuum_failsafe_age parameter, introduced to PG14[1], would also be
> helpful. When the failsafe is triggered, cost-based delay is no longer
> be applied, and index vacuuming is bypassed in order to finish vacuum
> work and advance relfrozenxid as quickly as possible.
>
> Regards
>
> [1]
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1e55e7d1755cefbb44982fbacc7da461fa8684e6
>
> --
> Masahiko Sawada
> EDB:  https://www.enterprisedb.com/
>

Unresolved repliaction hang and stop problem.

2021-04-13 Thread Krzysztof Kois

Hello,
After upgrading the cluster from 10.x to 13.1 we've started getting a problem 
describe pgsql-general:
https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
We've noticed similar issue being described on this list in
https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
with a fix being rolled out in 13.2.

After the 13.2 release, we've upgraded to it and unfortunately this did not 
solve the issue - the replication still stalls just as described in the 
original issue.
Please advise, how to debug and solve this issue.

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-04-13 Thread Julien Rouhaud

On Mon, Apr 12, 2021 at 10:12:46PM -0400, Bruce Momjian wrote:
> On Thu, Apr  8, 2021 at 01:01:42PM -0400, Bruce Momjian wrote:
> > On Thu, Apr  8, 2021 at 12:51:06PM -0400, Álvaro Herrera wrote:
> > > On 2021-Apr-08, Bruce Momjian wrote:
> > > 
> > > > pg_stat_activity.queryid is new, but I can imagine cases where you would
> > > > join pg_stat_activity to pg_stat_statements to get an estimate of how
> > > > long the query will take --- having one using an underscore and another
> > > > one not seems odd.
> > > 
> > > OK.  So far, you have one vote for queryid (your own) and two votes for
> > > query_id (mine and Julien's).  And even yourself were hesitating about
> > > it earlier in the thread.
> > 
> > OK, if people are fine with pg_stat_activity.query_id not matching
> > pg_stat_statements.queryid, I am fine with that.  I just don't want
> > someone to say it was a big mistake later.  ;-)
> 
> OK, the attached patch renames pg_stat_activity.queryid to 'query_id'. I
> have not changed any of the APIs which existed before this feature was
> added, and are called "queryid" or "queryId" --- it is kind of a mess. 
> I assume I should leave those unchanged.  It will also need a catversion
> bump.

-   uint64  st_queryid;
+   uint64  st_query_id;

I thought we would internally keep queryid/queryId, at least for the variable
names as this is the name of the saved field in PlannedStmt.

-extern void pgstat_report_queryid(uint64 queryId, bool force);
+extern void pgstat_report_query_id(uint64 queryId, bool force);

But if we don't then it should be "uint64 query_id".

Re: Replication slot stats misgivings

On Mon, Apr 12, 2021 at 7:03 PM Masahiko Sawada  wrote:
>
> On Mon, Apr 12, 2021 at 9:36 PM Amit Kapila  wrote:
> >
> > On Mon, Apr 12, 2021 at 5:29 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Apr 12, 2021 at 8:08 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, Apr 12, 2021 at 4:34 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Mon, Apr 12, 2021 at 6:19 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Apr 12, 2021 at 10:27 AM Masahiko Sawada 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Sat, Apr 10, 2021 at 9:53 PM Amit Kapila 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > It seems Vignesh has changed patches based on the latest set of
> > > > > > > > comments so you might want to rebase.
> > > > > > >
> > > > > > > I've merged my patch into the v6 patch set Vignesh submitted.
> > > > > > >
> > > > > > > I've attached the updated version of the patches. I didn't change
> > > > > > > anything in the patch that changes char[NAMEDATALEN] to NameData 
> > > > > > > (0001
> > > > > > > patch) and patches that add tests.
> > > > > > >
> > > > > >
> > > > > > I think we can push 0001. What do you think?
> > > > >
> > > > > +1
> > > > >
> > > > > >
> > > > > > > In 0003 patch I reordered the
> > > > > > > output parameters of pg_stat_replication_slots; showing total 
> > > > > > > number
> > > > > > > of transactions and total bytes followed by statistics for 
> > > > > > > spilled and
> > > > > > > streamed transactions seems appropriate to me.
> > > > > > >
> > > > > >
> > > > > > I am not sure about this because I think we might want to add some
> > > > > > info of stream/spill bytes in total_bytes description (something 
> > > > > > like
> > > > > > stream/spill bytes are not in addition to total_bytes).
> > >
> > > BTW doesn't it confuse users that stream/spill bytes are not in
> > > addition to total_bytes? User will need to do "total_bytes +
> > > spill/stream_bytes" to know the actual total amount of data sent to
> > > the decoding output plugin, is that right?
> > >
> >
> > No, total_bytes includes the spill/stream bytes. So, the user doesn't
> > need to do any calculation to compute totel_bytes sent to output
> > plugin.
>
> The following test for the latest v8 patch seems to show different.
> total_bytes is 1808 whereas spill_bytes is 1320. Am I missing
> something?
>
> postgres(1:85969)=# select pg_create_logical_replication_slot('s',
> 'test_decoding');
>  pg_create_logical_replication_slot
> 
>  (s,0/1884468)
> (1 row)
>
> postgres(1:85969)=# create table a (i int);
> CREATE TABLE
> postgres(1:85969)=# insert into a select generate_series(1, 10);
> INSERT 0 10
> postgres(1:85969)=# set logical_decoding_work_mem to 64;
> SET
> postgres(1:85969)=# select * from pg_stat_replication_slots ;
>  slot_name | total_txns | total_bytes | spill_txns | spill_count |
> spill_bytes | stream_txns | stream_count | stream_bytes | stats_reset
> ---++-++-+-+-+--+--+-
>  s |  0 |   0 |  0 |   0 |
>   0 |   0 |0 |0 |
> (1 row)
>
> postgres(1:85969)=# select count(*) from
> pg_logical_slot_peek_changes('s', NULL, NULL);
>  count
> 
>  14
> (1 row)
>
> postgres(1:85969)=# select * from pg_stat_replication_slots ;
>  slot_name | total_txns | total_bytes | spill_txns | spill_count |
> spill_bytes | stream_txns | stream_count | stream_bytes | stats_reset
> ---++-++-+-+-+--+--+-
>  s |  2 |1808 |  1 | 202 |
> 1320 |   0 |0 |0 |
> (1 row)
>

Thanks for identifying this issue, while spilling the transactions
reorder buffer changes gets released, we will not be able to get the
total size for spilled transactions from reorderbuffer size. I have
fixed it by including spilledbytes to totalbytes in case of spilled
transactions. Attached patch has the fix for this.
Thoughts?

Regards,
Vignesh
From f22ecb366a461a5203ec07dde1450f2c269ee675 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Sat, 10 Apr 2021 08:14:52 +0530
Subject: [PATCH v9 1/5] Changed char datatype to NameData datatype for
 slotname.

Changed char datatype to NameData datatype for slotname.
---
 src/backend/postmaster/pgstat.c   | 32 +++
 src/backend/replication/logical/logical.c | 13 ++---
 src/backend/replication/slot.c|  7 -
 src/backend/utils/adt/pgstatfuncs.c   |  2 +-
 src/include/pgstat.h  | 11 +++-
 5 files changed, 36 insertions(+), 29 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f4467625f7..666ce95d08 100644
--- a/src/backend/pos

Re: TRUNCATE on foreign table

2021-04-13 Thread Kyotaro Horiguchi

At Tue, 13 Apr 2021 16:17:12 +0900, Fujii Masao  
wrote in 
> 
> 
> On 2021/04/13 14:22, Kohei KaiGai wrote:
> > Let me remind the discussion at the design level.
> > If postgres_fdw (and other FDW drivers) needs to consider whether
> > ONLY-clause is given
> > on the foreign tables of them, what does a foreign table represent in
> > PostgreSQL system?
> > My assumption is, a foreign table provides a view to external data, as
> > if it performs like a table.
> > TRUNCATE command eliminates all the segment files, even if a table
> > contains multiple
> > underlying files, never eliminate them partially.
> > If a foreign table is equivalent to a table in SQL operation level,
> > indeed, ONLY-clause controls
> > which tables are picked up by the TRUNCATE command, but never controls
> > which portion of
> > the data shall be eliminated. So, I conclude that
> > ExecForeignTruncate() shall eliminate the entire
> > external data on behalf of a foreign table, regardless of ONLY-clause.
> > I think it is more significant to clarify prior to the implementation
> > details.
> > How about your opinions?
> 
> I'm still thinking that it's better to pass all information including
> ONLY clause about TRUNCATE command to FDW and leave FDW to determine
> how to use them. How postgres_fdw should use the information about
> ONLY
> is debetable. But for now IMO that users who explicitly specify ONLY
> clause for
> foreign tables understand the structure of remote tables and want to
> use ONLY
> in TRUNCATE command issued by postgres_fdw. But my opinion might be
> minority,
> so I'd like to hear more opinion about this, from other developers.

>From the syntactical point of view, my opinion on this is that the
"ONLY" in "TRUNCATE ONLY" is assumed to be consumed to tell to
disregard local children so it cannot be propagate further whichever
the target relation has children or not.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: TRUNCATE on foreign table

2021-04-13 Thread Kohei KaiGai

2021年4月13日(火) 16:17 Fujii Masao :
>
> On 2021/04/13 14:22, Kohei KaiGai wrote:
> > Let me remind the discussion at the design level.
> >
> > If postgres_fdw (and other FDW drivers) needs to consider whether
> > ONLY-clause is given
> > on the foreign tables of them, what does a foreign table represent in
> > PostgreSQL system?
> >
> > My assumption is, a foreign table provides a view to external data, as
> > if it performs like a table.
> > TRUNCATE command eliminates all the segment files, even if a table
> > contains multiple
> > underlying files, never eliminate them partially.
> > If a foreign table is equivalent to a table in SQL operation level,
> > indeed, ONLY-clause controls
> > which tables are picked up by the TRUNCATE command, but never controls
> > which portion of
> > the data shall be eliminated. So, I conclude that
> > ExecForeignTruncate() shall eliminate the entire
> > external data on behalf of a foreign table, regardless of ONLY-clause.
> >
> > I think it is more significant to clarify prior to the implementation 
> > details.
> > How about your opinions?
>
> I'm still thinking that it's better to pass all information including
> ONLY clause about TRUNCATE command to FDW and leave FDW to determine
> how to use them. How postgres_fdw should use the information about ONLY
> is debetable. But for now IMO that users who explicitly specify ONLY clause 
> for
> foreign tables understand the structure of remote tables and want to use ONLY
> in TRUNCATE command issued by postgres_fdw. But my opinion might be minority,
> so I'd like to hear more opinion about this, from other developers.
>
Here are two points to discuss.

Regarding to the FDW-APIs, yes, nobody can deny someone want to implement
their own FDW module that adds special handling when its foreign table
is specified
with ONLY-clause, even if we usually ignore.


On the other hand, when we consider a foreign table is an abstraction
of an external
data source, at least, the current postgres_fdw's behavior is not consistent.

When a foreign table by postgres_fdw that maps a remote parent table,
has a local
child table,

This command shows all the rows from both of local and remote.

postgres=# select * from f_table ;
 id |  v
+-
  1 | remote table t_parent id=1
  2 | remote table t_parent id=2
  3 | remote table t_parent id=3
 10 | remote table t_child1 id=10
 11 | remote table t_child1 id=11
 12 | remote table t_child1 id=12
 20 | remote table t_child2 id=20
 21 | remote table t_child2 id=21
 22 | remote table t_child2 id=22
 50 | it is l_child id=50
 51 | it is l_child id=51
 52 | it is l_child id=52
 53 | it is l_child id=53
(13 rows)

If f_table is specified with "ONLY", it picks up only the parent table
(f_table),
however, ONLY-clause is not push down to the remote side.

postgres=# select * from only f_table ;
 id |  v
+-
  1 | remote table t_parent id=1
  2 | remote table t_parent id=2
  3 | remote table t_parent id=3
 10 | remote table t_child1 id=10
 11 | remote table t_child1 id=11
 12 | remote table t_child1 id=12
 20 | remote table t_child2 id=20
 21 | remote table t_child2 id=21
 22 | remote table t_child2 id=22
(9 rows)

On the other hands, TRUNCATE ONLY f_table works as follows...

postgres=# truncate only f_table;
TRUNCATE TABLE
postgres=# select * from f_table ;
 id |  v
+-
 10 | remote table t_child1 id=10
 11 | remote table t_child1 id=11
 12 | remote table t_child1 id=12
 20 | remote table t_child2 id=20
 21 | remote table t_child2 id=21
 22 | remote table t_child2 id=22
 50 | it is l_child id=50
 51 | it is l_child id=51
 52 | it is l_child id=52
 53 | it is l_child id=53
(10 rows)

It eliminates the rows only from the remote parent table although it
is a part of the foreign table.

My expectation at the above command shows rows from the local child
table (id=50...53).

Best regards,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei

Performance Evaluation of Result Cache by using TPC-DS

2021-04-13 Thread Yuya Watari

Hello,

Recently, the result cache feature was committed to PostgreSQL. I
tested its performance by executing TPC-DS. As a result, I found that
there were some regressions in the query performance.

I used the TPC-DS scale factor 100 in the evaluation. I executed all
of the 99 queries in the TPC-DS, and the result cache worked in the 21
queries of them. However, some queries took too much time, so I
skipped their execution. I set work_mem to 256MB, and
max_parallel_workers_per_gather to 0.

Evaluation results are as follows. The negative speedup ratio
indicates that the execution time increased by the result cache.

Query No  |   Execution time with result cache  |   Execution time
without result cache  |  Speedup ratio
7   142.1   142.20.03%
8   144.0   142.8   -0.82%
13  164.6   162.0   -1.65%
27  138.9   138.7   -0.16%
34  135.7   137.11.02%
43  209.5   207.2   -1.10%
48  181.5   170.7   -6.32%
55  130.6   123.8   -5.48%
61  0.014   0.037   62.06%
62   66.759.9  -11.36%
68  131.3   127.2   -3.17%
72  567.0   563.4   -0.65%
73  130.1   129.8   -0.29%
88 1044.5  1048.70.40%
911.2 1.1   -7.93%
96  132.2   131.7   -0.37%

As you can see from these results, many queries have a negative
speedup ratio, which means that there are negative impacts on the
query performance. In query 62, the execution time increased by
11.36%. I guess these regressions are due to the misestimation of the
cost in the planner. I attached the execution plan of query 62.

The result cache is currently enabled by default. However, if this
kind of performance regression is common, we have to change its
default behavior.

Best regards,
Yuya Watari
  
QUERY PLAN  

--
 Limit  (cost=2816266.54..2816267.79 rows=100 width=110) (actual 
time=59613.454..59613.473 rows=100 loops=1)
   ->  Sort  (cost=2816266.54..2816267.38 rows=336 width=110) (actual 
time=59613.453..59613.464 rows=100 loops=1)
 Sort Key: (substr((warehouse.w_warehouse_name)::text, 1, 20)), 
ship_mode.sm_type, web_site.web_name
 Sort Method: top-N heapsort  Memory: 49kB
 ->  HashAggregate  (cost=2816248.24..2816252.44 rows=336 width=110) 
(actual time=59612.622..59612.730 rows=360 loops=1)
   Group Key: substr((warehouse.w_warehouse_name)::text, 1, 20), 
ship_mode.sm_type, web_site.web_name
   Batches: 1  Memory Usage: 157kB
   ->  Hash Join  (cost=2510.74..2794150.54 rows=368295 width=78) 
(actual time=7.597..45818.495 rows=14336926 loops=1)
 Hash Cond: (web_sales.ws_ship_mode_sk = 
ship_mode.sm_ship_mode_sk)
 ->  Hash Join  (cost=2509.29..2792056.55 rows=368356 
width=36) (actual time=7.571..38820.092 rows=14337390 loops=1)
   Hash Cond: (web_sales.ws_web_site_sk = 
web_site.web_site_sk)
   ->  Hash Join  (cost=2506.75..2790916.14 rows=368430 
width=33) (actual time=7.554..35314.217 rows=14338265 loops=1)
 Hash Cond: (web_sales.ws_warehouse_sk = 
warehouse.w_warehouse_sk)
 ->  Hash Join  (cost=2505.41..2789674.19 
rows=368516 width=20) (actual time=7.544..31214.782 rows=14340028 loops=1)
   Hash Cond: (web_sales.ws_ship_date_sk = 
date_dim.d_date_sk)
   ->  Seq Scan on web_sales  
(cost=0.00..2598153.08 rows=72001808 width=20) (actual time=0.026..17405.391 
rows=72001237 loops=1)
   ->  Hash  (cost=2500.73..2500.73 
rows=374 width=4) (actual time=7.505..7.506 rows=365 loops=1)
 Buckets: 1024  Batches: 1  Memory 
Usage: 21kB
 ->  Seq Scan on date_dim  
(cost=0.00..2500.73 rows=374 width=4) (actual time=3.599..7.462 rows=365 
loops=1)
   Filter: ((d_month_seq >= 
1212) AND (d_month_seq <= 1223))
   Rows Removed by Filter: 72684
 ->  Hash  (cost=1.15..1.15 rows=15 width=21) 
(actual time=0.007..0.007 rows=15 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 
9kB
   ->  Seq Scan on warehouse  
(cost=0.00..1.15 rows=15 width=21) (actual time=0.003..0.004 rows=15 loops=1)
   ->  Hash  (cost=2.24..2.24 rows=24 width=11) (actual 
time=0.01

Re: Performance Evaluation of Result Cache by using TPC-DS

2021-04-13 Thread David Rowley

On Tue, 13 Apr 2021 at 21:29, Yuya Watari  wrote:
> I used the TPC-DS scale factor 100 in the evaluation. I executed all
> of the 99 queries in the TPC-DS, and the result cache worked in the 21
> queries of them. However, some queries took too much time, so I
> skipped their execution. I set work_mem to 256MB, and
> max_parallel_workers_per_gather to 0.

Many thanks for testing this.

> As you can see from these results, many queries have a negative
> speedup ratio, which means that there are negative impacts on the
> query performance. In query 62, the execution time increased by
> 11.36%. I guess these regressions are due to the misestimation of the
> cost in the planner. I attached the execution plan of query 62.

Can you share if these times were to run EXPLAIN ANALYZE or if they
were just the queries being executed normally?

The times in the two files you attached do look very similar to the
times in your table, so I suspect either TIMING ON is not that high an
overhead on your machine, or the results are that of EXPLAIN ANALYZE.

It would be really great if you could show the EXPLAIN (ANALYZE,
TIMING OFF) for query 62.   There's a chance that the slowdown comes
from the additional EXPLAIN ANALYZE timing overhead with the Result
Cache version.

> The result cache is currently enabled by default. However, if this
> kind of performance regression is common, we have to change its
> default behavior.

Yes, the feedback we get during the beta period will help drive that
decision or if the costing needs to be adjusted.

David

Re: Old Postgresql version on i7-1165g7

2021-04-13 Thread Yura Sokolov

Yura Sokolov писал 2021-04-09 16:28:

Good day, hackers.

I've got HP ProBook 640g8 with i7-1165g7. I've installed Ubuntu 20.04
LTS on it

and started to play with PostgreSQL sources.

Occasinally I found I'm not able to `make check` old Postgresql
versions.

At least 9.6 and 10. They are failed at the initdb stage in the call
to postgresql.

Raw postgresql version 9.6.8 and 10.0 fails in boostrap stage:

running bootstrap script ... 2021-04-09 12:33:26.424 MSK [161121]
FATAL: could not find tuple for opclass 1
2021-04-09 12:33:26.424 MSK [161121] PANIC: cannot abort
transaction 1, it was already committed
Aborted (core dumped)
child process exited with exit code 134

Our modified custom version 9.6 fails inside of libc __strncmp_avx2
during post-bootstrap
with segmentation fault:

Program terminated with signal SIGSEGV, Segmentation fault.
#0 __strncmp_avx2 ()
#1 0x557168a7eeda in nameeq
#2 0x557168b4c4a0 in FunctionCall2Coll
#3 0x557168659555 in heapgettup_pagemode
#4 0x55716865a617 in heap_getnext
#5 0x557168678cf1 in systable_getnext
#6 0x557168b5651c in GetDatabaseTuple
#7 0x557168b574a4 in InitPostgres
#8 0x5571689dcb7d in PostgresMain
#9 0x5571688844d5 in main

I've bisected between REL_11_0 and "Rename pg_rewind's
copy_file_range()" and

found 372728b0d49552641f0ea83d9d2e08817de038fa
Replace our traditional initial-catalog-data format with a better
design.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=372728b0d49552641f0ea83d9d2e08817de038fa

This is first commit where `make check` doesn't fail during initdb on
my machine.
Therefore 02f3e558f21c0fbec9f94d5de9ad34f321eb0e57 is the last one
where `make check` fails.

I've tried with gcc9, gcc10 and clang10.
I've configured either without parameters or with `CFLAGS=-O0
./configure --enable-debug`.

Thing doesn't happen on Intel CPU of 10th series (i7-10510U and
i9-10900K).
Unfortunately, I have no fellows or colleagues with Intel CPU 11
series,

therefore I couldn't tell if this bug of 11 series or bug of concrete
CPU installed
in the notebook.

It will be great if some with i7-11* could try to make check and report
if it also fails or not.

BTW, problem remains in Debian stable (10.4) inside docker on same
machine.

With regards,
Yura Sokolov
PostgresPro

回复: Core dump happens when execute sql CREATE VIEW v1(c1) AS (SELECT ('4' COLLATE "C")::INT FROM generate_series(1, 10));

2021-04-13 Thread Yulin PEI

After reading the code and the patch, I think the patch is good. If the 
type is non-collatable, we do not add a CollateExpr node as a 'parent' node to 
the coerced node.

发件人: Tom Lane 
发送时间: 2021年4月13日 0:59
收件人: Yulin PEI 
抄送: pgsql-hackers@lists.postgresql.org 
主题: Re: Core dump happens when execute sql CREATE VIEW v1(c1) AS (SELECT ('4' 
COLLATE "C")::INT FROM generate_series(1, 10));

Yulin PEI  writes:
> I found it could cause a crash when executing sql statement: `CREATE VIEW 
> v1(c1) AS (SELECT ('4' COLLATE "C")::INT FROM generate_series(1, 10)); ` in 
> postgres 13.2 release.

Nice catch.  I don't think the code in DefineVirtualRelation is wrong:
exprCollation shouldn't report any collation for an expression of a
non-collatable type.  Rather the problem is with an old kluge in
coerce_type(), which will push a type coercion underneath a CollateExpr
... without any mind for the possibility that the coercion result isn't
collatable.  So the right fix is more or less the attached.

regards, tom lane

回复: Core dump happens when execute sql CREATE VIEW v1(c1) AS (SELECT ('4' COLLATE "C")::INT FROM generate_series(1, 10));

2021-04-13 Thread Yulin PEI

I think it is better to add this test case to regress.

发件人: Tom Lane 
发送时间: 2021年4月13日 0:59
收件人: Yulin PEI 
抄送: pgsql-hackers@lists.postgresql.org 
主题: Re: Core dump happens when execute sql CREATE VIEW v1(c1) AS (SELECT ('4' 
COLLATE "C")::INT FROM generate_series(1, 10));

Yulin PEI  writes:
> I found it could cause a crash when executing sql statement: `CREATE VIEW 
> v1(c1) AS (SELECT ('4' COLLATE "C")::INT FROM generate_series(1, 10)); ` in 
> postgres 13.2 release.

Nice catch.  I don't think the code in DefineVirtualRelation is wrong:
exprCollation shouldn't report any collation for an expression of a
non-collatable type.  Rather the problem is with an old kluge in
coerce_type(), which will push a type coercion underneath a CollateExpr
... without any mind for the possibility that the coercion result isn't
collatable.  So the right fix is more or less the attached.

regards, tom lane

0001-add-regress.patch
Description: 0001-add-regress.patch

Re: vacuum freeze - possible improvements

2021-04-13 Thread David Rowley

On Tue, 13 Apr 2021 at 19:48, Virender Singla  wrote:
> Yes another thing here is anti wraparound vacuum also cleans dead tuples but 
> i am not sure what we can do to avoid that.
> There can be vacuum to only freeze the tulpes?

You might want to have a look at [1], which was just pushed for PG14.

In particular:

> When the failsafe triggers, VACUUM takes extraordinary measures to
> finish as quickly as possible so that relfrozenxid and/or relminmxid can
> be advanced.  VACUUM will stop applying any cost-based delay that may be
> in effect.  VACUUM will also bypass any further index vacuuming and heap
> vacuuming -- it only completes whatever remaining pruning and freezing
> is required.  Bypassing index/heap vacuuming is enabled by commit
> 8523492d, which made it possible to dynamically trigger the mechanism
> already used within VACUUM when it is run with INDEX_CLEANUP off.

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1e55e7d1755cefbb44982fbacc7da461fa8684e6

Re: Old Postgresql version on i7-1165g7

2021-04-13 Thread Justin Pryzby

On Fri, Apr 09, 2021 at 04:28:25PM +0300, Yura Sokolov wrote:
> Good day, hackers.
> 
> I've got HP ProBook 640g8 with i7-1165g7. I've installed Ubuntu 20.04 LTS on
> it
> and started to play with PostgreSQL sources.
> 
> Occasinally I found I'm not able to `make check` old Postgresql versions.

Do you mean that HEAD works consistently, but v9.6 and v10 sometimes work but
sometimes fail ?

> #5  0x557168678cf1 in systable_getnext
> #6  0x557168b5651c in GetDatabaseTuple
> #7  0x557168b574a4 in InitPostgres
> #8  0x5571689dcb7d in PostgresMain
> #9  0x5571688844d5 in main
> 
> I've bisected between REL_11_0 and "Rename pg_rewind's copy_file_range()"
> and
> found 372728b0d49552641f0ea83d9d2e08817de038fa
> > Replace our traditional initial-catalog-data format with a better
> > design.
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=372728b0d49552641f0ea83d9d2e08817de038fa
> 
> This is first commit where `make check` doesn't fail during initdb on my
> machine. Therefore 02f3e558f21c0fbec9f94d5de9ad34f321eb0e57 is the last one 
> where
> `make check` fails.

This doesn't make much sense or help much, since 372728b doesn't actually
change the catalogs, or any .c file.

> I've tried with gcc9, gcc10 and clang10.
> I've configured either without parameters or with `CFLAGS=-O0 ./configure
> --enable-debug`.

You used make clean too, right ?

I would also use --with-cassert, since it might catch problems you'd otherwise
miss.

If that doesn't expose anything, maybe try to #define USE_VALGRIND in
src/include/pg_config_manual.h, and run with valgrind --trace-children=yes

-- 
Justin

Re: TRUNCATE on foreign table

2021-04-13 Thread Bharath Rupireddy

On Tue, Apr 13, 2021 at 2:37 PM Kohei KaiGai  wrote:
> Here are two points to discuss.
>
> Regarding to the FDW-APIs, yes, nobody can deny someone want to implement
> their own FDW module that adds special handling when its foreign table
> is specified
> with ONLY-clause, even if we usually ignore.
>
>
> On the other hand, when we consider a foreign table is an abstraction
> of an external
> data source, at least, the current postgres_fdw's behavior is not consistent.
>
> When a foreign table by postgres_fdw that maps a remote parent table,
> has a local
> child table,
>
> This command shows all the rows from both of local and remote.
>
> postgres=# select * from f_table ;
>  id |  v
> +-
>   1 | remote table t_parent id=1
>   2 | remote table t_parent id=2
>   3 | remote table t_parent id=3
>  10 | remote table t_child1 id=10
>  11 | remote table t_child1 id=11
>  12 | remote table t_child1 id=12
>  20 | remote table t_child2 id=20
>  21 | remote table t_child2 id=21
>  22 | remote table t_child2 id=22
>  50 | it is l_child id=50
>  51 | it is l_child id=51
>  52 | it is l_child id=52
>  53 | it is l_child id=53
> (13 rows)
>
> If f_table is specified with "ONLY", it picks up only the parent table
> (f_table),
> however, ONLY-clause is not push down to the remote side.
>
> postgres=# select * from only f_table ;
>  id |  v
> +-
>   1 | remote table t_parent id=1
>   2 | remote table t_parent id=2
>   3 | remote table t_parent id=3
>  10 | remote table t_child1 id=10
>  11 | remote table t_child1 id=11
>  12 | remote table t_child1 id=12
>  20 | remote table t_child2 id=20
>  21 | remote table t_child2 id=21
>  22 | remote table t_child2 id=22
> (9 rows)
>
> On the other hands, TRUNCATE ONLY f_table works as follows...
>
> postgres=# truncate only f_table;
> TRUNCATE TABLE
> postgres=# select * from f_table ;
>  id |  v
> +-
>  10 | remote table t_child1 id=10
>  11 | remote table t_child1 id=11
>  12 | remote table t_child1 id=12
>  20 | remote table t_child2 id=20
>  21 | remote table t_child2 id=21
>  22 | remote table t_child2 id=22
>  50 | it is l_child id=50
>  51 | it is l_child id=51
>  52 | it is l_child id=52
>  53 | it is l_child id=53
> (10 rows)
>
> It eliminates the rows only from the remote parent table although it
> is a part of the foreign table.
>
> My expectation at the above command shows rows from the local child
> table (id=50...53).

Yeah, ONLY clause is not pushed to the remote server in case of SELECT
commands. This is also true for DELETE and UPDATE commands on foreign
tables. I'm not sure if it wasn't thought necessary or if there is an
issue to push it or I may be missing something here. I think we can
start a separate thread to see other hackers' opinions on this.

I'm not sure whether all the clauses that are possible for
SELECT/UPDATE/DELETE/INSERT with local tables are pushed to the remote
server by postgres_fdw.

Well, now foreign TRUNCATE pushes the ONLY clause to the remote server
which is inconsistent when compared to SELECT/UPDATE/DELETE commands.
If we were to keep it consistent across all foreign commands that
ONLY clause is not pushed to remote server, then we can restrict for
TRUNCATE too and even if "TRUNCATE ONLY foreign_tbl" is specified,
just pass "TRUNCATE foreign_tbl" to remote server. Having said that, I
don't see any real problem in pushing the ONLY clause, at least in
case of TRUNCATE.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] force_parallel_mode and GUC categories

2021-04-13 Thread Justin Pryzby

On Tue, Apr 13, 2021 at 04:34:23PM +0900, Michael Paquier wrote:
> On Mon, Apr 12, 2021 at 01:40:52AM -0400, Tom Lane wrote:
> >> - Should we make more general the description of the developer options
> >> in the docs?
> > 
> > Perhaps ... what did you have in mind?
> 
> The first sentence of the page now says that:
> "The following parameters are intended for work on the PostgreSQL
> source code, and in some cases to assist with recovery of severely
> damaged databases."
> 
> That does not stick with force_parallel_mode IMO.  Maybe:

Good point.

-- 
Justin
>From b5355f3384df1aab87e4cb89afcf4d9703c14bdb Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 3 Apr 2021 19:06:37 -0500
Subject: [PATCH v3 1/3] track_activity_query_size is STATS_COLLECTOR category

Not Resource Usage / Memory, as since 995fb7420
---
 src/backend/utils/misc/guc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d0a51b507d..e65d62e71d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3506,7 +3506,7 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"track_activity_query_size", PGC_POSTMASTER, RESOURCES_MEM,
+		{"track_activity_query_size", PGC_POSTMASTER, STATS_COLLECTOR,
 			gettext_noop("Sets the size reserved for pg_stat_activity.query, in bytes."),
 			NULL,
 			GUC_UNIT_BYTE
-- 
2.17.0

>From b491b4520f157fb1328925f500f15fe2ef8e457c Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 3 Apr 2021 19:17:03 -0500
Subject: [PATCH v3 2/3] track_commit_timestamp is REPLICATION_SENDING

If I'm not wrong, this was missed at 4bd8ed31b
---
 src/backend/utils/misc/guc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e65d62e71d..a3c2ebbc53 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1188,7 +1188,7 @@ static struct config_bool ConfigureNamesBool[] =
 		check_bonjour, NULL, NULL
 	},
 	{
-		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION,
+		{"track_commit_timestamp", PGC_POSTMASTER, REPLICATION_SENDING,
 			gettext_noop("Collects transaction commit time."),
 			NULL
 		},
-- 
2.17.0

>From e9b556f2bc75dc8350759e4b32e588325086218e Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 3 Apr 2021 19:24:50 -0500
Subject: [PATCH v3 3/3] Change force_parallel_mode to a DEVELOPER GUC, and
 remove it from sample config..

..to discourage users from changing this option in hopes that it'll make their
queries faster, but without reading the documentation or understanding what it
does.
---
 doc/src/sgml/config.sgml  | 96 ++-
 src/backend/utils/misc/guc.c  |  2 +-
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 3 files changed, 50 insertions(+), 49 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f749fe9ce7..a7350c6ddc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5912,51 +5912,6 @@ SELECT * FROM parent WHERE key = 2400;
   
  
 
- 
-  force_parallel_mode (enum)
-  
-   force_parallel_mode configuration parameter
-  
-  
-  
-   
-Allows the use of parallel queries for testing purposes even in cases
-where no performance benefit is expected.
-The allowed values of force_parallel_mode are
-off (use parallel mode only when it is expected to improve
-performance), on (force parallel query for all queries
-for which it is thought to be safe), and regress (like
-on, but with additional behavior changes as explained
-below).
-   
-
-   
-More specifically, setting this value to on will add
-a Gather node to the top of any query plan for which this
-appears to be safe, so that the query runs inside of a parallel worker.
-Even when a parallel worker is not available or cannot be used,
-operations such as starting a subtransaction that would be prohibited
-in a parallel query context will be prohibited unless the planner
-believes that this will cause the query to fail.  If failures or
-unexpected results occur when this option is set, some functions used
-by the query may need to be marked PARALLEL UNSAFE
-(or, possibly, PARALLEL RESTRICTED).
-   
-
-   
-Setting this value to regress has all of the same effects
-as setting it to on plus some additional effects that are
-intended to facilitate automated regression testing.  Normally,
-messages from a parallel worker include a context line indicating that,
-but a setting of regress suppresses this line so that the
-output is the same as in non-parallel execution.  Also,
-the Gather nodes added to plans by this setting are hidden
-in EXPLAIN output so that the outp

Re: vacuum freeze - possible improvements

2021-04-13 Thread Masahiko Sawada

On Tue, Apr 13, 2021 at 1:51 PM Virender Singla  wrote:
>
> Thanks Masahiko for the response.
>
> "What is
> the use case where users want to freeze fewer transactions, meaning
> invoking anti-wraparound frequently?"
>
> My overall focus here is anti wraparound vacuum on huge tables in emergency 
> situations (where we reached very close to  2B transactions or already in 
> outage window). In this situation we want to recover ASAP instead of having 
> many hours of outage.The Purpose of increasing "vacuum_freeze_min_age" to 
> high value is that anti wraparound vacuum will have to do less work because 
> we are asking less transactions/tuples to freeze (Of Course subsequent vacuum 
> has to do the remaining work).

I think I understood your proposal. For example, if we insert 500GB
tuples during the first 1 billion transactions and then insert more
500GB tuples into another 500GB blocks during the next 1 billion
transactions, vacuum freeze scans 1TB whereas we scans only 500GB that
are modified by the first insertions if we’re able to freeze directly
tuples that are older than the cut-off. Is that right?

>
> "So the vacuum freeze will still have to
> process tuples that are inserted/modified during consuming 1 billion
> transactions. It seems to me that it’s not fewer transactions."
>
> Yes another thing here is anti wraparound vacuum also cleans dead tuples but 
> i am not sure what we can do to avoid that.
> There can be vacuum to only freeze the tulpes?

I think it's a good idea to skip all work except for freezing tuples
in emergency cases. Thanks to vacuum_failsafe_age we can avoid index
vacuuming, index cleanup, and heap vacuuming.

>
> Thanks for sharing PG14 improvements, those are nice to have. But still the 
> anti wraparound vacuum will have to scan all the pages (from visibility map) 
> even if we are freezing fewer transactions because currently there is no way 
> to know what block/tuple contains which transaction id.

Yes, that feature is to speed up vacuum by dynamically disabling both
cost-based delay and some cleanup work whereas your idea is to do that
by speeding up heap scan.

> If there is a way then it would be easier to directly freeze those tuples 
> quickly and advance the relfrozenxid for the table.

Maybe we can track the oldest xid per page in a map like visiblity map
or integrate it with visibility map. We need to freeze only pages that
are all-visible and whose oldest xid is older than the cut-off xid. I
think we need to track both xid and multi xid.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Monitoring stats docs inconsistency

Hi,

Few of the statistics description in monitoring_stats.sgml doc is not
consistent. Made all the descriptions consistent by including
punctuation marks at the end of each description.
Thoughts?

Regards,
Vignesh
From b74179aec11eb1f2439ef43e1830531c2cde78a2 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Mon, 12 Apr 2021 20:49:52 +0530
Subject: [PATCH] Monitoring stats docs inconsistency.

Few of the statistics description in monitoring_stats.sgml doc is not
consistent. Made all the description consistent by including fullstop.
---
 doc/src/sgml/monitoring.sgml | 354 +--
 1 file changed, 177 insertions(+), 177 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8287587f61..c3339c6753 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -695,7 +695,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
datid oid
   
   
-   OID of the database this backend is connected to
+   OID of the database this backend is connected to.
   
  
 
@@ -704,7 +704,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
datname name
   
   
-   Name of the database this backend is connected to
+   Name of the database this backend is connected to.
   
  
 
@@ -713,7 +713,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
pid integer
   
   
-   Process ID of this backend
+   Process ID of this backend.
   
  
 
@@ -733,7 +733,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
usesysid oid
   
   
-   OID of the user logged into this backend
+   OID of the user logged into this backend.
   
  
 
@@ -742,7 +742,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
usename name
   
   
-   Name of the user logged into this backend
+   Name of the user logged into this backend.
   
  
 
@@ -752,7 +752,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   
   
Name of the application that is connected
-   to this backend
+   to this backend.
   
  
 
@@ -819,7 +819,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   
Time when the currently active query was started, or if
state is not active, when the last query
-   was started
+   was started.
   
  
 
@@ -828,7 +828,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
state_change timestamp with time zone
   
   
-   Time when the state was last changed
+   Time when the state was last changed.
   
  
 
@@ -2302,7 +2302,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
pid integer
   
   
-   Process ID of a WAL sender process
+   Process ID of a WAL sender process.
   
  
 
@@ -2311,7 +2311,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
usesysid oid
   
   
-   OID of the user logged into this WAL sender process
+   OID of the user logged into this WAL sender process.
   
  
 
@@ -2320,7 +2320,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
usename name
   
   
-   Name of the user logged into this WAL sender process
+   Name of the user logged into this WAL sender process.
   
  
 
@@ -2330,7 +2330,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
   
Name of the application that is connected
-   to this WAL sender
+   to this WAL sender.
   
  
 
@@ -2362,7 +2362,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
   
TCP port number that the client is using for communication
-   with this WAL sender, or -1 if a Unix socket is used
+   with this WAL sender, or -1 if a Unix socket is used.
   
  
 
@@ -2372,7 +2372,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
   
Time when this process was started, i.e., when the
-   client connected to this WAL sender
+   client connected to this WAL sender.
   
  
 
@@ -2430,7 +2430,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
sent_lsn pg_lsn
   
   
-   Last write-ahead log location sent on this connection
+   Last write-ahead log location sent on this connection.
   
  
 
@@ -2440,7 +2440,7 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
   
Last write-ahead log location wri

Re: Identify missing publications from publisher while create/alter subscription.

On Thu, Apr 8, 2021 at 12:13 PM Bharath Rupireddy
 wrote:
>
> On Wed, Apr 7, 2021 at 10:37 PM vignesh C  wrote:
> > > I think, we can also have validate_publication option allowed for
> > > ALTER SUBSCRIPTION SET PUBLICATION and REFRESH PUBLICATION commands
> > > with the same behaviour i.e. error out when specified publications
> > > don't exist in the publisher. Thoughts?
> >
> > Sorry for the delayed reply, I was working on a few other projects so
> > I was not able to reply quickly.
> > Since we are getting the opinion that if we make the check
> > publications by default it might affect the existing users, I'm fine
> > with having an option validate_option to check if the publication
> > exists in the publisher, that way there is no change for the existing
> > user. I have made a patch in similar lines, attached patch has the
> > changes for the same.
> > Thoughts?
>
> Here are some comments on v3 patch:
>

Thanks for the comments

> 1) Please mention what's the default value of the option
> +   
> +validate_publication
> (boolean)
> +
> + 
> +  Specifies whether the subscriber must verify if the specified
> +  publications are present in the publisher. By default, the 
> subscriber
> +  will not check if the publications are present in the publisher.
> + 
> +
> +   
>

Modified.

> 2) How about
> +  Specifies whether the subscriber must verify the
> publications that are
> +   being subscribed to are present in the publisher. By default,
> the subscriber
> instead of
> +  Specifies whether the subscriber must verify if the specified
> +  publications are present in the publisher. By default, the 
> subscriber
>

Slightly reworded and modified.

> 3) I think we can make below common code into a single function with
> flags to differentiate processing for both, something like:
> StringInfoData *get_publist_str(List *publicaitons, bool use_quotes,
> bool is_fetch_table_list);
> check_publications:
> +/* Convert the publications which does not exist into a string. */
> +initStringInfo(&nonExistentPublications);
> +foreach(lc, publicationsCopy)
> +{
> and get_appended_publications_query:
>  foreach(lc, publications)
>
> With the new function that only prepares comma separated list of
> publications, you can get rid of get_appended_publications_query and
> just append the returned list to the query.
> fetch_table_list: get_publist_str(publications, true, true);
> check_publications: for select query preparation
> get_publist_str(publications, true, false); and for error string
> preparation get_publist_str(publications, false, false);
>
> And also let the new function get_publist_str allocate the string and
> just mention as a note in the function comment that the callers should
> pfree the returned string.
>

I felt the existing code looks better, if we have a common function,
we will have to lot of if conditions as both the functions is not same
to same, they operate on different data types and do the preparation
appropriately. Like fetch_table_list get nspname & relname and
converts it to RangeVar and adds to the list other function prepares a
text and deletes the entries present from the list. So I did not fix
this. Thoughts?

> 4) We could do following,
> ereport(ERROR,
> (errcode(ERRCODE_TOO_MANY_ARGUMENTS),
>  errmsg_plural("publication %s does not exist in the 
> publisher",
>"publications %s do not exist in the publisher",
> list_length(publicationsCopy),
> nonExistentPublications.data)));
> instead of
> +ereport(ERROR,
> +(errmsg("publication(s) %s does not exist in the publisher",
> +nonExistentPublications.data)));
>
> if (list_member(cte->ctecolnames,
> makeString(cte->search_clause->search_seq_column)))
>

Modified.

> 5) I think it's better with
> + * Check the specified publication(s) is(are) present in the publisher.
> instead of
> + * Verify that the specified publication(s) exists in the publisher.
>

Modified.

> 6) Instead of such a longer variable name "nonExistentPublications"
> how about just "pubnames" and add a comment there saying "going to
> error out with the list of non-existent publications" with that the
> variable and that part of code's context is clear.
>

Modified.

> 7) You can just do
> publications = list_copy(publications);
> instead of using another variable publicationsCopy
> publicationsCopy = list_copy(publications);

publications is an input list to this function, I did not want this
function to change this list. I felt existing is fine. Thoughts?

> 8) If you have done StringInfoData *cmd = makeStringInfo();, then no
> need of initStringInfo(cmd);

Modified.

Attached v4 patch has the fixes for the comments.

Regards,
Vignesh
From ffce0a164730bf872149d04d

Re: [PATCH] Identify LWLocks in tracepoints

On Tue, 13 Apr 2021 at 11:06, Andres Freund  wrote:

> > Each backend can have different tranche IDs (right?)
>
> No, they have to be the same in each. Note how the tranche ID is part of
> struct LWLock. Which is why LWLockNewTrancheId() has to acquire a lock
> etc.

Ah. I misunderstood that at some point.

That makes it potentially more sensible to skip reporting tranche
names. Not great, because it's much less convenient to work with trace
data full of internal ordinals that must be re-mapped in
post-processing. But I'm generally OK with deferring runtime costs to
tooling rather than the db itself so long as doing so is moderately
practical.

In this case, I think we could likely get away with removing the
tranche names from the tracepoints if we instead emit a trace event on
each dynamic tranche registration that reports the tranche id -> name
mapping. It still sucks for tools, since they have to scrape up the
static tranche registrations from somewhere else, but ... it'd be
tolerable.

> > The kernel is packed with extremely useful trace events, and for very
> > good reasons. Some on very hot paths.
>
> IIRC those aren't really comparable - the kernel actually does modify
> the executable code to replace the tracepoints with nops.

Same with userspace static trace markers (USDTs).

A followup mail will contain a testcase and samples to demonstrate this.

> Although I still don't really buy that static tracepoints are the best
> way to measure this kind of thing, given the delay introducing them and
> the cost of having them around. I think I pointed out
> https://postgr.es/m/20200813004233.hdsdfvufqrbdwzgr%40alap3.anarazel.de
> before.

Yeah. Semaphores are something hot enough that I'd hesitate to touch them.

> > LWLock lock-ordering deadlocks
>
> This seems unrelated to tracepoints to me.

If I can observe which locks are acquired in which order by each proc,
I can then detect excessive waits and report the stack of held locks
of both procs and their order of acquisition.

Since LWLocks shmem state doesn't AFAICS track any information on the
lock holder(s) I don't see a way to do this in-process.

It's not vital, it's just one of the use cases I have in mind. I
suspect that any case where such deadlocks are possible represents a
misuse of LWLocks anyway.

> > and there's no way to know what a given non-built-in tranche ID means
> > for any given backend without accessing backend-specific in-memory
> > state. Including for non-user-accessible backends like bgworkers and
> > auxprocs, where it's not possible to just query the state from a view
> > directly.
>
> The only per-backend part is that some backends might not know the
> tranche name for dynamically registered tranches where the
> LWLockRegisterTranche() hasn't been executed in a backend. Which should
> pretty much never be an aux process or such. And even for bgworkers it
> seems like a pretty rare thing, because those need to be started by
> something...
>
> It might be worth proposing a shared hashtable with tranch names and
> jut reserving enough space for ~hundred entries...

Yeah, that'd probably work and be cheap enough not to really matter.
Might even save us a chunk of memory by not turning CoW pages into
private mappings for each backend during registration.

> > And you can always build without `--enable-dtrace` and ... just not care.
>
> Practically speaking, distributions enable it, which then incurs the
> cost for everyone.

Yep. That's part of why I was so surprised to notice the
GetLWTrancheName() function call in LWLock tracepoints. Nearly
anywhere else it wouldn't matter at all, but LWLocks are hot enough
that it just might matter for the no-wait fastpath.

CTE push down

2021-04-13 Thread Alexander Pyhalov


Hi.

Currently PostgreSQL supports CTE push down for SELECT statements, but 
it is implemented as turning each CTE reference into subquery.


When CTE is referenced multiple times, we have choice - to materialize 
CTE (and disable quals distribution to the CTE query) or inline it (and 
so run CTE query multiple times,
which can be inefficient, for example, when CTE references foreign 
tables).


I was looking if it is possible to collect quals referencing CTE, 
combine in OR qual and add them to CTE query.


So far I consider the following changes.

1) Modify SS_process_ctes() to add a list of RestrictInfo* to 
PlannerInfo - one NULL RestrictInfo pointer per CTE (let's call this 
list cte_restrictinfos for now)/
2) In distribute_restrictinfo_to_rels(), when we get rel of RTE_CTE 
relkind and sure that can safely pushdown restrictinfo, preserve 
restrictinfo in cte_restrictinfos, converting multiple restrictions to 
"OR" RestrictInfos.
3) In the end of subquery_planner() (after inheritance_planner() or  
grouping_planner()) we can check if cte_restrictinfos contain some 
non-null RestrictInfo pointers and recreate plan for corresponding CTEs, 
distributing quals to relations inside CTE queries.


For now I'm not sure how to handle vars mapping when we push 
restrictinfos to the level of cte root or when we push it down to the 
cte plan, but properly mapping vars seems seems to be doable.


Is there something else I miss?
Does somebody work on alternative solution or see issues in such 
approach?


--
Best regards,
Alexander Pyhalov,
Postgres Professional

potential deadlock in parallel hashjoin grow-buckets-barrier and blocking nodes?

2021-04-13 Thread Luc Vlaming


Hi,

Whilst trying to debug a deadlock in some tpc-ds query I noticed 
something that could cause problems in the hashjoin implementation and 
cause potentially deadlocks (if my analysis is right).


Whilst building the inner hash table, the whole time the grow barriers 
are attached (the PHJ_BUILD_HASHING_INNER phase).
Usually this is not a problem, however if one of the nodes blocks 
somewhere further down in the plan whilst trying to fill the inner hash 
table whilst the others are trying to e.g. extend the number of buckets 
using ExecParallelHashIncreaseNumBuckets, they would all wait until the 
blocked process comes back to the hashjoin node and also joins the effort.
Wouldn't this give potential deadlock situations? Or why would a worker 
that is hashing the inner be required to come back and join the effort 
in growing the hashbuckets?


With very skewed workloads (one node providing all data) I was at least 
able to have e.g. 3 out of 4 workers waiting in 
ExecParallelHashIncreaseNumBuckets, whilst one was in the 
execprocnode(outernode). I tried to detatch and reattach the barrier but 
this proved to be a bad idea :)


Regards,
Luc

Re: [PATCH] Identify LWLocks in tracepoints

On Tue, 13 Apr 2021 at 21:05, Craig Ringer
 wrote:
> On Tue, 13 Apr 2021 at 11:06, Andres Freund  wrote:
> > IIRC those aren't really comparable - the kernel actually does modify
> > the executable code to replace the tracepoints with nops.
>
> Same with userspace static trace markers (USDTs).
>
> A followup mail will contain a testcase and samples to demonstrate this.

Demo follows, with source attached too. gcc 10.2 compiling with -O2,
using dtrace and  from systemtap 4.4 .

Trivial empty function definition:

__attribute__((noinline))
void
no_args(void)
{
SDT_NOOP_NO_ARGS();
}

Disassembly when SDT_NOOP_NO_ARGS is defined as

#define SDT_NOOP_NO_ARGS()

is:

:
retq

When built with a probes.d definition processed by the dtrace script
instead, the disassembly becomes:

:
nop
retq

So ... yup, it's a nop.

Now, if we introduce semaphores that changes.

__attribute__((noinline))
void
no_args(void)
{
if (SDT_NOOP_NO_ARGS_ENABLED())
SDT_NOOP_NO_ARGS();
}

disassembles to:

:
cmpw   $0x0,0x2ec4(%rip)# 
jne
retq
nopl   0x0(%rax,%rax,1)
nop
retq

so the semaphore test is actually quite harmful and wasteful in this
case. That's not surprising since this SDT is a simple marker point.
But what if we supply arguments to it? It turns out that the
disassembly is the same if args are passed, whether locals or globals,
including globals assigned based on program input that can't be
determined at compile time. Still just a nop.

If I pass a function call as an argument expression to a probe, e.g.

__attribute__((noinline)) static int
compute_probe_argument(void)
{
return 100;
}

void
with_computed_arg(void)
{
SDT_NOOP_WITH_COMPUTED_ARG(compute_probe_argument());
}

then the disassembly with SDTs is:

:
callq  
nop
retq

so the function call isn't elided even if it's unused. That's somewhat
expected. The same will be true if the arguments to a probe require
pointer chasing or non-trivial marshalling.

If a semaphore guard is added this becomes:

:
cmpw   $0x0,0x2e2e(%rip)# 
jne
retq
nopl   0x0(%rax,%rax,1)
callq  
nop
retq

so now the call to compute_probe_argument() is skipped unless the
probe is enabled, but the function is longer and requires a test and
jump.

If I dummy up a function that does some pointer chasing, without
semaphores I get

:
mov(%rdi),%rax
mov(%rax),%rax
mov(%rax),%rax
nop
retq

so the arguments are marshalled then ignored.

with semaphores I get:

:
cmpw   $0x0,0x2d90(%rip)# 
jne
retq
nopl   0x0(%rax,%rax,1)
mov(%rdi),%rax
mov(%rax),%rax
mov(%rax),%rax
nop
retq

so again the probe's argument marshalling is inline in the function
body, but at the end, and skipped over.

Findings:

* A probe without arguments or with simple arguments is just a 'nop' instruction
* Probes that require function calls, pointer chasing, other
expression evaluation etc may impose a fixed cost to collect up
arguments even if the probe is disabled.
* SDT semaphores can avoid that cost but add a branch, so should
probably be avoided unless preparing probe arguments is likely to be
expensive.

Hideous but effective demo code attached.
provider sdt_noop {
	probe no_args();
	probe with_args(int arg1, int arg2, int arg3);
	probe with_global_arg(int arg1);
	probe with_volatile_arg(int arg1);
	probe with_many_args(int arg1, int arg2, int arg3, int64_t arg4, int64_t arg5, int64_t arg6, int64_t arg7, int64_t arg8);
	probe with_computed_arg(int arg1);
	probe with_pointer_chasing(int arg1);
};

Makefile
Description: Binary data
#include 

#ifdef USE_SDT
#include "sdt_noop_probes_enabled.h"
#else
#include "sdt_noop_probes_disabled.h"
#endif

void no_args(void);
int with_args(void);
int with_global_arg(void);
int with_volatile_arg(void);
void with_many_args(void);
void with_computed_arg(void);
void with_pointer_chasing(int arg);

__attribute__((noinline))
void
no_args(void)
{
#ifdef USE_SDT_SEMAPHORES
	if (SDT_NOOP_NO_ARGS_ENABLED())
#endif
		SDT_NOOP_NO_ARGS();
}

__attribute__((noinline))
int
with_args(void)
{
	int arg1 = 0;
	int arg2 = 1;
	int arg3 = 2;
#ifdef USE_SDT_SEMAPHORES
	if (SDT_NOOP_WITH_ARGS_ENABLED())
#endif
		SDT_NOOP_WITH_ARGS(arg1, arg2, arg3);

	return arg1 + arg2 + arg3;
}

int some_global;

__attribute__((noinline))
int
with_global_arg(void)
{
#ifdef USE_SDT_SEMAPHORES
	if (SDT_NOOP_WITH_GLOBAL_ARG_ENABLED())
#endif
		SDT_NOOP_WITH_GLOBAL_ARG(some_global);

	return some_global;
}

__attribute__((noinline))
int
with_volatile_arg(void)
{
	volatile int arg1;
	arg1 = 42;
#ifdef USE_SDT_SEMAPHORES
	if (SDT_NOOP_WITH_VOLATILE_ARG_ENABLED())
#endif
		SDT_NOOP_WITH_VOLATILE_ARG(arg1);

	return arg1;
}

__att

RE: Truncate in synchronous logical replication failed

2021-04-13 Thread osumi.takami...@fujitsu.com

On Monday, April 12, 2021 3:58 PM Amit Kapila  wrote:
> On Mon, Apr 12, 2021 at 10:03 AM osumi.takami...@fujitsu.com
>  wrote:
> > but if we take a measure to fix the doc, we have to be careful for the
> > description, because when we remove the primary keys of 'test' tables on the
> scenario in [1], we don't have this issue.
> > It means TRUNCATE in synchronous logical replication is not always
> blocked.
> >
> 
> The problem happens only when we try to fetch IDENTITY_KEY attributes
> because pgoutput uses RelationGetIndexAttrBitmap() to get that information
> which locks the required indexes. Now, because TRUNCATE has already
> acquired an exclusive lock on the index, it seems to create a sort of deadlock
> where the actual Truncate operation waits for logical replication of 
> operation to
> complete and logical replication waits for actual Truncate operation to 
> finish.
> 
> Do we really need to use RelationGetIndexAttrBitmap() to build IDENTITY_KEY
> attributes? During decoding, we don't even lock the main relation, we just 
> scan
> the system table and build that information using a historic snapshot. Can't 
> we
> do something similar here?
I think we can build the IDENTITY_KEY attributes with NoLock
instead of calling RelationGetIndexAttrBitmap().

When we trace back the caller side of logicalrep_write_attrs(),
doing the thing equivalent to RelationGetIndexAttrBitmap()
for INDEX_ATTR_BITMAP_IDENTITY_KEY impacts only pgoutput_truncate.

OTOH, I can't find codes similar to RelationGetIndexAttrBitmap()
in pgoutput_* functions and in the file of relcache.c.
Therefore, I'd like to discuss how to address the hang.

My first idea is to extract some parts of RelationGetIndexAttrBitmap()
only for INDEX_ATTR_BITMAP_IDENTITY_KEY and implement those
either in a logicalrep_write_attrs() or as a new function.
RelationGetIndexAttrBitmap() has 'restart' label for goto statement
in order to ensure to return up-to-date attribute bitmaps, so
I prefer having a new function when we choose this direction.
Having that goto in logicalrep_write_attrs() makes it a little bit messy, I 
felt.

The other direction might be to extend RelationGetIndexAttrBitmap's function 
definition
to accept lockmode to give NoLock from logicalrep_write_attrs().
But, this change impacts on other several callers so is not as good as the 
first direction above, I think.

If someone has any better idea, please let me know.

Best Regards,
Takamichi Osumi

Re: vacuum freeze - possible improvements

2021-04-13 Thread Virender Singla

exactly my point, want to scan only 500GB data instead of 1TB.  That can be
handy  for vacuum freeze at a dangerous stage (reaching towards 2B).

"Maybe we can track the oldest xid per page in a map like visiblity map
or integrate it with visibility map. We need to freeze only pages that
are all-visible and whose oldest xid is older than the cut-off xid. I
think we need to track both xid and multi xid."

Yes I thought of that (keep track of olderst xid per page instead of per
tuple), only thing here is every time there is some modification on the
page, that oldest xid needs to be recalculated for respective page. Still
that makes sense with kind of BRIN type structure to keep the xid per page.
With Binary Tree Index structure, new transaction/tuple will fit right
side  (as that would be news transaction until 2B) and then other side leaf
blocks can be removed with every vacuum freeze.




On Tue, Apr 13, 2021 at 6:02 PM Masahiko Sawada 
wrote:

> On Tue, Apr 13, 2021 at 1:51 PM Virender Singla 
> wrote:
> >
> > Thanks Masahiko for the response.
> >
> > "What is
> > the use case where users want to freeze fewer transactions, meaning
> > invoking anti-wraparound frequently?"
> >
> > My overall focus here is anti wraparound vacuum on huge tables in
> emergency situations (where we reached very close to  2B transactions or
> already in outage window). In this situation we want to recover ASAP
> instead of having many hours of outage.The Purpose of increasing
> "vacuum_freeze_min_age" to high value is that anti wraparound vacuum will
> have to do less work because we are asking less transactions/tuples to
> freeze (Of Course subsequent vacuum has to do the remaining work).
>
> I think I understood your proposal. For example, if we insert 500GB
> tuples during the first 1 billion transactions and then insert more
> 500GB tuples into another 500GB blocks during the next 1 billion
> transactions, vacuum freeze scans 1TB whereas we scans only 500GB that
> are modified by the first insertions if we’re able to freeze directly
> tuples that are older than the cut-off. Is that right?
>
> >
> > "So the vacuum freeze will still have to
> > process tuples that are inserted/modified during consuming 1 billion
> > transactions. It seems to me that it’s not fewer transactions."
> >
> > Yes another thing here is anti wraparound vacuum also cleans dead tuples
> but i am not sure what we can do to avoid that.
> > There can be vacuum to only freeze the tulpes?
>
> I think it's a good idea to skip all work except for freezing tuples
> in emergency cases. Thanks to vacuum_failsafe_age we can avoid index
> vacuuming, index cleanup, and heap vacuuming.
>
> >
> > Thanks for sharing PG14 improvements, those are nice to have. But still
> the anti wraparound vacuum will have to scan all the pages (from visibility
> map) even if we are freezing fewer transactions because currently there is
> no way to know what block/tuple contains which transaction id.
>
> Yes, that feature is to speed up vacuum by dynamically disabling both
> cost-based delay and some cleanup work whereas your idea is to do that
> by speeding up heap scan.
>
> > If there is a way then it would be easier to directly freeze those
> tuples quickly and advance the relfrozenxid for the table.
>
> Maybe we can track the oldest xid per page in a map like visiblity map
> or integrate it with visibility map. We need to freeze only pages that
> are all-visible and whose oldest xid is older than the cut-off xid. I
> think we need to track both xid and multi xid.
>
> Regards,
>
>
> --
> Masahiko Sawada
> EDB:  https://www.enterprisedb.com/
>

Re: Have I found an interval arithmetic bug?

2021-04-13 Thread Zhihong Yu

On Mon, Apr 12, 2021 at 4:22 PM Bruce Momjian  wrote:

> On Mon, Apr 12, 2021 at 03:09:48PM -0700, Bryn Llewellyn wrote:
> > I showed you all this example a long time ago:
> >
> > select (
> > '
> >   3.853467 years
> > '::interval
> >   )::text as i;
> >
> > This behavior is the same in the env. of Bruce’s patch as in unpatched
> PG 13.2. This is the result.
> >
> > 3 years 10 mons
> >
> > Notice that "3.853467 years" is "3 years" plus "10.241604 months". This
> explains the "10 mons" in the result. But the 0.241604 months remainder did
> not spill down into days.
> >
> > Can anybody defend this quirk? An extension of this example with a real
> number of month in the user input is correspondingly yet more quirky. The
> rules can be written down. But they’re too tortuos to allow an ordinary
> mortal confidently to design code that relies on them.
> >
> > (I was unable to find any rule statement that lets the user predict this
> in the doc. But maybe that’s because of my feeble searching skills.)
> >
> > If there is no defense (and I cannot imagine one) might Bruce’s patch
> normalize this too to follow this rule:
> >
> > — convert 'y years m months' to the real number y*12 + m.
> >
> > — record truc( y*12 + m) in the "months" field of the internal
> representation
> >
> > — flow the remainder down to days (but no further)
> >
> > After all, you've bitten the bullet now and changed the behavior. This
> means that the semantics of some extant applications will change. So... in
> for a penny, in for a pound?
>
> The docs now say:
>
>  Field values can have fractional parts;  for example, '1.5
>  weeks' or '01:02:03.45'.  The fractional
> -->  parts are used to compute appropriate values for the next lower-order
>  internal fields (months, days, seconds).
>
> meaning fractional years flows to the next lower internal unit, months,
> and no further.  Fractional months would flow to days.  The idea of not
> flowing past the next lower-order internal field is that the
> approximations between units are not precise enough to flow accurately.
>
> With my patch, the output is now:
>
> SELECT INTERVAL '3 years 10.241604 months';
> interval
> 
>  3 years 10 mons 7 days
>
> It used to flow to seconds.
>
> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
>
>   If only the physical world exists, free will is an illusion.
>
>
Based on the results of more samples, I restore +1 to Bruce's latest patch.

Cheers

Re: [PATCH] force_parallel_mode and GUC categories

Michael Paquier  writes:
> On Mon, Apr 12, 2021 at 01:40:52AM -0400, Tom Lane wrote:
>> Perhaps ... what did you have in mind?

> The first sentence of the page now says that:
> "The following parameters are intended for work on the PostgreSQL
> source code, and in some cases to assist with recovery of severely
> damaged databases."

> That does not stick with force_parallel_mode IMO.  Maybe:
> "The following parameters are intended for development work related to
> PostgreSQL.  Some of them work on the PostgreSQL source code, some of
> them can be used to control the run-time behavior of the server, and
> in some cases they can be used to assist with the recovery of severely
> damaged databases."

I think that's overly wordy.  Maybe

The following parameters are intended for developer testing, and
should never be enabled for production work.  However, some of
them can be used to assist with the recovery of severely
damaged databases.

regards, tom lane

Re: TRUNCATE on foreign table

2021-04-13 Thread Kohei KaiGai

2021年4月13日(火) 21:03 Bharath Rupireddy :
>
> On Tue, Apr 13, 2021 at 2:37 PM Kohei KaiGai  wrote:
> > Here are two points to discuss.
> >
> > Regarding to the FDW-APIs, yes, nobody can deny someone want to implement
> > their own FDW module that adds special handling when its foreign table
> > is specified
> > with ONLY-clause, even if we usually ignore.
> >
> >
> > On the other hand, when we consider a foreign table is an abstraction
> > of an external
> > data source, at least, the current postgres_fdw's behavior is not 
> > consistent.
> >
> > When a foreign table by postgres_fdw that maps a remote parent table,
> > has a local
> > child table,
> >
> > This command shows all the rows from both of local and remote.
> >
> > postgres=# select * from f_table ;
> >  id |  v
> > +-
> >   1 | remote table t_parent id=1
> >   2 | remote table t_parent id=2
> >   3 | remote table t_parent id=3
> >  10 | remote table t_child1 id=10
> >  11 | remote table t_child1 id=11
> >  12 | remote table t_child1 id=12
> >  20 | remote table t_child2 id=20
> >  21 | remote table t_child2 id=21
> >  22 | remote table t_child2 id=22
> >  50 | it is l_child id=50
> >  51 | it is l_child id=51
> >  52 | it is l_child id=52
> >  53 | it is l_child id=53
> > (13 rows)
> >
> > If f_table is specified with "ONLY", it picks up only the parent table
> > (f_table),
> > however, ONLY-clause is not push down to the remote side.
> >
> > postgres=# select * from only f_table ;
> >  id |  v
> > +-
> >   1 | remote table t_parent id=1
> >   2 | remote table t_parent id=2
> >   3 | remote table t_parent id=3
> >  10 | remote table t_child1 id=10
> >  11 | remote table t_child1 id=11
> >  12 | remote table t_child1 id=12
> >  20 | remote table t_child2 id=20
> >  21 | remote table t_child2 id=21
> >  22 | remote table t_child2 id=22
> > (9 rows)
> >
> > On the other hands, TRUNCATE ONLY f_table works as follows...
> >
> > postgres=# truncate only f_table;
> > TRUNCATE TABLE
> > postgres=# select * from f_table ;
> >  id |  v
> > +-
> >  10 | remote table t_child1 id=10
> >  11 | remote table t_child1 id=11
> >  12 | remote table t_child1 id=12
> >  20 | remote table t_child2 id=20
> >  21 | remote table t_child2 id=21
> >  22 | remote table t_child2 id=22
> >  50 | it is l_child id=50
> >  51 | it is l_child id=51
> >  52 | it is l_child id=52
> >  53 | it is l_child id=53
> > (10 rows)
> >
> > It eliminates the rows only from the remote parent table although it
> > is a part of the foreign table.
> >
> > My expectation at the above command shows rows from the local child
> > table (id=50...53).
>
> Yeah, ONLY clause is not pushed to the remote server in case of SELECT
> commands. This is also true for DELETE and UPDATE commands on foreign
> tables. I'm not sure if it wasn't thought necessary or if there is an
> issue to push it or I may be missing something here. I think we can
> start a separate thread to see other hackers' opinions on this.
>
> I'm not sure whether all the clauses that are possible for
> SELECT/UPDATE/DELETE/INSERT with local tables are pushed to the remote
> server by postgres_fdw.
>
> Well, now foreign TRUNCATE pushes the ONLY clause to the remote server
> which is inconsistent when compared to SELECT/UPDATE/DELETE commands.
> If we were to keep it consistent across all foreign commands that
> ONLY clause is not pushed to remote server, then we can restrict for
> TRUNCATE too and even if "TRUNCATE ONLY foreign_tbl" is specified,
> just pass "TRUNCATE foreign_tbl" to remote server. Having said that, I
> don't see any real problem in pushing the ONLY clause, at least in
> case of TRUNCATE.
>
If ONLY-clause would be pushed down to the remote query of postgres_fdw,
what does the foreign-table represent in the local system?

In my understanding, a local foreign table by postgres_fdw is a
representation of
entire tree of the remote parent table and its children.
Thus, we have assumed that DML command fetches rows from the remote
parent table without ONLY-clause, once PostgreSQL picked up the foreign table
as a scan target.
I think we don't need to adjust definitions of the role of
foreign-table, even if
it represents non-RDBMS data sources.

If a foreign table by postgres_fdw supports a special table option to
indicate adding
ONLY-clause when remote query uses remote tables, it is suitable to
add ONLY-clause
on the remote TRUNCATE command also, not only SELECT/INSERT/UPDATE/DELETE.
In the other words, if a foreign-table represents only a remote parent
table, it is
suitable to truncate only the remote parent table.

Best regards,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei

Re: Identify missing publications from publisher while create/alter subscription.

2021-04-13 Thread Bharath Rupireddy

On Tue, Apr 13, 2021 at 6:22 PM vignesh C  wrote:
> > 2) How about
> > +  Specifies whether the subscriber must verify the
> > publications that are
> > +   being subscribed to are present in the publisher. By default,
> > the subscriber
> > instead of
> > +  Specifies whether the subscriber must verify if the specified
> > +  publications are present in the publisher. By default, the 
> > subscriber
> >
>
> Slightly reworded and modified.

+ 
+  When true, the command will try to verify if the specified
+  publications that are subscribed is present in the publisher.
+  The default is false.
+ 

"publications that are subscribed" is not right as the subscriber is
not yet subscribed, it is "trying to subscribing", and it's not that
the command "will try to verify", it actually verifies. So you can
modify as follows:

+ 
+  When true, the command verifies if all the specified
publications that are being subscribed to are present in the publisher
and throws an error if any of the publication doesn't exist. The
default is false.
+ 

> > 3) I think we can make below common code into a single function with
> > flags to differentiate processing for both, something like:
> > StringInfoData *get_publist_str(List *publicaitons, bool use_quotes,
> > bool is_fetch_table_list);
> > check_publications:
> > +/* Convert the publications which does not exist into a string. */
> > +initStringInfo(&nonExistentPublications);
> > +foreach(lc, publicationsCopy)
> > +{
> > and get_appended_publications_query:
> >  foreach(lc, publications)
> >
> > With the new function that only prepares comma separated list of
> > publications, you can get rid of get_appended_publications_query and
> > just append the returned list to the query.
> > fetch_table_list: get_publist_str(publications, true, true);
> > check_publications: for select query preparation
> > get_publist_str(publications, true, false); and for error string
> > preparation get_publist_str(publications, false, false);
> >
> > And also let the new function get_publist_str allocate the string and
> > just mention as a note in the function comment that the callers should
> > pfree the returned string.
> >
>
> I felt the existing code looks better, if we have a common function,
> we will have to lot of if conditions as both the functions is not same
> to same, they operate on different data types and do the preparation
> appropriately. Like fetch_table_list get nspname & relname and
> converts it to RangeVar and adds to the list other function prepares a
> text and deletes the entries present from the list. So I did not fix
> this. Thoughts?

I was actually thinking we could move the following duplicate code
into a function:
foreach(lc, publicationsCopy)
{
char   *pubname = strVal(lfirst(lc));

if (first)
first = false;
else
appendStringInfoString(&pubnames, ", ");
appendStringInfoString(&pubnames, "\"");
appendStringInfoString(&pubnames, pubname);
appendStringInfoString(&pubnames, "\"");
}
and
foreach(lc, publications)
{
char   *pubname = strVal(lfirst(lc));

if (first)
first = false;
else
appendStringInfoString(cmd, ", ");

appendStringInfoString(cmd, quote_literal_cstr(pubname));
}
that function can be:
static void
get_publications_str(List *publications, StringInfo dest, bool quote_literal)
{
ListCell   *lc;
boolfirst = true;

Assert(list_length(publications) > 0);

foreach(lc, publications)
{
char   *pubname = strVal(lfirst(lc));

if (first)
first = false;
else
appendStringInfoString(dest, ", ");

if (quote_literal)
appendStringInfoString(pubnames, quote_literal_cstr(pubname));
else
{
appendStringInfoString(&dest, "\"");
appendStringInfoString(&dest, pubname);
appendStringInfoString(&dest, "\"");
}
}
}

This way, we can get rid of get_appended_publications_query and use
the above function to return the appended list of publications. We
need to just pass quote_literal as true while preparing the publist
string for publication query and append it to the query outside the
function. While preparing publist str for error, pass quote_literal as
false. Thoughts?

> > 7) You can just do
> > publications = list_copy(publications);
> > instead of using another variable publicationsCopy
> > publicationsCopy = list_copy(publications);
>
> publications is an input list to this function, I did not want this
> function to change this list. I felt existing is fine. Thoughts?

Okay.

Typo - it's not "subcription" +# Create subcription for a publication
which does not exist.

I think we can remove e

More sepgsql weirdness

2021-04-13 Thread Dave Page

On a system with selinux and sepgsql configured, search path resolution
appears to fail if sepgsql is in enforcing mode, but selinux is in
permissive mode (which, as I understand it, should cause sepgsql to behave
as if it's in permissive mode anyway - and does for other operations).
Regardless of whether my understanding of the interaction of the two
permissive modes is correct, I don't believe the following should happen:

mls=# SELECT current_user;

 current_user

--

 postgres

(1 row)


mls=# SHOW search_path;

   search_path

-

 "$user", public

(1 row)


mls=# \dn+ public

  List of schemas

  Name  |  Owner   |  Access privileges   |  Description

+--+--+

 public | postgres | postgres=UC/postgres+| standard public schema

|  | =UC/postgres |

(1 row)


mls=# CREATE TABLE tb_users(uid int primary key, name text, mail text,
address text, salt text, phash text);

ERROR:  no schema has been selected to create in

LINE 1: CREATE TABLE tb_users(uid int primary key, name text, mail t...

 ^

mls=# CREATE TABLE public.tb_users(uid int primary key, name text, mail
text, address text, salt text, phash text);

CREATE TABLE

mls=# drop table tb_users;

ERROR:  table "tb_users" does not exist

mls=# drop table public.tb_users;

DROP TABLE

This is on head, pulled yesterday.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com

Re: Truncate in synchronous logical replication failed

2021-04-13 Thread Petr Jelinek



> On 12 Apr 2021, at 08:58, Amit Kapila  wrote:
> 
> On Mon, Apr 12, 2021 at 10:03 AM osumi.takami...@fujitsu.com
>  wrote:
>> 
>>> I checked the PG-DOC, found it says that “Replication of TRUNCATE
>>> commands is supported”[1], so maybe TRUNCATE is not supported in
>>> synchronous logical replication?
>>> 
>>> If my understanding is right, maybe PG-DOC can be modified like this. Any
>>> thought?
>>> Replication of TRUNCATE commands is supported
>>> ->
>>> Replication of TRUNCATE commands is supported in asynchronous mode
>> I'm not sure if this becomes the final solution,
>> 
> 
> I think unless the solution is not possible or extremely complicated
> going via this route doesn't seem advisable.
> 
>> but if we take a measure to fix the doc, we have to be careful for the 
>> description,
>> because when we remove the primary keys of 'test' tables on the scenario in 
>> [1], we don't have this issue.
>> It means TRUNCATE in synchronous logical replication is not always blocked.
>> 
> 
> The problem happens only when we try to fetch IDENTITY_KEY attributes
> because pgoutput uses RelationGetIndexAttrBitmap() to get that
> information which locks the required indexes. Now, because TRUNCATE
> has already acquired an exclusive lock on the index, it seems to
> create a sort of deadlock where the actual Truncate operation waits
> for logical replication of operation to complete and logical
> replication waits for actual Truncate operation to finish.
> 
> Do we really need to use RelationGetIndexAttrBitmap() to build
> IDENTITY_KEY attributes? During decoding, we don't even lock the main
> relation, we just scan the system table and build that information
> using a historic snapshot. Can't we do something similar here?
> 
> Adding Petr J. and Peter E. to know their views as this seems to be an
> old problem (since the decoding of Truncate operation is introduced).

We used RelationGetIndexAttrBitmap because it already existed, no other reason. 
I am not sure what exact locking we need but I would have guessed at least 
AccessShareLock would be needed.

--
Petr

Re: Old Postgresql version on i7-1165g7

Justin Pryzby  writes:
> On Fri, Apr 09, 2021 at 04:28:25PM +0300, Yura Sokolov wrote:
>> Occasinally I found I'm not able to `make check` old Postgresql versions.

>> I've bisected between REL_11_0 and "Rename pg_rewind's copy_file_range()"
>> and
>> found 372728b0d49552641f0ea83d9d2e08817de038fa
>>> Replace our traditional initial-catalog-data format with a better
>>> design.
>> This is first commit where `make check` doesn't fail during initdb on my
>> machine.

> This doesn't make much sense or help much, since 372728b doesn't actually
> change the catalogs, or any .c file.

It could make sense if some part of the toolchain that was previously
used to generate postgres.bki doesn't work right on that machine.
Overall though I'd have thought that 372728b would increase not
decrease our toolchain footprint.  It also seems unlikely that a
recent Ubuntu release would contain toolchain bugs that we hadn't
already heard about.

> You used make clean too, right ?

Really, when bisecting, you need to use "make distclean" or even
"git clean -dfx" between steps, or you may get bogus results,
because our makefiles aren't that great about tracking dependencies,
especially when you move backwards in the history.

So perhaps a more plausible theory is that this bisection result
is wrong because you weren't careful enough.

regards, tom lane

Re: [PATCH] Identify LWLocks in tracepoints

On Tue, 13 Apr 2021 at 21:40, Craig Ringer
 wrote:

> Findings:
>
> * A probe without arguments or with simple arguments is just a 'nop' 
> instruction
> * Probes that require function calls, pointer chasing, other
> expression evaluation etc may impose a fixed cost to collect up
> arguments even if the probe is disabled.
> * SDT semaphores can avoid that cost but add a branch, so should
> probably be avoided unless preparing probe arguments is likely to be
> expensive.

Back to the topic directly at hand.

Attached a disassembly of what LWLockAcquire looks like now on my
current build of git master @ 5fe83adad9efd5e3929f0465b44e786dc23c7b55
. This is an --enable-debug --enable-cassert --enable-dtrace build
with -Og -ggdb3.

The three tracepoints in it are all of the form:

   movzwl 0x0(%r13),%edi
   call   0x801c49 
   nop

so it's clear we're doing redundant calls to GetLWTrancheName(), as
expected. Not ideal.

Now if I patch it to add the _ENABLED() guards on all the tracepoints,
the probes look like this:

   0x00803176 <+200>:   cmpw   $0x0,0x462da8(%rip)#
0xc65f26 
   0x0080317e <+208>:   jne0x80328b 
    other interleaved code ...
   0x0080328b <+477>:   movzwl 0x0(%r13),%edi
   0x00803290 <+482>:   call   0x801c49 
   0x00803295 <+487>:   nop
   0x00803296 <+488>:   jmp0x803184 

so we avoid the GetLWTrancheName() call at the cost of a test and
possible branch, and a small expansion in total function size. Without
the semaphores, LWLockAcquire is 463 bytes. With them, it's 524 bytes,
which is nothing to sneeze at for code that doesn't do anything
99.999% of the time, but we avoid a bunch of GetLWTrancheName() calls.

If I instead replace T_NAME() with NULL for all tracepoints in
LWLockAcquire, the disassembly shows that the tracepoints now become a
simple

   0x00803176 <+200>:   nop

which is pretty hard to be concerned about.

So at the very least we should be calling GetLWTrancheName() once at
the start of the function if built with dtrace support and remembering
the value, instead of calling it for each tracepoint.

And omitting the tranche name looks like it might be sensible for the
LWLock code. In most places it won't matter, but LWLocks are hot
enough that it possibly might. A simple pg_regress run hits
LWLockAcquire 25 million times, so that's 75 million calls to
GetLWTrancheName().

Re: Extensions not dumped when --schema is used

2021-04-13 Thread Noah Misch

On Tue, Apr 13, 2021 at 02:43:11PM +0900, Michael Paquier wrote:
> On Sun, Apr 04, 2021 at 03:08:02PM -0700, Noah Misch wrote:
> > I noticed the patch's behavior for relations that are members of non-dumped
> > extensions and are also registered using pg_extension_config_dump().  It
> > depends on the schema:
> > 
> > - If extschema='public', "pg_dump -e plpgsql" makes no mention of the
> >   relations.
> 
> This one is expected to me.  The caller of pg_dump is not specifying
> the extension that should be dumped, hence it looks logic to me to not
> dump the tables marked as pg_extension_config_dump() part as an
> extension not listed.

Agreed.

> > - If extschema='public', "pg_dump -e plpgsql --schema=public" includes
> >   commands to dump the relation data.  This surprised me.  (The
> >   --schema=public argument causes selectDumpableNamespace() to set
> >   nsinfo->dobj.dump=DUMP_COMPONENT_ALL instead of DUMP_COMPONENT_ACL.)
> 
> This one would be expected to me.  Per the discussion of upthread, we
> want --schema and --extension to be two separate and exclusive
> switches.  So, once the caller specifies --schema we should dump the
> contents of the schema, even if its extension is not listed with
> --extension.

I may disagree with this later, but I'm setting it aside for the moment.

> Anyway, the behavior to select if a schema can be dumped
> or not is not really related to this new code, right?  And "public" is
> a mixed beast, being treated as a system object and a user object by
> selectDumpableNamespace().

Correct.

> > - If extschema is not any sort of built-in schema, "pg_dump -e plpgsql"
> >   includes commands to dump the relation data.  This surprised me.
> 
> Hmm.  But you are right that this one is inconsistent with the first
> case where the extension is not listed.  I would have said that as the
> extension is not directly specified and that the schema is not passed
> down either then we should not dump it at all, but this combination
> actually does so.  Maybe we should add an extra logic into
> selectDumpableNamespace(), close to the end of it, to decide if,
> depending on the contents of the extensions to include, we should dump
> its associated schema or not?  Another solution would be to make use
> of schema_include_oids to filter out the schemas we don't want, but
> that would mean that --extension gets priority over --schema or
> --table but we did ot want that per the original discussion.

No, neither of those solutions apply.  "pg_dump -e plpgsql" selects all
schemas.  That is consistent with its documentation; plain "pg_dump" has long
selected all schemas, and the documentation for "-e" does not claim that "-e"
changes the selection of non-extension objects.  We're not going to solve the
problem by making selectDumpableNamespace() select some additional aspect of
schema foo, because it's already selecting every available aspect.  Like you
say, we're also not going to solve the problem by removing some existing
aspect of schema foo from selection, because that would remove dump material
unrelated to any extension.

This isn't a problem of selecting schemas for inclusion in the dump.  This is
a problem of associating extensions with their pg_extension_config_dump()
relations and omitting those extension-member relations when "-e" causes
omission of the extension.

Re: TRUNCATE on foreign table

2021-04-13 Thread Fujii Masao





On 2021/04/13 23:25, Kohei KaiGai wrote:

2021年4月13日(火) 21:03 Bharath Rupireddy :

Yeah, ONLY clause is not pushed to the remote server in case of SELECT
commands. This is also true for DELETE and UPDATE commands on foreign
tables.


This sounds reasonable reason why ONLY should be ignored in TRUNCATE on
foreign tables, for now. If there is the existing rule about how to treat
ONLY clause for foreign tables, basically TRUNCATE should follow that at this
stage. Maybe we can change the rule, but it's an item for v15 or later?



I'm not sure if it wasn't thought necessary or if there is an
issue to push it or I may be missing something here.


I could not find the past discussion about foreign tables and ONLY clause.
I guess that ONLY is ignored in SELECT on foreign tables case because ONLY
is interpreted outside the executor and it's not easy to change the executor
so that ONLY is passed to FDW. Maybe..



I think we can
start a separate thread to see other hackers' opinions on this.

I'm not sure whether all the clauses that are possible for
SELECT/UPDATE/DELETE/INSERT with local tables are pushed to the remote
server by postgres_fdw.

Well, now foreign TRUNCATE pushes the ONLY clause to the remote server
which is inconsistent when compared to SELECT/UPDATE/DELETE commands.
If we were to keep it consistent across all foreign commands that
ONLY clause is not pushed to remote server, then we can restrict for
TRUNCATE too and even if "TRUNCATE ONLY foreign_tbl" is specified,
just pass "TRUNCATE foreign_tbl" to remote server. Having said that, I
don't see any real problem in pushing the ONLY clause, at least in
case of TRUNCATE.


If ONLY-clause would be pushed down to the remote query of postgres_fdw,
what does the foreign-table represent in the local system?

In my understanding, a local foreign table by postgres_fdw is a
representation of
entire tree of the remote parent table and its children.


If so, I'm still wondering why CASCADE/RESTRICT (i.e., DropBehavior) needs to
be passed to FDW. IOW, if a foreign table is an abstraction of an external
data source, ISTM that postgres_fdw should always issue TRUNCATE with
CASCADE. Why do we need to allow RESTRICT to be specified for a foreign table
even though it's an abstraction of an external data source?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Proposal: Save user's original authenticated identity for logging

2021-04-13 Thread Jacob Champion

On Wed, 2021-04-07 at 10:20 +0900, Michael Paquier wrote:
> Anyway, using a FATAL in this code path is fine by me at the end, so I
> have applied the patch.  Let's see now what the buildfarm thinks about
> it.

Looks like the farm has gone green, after some test fixups. Thanks for
all the reviews!

--Jacob

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-13 Thread Alvaro Herrera

On 2021-Apr-13, Amit Langote wrote:

> Actually it occurred to me this morning that CLOBBER_CACHE_ALWAYS is
> what exposed this problem on this animal (not sure if other such
> animals did too though).  With CLOBBER_CACHE_ALWAYS, a PartitionDesc
> will be built afresh on most uses.  In this particular case, the RI
> query executed by the insert has to build a new one (for d4_primary),
> correctly excluding the detach-pending partition (d4_primary1) per the
> snapshot with which it is run.  In normal builds, it would reuse the
> one built by an earlier query in the transaction, which does include
> the detach-pending partition, thus allowing the insert trying to
> insert a row referencing that partition to succeed.  There is a
> provision in RelationGetPartitionDesc() to rebuild if any
> detach-pending partitions in the existing copy of PartitionDesc are
> not to be seen by the current query, but a bug mentioned in my earlier
> reply prevents that from kicking in.

Right -- that explanation makes perfect sense: the problem stems from
the fact that the cached copy of the partition descriptor is not valid
depending on the visibility of detached partitions for the operation
that requests the descriptor.  I think your patch is a critical part to
a complete solution, but one thing is missing: we don't actually know
that the detached partitions we see now are the same detached partitions
we saw a moment ago.  After all, a partitioned table can have several
partitions in the process of being detached; so if you just go with the
boolean "does it have any detached or not" bit, you could end up with a
descriptor that doesn't include/ignore *all* detached partitions, just
the older one(s).

I think you could fix this aspect easily by decreeing that you can only
have only one partition-being-detached at one point.  So if you try to
DETACH CONCURRENTLY and there's another one in that state, raise an
error.  Maybe for simplicity we should do that anyway.

But I think there's another hidden assumption in your patch, which is
that the descriptor is rebuilt every now and then *anyway* because the
flag for detached flips between parser and executor, and because we send
invalidation messages for each detach.  I don't think we would ever
change things that would break this flipping (it sounds like planner and
executor necessarily have to be doing things differently all the time),
but it seems fragile as heck.  I would feel much safer if we just
avoided caching the wrong thing ... or perhaps keep a separate cache
entry (one descriptor including detached, another one not), to avoid
pointless rebuilds.

-- 
Álvaro Herrera   Valdivia, Chile

Proposal for working open source with PostgreSQL

2021-04-13 Thread Nandni Mehla

  Hello Sir/Madam,
I'm Nandni Mehla, a sophomore currently pursuing B.Tech in IT from Indira
Gandhi Delhi Technical University for Women, Delhi. I've recently started
working on open source and I think I will be a positive addition to
your organization for working on projects using C and SQL, as I have
experience in these, and I am willing to learn more from you.
I am attaching my proposal in this email for your reference, please guide
me through this.
Regards.

Proposal Link:
https://docs.google.com/document/d/1H84WmzZbMERPrjsnXbvoQ7W2AaKsM8eJU02SNw7vQBk/edit?usp=sharing

Re: More sepgsql weirdness

2021-04-13 Thread Robert Haas

On Tue, Apr 13, 2021 at 10:33 AM Dave Page  wrote:
> On a system with selinux and sepgsql configured, search path resolution 
> appears to fail if sepgsql is in enforcing mode, but selinux is in permissive 
> mode (which, as I understand it, should cause sepgsql to behave as if it's in 
> permissive mode anyway - and does for other operations). Regardless of 
> whether my understanding of the interaction of the two permissive modes is 
> correct, I don't believe the following should happen:

I agree that this sounds like something which shouldn't happen if the
system is in permissive mode, but I think the behavior itself is
deliberate. See OAT_NAMESPACE_SEARCH and commit
e965e6344cfaff0708a032721b56f61eea777bc5.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_upgrade check for invalid role-specific default config

2021-04-13 Thread Charlie Hornsby

Tom wrote:
> I do find it interesting that we now have two reports of somebody
> doing "ALTER ROLE SET role = something".  In the older thread,
> I was skeptical that that had any real use-case, so I wonder if
> Charlie has a rationale for having done that.

Unfortunately I haven't heard back from the original developer
who set up this role configuration, but if I do then I will share
their intentions.  In any case the invalid configuration had been
removed from every other role except one (certainly by mistake)
which lead to me rediscovering this issue.

I tested the above patch with the invalid data locally and it avoids
the restore error that we ran into previously.  Also it requires no
intervention to progress with pg_upgrade unlike my initial idea of
adding an check, so it is definitely simpler from a user perspective.

Thank you for taking a deep look into this and finding a better
solution.

Best regards,
Charlie Hornsby

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-04-13 Thread Alvaro Herrera

On 2021-Apr-12, Bruce Momjian wrote:

> OK, the attached patch renames pg_stat_activity.queryid to 'query_id'. I
> have not changed any of the APIs which existed before this feature was
> added, and are called "queryid" or "queryId" --- it is kind of a mess. 
> I assume I should leave those unchanged.  It will also need a catversion
> bump.

I think it is fine actually.  These names appear in structs Query and
PlannedStmt, and every single member of those already uses camelCase
naming.  Changing those to use "query_id" would look out of place.
You did change the one in PgBackendStatus to st_query_id, which also
matches the naming style in that struct, so that looks fine also.

So I'm -1 on Julien's first proposed change, and +1 on his second
proposed change (the name of the first argument of
pgstat_report_query_id should be query_id).

-- 
Álvaro Herrera   Valdivia, Chile

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2021-04-13 Thread James Coleman

On Mon, Apr 12, 2021 at 10:07 PM James Coleman  wrote:
>
> On Mon, Apr 12, 2021 at 7:49 PM David Rowley  wrote:
> >
> > On Tue, 13 Apr 2021 at 11:42, Tom Lane  wrote:
> > >
> > > David Rowley  writes:
> > > > I realised when working on something unrelated last night that we can
> > > > also do hash lookups for NOT IN too.
> > >
> > > ... and still get the behavior right for nulls?
> >
> > Yeah, it will. There are already some special cases for NULLs in the
> > IN version. Those would need to be adapted for NOT IN.
>
> I hadn't thought about using the negator operator directly that way
> when I initially wrote the patch.
>
> But also I didn't think a whole lot about the NOT IN case at all --
> and there's no mention of such that I see in this thread or the
> precursor thread. It's pretty obvious that it wasn't part of my
> immediate need, but obviously it'd be nice to have the consistency.
>
> All that to say this: my vote would be to put it into PG15 also.

...and here's a draft patch. I can take this to a new thread if you'd
prefer; the one here already got committed, on the other hand this is
pretty strongly linked to this discussion, so I figured it made sense
to post it here.

James
From 08c37c5b228a0a7e9546a481a789eb1c384fcfc7 Mon Sep 17 00:00:00 2001
From: jcoleman 
Date: Tue, 13 Apr 2021 13:36:38 -0400
Subject: [PATCH v1] Add HashedScalarArrayOp support for NOT IN

---
 src/backend/optimizer/prep/prepqual.c |  1 +
 src/backend/optimizer/util/clauses.c  |  4 +-
 src/backend/parser/parse_oper.c   |  1 +
 src/include/nodes/primnodes.h |  1 +
 src/test/regress/expected/expressions.out | 90 +++
 src/test/regress/sql/expressions.sql  | 31 
 6 files changed, 126 insertions(+), 2 deletions(-)

diff --git a/src/backend/optimizer/prep/prepqual.c b/src/backend/optimizer/prep/prepqual.c
index 42c3e4dc04..2c29993b13 100644
--- a/src/backend/optimizer/prep/prepqual.c
+++ b/src/backend/optimizer/prep/prepqual.c
@@ -129,6 +129,7 @@ negate_clause(Node *node)
 	newopexpr->opfuncid = InvalidOid;
 	newopexpr->hashfuncid = InvalidOid;
 	newopexpr->useOr = !saopexpr->useOr;
+	newopexpr->isNegator = true;
 	newopexpr->inputcollid = saopexpr->inputcollid;
 	newopexpr->args = saopexpr->args;
 	newopexpr->location = saopexpr->location;
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index 526997327c..99e688426e 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -2137,8 +2137,8 @@ convert_saop_to_hashed_saop_walker(Node *node, void *context)
 		Oid			lefthashfunc;
 		Oid			righthashfunc;
 
-		if (saop->useOr && arrayarg && IsA(arrayarg, Const) &&
-			!((Const *) arrayarg)->constisnull &&
+		if ((saop->useOr || (!saop->useOr && saop->isNegator)) && arrayarg &&
+			IsA(arrayarg, Const) && !((Const *) arrayarg)->constisnull &&
 			get_op_hash_functions(saop->opno, &lefthashfunc, &righthashfunc) &&
 			lefthashfunc == righthashfunc)
 		{
diff --git a/src/backend/parser/parse_oper.c b/src/backend/parser/parse_oper.c
index c379d72fce..222f15719a 100644
--- a/src/backend/parser/parse_oper.c
+++ b/src/backend/parser/parse_oper.c
@@ -896,6 +896,7 @@ make_scalar_array_op(ParseState *pstate, List *opname,
 	result->opfuncid = opform->oprcode;
 	result->hashfuncid = InvalidOid;
 	result->useOr = useOr;
+	result->isNegator = strcmp("<>", NameStr(opform->oprname)) == 0;
 	/* inputcollid will be set by parse_collate.c */
 	result->args = args;
 	result->location = location;
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 9ae851d847..819856395e 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -592,6 +592,7 @@ typedef struct ScalarArrayOpExpr
 	Oid			opfuncid;		/* PG_PROC OID of comparison function */
 	Oid			hashfuncid;		/* PG_PROC OID of hash func or InvalidOid */
 	bool		useOr;			/* true for ANY, false for ALL */
+	bool		isNegator;		/* true if NOT has been applied to opno */
 	Oid			inputcollid;	/* OID of collation that operator should use */
 	List	   *args;			/* the scalar and array operands */
 	int			location;		/* token location, or -1 if unknown */
diff --git a/src/test/regress/expected/expressions.out b/src/test/regress/expected/expressions.out
index 5944dfd5e1..2e88f1ca19 100644
--- a/src/test/regress/expected/expressions.out
+++ b/src/test/regress/expected/expressions.out
@@ -216,6 +216,61 @@ select return_text_input('a') in ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', '
  t
 (1 row)
 
+-- NOT IN
+select return_int_input(1) not in (10, 9, 2, 8, 3, 7, 4, 6, 5, 1);
+ ?column? 
+--
+ f
+(1 row)
+
+select return_int_input(1) not in (10, 9, 2, 8, 3, 7, 4, 6, 5, 0);
+ ?column? 
+--
+ t
+(1 row)
+
+select return_int_input(1) not in (10, 9, 2, 8, 3, 7, 4, 6, 5, 2, null);
+ ?column? 
+--
+ 
+(1 row)
+
+select return_int_input(1) not in (10, 9, 2, 8, 3, 7, 4, 6, 5, 1, null);
+ ?col

Re: Have I found an interval arithmetic bug?

2021-04-13 Thread Bryn Llewellyn

> On 12-Apr-2021, at 17:25, Bruce Momjian  wrote:
> 
> On Mon, Apr 12, 2021 at 05:20:43PM -0700, Bryn Llewellyn wrote:
>> I’d argue that the fact that this:
>> 
>> ('0.3 months'::interval) + ('0.7 months'::interval)
>> 
>> Is reported as '30 days' and not '1 month' is yet another
>> bug—precisely because of what I said in my previous email (sorry
>> that I forked the thread) where I referred to the fact that, in the
>> right test, adding 1 month gets a different answer than adding 30
>> days. 
> 
> Flowing _up_ is what these functions do:
> 
>   \df *justify*
>  List of functions
>  Schema   |   Name   | Result data type | Argument data types 
> | Type
>   
> +--+--+-+--
>pg_catalog | justify_days | interval | interval
> | func
>pg_catalog | justify_hours| interval | interval
> | func
>pg_catalog | justify_interval | interval | interval
> | func
> 
>> Yet another convincing reason to get rid of this flow down
>> business altogether.
> 
> We can certainly get rid of all downflow, which in the current patch is
> only when fractional internal units are specified.
> 
>> If some application wants to model flow-down, then it can do so with
>> trivial programming and full control over its own definition of the
>> rules.

“Yes please!” re Bruce’s “We can certainly get rid of all downflow, which in 
the current patch is only when fractional internal units are specified.”

Notice that a user who creates interval values explicitly only by using 
“make_interval()” will see no behavior change.

There’s another case of up-flow. When you subtract one timestamp value from 
another, and when they’re far enough apart, then the (internal representation 
of the) resulting interval value has both a seconds component and a days 
component. (But never, in my tests, a months component.) I assume that the 
implementation first gets the difference between the two timestamp values in 
seconds using (the moral equivalent of) “extract epoch”. And then, if this is 
greater than 24*60*60, it implements up-flow using the “rule-of-24”—never mind 
that this means that if you add the answer back to the timestamp value that you 
subtracted, then you might not get the timestamp value from which you 
subtracted. (This happens around a DST change and has been discussed earlier in 
the thread.)

The purpose of these three “justify” functions is dubious. I think that it’s 
best to think of the 3-component interval vector like an [x, y, z] vector in 3d 
geometric space, where the three coordinates are not mutually convertible 
because each has unique semantics. This point has been rehearsed many times in 
this long thread.

Having said this, it isn’t hard to understand the rules that the functions 
implement. And, most importantly, their use is voluntary. They are, though, no 
more than shipped and documented wrappers for a few special cases. A user could 
so easily write their own function like this:

1. Compute the values of the three components of the internal representation of 
the passed-in interval value using the “extract” feature and some simple 
arithmetic.

2. Derive the [minutes, days, seconds] values of a new representation using 
whatever rules you feel for.

3. Use these new values to create the return interval value.

For example, I might follow a discipline to use interval values that have only 
one of the three components of the internal representation non-zero. It’s easy 
to use the “domain” feature for this. (I can use these in any context where I 
can use the shipped interval.) I could then write a function to convert a “pure 
seconds” value of my_interval to a “pure years” value. And I could implement my 
own rules:

— Makes sense only for a large number of seconds that comes out to at least 
five years (else assert failure).

— Converts seconds to years using the rule that 1 year is, on average, 
365.25*24*60*60 seconds, and then truncates it.

There’s no shipped function that does this, and this makes me suspect that I’d 
prefer to roll my own for any serious purpose.

The existence of the three “justify” functions is, therefore, harmless.

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-13 Thread Justin Pryzby

On Sat, Apr 10, 2021 at 01:42:26PM -0500, Justin Pryzby wrote:
> On Sun, Mar 21, 2021 at 03:01:15PM -0300, Alvaro Herrera wrote:
> > > But note that it doesn't check if an existing constraint "implies" the new
> > > constraint - maybe it should.
> > 
> > Hm, I'm not sure I want to do that, because that means that if I later
> > have to attach the partition again with the same partition bounds, then
> > I might have to incur a scan to recheck the constraint.  I think we want
> > to make the new constraint be as tight as possible ...
> 
> If it *implies* the partition constraint, then it's at least as tight (and
> maybe tighter), yes ?
> 
> I think you're concerned with the case that someone has a partition with
> "tight" bounds like (a>=200 and a<300) and a check constraint that's "less
> tight" like (a>=100 AND a<400).  In that case, the loose check constraint
> doesn't imply the tighter partition constraint, so your patch would add a
> non-redundant constraint.
> 
> I'm interested in the case that someone has a check constraint that almost but
> not exactly matches the partition constraint, like (a<300 AND a>=200).  In 
> that
> case, your patch adds a redundant constraint.
> 
> I wrote a patch which seems to effect my preferred behavior - please check.

On Sat, Apr 10, 2021 at 02:13:26PM -0500, Justin Pryzby wrote:
> I suppose the docs should be updated for the exact behavior, maybe even 
> without
> this patch:
> 
> |A CHECK constraint
> |that duplicates the partition constraint is added to the partition.

I added this as an Opened Item, since it affects user-visible behavior:
whether or not a redundant, non-equal constraint is added.

-- 
Justin

Re: [PATCH] Identify LWLocks in tracepoints

2021-04-13 Thread Robert Haas

On Mon, Apr 12, 2021 at 11:06 PM Andres Freund  wrote:
> No, they have to be the same in each. Note how the tranche ID is part of
> struct LWLock. Which is why LWLockNewTrancheId() has to acquire a lock
> etc.

More precisely, if a tranche ID is defined in multiple backends, it
needs to be defined the same way in all of them. But it is possible to
have an extension loaded into some backends and not others and have it
define a tranche ID that other backends know nothing about.

Another point to note is that, originally, I had an idea that each
tranche of lwlocks was situation in a single array somewhere in
memory. Perhaps that was an array of something else, like buffer
descriptors, and the lwlocks were just one element of the struct, or
maybe it was an array specifically of LWLocks, but one way or the
other, there was definitely one array that had all the LWLocks from
that tranche in it. So before the commit in question --
3761fe3c20bb040b15f0e8da58d824631da00caa -- T_ID() used to compute an
offset for a lock within the tranche that was supposed to uniquely
identify the lock. However, the whole idea of an array per tranche
turns out to be broken by design.

Consider parallel query. You could, perhaps, arrange for all the
LWLocks that a particular query needs to be in one tranche. And that's
all fine. But what if there are multiple parallel contexts in
existence at the same time? I think right now that may be impossible
as a practical matter, since for example an SQL function that is
called by a parallel query is supposed to run any SQL statements
inside of it without parallelism. But, that's basically a policy
decision. There's nothing in the parallel context machinery itself
which prevents multiple parallel contexts from being active at the
same time. And if that happens, then you'd have multiple arrays with
the same tranche ID, so how do you identify the locks then? The
pre-3761fe3c20bb040b15f0e8da58d824631da00caa data structure doesn't
work because it has only one place to store an array base, but having
multiple places to store an array base doesn't fix it either because
now you've just given the same identifier to multiple locks.

You could maybe fix it by putting a limit on how many parallel
contexts can be open at the same time, and then having N copies of
each parallelism-related tranche. But that seems ugly and messy and a
burden on extension authors and not really what anybody wants.

You could try to identify locks by pointer addresses, but that's got
security hazards and the addreses aren't portable across all the
backends involved in the parallel query because of how DSM works, so
it's not really that helpful in terms of matching stuff up.

You could identify every lock by a tranche ID + an array offset + a
"tranche instance ID". But where would you store the tranche instance
ID to make it readily accessible, other than in the lock itself?
Andres wasn't thrilled about using even 2 bytes to identify the
LWLock, so he'll probably like having more bytes in there for that
kind of thing even less. And to be honest I wouldn't blame him. We
only need 12 bytes to implement the lock itself -- we can't justify
having more than a couple of additional bytes for debugging purposes.

On a broader level, I agree that being able to find out what the
system is doing is really important. But I'm also not entirely
convinced that having really fine-grained information here to
distinguish between one lock and another is the way to get there.
Personally, I've never run into a problem where I really needed to
know anything more than the tranche name. Like, I've seen problems for
example we can see that there's a lot of contention on
SubtransSLRULock, or there's problems with WALInsertLock. But I can't
really see why I'd need to know which WALInsertLock was experiencing
contention. If we were speaking of buffer content locks, I suppose I
can imagine wanting more details, but it's not really the buffer
number I'd want to know. I'd want to know the database OID, the
relfilenode, the fork number, and the block number. You can argue that
we should just expose the buffer number and let the user sort out the
rest with dtrace/systemtap magic, but that makes it useless in
practice to an awful lot of people, including me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_upgrade check for invalid role-specific default config

Charlie Hornsby  writes:
> I tested the above patch with the invalid data locally and it avoids
> the restore error that we ran into previously.  Also it requires no
> intervention to progress with pg_upgrade unlike my initial idea of
> adding an check, so it is definitely simpler from a user perspective.

Thanks for testing!  I've pushed this, so it will be in the May
minor releases.

regards, tom lane

Re: New IndexAM API controlling index vacuum strategies

2021-04-13 Thread Peter Geoghegan

On Mon, Apr 12, 2021 at 11:05 PM Masahiko Sawada  wrote:
> I realized that when the failsafe is triggered, we don't bypass heap
> truncation that is performed before updating relfrozenxid. I think
> it's better to bypass it too. What do you think?

I agree. Bypassing heap truncation is exactly the kind of thing that
risks adding significant, unpredictable delay at a time when we need
to advance relfrozenxid as quickly as possible.

I pushed a trivial commit that makes the failsafe bypass heap
truncation as well just now.

Thanks
-- 
Peter Geoghegan

Re: [PATCH] Identify LWLocks in tracepoints

2021-04-13 Thread Andres Freund

Hi,

On 2021-04-13 14:25:23 -0400, Robert Haas wrote:
> On Mon, Apr 12, 2021 at 11:06 PM Andres Freund  wrote:
> You could identify every lock by a tranche ID + an array offset + a
> "tranche instance ID". But where would you store the tranche instance
> ID to make it readily accessible, other than in the lock itself?
> Andres wasn't thrilled about using even 2 bytes to identify the
> LWLock, so he'll probably like having more bytes in there for that
> kind of thing even less.

I still don't like the two bytes, fwiw ;). Especially because it's 4
bytes due to padding right now.

I'd like to move the LWLock->waiters list to outside of the lwlock
itself - at most TotalProcs LWLocks can be waited for, so we don't need
millions of empty proclist_heads. That way we could also remove the
proclist indirection - which shows up a fair bit in contended workloads.

And if we had a separate "lwlocks being waited for" structure, we could
also add more information to it if we wanted to...

The difficulty of course is having space to indicate which of these
"waiting for" lists are being used - there's not enough space in ->state
right now to represent that. Two possibile approaches:

- We could make it work if we restricted MAX_BACKENDS to be 2**14 - but
  while I personally think that's a sane upper limit, I already had a
  hard time getting consensus to lower the limit to 2^18-1.

- Use a 64bit integer. Then we can easily fit MAX_BACKENDS lockers, as
  well as an offset to one of MAX_BACKENDS "wait lists" into LWLock.

It's not so much that I want to lower the overall memory usage (although
it doesn't hurt). It's more about being able to fit more data into one
cacheline together with the lwlock. E.g. being able to fit more into
BufferDesc would be very useful.

A secondary benefit of such an approach would be that it it makes it a
lot easier to add efficient adaptive spinning on contended locks. I did
experiment with that, and there's some considerable potential for
performance benefits there. But for it to scale well we need something
similar to "mcs locks", to avoid causing too much contention. And that
pretty much requires some separate space to store wait information
anyway.

With an 8 bytes state we probably could also stash the tranche inside
that...

> On a broader level, I agree that being able to find out what the
> system is doing is really important. But I'm also not entirely
> convinced that having really fine-grained information here to
> distinguish between one lock and another is the way to get there.
> Personally, I've never run into a problem where I really needed to
> know anything more than the tranche name.

I think it's quite useful for relatively simple things like analyzing
the total amount of time spent in individual locks, without incuring
much overhead when not doing so (for which you need to identify
individual locks, otherwise your end - start time is going to be
meaningless). And, slightly more advanced, for analyzing what the stack
was when the lock was released - which then allows you to see what work
you're blocked on, something I found hard to figure out otherwise.

I found that that's mostly quite doable with dynamic probes though.

> Like, I've seen problems for example we can see that there's a lot of
> contention on SubtransSLRULock, or there's problems with
> WALInsertLock. But I can't really see why I'd need to know which
> WALInsertLock was experiencing contention.

Well, but you might want to know what the task blocking you was
doing. What to optimize might differ if the other task is e.g. a log
switch (which acquires all insert locks), than if it's WAL writes by
VACUUM.

> If we were speaking of buffer content locks, I suppose I can imagine
> wanting more details, but it's not really the buffer number I'd want
> to know. I'd want to know the database OID, the relfilenode, the fork
> number, and the block number. You can argue that we should just expose
> the buffer number and let the user sort out the rest with
> dtrace/systemtap magic, but that makes it useless in practice to an
> awful lot of people, including me.

I have wondered if we ought to put some utilities for that in contrib or
such. It's a lot easier to address something new with a decent starting
point...

Greetings,

Andres Freund

Re: Retry in pgbench

2021-04-13 Thread Jehan-Guillaume de Rorthais

Hi,

On Tue, 13 Apr 2021 16:12:59 +0900 (JST)
Tatsuo Ishii  wrote:

>  [...]  
>  [...]  
>  [...]  
> 
> Thanks for the pointer. It seems we need to resume the discussion.

By the way, I've been playing with the idea of failing gracefully and retry
indefinitely (or until given -T) on SQL error AND connection issue.

It would be useful to test replicating clusters with a (switch|fail)over
procedure.

Regards,

Converting contrib SQL functions to new style

Attached are some draft patches to convert almost all of the
contrib modules' SQL functions to use SQL-standard function bodies.
The point of this is to remove the residual search_path security
hazards that we couldn't fix in commits 7eeb1d986 et al.  Since
a SQL-style function body is fully parsed at creation time,
its object references are not subject to capture by the run-time
search path.  Possibly there are small performance benefits too,
though I've not tried to measure that.

I've not touched the documentation yet.  I suppose that we can
tone down the warnings added by 7eeb1d986 quite a bit, maybe
replacing them with just "be sure to use version x.y or later".
However I think we may still need an assumption that earthdistance
and cube are in the same schema --- any comments on that?

I'd like to propose squeezing these changes into v14, even though
we're past feature freeze.  Reason one is that this is less a
new feature than a security fix; reason two is that this provides
some non-artificial test coverage for the SQL-function-body feature.

BTW, there still remain a couple of old-style SQL functions in
contrib/adminpack and contrib/lo.  AFAICS those are unconditionally
secure, so I didn't bother with them.

Thoughts?

regards, tom lane

diff --git a/contrib/citext/Makefile b/contrib/citext/Makefile
index a7de52928d..d879f4eb0b 100644
--- a/contrib/citext/Makefile
+++ b/contrib/citext/Makefile
@@ -4,6 +4,7 @@ MODULES = citext
 
 EXTENSION = citext
 DATA = citext--1.4.sql \
+	citext--1.6--1.7.sql \
 	citext--1.5--1.6.sql \
 	citext--1.4--1.5.sql \
 	citext--1.3--1.4.sql \
diff --git a/contrib/citext/citext--1.6--1.7.sql b/contrib/citext/citext--1.6--1.7.sql
new file mode 100644
index 00..51a7edf2bc
--- /dev/null
+++ b/contrib/citext/citext--1.6--1.7.sql
@@ -0,0 +1,65 @@
+/* contrib/citext/citext--1.6--1.7.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION citext UPDATE TO '1.7'" to load this file. \quit
+
+--
+-- Matching citext in string comparison functions.
+-- XXX TODO Ideally these would be implemented in C.
+--
+
+CREATE OR REPLACE FUNCTION regexp_match( citext, citext ) RETURNS TEXT[]
+LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE
+RETURN pg_catalog.regexp_match( $1::pg_catalog.text, $2::pg_catalog.text, 'i' );
+
+CREATE OR REPLACE FUNCTION regexp_match( citext, citext, text ) RETURNS TEXT[]
+LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE
+RETURN pg_catalog.regexp_match( $1::pg_catalog.text, $2::pg_catalog.text, CASE WHEN pg_catalog.strpos($3, 'c') = 0 THEN  $3 || 'i' ELSE $3 END );
+
+CREATE OR REPLACE FUNCTION regexp_matches( citext, citext ) RETURNS SETOF TEXT[]
+LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE ROWS 1
+RETURN pg_catalog.regexp_matches( $1::pg_catalog.text, $2::pg_catalog.text, 'i' );
+
+CREATE OR REPLACE FUNCTION regexp_matches( citext, citext, text ) RETURNS SETOF TEXT[]
+LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE ROWS 10
+RETURN pg_catalog.regexp_matches( $1::pg_catalog.text, $2::pg_catalog.text, CASE WHEN pg_catalog.strpos($3, 'c') = 0 THEN  $3 || 'i' ELSE $3 END );
+
+CREATE OR REPLACE FUNCTION regexp_replace( citext, citext, text ) returns TEXT
+LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE
+RETURN pg_catalog.regexp_replace( $1::pg_catalog.text, $2::pg_catalog.text, $3, 'i');
+
+CREATE OR REPLACE FUNCTION regexp_replace( citext, citext, text, text ) returns TEXT
+LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE
+RETURN pg_catalog.regexp_replace( $1::pg_catalog.text, $2::pg_catalog.text, $3, CASE WHEN pg_catalog.strpos($4, 'c') = 0 THEN  $4 || 'i' ELSE $4 END);
+
+CREATE OR REPLACE FUNCTION regexp_split_to_array( citext, citext ) RETURNS TEXT[]
+LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE
+RETURN pg_catalog.regexp_split_to_array( $1::pg_catalog.text, $2::pg_catalog.text, 'i' );
+
+CREATE OR REPLACE FUNCTION regexp_split_to_array( citext, citext, text ) RETURNS TEXT[]
+LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE
+RETURN pg_catalog.regexp_split_to_array( $1::pg_catalog.text, $2::pg_catalog.text, CASE WHEN pg_catalog.strpos($3, 'c') = 0 THEN  $3 || 'i' ELSE $3 END );
+
+CREATE OR REPLACE FUNCTION regexp_split_to_table( citext, citext ) RETURNS SETOF TEXT
+LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE
+RETURN pg_catalog.regexp_split_to_table( $1::pg_catalog.text, $2::pg_catalog.text, 'i' );
+
+CREATE OR REPLACE FUNCTION regexp_split_to_table( citext, citext, text ) RETURNS SETOF TEXT
+LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE
+RETURN pg_catalog.regexp_split_to_table( $1::pg_catalog.text, $2::pg_catalog.text, CASE WHEN pg_catalog.strpos($3, 'c') = 0 THEN  $3 || 'i' ELSE $3 END );
+
+CREATE OR REPLACE FUNCTION strpos( citext, citext ) RETURNS INT
+LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE
+RETURN pg_catalog.strpos( pg_catalog.lower( $1::pg_catalog.text ), pg_catalog.lower( $2::pg_catalog.text ) );
+
+CREATE OR REPLACE FUNCTION replace( citext, citext, citext ) RETURNS TEXT
+LANGUAGE SQL IMMUTABLE

Re: Uninitialized scalar variable (UNINIT) (src/backend/statistics/extended_stats.c)

2021-04-13 Thread Tomas Vondra




On 4/12/21 7:04 PM, Tom Lane wrote:
> Ranier Vilela  writes:
>> Em seg., 12 de abr. de 2021 às 03:04, Tom Lane  escreveu:
>>> It would be wrong, though, or at least not have the same effect.
> 
>> I think that you speak about fill pointers with 0 is not the same as fill
>> pointers with NULL.
> 
> No, I mean that InvalidBlockNumber isn't 0.
> 
>> I was confused here, does the patch follow the pattern and fix the problem
>> or not?
> 
> Your patch seems fine.  Justin's proposed improvement isn't.
> 

Pushed.

> (I'm not real sure whether there's any *actual* bug here --- would we
> really be looking at either ctid or tableoid of this temporary tuple?
> But it's probably best to ensure that they're valid anyway.)>

Yeah, the tuple is only built so that we can pass it to the various
selectivity estimators. I don't think anything will be actually looking
at those fields, but initializing them seems easy enough.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: New IndexAM API controlling index vacuum strategies

2021-04-13 Thread Masahiko Sawada

On Wed, Apr 14, 2021 at 4:59 AM Peter Geoghegan  wrote:
>
> On Mon, Apr 12, 2021 at 11:05 PM Masahiko Sawada  
> wrote:
> > I realized that when the failsafe is triggered, we don't bypass heap
> > truncation that is performed before updating relfrozenxid. I think
> > it's better to bypass it too. What do you think?
>
> I agree. Bypassing heap truncation is exactly the kind of thing that
> risks adding significant, unpredictable delay at a time when we need
> to advance relfrozenxid as quickly as possible.
>
> I pushed a trivial commit that makes the failsafe bypass heap
> truncation as well just now.

Great, thanks!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Extensions not dumped when --schema is used

On Tue, Apr 13, 2021 at 08:00:34AM -0700, Noah Misch wrote:
> On Tue, Apr 13, 2021 at 02:43:11PM +0900, Michael Paquier wrote:
>>> - If extschema='public', "pg_dump -e plpgsql --schema=public" includes
>>>   commands to dump the relation data.  This surprised me.  (The
>>>   --schema=public argument causes selectDumpableNamespace() to set
>>>   nsinfo->dobj.dump=DUMP_COMPONENT_ALL instead of DUMP_COMPONENT_ACL.)
>> 
>> This one would be expected to me.  Per the discussion of upthread, we
>> want --schema and --extension to be two separate and exclusive
>> switches.  So, once the caller specifies --schema we should dump the
>> contents of the schema, even if its extension is not listed with
>> --extension.
> 
> I may disagree with this later, but I'm setting it aside for the moment.
>
> This isn't a problem of selecting schemas for inclusion in the dump.  This is
> a problem of associating extensions with their pg_extension_config_dump()
> relations and omitting those extension-member relations when "-e" causes
> omission of the extension.

At code level, the decision to dump the data of any extension's
dumpable table is done in processExtensionTables().  I have to admit
that your feeling here keeps the code simpler than what I have been
thinking if we apply an extra filtering based on the list of
extensions in this code path.  So I can see the value in your argument
to not dump at all the data of an extension's dumpable table as long
as its extension is not listed, and this, even if its schema is
explicitly listed.

So I got down to make the behavior more consistent with the patch
attached.  This passes your case.  It is worth noting that if a
table is part of a schema created by an extension, but that the table
is not dependent on the extension, we would still dump its data if
using --schema with this table's schema while the extension is not
part of the list from --extension.  In the attached, that's just the
extra test with without_extension_implicit_schema.

(By the way, good catch with the duplicated --no-sync in the new
tests.)

What do you think?
--
Michael
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index d0ea489614..391947340f 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -18271,7 +18271,8 @@ processExtensionTables(Archive *fout, ExtensionInfo extinfo[],
 	 * Note that we create TableDataInfo objects even in schemaOnly mode, ie,
 	 * user data in a configuration table is treated like schema data. This
 	 * seems appropriate since system data in a config table would get
-	 * reloaded by CREATE EXTENSION.
+	 * reloaded by CREATE EXTENSION.  If the extension is not listed in the
+	 * list of extensions to be included, none of its data is dumped.
 	 */
 	for (i = 0; i < numExtensions; i++)
 	{
@@ -18283,6 +18284,15 @@ processExtensionTables(Archive *fout, ExtensionInfo extinfo[],
 		int			nconfigitems = 0;
 		int			nconditionitems = 0;
 
+		/*
+		 * Check if this extension is listed as to include in the dump.  If
+		 * not, any table data associated with it is discarded.
+		 */
+		if (extension_include_oids.head != NULL &&
+			!simple_oid_list_member(&extension_include_oids,
+	curext->dobj.catId.oid))
+			continue;
+
 		if (strlen(extconfig) != 0 || strlen(extcondition) != 0)
 		{
 			int			j;
diff --git a/src/test/modules/test_pg_dump/t/001_base.pl b/src/test/modules/test_pg_dump/t/001_base.pl
index ef98c08493..51450f1b3a 100644
--- a/src/test/modules/test_pg_dump/t/001_base.pl
+++ b/src/test/modules/test_pg_dump/t/001_base.pl
@@ -208,6 +208,30 @@ my %pgdump_runs = (
 			'pg_dump', '--no-sync', "--file=$tempdir/without_extension.sql",
 			'--extension=plpgsql', 'postgres',
 		],
+	},
+
+	# plgsql in the list of extensions blocks the dump of extension
+	# test_pg_dump.
+	without_extension_explicit_schema => {
+		dump_cmd => [
+			'pg_dump',
+			'--no-sync',
+			"--file=$tempdir/without_extension_explicit_schema.sql",
+			'--extension=plpgsql',
+			'--schema=public',
+			'postgres',
+		],
+	},
+
+	without_extension_implicit_schema => {
+		dump_cmd => [
+			'pg_dump',
+			'--no-sync',
+			"--file=$tempdir/without_extension_implicit_schema.sql",
+			'--extension=plpgsql',
+			'--schema=regress_pg_dump_schema',
+			'postgres',
+		],
 	},);
 
 ###
@@ -632,6 +656,8 @@ my %tests = (
 			pg_dumpall_globals => 1,
 			section_data   => 1,
 			section_pre_data   => 1,
+			# Excludes this schema as extension is not listed.
+			without_extension_explicit_schema => 1,
 		},
 	},
 
@@ -646,6 +672,8 @@ my %tests = (
 			pg_dumpall_globals => 1,
 			section_data   => 1,
 			section_pre_data   => 1,
+			# Excludes this schema as extension is not listed.
+			without_extension_explicit_schema => 1,
 		},
 	},
 
@@ -662,6 +690,8 @@ my %tests = (
 			%full_runs,
 			schema_only  => 1,
 			section_pre_data => 1,
+			# Excludes the extension and keeps the schema's data.
+			without_extension_impl

Re: Replication slot stats misgivings

On Tue, Apr 13, 2021 at 10:46 AM Masahiko Sawada 
wrote:
>
> On Mon, Apr 12, 2021 at 9:16 PM vignesh C  wrote:
> >
> > On Mon, Apr 12, 2021 at 4:46 PM Amit Kapila 
wrote:
> > >
> > > On Sat, Apr 10, 2021 at 6:51 PM vignesh C  wrote:
> > > >
> > >
> > > Thanks, 0001 and 0002 look good to me. I have a minor comment for
0002.
> > >
> > > 
> > > +total_bytesbigint
> > > +   
> > > +   
> > > +Amount of decoded transactions data sent to the decoding
output plugin
> > > +while decoding the changes from WAL for this slot. This can
be used to
> > > +gauge the total amount of data sent during logical decoding.
> > >
> > > Can we slightly extend it to say something like: Note that this
> > > includes the bytes streamed and or spilled. Similarly, we can extend
> > > it for total_txns.
> > >
> >
> > Thanks for the comments, the comments are fixed in the v8 patch
attached.
> > Thoughts?
>
> Here are review comments on new TAP tests:

Thanks for the comments.

> +# Create replication slots.
> +$node->safe_psql('postgres',
> +   "SELECT 'init' FROM
> pg_create_logical_replication_slot('regression_slot1',
> 'test_decoding')");
> +$node->safe_psql('postgres',
> +   "SELECT 'init' FROM
> pg_create_logical_replication_slot('regression_slot2',
> 'test_decoding')");
> +$node->safe_psql('postgres',
> +   "SELECT 'init' FROM
> pg_create_logical_replication_slot('regression_slot3',
> 'test_decoding')");
> +$node->safe_psql('postgres',
> +   "SELECT 'init' FROM
> pg_create_logical_replication_slot('regression_slot4',
> 'test_decoding')");
>
> and
>
> +
> +$node->safe_psql('postgres',
> +   "SELECT data FROM
> pg_logical_slot_get_changes('regression_slot1', NULL, NULL,
> 'include-xids', '0', 'skip-empty-xacts', '1')");
> +$node->safe_psql('postgres',
> +   "SELECT data FROM
> pg_logical_slot_get_changes('regression_slot2', NULL, NULL,
> 'include-xids', '0', 'skip-empty-xacts', '1')");
> +$node->safe_psql('postgres',
> +"SELECT data FROM
> pg_logical_slot_get_changes('regression_slot3', NULL, NULL,
> 'include-xids', '0', 'skip-empty-xacts', '1')");
> +$node->safe_psql('postgres',
> +   "SELECT data FROM
> pg_logical_slot_get_changes('regression_slot4', NULL, NULL,
> 'include-xids', '0', 'skip-empty-xacts', '1')");
>
> I think we can do those similar queries in a single psql connection
> like follows:
>
>  # Create replication slots.
>  $node->safe_psql('postgres',
>qq[
> SELECT pg_create_logical_replication_slot('regression_slot1',
'test_decoding');
> SELECT pg_create_logical_replication_slot('regression_slot2',
'test_decoding');
> SELECT pg_create_logical_replication_slot('regression_slot3',
'test_decoding');
> SELECT pg_create_logical_replication_slot('regression_slot4',
'test_decoding');
> ]);
>
> and
>
> $node->safe_psql('postgres',
>qq[
> SELECT data FROM pg_logical_slot_get_changes('regression_slot1', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> SELECT data FROM pg_logical_slot_get_changes('regression_slot2', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> SELECT data FROM pg_logical_slot_get_changes('regression_slot3', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
> SELECT data FROM pg_logical_slot_get_changes('regression_slot4', NULL,
> NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
>   ]);
>

Modified.

> ---
> +# Wait for the statistics to be updated.
> +my $slot1_stat_check_query =
> +  "SELECT count(1) = 1 FROM pg_stat_replication_slots WHERE slot_name
> = 'regression_slot1' AND total_txns > 0 AND total_bytes > 0;";
> +my $slot2_stat_check_query =
> +  "SELECT count(1) = 1 FROM pg_stat_replication_slots WHERE slot_name
> = 'regression_slot2' AND total_txns > 0 AND total_bytes > 0;";
> +my $slot3_stat_check_query =
> +  "SELECT count(1) = 1 FROM pg_stat_replication_slots WHERE slot_name
> = 'regression_slot3' AND total_txns > 0 AND total_bytes > 0;";
> +my $slot4_stat_check_query =
> +  "SELECT count(1) = 1 FROM pg_stat_replication_slots WHERE slot_name
> = 'regression_slot4' AND total_txns > 0 AND total_bytes > 0;";
> +
> +# Verify that the statistics have been updated.
> +$node->poll_query_until('postgres', $slot1_stat_check_query)
> +  or die "Timed out while waiting for statistics to be updated";
> +$node->poll_query_until('postgres', $slot2_stat_check_query)
> +  or die "Timed out while waiting for statistics to be updated";
> +$node->poll_query_until('postgres', $slot3_stat_check_query)
> +  or die "Timed out while waiting for statistics to be updated";
> +$node->poll_query_until('postgres', $slot4_stat_check_query)
> +  or die "Timed out while waiting for statistics to be updated";
>
> We can simplify the above code to something like:
>
> $node->poll_query_until(
>'postgres', qq[
> SELECT count(slot_name) >= 4
> FROM pg_stat_replication_slots
> WHERE slot_name ~ 'regression_slot'
> AND total_txns > 0
> AND total_bytes > 0;
> ]) or die "Timed out while waiting for

Re: Converting contrib SQL functions to new style

2021-04-13 Thread Noah Misch

On Tue, Apr 13, 2021 at 06:26:34PM -0400, Tom Lane wrote:
> Attached are some draft patches to convert almost all of the
> contrib modules' SQL functions to use SQL-standard function bodies.
> The point of this is to remove the residual search_path security
> hazards that we couldn't fix in commits 7eeb1d986 et al.  Since
> a SQL-style function body is fully parsed at creation time,
> its object references are not subject to capture by the run-time
> search path.

Are there any inexact matches in those function/operator calls?  Will that
matter more or less than it does today?

> However I think we may still need an assumption that earthdistance
> and cube are in the same schema --- any comments on that?

That part doesn't change, indeed.

> I'd like to propose squeezing these changes into v14, even though
> we're past feature freeze.  Reason one is that this is less a
> new feature than a security fix; reason two is that this provides
> some non-artificial test coverage for the SQL-function-body feature.

Dogfooding like this is good.  What about the SQL-language functions that
initdb creates?

Re: Replication slot stats misgivings

2021-04-13 Thread Masahiko Sawada

On Tue, Apr 13, 2021 at 5:07 PM vignesh C  wrote:
>
> On Mon, Apr 12, 2021 at 7:03 PM Masahiko Sawada  wrote:
> >
> > On Mon, Apr 12, 2021 at 9:36 PM Amit Kapila  wrote:
> > >
> > > On Mon, Apr 12, 2021 at 5:29 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Mon, Apr 12, 2021 at 8:08 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Mon, Apr 12, 2021 at 4:34 PM Masahiko Sawada 
> > > > >  wrote:
> > > > > >
> > > > > > On Mon, Apr 12, 2021 at 6:19 PM Amit Kapila 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Mon, Apr 12, 2021 at 10:27 AM Masahiko Sawada 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Sat, Apr 10, 2021 at 9:53 PM Amit Kapila 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > It seems Vignesh has changed patches based on the latest set 
> > > > > > > > > of
> > > > > > > > > comments so you might want to rebase.
> > > > > > > >
> > > > > > > > I've merged my patch into the v6 patch set Vignesh submitted.
> > > > > > > >
> > > > > > > > I've attached the updated version of the patches. I didn't 
> > > > > > > > change
> > > > > > > > anything in the patch that changes char[NAMEDATALEN] to 
> > > > > > > > NameData (0001
> > > > > > > > patch) and patches that add tests.
> > > > > > > >
> > > > > > >
> > > > > > > I think we can push 0001. What do you think?
> > > > > >
> > > > > > +1
> > > > > >
> > > > > > >
> > > > > > > > In 0003 patch I reordered the
> > > > > > > > output parameters of pg_stat_replication_slots; showing total 
> > > > > > > > number
> > > > > > > > of transactions and total bytes followed by statistics for 
> > > > > > > > spilled and
> > > > > > > > streamed transactions seems appropriate to me.
> > > > > > > >
> > > > > > >
> > > > > > > I am not sure about this because I think we might want to add some
> > > > > > > info of stream/spill bytes in total_bytes description (something 
> > > > > > > like
> > > > > > > stream/spill bytes are not in addition to total_bytes).
> > > >
> > > > BTW doesn't it confuse users that stream/spill bytes are not in
> > > > addition to total_bytes? User will need to do "total_bytes +
> > > > spill/stream_bytes" to know the actual total amount of data sent to
> > > > the decoding output plugin, is that right?
> > > >
> > >
> > > No, total_bytes includes the spill/stream bytes. So, the user doesn't
> > > need to do any calculation to compute totel_bytes sent to output
> > > plugin.
> >
> > The following test for the latest v8 patch seems to show different.
> > total_bytes is 1808 whereas spill_bytes is 1320. Am I missing
> > something?
> >
> > postgres(1:85969)=# select pg_create_logical_replication_slot('s',
> > 'test_decoding');
> >  pg_create_logical_replication_slot
> > 
> >  (s,0/1884468)
> > (1 row)
> >
> > postgres(1:85969)=# create table a (i int);
> > CREATE TABLE
> > postgres(1:85969)=# insert into a select generate_series(1, 10);
> > INSERT 0 10
> > postgres(1:85969)=# set logical_decoding_work_mem to 64;
> > SET
> > postgres(1:85969)=# select * from pg_stat_replication_slots ;
> >  slot_name | total_txns | total_bytes | spill_txns | spill_count |
> > spill_bytes | stream_txns | stream_count | stream_bytes | stats_reset
> > ---++-++-+-+-+--+--+-
> >  s |  0 |   0 |  0 |   0 |
> >   0 |   0 |0 |0 |
> > (1 row)
> >
> > postgres(1:85969)=# select count(*) from
> > pg_logical_slot_peek_changes('s', NULL, NULL);
> >  count
> > 
> >  14
> > (1 row)
> >
> > postgres(1:85969)=# select * from pg_stat_replication_slots ;
> >  slot_name | total_txns | total_bytes | spill_txns | spill_count |
> > spill_bytes | stream_txns | stream_count | stream_bytes | stats_reset
> > ---++-++-+-+-+--+--+-
> >  s |  2 |1808 |  1 | 202 |
> > 1320 |   0 |0 |0 |
> > (1 row)
> >
>
> Thanks for identifying this issue, while spilling the transactions
> reorder buffer changes gets released, we will not be able to get the
> total size for spilled transactions from reorderbuffer size. I have
> fixed it by including spilledbytes to totalbytes in case of spilled
> transactions. Attached patch has the fix for this.
> Thoughts?

I've not looked at the patches yet but as Amit mentioned before[1],
it's better to move 0002 patch to after 0004. That is, 0001 patch
changes data type to NameData, 0002 patch adds total_txn and
total_bytes, and 0003 patch adds regression tests. 0004 patch will be
the patch using HTAB (was 0002 patch) and get reviewed after pushing
0001, 0002, and 0003 patches. 0005 patch adds more regression tests
for the problem 0004 patch addresses.

Regards,

[1] 
https

Re: [PATCH] Identify LWLocks in tracepoints

On Wed, 14 Apr 2021 at 04:46, Andres Freund  wrote:
>
> On 2021-04-13 14:25:23 -0400, Robert Haas wrote:
> > On Mon, Apr 12, 2021 at 11:06 PM Andres Freund  wrote:
> > You could identify every lock by a tranche ID + an array offset + a
> > "tranche instance ID". But where would you store the tranche instance
> > ID to make it readily accessible, other than in the lock itself?
> > Andres wasn't thrilled about using even 2 bytes to identify the
> > LWLock, so he'll probably like having more bytes in there for that
> > kind of thing even less.
>
> I still don't like the two bytes, fwiw ;). Especially because it's 4
> bytes due to padding right now.

Aha, did I hear you say "there are two free bytes for me to shove
something marginally useful and irrelevant into"?

(*grin*)

> I'd like to move the LWLock->waiters list to outside of the lwlock
> itself - at most TotalProcs LWLocks can be waited for, so we don't need
> millions of empty proclist_heads. That way we could also remove the
> proclist indirection - which shows up a fair bit in contended workloads.
>
> And if we had a separate "lwlocks being waited for" structure, we could
> also add more information to it if we wanted to...

Having the ability to observe LWLock waiters would be nice. But you're
right to constantly emphasise that LWLocks need to be very slim. We
don't want to turn them into near-heavyweight locks by saddling them
with overhead that's not strictly necessary. A simple pg_regress run
(with cassert, admittedly) takes 25,000,000 LWLocks and does 24,000
LWLock waits and 130,000 condvar waits. All in less than a minute.
OTOH, once someone's waiting we don't care nearly as much about
bookkeeping cost, it's the un-contested fast paths that're most
critical.

> - We could make it work if we restricted MAX_BACKENDS to be 2**14 - but
>   while I personally think that's a sane upper limit, I already had a
>   hard time getting consensus to lower the limit to 2^18-1.

16384 backends is totally fine in sane real world deployments. But
it'll probably upset marketing people when OtherDatabaseVendor starts
shouting that they support 14 million connections, and postgres only
has 16k. Sigh.

The real answer here in the long term probably needs to be decoupling
of executors from connection state inside postgres. But while we're on
that topic, how about we convert the entire codebase to Rust while
riding on a flying rainbow unicorn? We're stuck with the 1:1
connection to executor mapping for the foreseeable future.

> - Use a 64bit integer. Then we can easily fit MAX_BACKENDS lockers, as
>   well as an offset to one of MAX_BACKENDS "wait lists" into LWLock.

You know much more than me about the possible impacts of that on
layout and caching, but I gather that it's probably undesirable to
make LWLocks any bigger.

> I think it's quite useful for relatively simple things like analyzing
> the total amount of time spent in individual locks, without incuring
> much overhead when not doing so (for which you need to identify
> individual locks, otherwise your end - start time is going to be
> meaningless).

Yep.

That's why the removal of the lock offset is a bit frustrating,
because you can't identify an LWLock instance-wide by LWLock* due to
the possibility of different per-backend DSM base-address mappings.
Well, and ASLR on EXEC_BACKEND systems, but who cares about those?

The removal was for good reasons though. And it only affects LWLocks
in DSM, for everything else the LWLock* is good enough. If identifying
LWLocks in DSM ever matters enough to bother to solve that problem, we
can emit trace events on DSM mapping attach in each backend, and trace
tools can do the work to track which LWLocks are in DSM and convert
their addresses to a reference base address. Pg shouldn't have to pay
the price for that unless it's something a lot of people need.

> And, slightly more advanced, for analyzing what the stack
> was when the lock was released - which then allows you to see what work
> you're blocked on, something I found hard to figure out otherwise.
>
> I found that that's mostly quite doable with dynamic probes though.

Yeah, it is.

That's part of why my patchset here doesn't try to do a lot to LWLock
tracepoints - I didn't think it was necessary to add a lot.
The LWLock code is fairly stable, not usually something you have to
worry about in production unless you're debugging badly behaved
extensions, and usually somewhat probe-able with DWARF based dynamic
probes. However, the way the wait-loop and fast-path are in the same
function is a serious pain for dynamic probing; you can't tell the
difference between a fast-path acquire and an acquire after a wait
without using probes on function+offset or probing by source line.
Both those are fine for dev work but useless in tool/library scripts.

I almost wonder if we should test out moving the LWLock wait-loops out
of LWLockAcquire(), LWLockAcquireOrWait() and LWLockWaitForVar()
anyway, so the hot parts of the fun

Re: Replication slot stats misgivings

On Wed, Apr 14, 2021 at 7:52 AM Masahiko Sawada  wrote:
>
> On Tue, Apr 13, 2021 at 5:07 PM vignesh C  wrote:
> >
> > On Mon, Apr 12, 2021 at 7:03 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Apr 12, 2021 at 9:36 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, Apr 12, 2021 at 5:29 PM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Mon, Apr 12, 2021 at 8:08 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Apr 12, 2021 at 4:34 PM Masahiko Sawada 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Mon, Apr 12, 2021 at 6:19 PM Amit Kapila 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Mon, Apr 12, 2021 at 10:27 AM Masahiko Sawada 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Sat, Apr 10, 2021 at 9:53 PM Amit Kapila 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > It seems Vignesh has changed patches based on the latest 
> > > > > > > > > > set of
> > > > > > > > > > comments so you might want to rebase.
> > > > > > > > >
> > > > > > > > > I've merged my patch into the v6 patch set Vignesh submitted.
> > > > > > > > >
> > > > > > > > > I've attached the updated version of the patches. I didn't 
> > > > > > > > > change
> > > > > > > > > anything in the patch that changes char[NAMEDATALEN] to 
> > > > > > > > > NameData (0001
> > > > > > > > > patch) and patches that add tests.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I think we can push 0001. What do you think?
> > > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > >
> > > > > > > > > In 0003 patch I reordered the
> > > > > > > > > output parameters of pg_stat_replication_slots; showing total 
> > > > > > > > > number
> > > > > > > > > of transactions and total bytes followed by statistics for 
> > > > > > > > > spilled and
> > > > > > > > > streamed transactions seems appropriate to me.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I am not sure about this because I think we might want to add 
> > > > > > > > some
> > > > > > > > info of stream/spill bytes in total_bytes description 
> > > > > > > > (something like
> > > > > > > > stream/spill bytes are not in addition to total_bytes).
> > > > >
> > > > > BTW doesn't it confuse users that stream/spill bytes are not in
> > > > > addition to total_bytes? User will need to do "total_bytes +
> > > > > spill/stream_bytes" to know the actual total amount of data sent to
> > > > > the decoding output plugin, is that right?
> > > > >
> > > >
> > > > No, total_bytes includes the spill/stream bytes. So, the user doesn't
> > > > need to do any calculation to compute totel_bytes sent to output
> > > > plugin.
> > >
> > > The following test for the latest v8 patch seems to show different.
> > > total_bytes is 1808 whereas spill_bytes is 1320. Am I missing
> > > something?
> > >
> > > postgres(1:85969)=# select pg_create_logical_replication_slot('s',
> > > 'test_decoding');
> > >  pg_create_logical_replication_slot
> > > 
> > >  (s,0/1884468)
> > > (1 row)
> > >
> > > postgres(1:85969)=# create table a (i int);
> > > CREATE TABLE
> > > postgres(1:85969)=# insert into a select generate_series(1, 10);
> > > INSERT 0 10
> > > postgres(1:85969)=# set logical_decoding_work_mem to 64;
> > > SET
> > > postgres(1:85969)=# select * from pg_stat_replication_slots ;
> > >  slot_name | total_txns | total_bytes | spill_txns | spill_count |
> > > spill_bytes | stream_txns | stream_count | stream_bytes | stats_reset
> > > ---++-++-+-+-+--+--+-
> > >  s |  0 |   0 |  0 |   0 |
> > >   0 |   0 |0 |0 |
> > > (1 row)
> > >
> > > postgres(1:85969)=# select count(*) from
> > > pg_logical_slot_peek_changes('s', NULL, NULL);
> > >  count
> > > 
> > >  14
> > > (1 row)
> > >
> > > postgres(1:85969)=# select * from pg_stat_replication_slots ;
> > >  slot_name | total_txns | total_bytes | spill_txns | spill_count |
> > > spill_bytes | stream_txns | stream_count | stream_bytes | stats_reset
> > > ---++-++-+-+-+--+--+-
> > >  s |  2 |1808 |  1 | 202 |
> > > 1320 |   0 |0 |0 |
> > > (1 row)
> > >
> >
> > Thanks for identifying this issue, while spilling the transactions
> > reorder buffer changes gets released, we will not be able to get the
> > total size for spilled transactions from reorderbuffer size. I have
> > fixed it by including spilledbytes to totalbytes in case of spilled
> > transactions. Attached patch has the fix for this.
> > Thoughts?
>
> I've not looked at the patches yet but as Amit mentioned before[1],
> it's better to move 0002 patch to after 0004. That is, 0001 patch
> changes

Re: Truncate in synchronous logical replication failed

2021-04-13 Thread Japin Li



On Tue, 13 Apr 2021 at 21:54, osumi.takami...@fujitsu.com 
 wrote:
> On Monday, April 12, 2021 3:58 PM Amit Kapila  wrote:
>> On Mon, Apr 12, 2021 at 10:03 AM osumi.takami...@fujitsu.com
>>  wrote:
>> > but if we take a measure to fix the doc, we have to be careful for the
>> > description, because when we remove the primary keys of 'test' tables on 
>> > the
>> scenario in [1], we don't have this issue.
>> > It means TRUNCATE in synchronous logical replication is not always
>> blocked.
>> >
>> 
>> The problem happens only when we try to fetch IDENTITY_KEY attributes
>> because pgoutput uses RelationGetIndexAttrBitmap() to get that information
>> which locks the required indexes. Now, because TRUNCATE has already
>> acquired an exclusive lock on the index, it seems to create a sort of 
>> deadlock
>> where the actual Truncate operation waits for logical replication of 
>> operation to
>> complete and logical replication waits for actual Truncate operation to 
>> finish.
>> 
>> Do we really need to use RelationGetIndexAttrBitmap() to build IDENTITY_KEY
>> attributes? During decoding, we don't even lock the main relation, we just 
>> scan
>> the system table and build that information using a historic snapshot. Can't 
>> we
>> do something similar here?
> I think we can build the IDENTITY_KEY attributes with NoLock
> instead of calling RelationGetIndexAttrBitmap().
>
> When we trace back the caller side of logicalrep_write_attrs(),
> doing the thing equivalent to RelationGetIndexAttrBitmap()
> for INDEX_ATTR_BITMAP_IDENTITY_KEY impacts only pgoutput_truncate.
>
> OTOH, I can't find codes similar to RelationGetIndexAttrBitmap()
> in pgoutput_* functions and in the file of relcache.c.
> Therefore, I'd like to discuss how to address the hang.
>
> My first idea is to extract some parts of RelationGetIndexAttrBitmap()
> only for INDEX_ATTR_BITMAP_IDENTITY_KEY and implement those
> either in a logicalrep_write_attrs() or as a new function.
> RelationGetIndexAttrBitmap() has 'restart' label for goto statement
> in order to ensure to return up-to-date attribute bitmaps, so
> I prefer having a new function when we choose this direction.
> Having that goto in logicalrep_write_attrs() makes it a little bit messy, I 
> felt.
>
> The other direction might be to extend RelationGetIndexAttrBitmap's function 
> definition
> to accept lockmode to give NoLock from logicalrep_write_attrs().
> But, this change impacts on other several callers so is not as good as the 
> first direction above, I think.
>
> If someone has any better idea, please let me know.
>

I think the first idea is better than the second.  OTOH, can we release the
locks before SyncRepWaitForLSN(), since it already flush to local WAL files.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: [PATCH] Identify LWLocks in tracepoints

On Wed, 14 Apr 2021 at 02:25, Robert Haas  wrote:

> So before the commit in question --
> 3761fe3c20bb040b15f0e8da58d824631da00caa -- T_ID() used to compute an
> offset for a lock within the tranche that was supposed to uniquely
> identify the lock. However, the whole idea of an array per tranche
> turns out to be broken by design.

Yeah, I understand that.

I'd really love it if a committer could add an explanatory comment or
two in the area though. I'm happy to draft a comment patch if anyone
agrees my suggestion is sensible. The key things I needed to know when
studying the code were:

* A LWLock* is always part of a tranche, but locks within a given
tranche are not necessarily co-located in memory, cross referenced or
associated in any way.
* A LWLock tranche does not track how many LWLocks are in the tranche
or where they are in memory. It only groups up LWLocks into categories
and maps the tranche ID to a name.
* Not all LWLocks are part of the main LWLock array; others can be
embedded in shmem structs elsewhere, including in DSM segments.
* LWLocks in DSM segments may not have the same address between
different backends, because the DSM base address can vary, so a
LWLock* cannot be reliably compared between backends unless you know
it's in the main LWLock array or in static shmem.

Having that info in lwlock.c near the tranche management code or the
tranche and main lwlock arrays would've been very handy.

> You could try to identify locks by pointer addresses, but that's got
> security hazards and the addreses aren't portable across all the
> backends involved in the parallel query because of how DSM works, so
> it's not really that helpful in terms of matching stuff up.

What I'm doing now is identifying them by LWLock* across backends. I
keep track of DSM segment mappings in each backend inside the trace
script and I relocate LWLock* pointers known to be inside DSM segments
relative to a dummy base address so they're equal across backends.

It's a real pain, but it works. The main downside is that the trace
script has to observe the DSM attach; if it's started once a backend
already has the DSM segment attached, it has no idea the LWLock is in
a DSM segment or how to remap it. But that's not a serious issue.

> On a broader level, I agree that being able to find out what the
> system is doing is really important. But I'm also not entirely
> convinced that having really fine-grained information here to
> distinguish between one lock and another is the way to get there.

At the start of this thread I would've disagreed with you.

But yeah, you and Andres are right, because the costs outweigh the
benefits here.

I'm actually inclined to revise the patch I sent in order to *remove*
the LWLock name from the tracepoint argument. At least for the
fast-path tracepoints on start-of-acquire and end-of-acquire. I think
it's probably OK to report it in the lock wait tracepoints, which is
where it's most important to have anyway. So the tracepoint will
always report the LWLock* and tranche ID, but it won't report the
tranche name for the fast-path. I'll add trace events for tranche ID
registration, so trace tools can either remember the tranche ID->name
mappings from there, or capture them from lock wait events and
remember them.

Reasonable? That way we retain the most important trace functionality,
but we reduce the overheads.

Re: [PATCH] Identify LWLocks in tracepoints

On Wed, 14 Apr 2021 at 10:41, Craig Ringer
 wrote:
> On Wed, 14 Apr 2021 at 02:25, Robert Haas  wrote:
> > You could try to identify locks by pointer addresses, but that's got
> > security hazards and the addreses aren't portable across all the
> > backends involved in the parallel query because of how DSM works, so
> > it's not really that helpful in terms of matching stuff up.
>
> What I'm doing now is identifying them by LWLock* across backends. I
> keep track of DSM segment mappings in each backend inside the trace
> script and I relocate LWLock* pointers known to be inside DSM segments
> relative to a dummy base address so they're equal across backends.

BTW, one of the reasons I did this was to try to identify BDR and
pglogical code that blocks or sleeps while holding a LWLock. I got
stuck on that for other reasons, so it didn't go anywhere, but those
issues are now resolved so I should probably return to it at some
point.

It'd be a nice thing to be able to run on postgres itself too.

Re: Replication slot stats misgivings

2021-04-13 Thread Amit Kapila

On Wed, Apr 14, 2021 at 8:04 AM vignesh C  wrote:
>
> On Wed, Apr 14, 2021 at 7:52 AM Masahiko Sawada  wrote:
> >
> > I've not looked at the patches yet but as Amit mentioned before[1],
> > it's better to move 0002 patch to after 0004. That is, 0001 patch
> > changes data type to NameData, 0002 patch adds total_txn and
> > total_bytes, and 0003 patch adds regression tests. 0004 patch will be
> > the patch using HTAB (was 0002 patch) and get reviewed after pushing
> > 0001, 0002, and 0003 patches. 0005 patch adds more regression tests
> > for the problem 0004 patch addresses.
>
> I will make the change for this and post a patch for this.
> Currently we have kept total_txns and total_bytes at the beginning of
> pg_stat_replication_slots, I did not see any conclusion on this. I
> preferred it to be at the beginning.
> Thoughts?
>

I prefer those two fields after spill and stream fields. I have
mentioned the same in one of the emails above.

-- 
With Regards,
Amit Kapila.

Re: ModifyTable overheads in generic plans

2021-04-13 Thread Amit Langote

On Wed, Apr 7, 2021 at 5:18 PM Amit Langote  wrote:
> On Wed, Apr 7, 2021 at 8:24 AM Tom Lane  wrote:
> > I also could not get excited about postponing initialization of RETURNING
> > or WITH CHECK OPTIONS expressions.  I grant that that can be helpful
> > when those features are used, but I doubt that RETURNING is used that
> > heavily, and WITH CHECK OPTIONS is surely next door to nonexistent
> > in performance-critical queries.  If the feature isn't used, the cost
> > of the existing code is about zero.  So I couldn't see that it was worth
> > the amount of code thrashing and risk of new bugs involved.
>
> Okay.
>
> > The bit you
> > noted about EXPLAIN missing a subplan is pretty scary in this connection;
> > I'm not at all sure that that's just cosmetic.
>
> Yeah, this and...

I looked into this and can't see why this isn't just cosmetic as far
as ModifyTable is concerned.

"EXPLAIN missing a subplan" here just means that
ModifyTableState.PlanState.subPlan is not set. Besides ExplainNode(),
only ExecReScan() looks at PlanState.subPlan, and that does not seem
relevant to ModifyTable, because it doesn't support rescanning.

I don't see any such problems with creating RETURNING projections
on-demand either.

> > (Having said that, I'm wondering if there are bugs in these cases for
> > cross-partition updates that target a previously-not-used partition.
> > So we might have things to fix anyway.)
>
> ...this would need to be looked at a bit more closely, which I'll try
> to do sometime later this week.

Given the above, I can't think of any undiscovered problems related to
WCO and RETURNING expression states in the cases where cross-partition
updates target partitions that need to be initialized by
ExecInitPartitionInfo().  Here is the result for the test case in
updatable_views.sql modified to use partitioning and cross-partition
updates:

CREATE TABLE base_tbl (a int) partition by range (a);
CREATE TABLE base_tbl1 PARTITION OF base_tbl FOR VALUES FROM (1) TO (6);
CREATE TABLE base_tbl2 PARTITION OF base_tbl FOR VALUES FROM (6) TO (11);
CREATE TABLE base_tbl3 PARTITION OF base_tbl FOR VALUES FROM (11) TO (15);
CREATE TABLE ref_tbl (a int PRIMARY KEY);
INSERT INTO ref_tbl SELECT * FROM generate_series(1,10);
CREATE VIEW rw_view1 AS
  SELECT * FROM base_tbl b
  WHERE EXISTS(SELECT 1 FROM ref_tbl r WHERE r.a = b.a)
  WITH CHECK OPTION;

INSERT INTO rw_view1 VALUES (1);
INSERT 0 1

INSERT INTO rw_view1 VALUES (11);
ERROR:  new row violates check option for view "rw_view1"
DETAIL:  Failing row contains (11).

-- Both are cross-partition updates where the target relation is
-- lazily initialized in ExecInitPartitionInfo(), along with the WCO
-- qual ExprState
UPDATE rw_view1 SET a = a + 5 WHERE a = 1;
UPDATE 1

UPDATE rw_view1 SET a = a + 5 WHERE a = 6;
ERROR:  new row violates check option for view "rw_view1"
DETAIL:  Failing row contains (11).

EXPLAIN (costs off) INSERT INTO rw_view1 VALUES (5);
  QUERY PLAN
--
 Insert on base_tbl b
   ->  Result
(2 rows)

EXPLAIN (costs off) UPDATE rw_view1 SET a = a + 5 WHERE a = 1;
   QUERY PLAN

 Update on base_tbl b
   Update on base_tbl1 b_1
   ->  Nested Loop
 ->  Index Scan using ref_tbl_pkey on ref_tbl r
   Index Cond: (a = 1)
 ->  Seq Scan on base_tbl1 b_1
   Filter: (a = 1)
(7 rows)

EXPLAIN (costs off) UPDATE rw_view1 SET a = a + 5 WHERE a = 6;
   QUERY PLAN

 Update on base_tbl b
   Update on base_tbl2 b_1
   ->  Nested Loop
 ->  Index Scan using ref_tbl_pkey on ref_tbl r
   Index Cond: (a = 6)
 ->  Seq Scan on base_tbl2 b_1
   Filter: (a = 6)
(7 rows)

Patch attached.  I tested the performance benefit of doing this by
modifying the update query used in earlier benchmarks to have a
RETURNING * clause, getting the following TPS numbers:

-Mprepared (plan_cache_mode=force_generic_plan)

nparts  10cols  20cols  40cols

HEAD
64  10909   90677171
128 690356244161
256 374830562219
1024953 738 427

Patched
64  13817   13395   12754
128 927191028279
256 534552075083
1024146314431389

Also, I don't see much impact of checking if (node->returningLists) in
the per-result-rel initialization loop in the common cases where
there's no RETURNING.

--
Amit Langote
EDB: http://www.enterprisedb.com

0001-Initialize-WITH-CHECK-OPTIONS-and-RETURNING-expressi.patch
Description: Binary data

Re: Converting contrib SQL functions to new style

Noah Misch  writes:
> On Tue, Apr 13, 2021 at 06:26:34PM -0400, Tom Lane wrote:
>> Attached are some draft patches to convert almost all of the
>> contrib modules' SQL functions to use SQL-standard function bodies.
>> The point of this is to remove the residual search_path security
>> hazards that we couldn't fix in commits 7eeb1d986 et al.  Since
>> a SQL-style function body is fully parsed at creation time,
>> its object references are not subject to capture by the run-time
>> search path.

> Are there any inexact matches in those function/operator calls?  Will that
> matter more or less than it does today?

I can't claim to have looked closely for inexact matches.  It should
matter less than today, since there's a hazard only during creation
(with a somewhat-controlled search path) and not during use.  But
that doesn't automatically eliminate the issue.

>> I'd like to propose squeezing these changes into v14, even though
>> we're past feature freeze.  Reason one is that this is less a
>> new feature than a security fix; reason two is that this provides
>> some non-artificial test coverage for the SQL-function-body feature.

> Dogfooding like this is good.  What about the SQL-language functions that
> initdb creates?

Hadn't thought about those, but converting them seems like a good idea.

regards, tom lane

Re: TRUNCATE on foreign table

2021-04-13 Thread Bharath Rupireddy

On Tue, Apr 13, 2021 at 8:30 PM Fujii Masao  wrote:
> On 2021/04/13 23:25, Kohei KaiGai wrote:
> > 2021年4月13日(火) 21:03 Bharath Rupireddy 
> > :
> >> Yeah, ONLY clause is not pushed to the remote server in case of SELECT
> >> commands. This is also true for DELETE and UPDATE commands on foreign
> >> tables.
>
> This sounds reasonable reason why ONLY should be ignored in TRUNCATE on
> foreign tables, for now. If there is the existing rule about how to treat
> ONLY clause for foreign tables, basically TRUNCATE should follow that at this
> stage. Maybe we can change the rule, but it's an item for v15 or later?
>
> >> I'm not sure if it wasn't thought necessary or if there is an
> >> issue to push it or I may be missing something here.
>
> I could not find the past discussion about foreign tables and ONLY clause.
> I guess that ONLY is ignored in SELECT on foreign tables case because ONLY
> is interpreted outside the executor and it's not easy to change the executor
> so that ONLY is passed to FDW. Maybe..
>
>
> >> I think we can
> >> start a separate thread to see other hackers' opinions on this.
> >>
> >> I'm not sure whether all the clauses that are possible for
> >> SELECT/UPDATE/DELETE/INSERT with local tables are pushed to the remote
> >> server by postgres_fdw.
> >>
> >> Well, now foreign TRUNCATE pushes the ONLY clause to the remote server
> >> which is inconsistent when compared to SELECT/UPDATE/DELETE commands.
> >> If we were to keep it consistent across all foreign commands that
> >> ONLY clause is not pushed to remote server, then we can restrict for
> >> TRUNCATE too and even if "TRUNCATE ONLY foreign_tbl" is specified,
> >> just pass "TRUNCATE foreign_tbl" to remote server. Having said that, I
> >> don't see any real problem in pushing the ONLY clause, at least in
> >> case of TRUNCATE.
> >>
> > If ONLY-clause would be pushed down to the remote query of postgres_fdw,
> > what does the foreign-table represent in the local system?
> >
> > In my understanding, a local foreign table by postgres_fdw is a
> > representation of
> > entire tree of the remote parent table and its children.
>
> If so, I'm still wondering why CASCADE/RESTRICT (i.e., DropBehavior) needs to
> be passed to FDW. IOW, if a foreign table is an abstraction of an external
> data source, ISTM that postgres_fdw should always issue TRUNCATE with
> CASCADE. Why do we need to allow RESTRICT to be specified for a foreign table
> even though it's an abstraction of an external data source?

IMHO, we can push all the TRUNCATE options (ONLY, RESTRICTED, CASCADE,
RESTART/CONTINUE IDENTITY), because it doesn't have any major
challenge(implementation wise) unlike pushing some clauses in
SELECT/UPDATE/DELETE and we already do this on the master. It doesn't
look good and may confuse users, if we push some options and restrict
others. We should have an explicit note in the documentation saying we
push all these options to the remote server. We can leave it to the
user to write TRUNCATE for foreign tables with the appropriate
options. If somebody complains about a problem that they will face
with this behavior, we can revisit. This is my opinion, others may
disagree.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: TRUNCATE on foreign table

2021-04-13 Thread Kohei KaiGai

2021年4月14日(水) 0:00 Fujii Masao :
>
> On 2021/04/13 23:25, Kohei KaiGai wrote:
> > 2021年4月13日(火) 21:03 Bharath Rupireddy 
> > :
> >> Yeah, ONLY clause is not pushed to the remote server in case of SELECT
> >> commands. This is also true for DELETE and UPDATE commands on foreign
> >> tables.
>
> This sounds reasonable reason why ONLY should be ignored in TRUNCATE on
> foreign tables, for now. If there is the existing rule about how to treat
> ONLY clause for foreign tables, basically TRUNCATE should follow that at this
> stage. Maybe we can change the rule, but it's an item for v15 or later?
>
>
> >> I'm not sure if it wasn't thought necessary or if there is an
> >> issue to push it or I may be missing something here.
>
> I could not find the past discussion about foreign tables and ONLY clause.
> I guess that ONLY is ignored in SELECT on foreign tables case because ONLY
> is interpreted outside the executor and it's not easy to change the executor
> so that ONLY is passed to FDW. Maybe..
>
>
> >> I think we can
> >> start a separate thread to see other hackers' opinions on this.
> >>
> >> I'm not sure whether all the clauses that are possible for
> >> SELECT/UPDATE/DELETE/INSERT with local tables are pushed to the remote
> >> server by postgres_fdw.
> >>
> >> Well, now foreign TRUNCATE pushes the ONLY clause to the remote server
> >> which is inconsistent when compared to SELECT/UPDATE/DELETE commands.
> >> If we were to keep it consistent across all foreign commands that
> >> ONLY clause is not pushed to remote server, then we can restrict for
> >> TRUNCATE too and even if "TRUNCATE ONLY foreign_tbl" is specified,
> >> just pass "TRUNCATE foreign_tbl" to remote server. Having said that, I
> >> don't see any real problem in pushing the ONLY clause, at least in
> >> case of TRUNCATE.
> >>
> > If ONLY-clause would be pushed down to the remote query of postgres_fdw,
> > what does the foreign-table represent in the local system?
> >
> > In my understanding, a local foreign table by postgres_fdw is a
> > representation of
> > entire tree of the remote parent table and its children.
>
> If so, I'm still wondering why CASCADE/RESTRICT (i.e., DropBehavior) needs to
> be passed to FDW. IOW, if a foreign table is an abstraction of an external
> data source, ISTM that postgres_fdw should always issue TRUNCATE with
> CASCADE. Why do we need to allow RESTRICT to be specified for a foreign table
> even though it's an abstraction of an external data source?
>
Please assume the internal heap data is managed by PostgreSQL core, and
external data source is managed by postgres_fdw (or other FDW driver).
TRUNCATE command requires these object managers to eliminate the data
on behalf of the foreign tables picked up.

Even though the object manager tries to eliminate the managed data, it may be
restricted by some reason; FK restrictions in case of PostgreSQL internal data.
In this case, CASCADE/RESTRICT option suggests the object manager how
to handle the target data.

The ONLY clause controls whoes data shall be eliminated.
On the other hand, CASCADE/RESTRICT and CONTINUE/RESTART controls
how data shall be eliminated. It is a primitive difference.

Best regards,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei

Re: TRUNCATE on foreign table

2021-04-13 Thread Kyotaro Horiguchi

At Wed, 14 Apr 2021 13:17:55 +0900, Kohei KaiGai  wrote in 
> 2021年4月14日(水) 0:00 Fujii Masao :
> >
> > On 2021/04/13 23:25, Kohei KaiGai wrote:
> > > 2021年4月13日(火) 21:03 Bharath Rupireddy 
> > > :
> > >> Yeah, ONLY clause is not pushed to the remote server in case of SELECT
> > >> commands. This is also true for DELETE and UPDATE commands on foreign
> > >> tables.
> >
> > This sounds reasonable reason why ONLY should be ignored in TRUNCATE on
> > foreign tables, for now. If there is the existing rule about how to treat
> > ONLY clause for foreign tables, basically TRUNCATE should follow that at 
> > this
> > stage. Maybe we can change the rule, but it's an item for v15 or later?
> >
> >
> > >> I'm not sure if it wasn't thought necessary or if there is an
> > >> issue to push it or I may be missing something here.
> >
> > I could not find the past discussion about foreign tables and ONLY clause.
> > I guess that ONLY is ignored in SELECT on foreign tables case because ONLY
> > is interpreted outside the executor and it's not easy to change the executor
> > so that ONLY is passed to FDW. Maybe..
> >
> >
> > >> I think we can
> > >> start a separate thread to see other hackers' opinions on this.
> > >>
> > >> I'm not sure whether all the clauses that are possible for
> > >> SELECT/UPDATE/DELETE/INSERT with local tables are pushed to the remote
> > >> server by postgres_fdw.
> > >>
> > >> Well, now foreign TRUNCATE pushes the ONLY clause to the remote server
> > >> which is inconsistent when compared to SELECT/UPDATE/DELETE commands.
> > >> If we were to keep it consistent across all foreign commands that
> > >> ONLY clause is not pushed to remote server, then we can restrict for
> > >> TRUNCATE too and even if "TRUNCATE ONLY foreign_tbl" is specified,
> > >> just pass "TRUNCATE foreign_tbl" to remote server. Having said that, I
> > >> don't see any real problem in pushing the ONLY clause, at least in
> > >> case of TRUNCATE.
> > >>
> > > If ONLY-clause would be pushed down to the remote query of postgres_fdw,
> > > what does the foreign-table represent in the local system?
> > >
> > > In my understanding, a local foreign table by postgres_fdw is a
> > > representation of
> > > entire tree of the remote parent table and its children.
> >
> > If so, I'm still wondering why CASCADE/RESTRICT (i.e., DropBehavior) needs 
> > to
> > be passed to FDW. IOW, if a foreign table is an abstraction of an external
> > data source, ISTM that postgres_fdw should always issue TRUNCATE with
> > CASCADE. Why do we need to allow RESTRICT to be specified for a foreign 
> > table
> > even though it's an abstraction of an external data source?
> >
> Please assume the internal heap data is managed by PostgreSQL core, and
> external data source is managed by postgres_fdw (or other FDW driver).
> TRUNCATE command requires these object managers to eliminate the data
> on behalf of the foreign tables picked up.
> 
> Even though the object manager tries to eliminate the managed data, it may be
> restricted by some reason; FK restrictions in case of PostgreSQL internal 
> data.
> In this case, CASCADE/RESTRICT option suggests the object manager how
> to handle the target data.
> 
> The ONLY clause controls whoes data shall be eliminated.
> On the other hand, CASCADE/RESTRICT and CONTINUE/RESTART controls
> how data shall be eliminated. It is a primitive difference.

I object to unconditionally push ONLY to remote. As Kaigai-san said
that it works an apparent wrong way when a user wants to truncate only
the specified foreign table in a inheritance tree and there's no way to
avoid the behavior.

I also don't think it is right to push down CASCADE/RESTRICT.  The
options suggest to propagate truncation to *local* referrer tables
from the *foreign* table, not to the remote referrer tables from the
original table on remote. If a user want to allow that behavior it
should be specified by foreign table options.  (It is bothersome when
someone wants to specify the behavior on-the-fly.)

alter foreign table ft1 options (add truncate_cascade 'true');

Also, CONTINUE/RESTART IDENTITY should not work since foreign tables
don't have an identity-sequence. However, this we might be able to
push down the options since it affects only the target table.

I would accept that behavior if TRUNCATE were "TRUNCATE FOREIGN
TABLE", which explicitly targets a foreign table. But I'm not sure it
is possible to add such syntax reasonable way.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [PATCH] force_parallel_mode and GUC categories

On Tue, Apr 13, 2021 at 10:12:35AM -0400, Tom Lane wrote:
> The following parameters are intended for developer testing, and
> should never be enabled for production work.  However, some of
> them can be used to assist with the recovery of severely
> damaged databases.

Okay, that's fine by me.
--
Michael


signature.asc
Description: PGP signature

Re: Performance Evaluation of Result Cache by using TPC-DS

2021-04-13 Thread Yuya Watari

Hello David,

Thank you for your reply.

> Can you share if these times were to run EXPLAIN ANALYZE or if they
> were just the queries being executed normally?

These times were to run EXPLAIN ANALYZE. I executed each query twice,
and the **average** execution time was shown in the table of the last
e-mail. Therefore, the result of the table is not simply equal to that
of the attached file. I'm sorry for the insufficient explanation.

> It would be really great if you could show the EXPLAIN (ANALYZE,
> TIMING OFF) for query 62.   There's a chance that the slowdown comes
> from the additional EXPLAIN ANALYZE timing overhead with the Result
> Cache version.

I ran query 62 by "EXPLAIN (ANALYZE, TIMING OFF)" and normally. I
attached these execution results to this e-mail. At this time, I
executed each query only once (not twice). The results are as follows.

Method |  Execution time with result cache (s) |  Execution time
without result cache (s) | Speedup ratio
EXPLAIN (ANALYZE, TIMING ON)  67.161 59.615-12.66%
EXPLAIN (ANALYZE, TIMING OFF) 66.142 60.660 -9.04%
Normal66.611 60.955 -9.28%

Although there is variation in the execution time, the speedup ratio
is around -10%. So, the result cache has a 10% regression in query 62.
The overhead of EXPLAIN ANALYZE and TIMING ON do not seem to be high.

Best regards,
Yuya Watari

On Tue, Apr 13, 2021 at 7:13 PM David Rowley  wrote:
>
> On Tue, 13 Apr 2021 at 21:29, Yuya Watari  wrote:
> > I used the TPC-DS scale factor 100 in the evaluation. I executed all
> > of the 99 queries in the TPC-DS, and the result cache worked in the 21
> > queries of them. However, some queries took too much time, so I
> > skipped their execution. I set work_mem to 256MB, and
> > max_parallel_workers_per_gather to 0.
>
> Many thanks for testing this.
>
> > As you can see from these results, many queries have a negative
> > speedup ratio, which means that there are negative impacts on the
> > query performance. In query 62, the execution time increased by
> > 11.36%. I guess these regressions are due to the misestimation of the
> > cost in the planner. I attached the execution plan of query 62.
>
> Can you share if these times were to run EXPLAIN ANALYZE or if they
> were just the queries being executed normally?
>
> The times in the two files you attached do look very similar to the
> times in your table, so I suspect either TIMING ON is not that high an
> overhead on your machine, or the results are that of EXPLAIN ANALYZE.
>
> It would be really great if you could show the EXPLAIN (ANALYZE,
> TIMING OFF) for query 62.   There's a chance that the slowdown comes
> from the additional EXPLAIN ANALYZE timing overhead with the Result
> Cache version.
>
> > The result cache is currently enabled by default. However, if this
> > kind of performance regression is common, we have to change its
> > default behavior.
>
> Yes, the feedback we get during the beta period will help drive that
> decision or if the costing needs to be adjusted.
>
> David
SET
   QUERY PLAN   


 Limit  (cost=2816266.54..2816267.79 rows=100 width=110) (actual rows=100 
loops=1)
   ->  Sort  (cost=2816266.54..2816267.38 rows=336 width=110) (actual rows=100 
loops=1)
 Sort Key: (substr((warehouse.w_warehouse_name)::text, 1, 20)), 
ship_mode.sm_type, web_site.web_name
 Sort Method: top-N heapsort  Memory: 49kB
 ->  HashAggregate  (cost=2816248.24..2816252.44 rows=336 width=110) 
(actual rows=360 loops=1)
   Group Key: substr((warehouse.w_warehouse_name)::text, 1, 20), 
ship_mode.sm_type, web_site.web_name
   Batches: 1  Memory Usage: 157kB
   ->  Hash Join  (cost=2510.74..2794150.54 rows=368295 width=78) 
(actual rows=14336926 loops=1)
 Hash Cond: (web_sales.ws_ship_mode_sk = 
ship_mode.sm_ship_mode_sk)
 ->  Hash Join  (cost=2509.29..2792056.55 rows=368356 
width=36) (actual rows=14337390 loops=1)
   Hash Cond: (web_sales.ws_web_site_sk = 
web_site.web_site_sk)
   ->  Hash Join  (cost=2506.75..2790916.14 rows=368430 
width=33) (actual rows=14338265 loops=1)
 Hash Cond: (web_sales.ws_warehouse_sk = 
warehouse.w_warehouse_sk)
 ->  Hash Join  (cost=2505.41..2789674.19 
rows=368516 width=20) (actual rows=14340028 loops=1)
   Hash Cond: (web_sales.ws_ship_date_sk = 
date_dim.d_date_sk)
   ->  Seq Scan on web_sales  
(cost=0.00..2598153.08 rows=72001808 width=20) (actual rows=72001237 loops=1)

Re: Simplify backend terminate and wait logic in postgres_fdw test

On Tue, Apr 13, 2021 at 04:39:58PM +0900, Michael Paquier wrote:
> Looks fine to me.  Let's wait a bit first to see if Fujii-san has any
> objections to this cleanup as that's his code originally, from
> 32a9c0bd.

And hearing nothing, done.  The tests of postgres_fdw are getting much
faster for me now, from basically 6s to 4s (actually that's roughly
1.8s of gain as pg_wait_until_termination waits at least 100ms,
twice), so that's a nice gain.
--
Michael


signature.asc
Description: PGP signature

jsonb subscripting assignment performance

2021-04-13 Thread Joel Jacobson

Hi,

commit 676887a3 added support for jsonb subscripting.

Many thanks for working on this. I really like the improved syntax.

I was also hoping for some performance benefits,
but my testing shows that

   jsonb_value['existing_key'] = new_value;

takes just as long time as

   jsonb_value := jsonb_set(jsonb_value, ARRAY['existing_key'], new_value);

which is a bit surprising to me. Shouldn't subscripting be a lot faster, since 
it could modify the existing data structure in-place? What am I missing here?

I came to think of the this new functionality when trying to optimize some
PL/pgSQL code where the bottle-neck turned out to be lots of calls to 
jsonb_set() for large jsonb objects.

Here is the output from attached bench:

n=1
00:00:00.002628 jsonb := jsonb_set(jsonb, ARRAY[existing key], value);
00:00:00.002778 jsonb := jsonb_set(jsonb, ARRAY[new key], value);
00:00:00.002332 jsonb[existing key] := value;
00:00:00.002794 jsonb[new key] := value;
n=10
00:00:00.042843 jsonb := jsonb_set(jsonb, ARRAY[existing key], value);
00:00:00.046515 jsonb := jsonb_set(jsonb, ARRAY[new key], value);
00:00:00.044974 jsonb[existing key] := value;
00:00:00.075429 jsonb[new key] := value;
n=100
00:00:00.420808 jsonb := jsonb_set(jsonb, ARRAY[existing key], value);
00:00:00.449622 jsonb := jsonb_set(jsonb, ARRAY[new key], value);
00:00:00.31834 jsonb[existing key] := value;
00:00:00.527904 jsonb[new key] := value;

Many thanks for clarifying.

Best regards,

Joel

jsonb-bench.sql
Description: Binary data

Re: Replication slot stats misgivings

2021-04-13 Thread Amit Kapila

On Tue, Apr 13, 2021 at 1:37 PM vignesh C  wrote:
>
> On Mon, Apr 12, 2021 at 7:03 PM Masahiko Sawada  wrote:
> >
> >
> > The following test for the latest v8 patch seems to show different.
> > total_bytes is 1808 whereas spill_bytes is 1320. Am I missing
> > something?
> >
> > postgres(1:85969)=# select pg_create_logical_replication_slot('s',
> > 'test_decoding');
> >  pg_create_logical_replication_slot
> > 
> >  (s,0/1884468)
> > (1 row)
> >
> > postgres(1:85969)=# create table a (i int);
> > CREATE TABLE
> > postgres(1:85969)=# insert into a select generate_series(1, 10);
> > INSERT 0 10
> > postgres(1:85969)=# set logical_decoding_work_mem to 64;
> > SET
> > postgres(1:85969)=# select * from pg_stat_replication_slots ;
> >  slot_name | total_txns | total_bytes | spill_txns | spill_count |
> > spill_bytes | stream_txns | stream_count | stream_bytes | stats_reset
> > ---++-++-+-+-+--+--+-
> >  s |  0 |   0 |  0 |   0 |
> >   0 |   0 |0 |0 |
> > (1 row)
> >
> > postgres(1:85969)=# select count(*) from
> > pg_logical_slot_peek_changes('s', NULL, NULL);
> >  count
> > 
> >  14
> > (1 row)
> >
> > postgres(1:85969)=# select * from pg_stat_replication_slots ;
> >  slot_name | total_txns | total_bytes | spill_txns | spill_count |
> > spill_bytes | stream_txns | stream_count | stream_bytes | stats_reset
> > ---++-++-+-+-+--+--+-
> >  s |  2 |1808 |  1 | 202 |
> > 1320 |   0 |0 |0 |
> > (1 row)
> >
>
> Thanks for identifying this issue, while spilling the transactions
> reorder buffer changes gets released, we will not be able to get the
> total size for spilled transactions from reorderbuffer size. I have
> fixed it by including spilledbytes to totalbytes in case of spilled
> transactions. Attached patch has the fix for this.
> Thoughts?
>

I am not sure if that is the best way to fix it because sometimes we
clear the serialized flag in which case it might not give the correct
answer. Another way to fix it could be that before we try to restore a
new set of changes, we update totalBytes counter. See, the attached
patch atop your v6-0002-* patch.

-- 
With Regards,
Amit Kapila.


fix_spilled_stats_1.patch
Description: Binary data

Re: [PATCH] force_parallel_mode and GUC categories