Attach to shared memory after fork()

2021-04-27 Thread 邱宇航(烛远)
Fork is an expensive operation[1]. The major cost is the mm(VMA PTE...) copy.

ARM is especially weak on fork, which will invalid TLB entries one by one, and 
this is an expensive operation[2]. We could easily got 100% CPU on ARM machine. 
We also meet fork problem in x86, but not as serious as arm.

We can avoid this by enable hugepage(and 2MB doesn’t help us under arm, we got 
a huge shared buffer), but we still think it is a problem.

So I propose to remove shared buffers from postmaster and shmat them after 
fork. Not all of them, we still keep necessary shared memories in postmaster. 
Or maybe we just need to give up fork like under Windows?

Any good idea about it?

[1]. https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/
[2]. https://developer.arm.com/documentation/ddi0487/latest/
D5.10 TLB maintenance requirements and the TLB maintenance instructions:
Break-before-make sequence on changing from an old translation table entry to a 
new translation table entryrequires the following steps:
1. Replace the old translation table entry with an invalid entry, and execute a 
DSB instruction.
2. Invalidate the translation table entry with a broadcast TLB invalidation 
instruction, and execute a DSBinstruction to ensure the completion of that 
invalidation.
3. Write the new translation table entry, and execute a DSB instruction to 
ensure that the new entry is visible.

Regards.
Yuhang Qiu.

Re: [BUG] "FailedAssertion" reported when streaming in logical replication

2021-04-27 Thread Amit Kapila
On Tue, Apr 27, 2021 at 12:22 PM Dilip Kumar  wrote:
>
> On Tue, Apr 27, 2021 at 12:05 PM Amit Kapila  wrote:
> > > > Can't we use 'txns_by_base_snapshot_lsn' list for this purpose? It is
> > > > ensured in ReorderBufferSetBaseSnapshot that we always assign
> > > > base_snapshot to a top-level transaction if the current is a known
> > > > subxact. I think that will be true because we always form xid-subxid
> > > > relation before processing each record in
> > > > LogicalDecodingProcessRecord.
> > >
> > > Yeah, we can do that, but here we are only interested in top
> > > transactions and this list will give us sub-transaction as well so we
> > > will have to skip it in the below if condition.
> > >
> >
> > I am not so sure about this point. I have explained above why I think
> > there won't be any subtransactions in this. Can you please let me know
> > what am I missing if anything?
>
> Got your point, yeah this will only have top transactions so we can
> use this.  I will change this in the next patch.  In fact we can put
> an assert that it should not be an sub transaction?
>

Right. It is good to have an assert.

-- 
With Regards,
Amit Kapila.




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-27 Thread Amit Langote
Thanks for the updated patch.  I've been reading it, but I noticed a
bug in 8aba9322511f, which I thought you'd want to know to make a note
of when committing this one.

So we decided in 8aba9322511f that it is okay to make the memory
context in which a transient partdesc is allocated a child of
PortalContext so that it disappears when the portal does.  But
PortalContext is apparently NULL when the planner runs, at least in
the "simple" query protocol, so any transient partdescs built by the
planner would effectively leak.  Not good.

With this patch, partdesc_nodetached is no longer transient, so the
problem doesn't exist.

I will write more about the updated patch in a bit...

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: SQL-standard function body

2021-04-27 Thread Peter Eisentraut

On 18.04.21 23:33, Tom Lane wrote:

... BTW, a dependency loop is also possible without using this feature,
by abusing default-value expressions:

create function f1(x int, y int) returns int language sql
as 'select $1 + $2';
create function f2(x int, y int default f1(1,2)) returns int language sql
as 'select $1 + $2';
create or replace function f1(x int, y int default f2(11,12)) returns int 
language sql
as 'select $1 + $2';

The actual use-case for that seems pretty thin, so we never bothered
to worry about it before.  But if we're going to build loop-breaking
logic to handle function body dependencies, it should deal with this
too.  I think that all that's required is for the initial dummy
function declaration to omit defaults as well as providing a dummy
body.


I have studied this a bit.  I'm not sure where the dummy function 
declaration should be created.  The current dependency-breaking logic in 
pg_dump_sort.c doesn't appear to support injecting additional objects 
into the set of dumpable objects.  So we would need to create it perhaps 
in dumpFunc() and then later set flags that indicate whether it will be 
required.


Another option would be that we disallow this at creation time.  It 
seems we could detect dependency loops using findDependentObjects(), so 
this might not be so difficult.  The use case for recursive SQL 
functions is probably low, at least with the current limited set of 
control flow options in SQL.  (And you can always use a quoted body to 
work around it.)






Re: [PATCH] Re: pg_identify_object_as_address() doesn't support pg_event_trigger oids

2021-04-27 Thread Michael Paquier
On Tue, Apr 27, 2021 at 07:16:25AM +0200, Joel Jacobson wrote:
> I've added a test at the end of event_trigger.sql,
> reusing the three event triggers already in existence,
> just before they are dropped.

Cool, thanks.  I have been looking at it and I'd still like to
cross-check the output data of pg_get_object_address() to see if
pg_identify_object() remains consistent through it.  See for example
the attached that uses a trick based on LATERAL, a bit different than
what's done in object_address.sql but that gives the same amount of
coverage (I could also use two ROW()s and an equality, but well..).
--
Michael
diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index 6d88b690d8..ad9740098e 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -5607,10 +5607,7 @@ getObjectIdentityParts(const ObjectAddress *object,
 			{
 HeapTuple	tup;
 Form_pg_event_trigger trigForm;
-
-/* no objname support here */
-if (objname)
-	*objname = NIL;
+char	   *evtname;
 
 tup = SearchSysCache1(EVENTTRIGGEROID,
 	  ObjectIdGetDatum(object->objectId));
@@ -5622,8 +5619,10 @@ getObjectIdentityParts(const ObjectAddress *object,
 	break;
 }
 trigForm = (Form_pg_event_trigger) GETSTRUCT(tup);
-appendStringInfoString(&buffer,
-	   quote_identifier(NameStr(trigForm->evtname)));
+evtname = NameStr(trigForm->evtname);
+appendStringInfoString(&buffer, quote_identifier(evtname));
+if (objname)
+	*objname = list_make1(evtname);
 ReleaseSysCache(tup);
 break;
 			}
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index bdd0ffcdaf..1aca794d6d 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -533,6 +533,23 @@ DROP POLICY p2 ON event_trigger_test;
 NOTICE:  DROP POLICY - ddl_command_start
 NOTICE:  DROP POLICY - sql_drop
 NOTICE:  DROP POLICY - ddl_command_end
+-- Check the object address of those event triggers
+SELECT
+e.evtname,
+pg_describe_object('pg_event_trigger'::regclass, e.oid, 0) as descr,
+b.type, b.object_names, b.object_args,
+pg_identify_object(a.classid, a.objid, a.objsubid) as ident
+  FROM pg_event_trigger as e,
+LATERAL pg_identify_object_as_address('pg_event_trigger'::regclass, e.oid, 0) as b,
+LATERAL pg_get_object_address(b.type, b.object_names, b.object_args) as a
+  ORDER BY e.evtname;
+  evtname  |  descr  | type  |object_names | object_args | ident  
+---+-+---+-+-+
+ end_rls_command   | event trigger end_rls_command   | event trigger | {end_rls_command}   | {}  | ("event trigger",,end_rls_command,end_rls_command)
+ sql_drop_command  | event trigger sql_drop_command  | event trigger | {sql_drop_command}  | {}  | ("event trigger",,sql_drop_command,sql_drop_command)
+ start_rls_command | event trigger start_rls_command | event trigger | {start_rls_command} | {}  | ("event trigger",,start_rls_command,start_rls_command)
+(3 rows)
+
 DROP EVENT TRIGGER start_rls_command;
 DROP EVENT TRIGGER end_rls_command;
 DROP EVENT TRIGGER sql_drop_command;
diff --git a/src/test/regress/sql/event_trigger.sql b/src/test/regress/sql/event_trigger.sql
index 18b2a267cb..f7347ffbb1 100644
--- a/src/test/regress/sql/event_trigger.sql
+++ b/src/test/regress/sql/event_trigger.sql
@@ -426,6 +426,17 @@ ALTER POLICY p1 ON event_trigger_test USING (TRUE);
 ALTER POLICY p1 ON event_trigger_test RENAME TO p2;
 DROP POLICY p2 ON event_trigger_test;
 
+-- Check the object address of those event triggers
+SELECT
+e.evtname,
+pg_describe_object('pg_event_trigger'::regclass, e.oid, 0) as descr,
+b.type, b.object_names, b.object_args,
+pg_identify_object(a.classid, a.objid, a.objsubid) as ident
+  FROM pg_event_trigger as e,
+LATERAL pg_identify_object_as_address('pg_event_trigger'::regclass, e.oid, 0) as b,
+LATERAL pg_get_object_address(b.type, b.object_names, b.object_args) as a
+  ORDER BY e.evtname;
+
 DROP EVENT TRIGGER start_rls_command;
 DROP EVENT TRIGGER end_rls_command;
 DROP EVENT TRIGGER sql_drop_command;


signature.asc
Description: PGP signature


Re: [HACKERS] logical decoding of two-phase transactions

2021-04-27 Thread vignesh C
On Wed, Apr 21, 2021 at 12:13 PM Peter Smith  wrote:
>
> On Tue, Apr 20, 2021 at 3:45 PM Peter Smith  wrote:
> >
> > Please find attached the latest patch set v73`*
> >
> > Differences from v72* are:
> >
> > * Rebased to HEAD @ today (required because v72-0001 no longer applied 
> > cleanly)
> >
> > * Minor documentation correction for protocol messages for Commit Prepared 
> > ('K')
> >
> > * Non-functional code tidy (mostly proto.c) to reduce overloading
> > different meanings to same member names for prepare/commit times.
>
>
> Please find attached a re-posting of patch set v73*
>
> This is the same as yesterday's v73 but with a contrib module compile
> error fixed.

Few comments on
v73-0002-Add-prepare-API-support-for-streaming-transactio.patch patch:
1) There are slight differences in error message in case of Alter
subscription ... drop publication, we can keep the error message
similar:
postgres=# ALTER SUBSCRIPTION mysub drop PUBLICATION mypub WITH
(refresh = false, copy_data=true, two_phase=true);
ERROR:  unrecognized subscription parameter: "copy_data"
postgres=# ALTER SUBSCRIPTION mysub drop PUBLICATION mypub WITH
(refresh = false, two_phase=true, streaming=true);
ERROR:  cannot alter two_phase option

2) We are sending txn->xid twice, I felt we should send only once in
logicalrep_write_stream_prepare:
+   /* transaction ID */
+   Assert(TransactionIdIsValid(txn->xid));
+   pq_sendint32(out, txn->xid);
+
+   /* send the flags field */
+   pq_sendbyte(out, flags);
+
+   /* send fields */
+   pq_sendint64(out, prepare_lsn);
+   pq_sendint64(out, txn->end_lsn);
+   pq_sendint64(out, txn->u_op_time.prepare_time);
+   pq_sendint32(out, txn->xid);
+

3) We could remove xid and return prepare_data->xid
+TransactionId
+logicalrep_read_stream_prepare(StringInfo in,
LogicalRepPreparedTxnData *prepare_data)
+{
+   TransactionId xid;
+   uint8   flags;
+
+   xid = pq_getmsgint(in, 4);

4) Here comments can be above apply_spooled_messages for better readability
+   /*
+* 1. Replay all the spooled operations - Similar code as for
+* apply_handle_stream_commit (i.e. non two-phase stream commit)
+*/
+
+   ensure_transaction();
+
+   nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn);
+

5) Similarly this below comment can be above PrepareTransactionBlock
+   /*
+* 2. Mark the transaction as prepared. - Similar code as for
+* apply_handle_prepare (i.e. two-phase non-streamed prepare)
+*/
+
+   /*
+* BeginTransactionBlock is necessary to balance the EndTransactionBlock
+* called within the PrepareTransactionBlock below.
+*/
+   BeginTransactionBlock();
+   CommitTransactionCommand();
+
+   /*
+* Update origin state so we can restart streaming from correct position
+* in case of crash.
+*/
+   replorigin_session_origin_lsn = prepare_data.end_lsn;
+   replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+   PrepareTransactionBlock(gid);
+   CommitTransactionCommand();
+
+   pgstat_report_stat(false);

6) There is a lot of common code between apply_handle_stream_prepare
and apply_handle_prepare, if possible try to have a common function to
avoid fixing at both places.
+   /*
+* 2. Mark the transaction as prepared. - Similar code as for
+* apply_handle_prepare (i.e. two-phase non-streamed prepare)
+*/
+
+   /*
+* BeginTransactionBlock is necessary to balance the EndTransactionBlock
+* called within the PrepareTransactionBlock below.
+*/
+   BeginTransactionBlock();
+   CommitTransactionCommand();
+
+   /*
+* Update origin state so we can restart streaming from correct position
+* in case of crash.
+*/
+   replorigin_session_origin_lsn = prepare_data.end_lsn;
+   replorigin_session_origin_timestamp = prepare_data.prepare_time;
+
+   PrepareTransactionBlock(gid);
+   CommitTransactionCommand();
+
+   pgstat_report_stat(false);
+
+   store_flush_position(prepare_data.end_lsn);

7) two-phase commit is slightly misleading, we can just mention
streaming prepare.
+ * PREPARE callback (for streaming two-phase commit).
+ *
+ * Notify the downstream to prepare the transaction.
+ */
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+   ReorderBufferTXN *txn,
+   XLogRecPtr prepare_lsn)

8) should we include Assert of in_streaming similar to other
pgoutput_stream*** functions.
+static void
+pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx,
+   ReorderBufferTXN *txn,
+   XLogRecPtr prepare_lsn)
+{
+   Assert(rbtxn_is_streamed(txn));
+
+   OutputP

Re: Bug fix for tab completion of ALTER TABLE ... VALIDATE CONSTRAINT ...

2021-04-27 Thread Aleksander Alekseev
Hi David,

> I noticed that $subject completes with already valid constraints,
> please find attached a patch that fixes it. I noticed that there are
> other places constraints can be validated, but didn't check whether
> similar bugs exist there yet.

There was a typo in the patch ("... and and not convalidated"). I've fixed
it. Otherwise the patch passes all the tests and works as expected:

eax=# create table foo (x int);
CREATE TABLE
eax=# alter table foo add constraint bar check (x < 3) not valid;
ALTER TABLE
eax=# alter table foo add constraint baz check (x <> 5) not valid;
ALTER TABLE
eax=# alter table foo validate constraint ba
bar baz
eax=# alter table foo validate constraint bar;
ALTER TABLE

-- 
Best regards,
Aleksander Alekseev


v2-0001-tab-completion-of-ALTER-TABLE.VALIDATE-CONSTRAINT.patch
Description: Binary data


Re: TRUNCATE on foreign table

2021-04-27 Thread Fujii Masao




On 2021/04/27 15:02, Bharath Rupireddy wrote:

On Tue, Apr 27, 2021 at 11:19 AM Fujii Masao
 wrote:

In docs v4 patch, I think we can combine below two lines into a single line:
+   supported by the foreign data wrapper,
  see .


You mean "supported by the foreign data wrapper "?

I was thinking that it's better to separate them because postgres_fdw
is just an example of the foreign data wrapper supporting TRUNCATE.
This makes me think again; isn't it better to add "for example" or
"for instance" into after "data wrapper"? That is,

  TRUNCATE can be used for foreign tables if
  supported by the foreign data wrapper, for instance,
  see .


+1.


Pushed. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




CREATE COLLATION - check for duplicate options and error out if found one

2021-04-27 Thread Bharath Rupireddy
Hi,

While reviewing [1], I found that the CREATE COLLATION doesn't throw an
error if duplicate options are specified, see [2] for testing a few cases
on HEAD. This may end up accepting some of the weird cases, see [2]. It's
against other option checking code in the server where the duplicate option
is detected and an error thrown if found one.  Attached a patch doing that.
I chose to have the error message "option \"%s\" specified more than once"
and parser_errposition because that's kind of agreed in [3].

Thoughts?

[1]
https://www.postgresql.org/message-id/CALj2ACWVd%3D-E6uG5AdHD0MvHY6e4mVzkapT%3DvLDnJJseGjaJLQ%40mail.gmail.com

[2]
CREATE COLLATION coll_dup_chk (LC_COLLATE = "POSIX", LC_COLLATE =
"NONSENSE", LC_CTYPE = "POSIX"); -- ERROR
CREATE COLLATION coll_dup_chk (LC_COLLATE = "NONSENSE", LC_COLLATE =
"POSIX", LC_CTYPE = "POSIX"); -- OK but it's weird
CREATE COLLATION coll_dup_chk (LC_CTYPE = "POSIX", LC_CTYPE = "NONSENSE",
LC_COLLATE = "POSIX"); -- ERROR
CREATE COLLATION coll_dup_chk (LC_CTYPE = "NONSENSE", LC_CTYPE = "POSIX",
LC_COLLATE = "POSIX",); -- OK but it's weird
CREATE COLLATION coll_dup_chk (PROVIDER = icu, PROVIDER = NONSENSE,
LC_COLLATE = "POSIX", LC_CTYPE = "POSIX"); -- ERROR
CREATE COLLATION coll_dup_chk (PROVIDER = NONSENSE, PROVIDER = icu,
LC_COLLATE = "POSIX", LC_CTYPE = "POSIX"); -- OK but it's weird
CREATE COLLATION case_sensitive (LOCALE = '', LOCALE = "NONSENSE"); -- ERROR
CREATE COLLATION coll_dup_chk (LOCALE = "NONSENSE", LOCALE = ''); -- OK but
it's weird
CREATE COLLATION coll_dup_chk (DETERMINISTIC = TRUE, DETERMINISTIC =
NONSENSE, LOCALE = ''); -- ERROR
CREATE COLLATION coll_dup_chk (DETERMINISTIC = NONSENSE, DETERMINISTIC =
TRUE, LOCALE = ''); -- OK but it's weird

[3]
https://www.postgresql.org/message-id/CALj2ACUa%3DZM8QtOLPCHc7%3DWgFrx9P6-AgKQs8cmKLvNCvu7arQ%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 555ac67a94d899b107e27a691f3a2a7340a14f64 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 27 Apr 2021 12:01:44 +0530
Subject: [PATCH v1] Check duplicate options and error out for CREATE COLLATION
 command

CREATE COLLATION doesn't throw an error if duplicate options are
specified. Modify the option parsing loop in DefineCollation to
check for duplicate options and error out if found one.
---
 src/backend/commands/collationcmds.c  | 35 +++
 src/test/regress/expected/collate.out | 35 +++
 src/test/regress/sql/collate.sql  | 17 +
 3 files changed, 87 insertions(+)

diff --git a/src/backend/commands/collationcmds.c b/src/backend/commands/collationcmds.c
index b8ec6f5756..8d07b21eda 100644
--- a/src/backend/commands/collationcmds.c
+++ b/src/backend/commands/collationcmds.c
@@ -86,15 +86,50 @@ DefineCollation(ParseState *pstate, List *names, List *parameters, bool if_not_e
 		if (strcmp(defel->defname, "from") == 0)
 			defelp = &fromEl;
 		else if (strcmp(defel->defname, "locale") == 0)
+		{
+			if (localeEl)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("option \"%s\" specified more than once", "locale"),
+		 parser_errposition(pstate, defel->location)));
 			defelp = &localeEl;
+		}
 		else if (strcmp(defel->defname, "lc_collate") == 0)
+		{
+			if (lccollateEl)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("option \"%s\" specified more than once", "lc_collate"),
+		 parser_errposition(pstate, defel->location)));
 			defelp = &lccollateEl;
+		}
 		else if (strcmp(defel->defname, "lc_ctype") == 0)
+		{
+			if (lcctypeEl)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("option \"%s\" specified more than once", "lc_ctype"),
+		 parser_errposition(pstate, defel->location)));
 			defelp = &lcctypeEl;
+		}
 		else if (strcmp(defel->defname, "provider") == 0)
+		{
+			if (providerEl)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("option \"%s\" specified more than once", "provider"),
+		 parser_errposition(pstate, defel->location)));
 			defelp = &providerEl;
+		}
 		else if (strcmp(defel->defname, "deterministic") == 0)
+		{
+			if (deterministicEl)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("option \"%s\" specified more than once", "deterministic"),
+		 parser_errposition(pstate, defel->location)));
 			defelp = &deterministicEl;
+		}
 		else
 		{
 			ereport(ERROR,
diff --git a/src/test/regress/expected/collate.out b/src/test/regress/expected/collate.out
index 0b60b55514..85a10c8198 100644
--- a/src/test/regress/expected/collate.out
+++ b/src/test/regress/expected/collate.out
@@ -701,6 +701,41 @@ View definition:
  SELECT ss.c1 + 1 AS c1p
FROM ( SELECT 4 AS c1) ss;
 
+-- Check duplicate options in CREATE COLLATION command and throw error
+-- LC_COLLATE
+CREATE COLLATION coll_dup_chk (LC_COLLATE = "POSIX", LC_COLLATE = "NONSENSE", LC_CTYPE = "POSIX");
+ERROR:  option "lc_collate" specifie

Re: Bug fix for tab completion of ALTER TABLE ... VALIDATE CONSTRAINT ...

2021-04-27 Thread Aleksander Alekseev
Hi hackers,

> Otherwise the patch passes all the tests and works as expected

I've noticed there is no tab completion for ALTER TABLE xxx ADD. Here
is an alternative version of the patch that fixes this as well. Not
sure if this should be in the same commit though.

-- 
Best regards,
Aleksander Alekseev


v3-0001-tab-completion-of-ALTER-TABLE.VALIDATE-CONSTRAINT.patch
Description: Binary data


Re: Better sanity checking for message length words

2021-04-27 Thread Aleksander Alekseev
Hi Tom,

> ...
> Given the small number of complaints to date, I'm hesitant to
> back-patch this: if there's anybody out there with valid use for
> long messages that I didn't think should be long, this might break
> things for them.  But I think it'd be reasonable to sneak it into
> v14, since we've not started beta yet.
>
> Thoughts?

I'm having slight issues applying your patch to the `master` branch.
Is it the right target?

Regarding the idea, I think extra checks are a good thing and
definitely something that can be introduced in the next major version.
If we receive a complaint during beta-testing we can revert the patch
or maybe increase the limit for small messages.

-- 
Best regards,
Aleksander Alekseev




Re: Attach to shared memory after fork()

2021-04-27 Thread Andrew Dunstan


On 4/26/21 11:56 PM, 邱宇航(烛远) wrote:
> Fork is an expensive operation[1]. The major cost is the mm(VMA
> PTE...) copy.
>
> ARM is especially weak on fork, which will invalid TLB entries one by
> one, and this is an expensive operation[2]. We could easily got 100%
> CPU on ARM machine. We also meet fork problem in x86, but not as
> serious as arm.
>
> We can avoid this by enable hugepage(and 2MB doesn’t help us under
> arm, we got a huge shared buffer), but we still think it is a problem.
>
> So I propose to remove shared buffers from postmaster and shmat them
> after fork. Not all of them, we still keep necessary shared memories
> in postmaster. Or maybe we just need to give up fork like under Windows?
>

Windows has CreateProcess, which isn't available elsewhere. If you build
with EXEC_BACKEND on *nix it will fork() followed by exec(), the classic
Unix pattern. You can benchmark that but I doubt you will like the results.

This is one of the reasons for using a connection pooler like pgbouncer,
which can vastly reduce the number of new process creations Postgres has
to do.

Better shared memory management might be more promising.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Enhanced error message to include hint messages for redundant options error

2021-04-27 Thread Bharath Rupireddy
On Tue, Apr 27, 2021 at 6:23 AM Bharath Rupireddy
 wrote:
>
> On Mon, Apr 26, 2021 at 9:10 PM Bharath Rupireddy
>  wrote:
> >
> > I found another problem with collationcmds.c is that it doesn't error
> > out if some of the options are specified more than once, something
> > like below. I think the option checking "for loop" in DefineCollation
> > needs to be reworked.
> > CREATE COLLATION case_insensitive (provider = icu, provider =
> > someother locale = '@colStrength=secondary', deterministic = false,
> > deterministic = true);
>
> I'm thinking that the above problem should be discussed separately. I
> will start a new thread soon on this.

I started a separate thread -
https://www.postgresql.org/message-id/CALj2ACWtL6fTLdyF4R_YkPtf1YEDb6FUoD5DGAki3rpD%2BsWqiA%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: [BUG] "FailedAssertion" reported when streaming in logical replication

2021-04-27 Thread Dilip Kumar
On Tue, Apr 27, 2021 at 12:55 PM Amit Kapila  wrote:
>
> On Tue, Apr 27, 2021 at 12:22 PM Dilip Kumar  wrote:
> >
> > On Tue, Apr 27, 2021 at 12:05 PM Amit Kapila  
> > wrote:
> > > > > Can't we use 'txns_by_base_snapshot_lsn' list for this purpose? It is
> > > > > ensured in ReorderBufferSetBaseSnapshot that we always assign
> > > > > base_snapshot to a top-level transaction if the current is a known
> > > > > subxact. I think that will be true because we always form xid-subxid
> > > > > relation before processing each record in
> > > > > LogicalDecodingProcessRecord.
> > > >
> > > > Yeah, we can do that, but here we are only interested in top
> > > > transactions and this list will give us sub-transaction as well so we
> > > > will have to skip it in the below if condition.
> > > >
> > >
> > > I am not so sure about this point. I have explained above why I think
> > > there won't be any subtransactions in this. Can you please let me know
> > > what am I missing if anything?
> >
> > Got your point, yeah this will only have top transactions so we can
> > use this.  I will change this in the next patch.  In fact we can put
> > an assert that it should not be an sub transaction?
> >
>
> Right. It is good to have an assert.

I have modified the patch based on the above comments.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 6fcd9be88bd85df4af0508f082dfdad0f8bbf518 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Tue, 27 Apr 2021 13:57:21 +0530
Subject: [PATCH v3] Assorted bug fix while selecting the largest top
 transaction for streaming

There were mainly 2 problems, 1) Ideally, if we haven't selected any transaction
we should take next available transaction without comparing the size but the
condition was wrong and it was selecting the next available transaction without
comparing the size if we had already selected a transaction which was wrong.
2) Another probelm was we were selecting the transaction without checking their
base snapshot, so if the base snapshot is NULL then we can not stream any change
so it was hitting the assert that after streaming txn->size should be 0.  So the
solution is we should never select the transaction for streaming which doesn't
have a base snapshot as we can not stream that transaction.
---
 src/backend/replication/logical/reorderbuffer.c | 31 ++---
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5cb484f..ea217ef 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3362,16 +3362,17 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
  * iterate over the limited number of toplevel transactions.
  *
  * Note that, we skip transactions that contains incomplete changes. There
- * is a scope of optimization here such that we can select the largest transaction
- * which has complete changes.  But that will make the code and design quite complex
- * and that might not be worth the benefit.  If we plan to stream the transactions
- * that contains incomplete changes then we need to find a way to partially
- * stream/truncate the transaction changes in-memory and build a mechanism to
- * partially truncate the spilled files.  Additionally, whenever we partially
- * stream the transaction we need to maintain the last streamed lsn and next time
- * we need to restore from that segment and the offset in WAL.  As we stream the
- * changes from the top transaction and restore them subtransaction wise, we need
- * to even remember the subxact from where we streamed the last change.
+ * is a scope of optimization here such that we can select the largest
+ * transaction which has incomplete changes.  But that will make the code and
+ * design quite complex and that might not be worth the benefit.  If we plan to
+ * stream the transactions that contains incomplete changes then we need to
+ * find a way to partially stream/truncate the transaction changes in-memory
+ * and build a mechanism to partially truncate the spilled files.
+ * Additionally, whenever we partially stream the transaction we need to
+ * maintain the last streamed lsn and next time we need to restore from that
+ * segment and the offset in WAL.  As we stream the changes from the top
+ * transaction and restore them subtransaction wise, we need to even remember
+ * the subxact from where we streamed the last change.
  */
 static ReorderBufferTXN *
 ReorderBufferLargestTopTXN(ReorderBuffer *rb)
@@ -3381,13 +3382,17 @@ ReorderBufferLargestTopTXN(ReorderBuffer *rb)
 	ReorderBufferTXN *largest = NULL;
 
 	/* Find the largest top-level transaction. */
-	dlist_foreach(iter, &rb->toplevel_by_lsn)
+	dlist_foreach(iter, &rb->txns_by_base_snapshot_lsn)
 	{
 		ReorderBufferTXN *txn;
 
-		txn = dlist_container(ReorderBufferTXN, node, iter.cur);
+		txn = dlist_container(ReorderBufferTXN, base_snapshot_node, iter.c

Re: Replication slot stats misgivings

2021-04-27 Thread Amit Kapila
On Tue, Apr 27, 2021 at 8:58 AM Masahiko Sawada  wrote:
>
> On Tue, Apr 27, 2021 at 11:45 AM Amit Kapila  wrote:
> >
> >
> > Sawada-San, I would like to go ahead with your
> > "Use-HTAB-for-replication-slot-statistics" unless you think otherwise?
>
> I agree that it's better to use the stats collector. So please go ahead.
>

I have pushed this patch and seeing one buildfarm failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-27%2009%3A23%3A14

  starting permutation: s1_init s1_begin s1_insert_tbl1 s1_insert_tbl2
s2_alter_tbl1_char s1_commit s2_get_changes
+ isolationtester: canceling step s1_init after 314 seconds
  step s1_init: SELECT 'init' FROM
pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
  ?column?

I am analyzing this. Do let me know if you have any thoughts on the same?

-- 
With Regards,
Amit Kapila.




Re: Asynchronous Append on postgres_fdw nodes.

2021-04-27 Thread Etsuro Fujita
On Mon, Apr 26, 2021 at 3:01 PM Andrey V. Lepikhov
 wrote:
> While studying the capabilities of AsyncAppend, I noticed an
> inconsistency with the cost model of the optimizer:

> Here I see two problems:
> 1. Cost of an AsyncAppend is the same as cost of an Append. But
> execution time of the AsyncAppend for three remote partitions has more
> than halved.
> 2. Cost of an AsyncAppend looks as a sum of the child ForeignScan costs.

Yeah, we don’t adjust the cost for async Append; it’s the same as that
for sync Append.  But I don’t see any issue as-is, either.  (It’s not
that easy to adjust the cost to an appropriate value in the case of
postgres_fdw, because in that case the cost would vary depending on
which connections are used for scanning foreign tables [1].)

> I haven't ideas why it may be a problem right now. But I can imagine
> that it may be a problem in future if we have alternative paths: complex
> pushdown in synchronous mode (a few rows to return) or simple
> asynchronous append with a large set of rows to return.

Yeah, I think it’s better if we could consider async append paths and
estimate the costs for them accurately at path-creation time, not
plan-creation time, because that would make it possible to use async
execution in more cases, as you pointed out.  But I left that for
future work, because I wanted to make the first cut simple.

Thanks for the review!

Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/CAPmGK15i-OyCesd369P8zyBErjN_T18zVYu27714bf_L%3DCOXew%40mail.gmail.com




Re: Asynchronous Append on postgres_fdw nodes.

2021-04-27 Thread Etsuro Fujita
On Mon, Apr 26, 2021 at 7:35 PM Andrey V. Lepikhov
 wrote:
> Small mistake i found. If no tuple was received from a foreign
> partition, explain shows that we never executed node. For example,
> if we have 0 tuples in f1 and 100 tuples in f2:
>
> Query:
> EXPLAIN (ANALYZE, VERBOSE, TIMING OFF, COSTS OFF)
> SELECT * FROM (SELECT * FROM f1 UNION ALL SELECT * FROM f2) AS q1
> LIMIT 101;
>
> Explain:
>   Limit (actual rows=100 loops=1)
> Output: f1.a
> ->  Append (actual rows=100 loops=1)
>   ->  Async Foreign Scan on public.f1 (never executed)
> Output: f1.a
> Remote SQL: SELECT a FROM public.l1
>   ->  Async Foreign Scan on public.f2 (actual rows=100 loops=1)
> Output: f2.a
> Remote SQL: SELECT a FROM public.l2
>
> The patch in the attachment fixes this.

Thanks for the report and patch!  Will look into this.

Best regards,
Etsuro Fujita




Re: [PATCH] Re: pg_identify_object_as_address() doesn't support pg_event_trigger oids

2021-04-27 Thread Joel Jacobson
On Tue, Apr 27, 2021, at 09:48, Michael Paquier wrote:
> On Tue, Apr 27, 2021 at 07:16:25AM +0200, Joel Jacobson wrote:
> > I've added a test at the end of event_trigger.sql,
> > reusing the three event triggers already in existence,
> > just before they are dropped.
> 
> Cool, thanks.  I have been looking at it and I'd still like to
> cross-check the output data of pg_get_object_address() to see if
> pg_identify_object() remains consistent through it.  See for example
> the attached that uses a trick based on LATERAL, a bit different than
> what's done in object_address.sql but that gives the same amount of
> coverage (I could also use two ROW()s and an equality, but well..).

Neat trick, looks good to me.

I've successfully tested fix_event_trigger_pg_identify_object_as_address3.patch.

/Joel

RE: Forget close an open relation in ReorderBufferProcessTXN()

2021-04-27 Thread osumi.takami...@fujitsu.com
On Tuesday, April 20, 2021 12:07 PM Amit Langote  
wrote:
> On Sat, Apr 17, 2021 at 1:30 PM Amit Kapila 
> wrote:
> > On Fri, Apr 16, 2021 at 11:24 PM Andres Freund 
> > wrote:> > This made me take a brief look at pgoutput.c - maybe I am
> > missing
> > > something, but how is the following not a memory leak?
> > >
> > > static void
> > > maybe_send_schema(LogicalDecodingContext *ctx,
> > >   ReorderBufferTXN *txn, ReorderBufferChange
> *change,
> > >   Relation relation, RelationSyncEntry *relentry) {
> > > ...
> > > /* Map must live as long as the session does. */
> > > oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> > > relentry->map =
> convert_tuples_by_name(CreateTupleDescCopy(indesc),
> > >
> CreateTupleDescCopy(outdesc));
> > > MemoryContextSwitchTo(oldctx);
> > > send_relation_and_attrs(ancestor, xid, ctx);
> > > RelationClose(ancestor);
> > >
> > > If - and that's common - convert_tuples_by_name() won't have to do
> > > anything, the copied tuple descs will be permanently leaked.
> > >
> >
> > I also think this is a permanent leak. I think we need to free all the
> > memory associated with this map on the invalidation of this particular
> > relsync entry (basically in rel_sync_cache_relation_cb).
> 
> I agree there's a problem here.
> 
> Back in:
> 
> https://www.postgresql.org/message-id/CA%2BHiwqEeU19iQgjN6HF1HTP
> U0L5%2BJxyS5CmxaOVGNXBAfUY06Q%40mail.gmail.com
> 
> I had proposed to move the map creation from maybe_send_schema() to
> get_rel_sync_entry(), mainly because the latter is where I realized it
> belongs, though a bit too late.   Attached is the part of the patch
> for this particular issue.  It also takes care to release the copied 
> TupleDescs
> if the map is found to be unnecessary, thus preventing leaking into
> CacheMemoryContext.
Thank you for sharing the patch.
Your patch looks correct to me. Make check-world has
passed with it. Also, I agree with the idea to place
the processing to set the map in the get_rel_sync_entry.

One thing I'd like to ask is an advanced way to confirm
the memory leak is solved by the patch, not just by running make check-world.

I used OSS HEAD and valgrind, expecting that
I could see function stack which has a call of CreateTupleDescCopy
from both pgoutput_change and pgoutput_truncate as memory leak report
in the valgrind logs, and they disappear after applying the patch.

But, I cannot find the pair of pgoutput functions and CreateTupleDescCopy in 
one report
when I used OSS HEAD, which means that I need to do advanced testing to check if
the memory leak of CreateTupleDescCopy is addressed.
I collected the logs from RT at src/test/subscription so should pass the routes 
of our interest.

Could someone give me
an advise about the way to confirm the memory leak is solved ?


Best Regards,
Takamichi Osumi



Result Cache node shows per-worker info even for workers not launched

2021-04-27 Thread Amit Khandekar
Hi,

If planned parallel workers do not get launched, the Result Cache plan
node shows all-0 stats for each of those workers:

tpch=# set max_parallel_workers TO 0;
SET
tpch=# explain analyze
select avg(l_discount) from orders, lineitem
where
l_orderkey = o_orderkey
and o_orderdate < date '1995-03-09'
and l_shipdate > date '1995-03-09';


QUERY PLAN
-
 Finalize Aggregate  (cost=315012.87..315012.88 rows=1 width=32)
(actual time=27533.482..27533.598 rows=1 loops=1)
   ->  Gather  (cost=315012.44..315012.85 rows=4 width=32) (actual
time=27533.471..27533.587 rows=1 loops=1)
 Workers Planned: 4
 Workers Launched: 0
 ->  Partial Aggregate  (cost=314012.44..314012.45 rows=1
width=32) (actual time=27533.177..27533.178 rows=1 loops=1)
   ->  Nested Loop  (cost=0.44..309046.68 rows=1986303
width=4) (actual time=0.400..27390.835 rows=748912 loops=1)
 ->  Parallel Seq Scan on lineitem
(cost=0.00..154513.66 rows=4120499 width=12) (actual
time=0.044..7910.399 rows=16243662 loops=1)
   Filter: (l_shipdate > '1995-03-09'::date)
   Rows Removed by Filter: 13756133
 ->  Result Cache  (cost=0.44..0.53 rows=1
width=4) (actual time=0.001..0.001 rows=0 loops=16243662)
   Cache Key: lineitem.l_orderkey
   Hits: 12085749  Misses: 4157913  Evictions:
3256424  Overflows: 0  Memory Usage: 65537kB
   Worker 0:  Hits: 0  Misses: 0  Evictions: 0
 Overflows: 0  Memory Usage: 0kB
   Worker 1:  Hits: 0  Misses: 0  Evictions: 0
 Overflows: 0  Memory Usage: 0kB
   Worker 2:  Hits: 0  Misses: 0  Evictions: 0
 Overflows: 0  Memory Usage: 0kB
   Worker 3:  Hits: 0  Misses: 0  Evictions: 0
 Overflows: 0  Memory Usage: 0kB
   ->  Index Scan using orders_pkey on orders
(cost=0.43..0.52 rows=1 width=4) (actual time=0.002..0.002 rows=0
loops=4157913)
 Index Cond: (o_orderkey = lineitem.l_orderkey)
 Filter: (o_orderdate < '1995-03-09'::date)
 Rows Removed by Filter: 1
 Planning Time: 0.211 ms
 Execution Time: 27553.477 ms
(22 rows)

By looking at the other cases like  show_sort_info() or printing
per-worker jit info, I could see that the general policy is that we
skip printing info for workers that are not launched. Attached is a
patch to do the same for Result Cache.

I was earlier thinking about using (instrument[n].nloops == 0) to
check for not-launched workers. But we are already using "if
(rcstate->stats.cache_misses == 0)" for the leader process, so for
consistency I used the same method for workers.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies


skip_notlaunched_workers.patch
Description: Binary data


Re: wal stats questions

2021-04-27 Thread Fujii Masao




On 2021/04/26 10:11, Masahiro Ikeda wrote:



On 2021/04/23 16:30, Fujii Masao wrote:



On 2021/04/23 10:25, Andres Freund wrote:

Hi,

On 2021-04-23 09:26:17 +0900, Masahiro Ikeda wrote:

On 2021/04/23 0:36, Andres Freund wrote:

On Thu, Apr 22, 2021, at 06:42, Fujii Masao wrote:

On 2021/04/21 18:31, Masahiro Ikeda wrote:

BTW, is it better to change PgStat_Counter from int64 to uint64
because> there aren't any counters with negative number?

On second thought, it's ok even if the counters like wal_records get
overflowed.
Because they are always used to calculate the diff between the previous and
current counters. Even when the current counters are overflowed and
the previous ones are not, WalUsageAccumDiff() seems to return the right
diff of them. If this understanding is right, I'd withdraw my comment and
it's ok to use "long" type for those counters. Thought?


Why long? It's of a platform dependent size, which doesn't seem useful here?


I think it's ok to unify uint64. Although it's better to use small size for
cache, the idea works well for only some platform which use 4bytes for "long".


(Although I'm not familiar with the standardization...)
It seems that we need to adopt unsinged long if use "long", which may be
4bytes.

I though WalUsageAccumDiff() seems to return the right value too. But, I
researched deeply and found that ISO/IEC 9899:1999 defines unsinged integer
never overflow(2.6.5 Types 9th section) although it doesn't define overflow of
signed integer type.

If my understanding is right, the definition only guarantee
WalUsageAccumDiff() returns the right value only if the type is unsigned
integer. So, it's safe to change "long" to "unsigned long".


I think this should just use 64bit counters. Emitting more than 4
billion records in one transaction isn't actually particularly rare. And


Right. I agree to change the types of the counters to int64.

I think that it's better to make this change as a separate patch from
the changes for pg_stat_wal view.


Thanks for your comments.
OK, I separate two patches.


Thanks!




First patch has only the changes for pg_stat_wal view.
("v6-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch")


+   pgWalUsage.wal_records == prevWalUsage.wal_records &&
+   walStats.wal_write == 0 && walStats.wal_sync == 0 &&

WalStats.m_wal_write should be checked here instead of walStats.wal_write?

Is there really the case where the number of sync is larger than zero when
the number of writes is zero? If not, it's enough to check only the number
of writes?

+* wal records weren't generated. So, the counters of 'wal_fpi',
+* 'wal_bytes', 'm_wal_buffers_full' are not updated neither.

It's better to add the assertion check that confirms
m_wal_buffers_full == 0 whenever wal_records is larger than zero?



Second one has the changes for the type of the BufferUsage's and WalUsage's
members. I change the type from long to int64. Is it better to make new thread?
("v6-0002-change-the-data-type-of-XXXUsage-from-long-to-int64.patch")


Will review the patch later. I'm ok to discuss that in this thread.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Add reset information to pg_stat_statements_info

2021-04-27 Thread Seino Yuki

Hi.

This is a proposal for a new feature in pg_stat_statements extension.
I think we need to add some statistics to pg_stat_statements_info view.

"pg_stat_statements_info.stats_reset" will only be logged if 
"pg_statements_reset()" or "pg_statements_reset(0,0,0)" is executed.

How about changing it to the following ?

[before]
-[ RECORD 1 ]--
dealloc | 0
stats_reset | 2021-04-27 21:30:00

[after]
-[ RECORD 1 ]--
dealloc | 0
last_reset_all_time | 2021-04-27 21:30:00
last_reset_userid   | 10
last_reset_userid_time  | 2021-04-27 22:30:00
last_reset_dbid | 13974
last_reset_dbid_time| 2021-04-27 23:30:00
last_reset_queryid  | 8436481539005031698
last_reset_queryid_time | 2021-04-27 23:30:00

If "pg_statements_reset(10,0,0)" is executed, then "last_reset_userid", 
"last_reset_userid" are updated, but do not update 
"last_reset_all_time".
If "pg_statements_reset(0,0,0)" is executed, then "last_reset_userid", 
"last_reset_userid" and others are null.


What do you think?

Regards,
Seino Yuki




Re: Skip temporary table schema name from explain-verbose output.

2021-04-27 Thread Ashutosh Bapat
On Tue, Apr 27, 2021 at 12:23 PM Amul Sul  wrote:
> >
> > How about using an explain filter to replace the unstable text
> > pg_temp_3 to pg_temp_N instead of changing it in the core? Following
> > are the existing explain filters: explain_filter,
> > explain_parallel_append, explain_analyze_without_memory,
> > explain_resultcache, explain_parallel_sort_stats, explain_sq_limit.
> >
>
> Well, yes eventually, that will be the kludge. I was wondering if that
> table is accessible in a query via pg_temp schema then why should
> bother about printing the pg_temp_N schema name which is an internal
> purpose.

Although only the associated session can access objects from that
schema, I think, the entries in pg_class have different namespace oids
and are accessible from other sessions. So knowing the actual schema
name is useful for debugging purposes. Using auto_explain, the explain
output goes to server log, where access to two temporary tables with
the same name from different sessions can be identified by the actual
schema name easily.

I am not sure whether we should change explain output only for the
sake of stable tests.

You could add a flag to EXPLAIN to mask pg_temp name but that's
probably an overkill. Filtering is a better option for tests.

-- 
Best Wishes,
Ashutosh Bapat




Re: Skip temporary table schema name from explain-verbose output.

2021-04-27 Thread Bharath Rupireddy
On Tue, Apr 27, 2021 at 6:59 PM Ashutosh Bapat
 wrote:
>
> On Tue, Apr 27, 2021 at 12:23 PM Amul Sul  wrote:
> > >
> > > How about using an explain filter to replace the unstable text
> > > pg_temp_3 to pg_temp_N instead of changing it in the core? Following
> > > are the existing explain filters: explain_filter,
> > > explain_parallel_append, explain_analyze_without_memory,
> > > explain_resultcache, explain_parallel_sort_stats, explain_sq_limit.
> > >
> >
> > Well, yes eventually, that will be the kludge. I was wondering if that
> > table is accessible in a query via pg_temp schema then why should
> > bother about printing the pg_temp_N schema name which is an internal
> > purpose.
>
> Although only the associated session can access objects from that
> schema, I think, the entries in pg_class have different namespace oids
> and are accessible from other sessions. So knowing the actual schema
> name is useful for debugging purposes. Using auto_explain, the explain
> output goes to server log, where access to two temporary tables with
> the same name from different sessions can be identified by the actual
> schema name easily.
>
> I am not sure whether we should change explain output only for the
> sake of stable tests.

I agree to not change the explain code, just for tests.

> You could add a flag to EXPLAIN to mask pg_temp name but that's
> probably an overkill.

IMO, you are right, it will be an overkill. We might end up having
requests to add flags for other cases as well.

> Filtering is a better option for tests.

+1. EXPLAIN output filtering is not something new, we have already
stabilized a few tests.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Performance degradation of REFRESH MATERIALIZED VIEW

2021-04-27 Thread Tomas Vondra




On 4/27/21 7:34 AM, Masahiko Sawada wrote:

On Tue, Apr 27, 2021 at 8:07 AM Andres Freund  wrote:


Hi,

On 2021-04-26 23:59:17 +0200, Tomas Vondra wrote:

On 4/26/21 9:27 PM, Andres Freund wrote:

On 2021-04-26 15:31:02 +0200, Tomas Vondra wrote:

I'm not sure what to do about this :-( I don't have any ideas about how to
eliminate this overhead, so the only option I see is reverting the changes
in heap_insert. Unfortunately, that'd mean inserts into TOAST tables won't
be frozen ...


ISTM that the fundamental issue here is not that we acquire pins that we
shouldn't, but that we do so at a much higher frequency than needed.

It's probably too invasive for 14, but I think it might be worth exploring
passing down a BulkInsertState in nodeModifyTable.c's table_tuple_insert() iff
the input will be more than one row.

And then add the vm buffer of the target page to BulkInsertState, so that
hio.c can avoid re-pinning the buffer.



Yeah. The question still is what to do about 14, though. Shall we leave the
code as it is now, or should we change it somehow? It seem a bit unfortunate
that a COPY FREEZE optimization should negatively influence other (more)
common use cases, so I guess we can't just keep the current code ...


I'd suggest prototyping the use of BulkInsertState in nodeModifyTable.c
and see whether that fixes the regression.


Is this idea to have RelationGetBufferForTuple() skip re-pinning
vmbuffer? If so, is this essentially the same as the one in the v3
patch?



I don't think it is the same approach - it's a bit hard to follow what 
exactly happens in RelationGetBufferForTuple, but AFAICS it always 
starts with vmbuffer = InvalidBuffer, so it may pin the vmbuffer quite 
often, no?


What Andres is suggesting (I think) is to modify ExecInsert() to pass a 
valid bistate to table_tuple_insert, instead of just NULL, and store the 
vmbuffer in it. Not sure how to identify when inserting more than just a 
single row, though ...



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Attach to shared memory after fork()

2021-04-27 Thread Tom Lane
"=?UTF-8?B?6YKx5a6H6IiqKOeDm+i/nCk=?="  writes:
> Fork is an expensive operation[1].

Yeah, it's not hugely cheap.

> So I propose to remove shared buffers from postmaster and shmat them
> after fork.

This proposal seems moderately insane.  In the first place, it
introduces failure modes we could do without, and in the second place,
how is it not strictly *more* expensive than what happens now?  You
still have to end up with all those TLB entries mapped in the child.

(If your kernel is unable to pass down shared-memory TLBs effectively,
ISTM that's a kernel shortcoming not a Postgres architectural problem.)

regards, tom lane




Re: Performance degradation of REFRESH MATERIALIZED VIEW

2021-04-27 Thread Bharath Rupireddy
On Tue, Apr 27, 2021 at 7:13 PM Tomas Vondra
 wrote:
> What Andres is suggesting (I think) is to modify ExecInsert() to pass a
> valid bistate to table_tuple_insert, instead of just NULL, and store the
> vmbuffer in it. Not sure how to identify when inserting more than just a
> single row, though ...

I think the thread "should INSERT SELECT use a BulkInsertState?" [1],
has a simple dynamic mechanism [with a GUC defining the threshold
tuples] to switch over to using BulkInsertState. Maybe it's worth
having a look at the patch -
0001-INSERT-SELECT-to-use-BulkInsertState-and-multi_i.patch?

+/* Use bulk insert after a threshold number of tuples */
+// XXX: maybe this should only be done if it's not a partitioned table or
+// if the partitions don't support miinfo, which uses its own bistates
+mtstate->ntuples++;
+if (mtstate->bistate == NULL &&
+mtstate->operation == CMD_INSERT &&
+mtstate->ntuples > bulk_insert_ntuples &&
+bulk_insert_ntuples >= 0)
+{
+elog(DEBUG1, "enabling bulk insert");
+mtstate->bistate = GetBulkInsertState();
+}

[1] https://www.postgresql.org/message-id/20210222030158.GS14772%40telsasoft.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Performance degradation of REFRESH MATERIALIZED VIEW

2021-04-27 Thread Justin Pryzby
On Tue, Apr 27, 2021 at 03:43:07PM +0200, Tomas Vondra wrote:
> On 4/27/21 7:34 AM, Masahiko Sawada wrote:
> > On Tue, Apr 27, 2021 at 8:07 AM Andres Freund  wrote:
> > > On 2021-04-26 23:59:17 +0200, Tomas Vondra wrote:
> > > > On 4/26/21 9:27 PM, Andres Freund wrote:
> > > > > On 2021-04-26 15:31:02 +0200, Tomas Vondra wrote:
> > > > > > I'm not sure what to do about this :-( I don't have any ideas about 
> > > > > > how to
> > > > > > eliminate this overhead, so the only option I see is reverting the 
> > > > > > changes
> > > > > > in heap_insert. Unfortunately, that'd mean inserts into TOAST 
> > > > > > tables won't
> > > > > > be frozen ...
> > > > > 
> > > > > ISTM that the fundamental issue here is not that we acquire pins that 
> > > > > we
> > > > > shouldn't, but that we do so at a much higher frequency than needed.
> > > > > 
> > > > > It's probably too invasive for 14, but I think it might be worth 
> > > > > exploring
> > > > > passing down a BulkInsertState in nodeModifyTable.c's 
> > > > > table_tuple_insert() iff
> > > > > the input will be more than one row.
> > > > > 
> > > > > And then add the vm buffer of the target page to BulkInsertState, so 
> > > > > that
> > > > > hio.c can avoid re-pinning the buffer.
> > > > > 
> > > > 
> > > > Yeah. The question still is what to do about 14, though. Shall we leave 
> > > > the
> > > > code as it is now, or should we change it somehow? It seem a bit 
> > > > unfortunate
> > > > that a COPY FREEZE optimization should negatively influence other (more)
> > > > common use cases, so I guess we can't just keep the current code ...
> > > 
> > > I'd suggest prototyping the use of BulkInsertState in nodeModifyTable.c
> > > and see whether that fixes the regression.
> 
> What Andres is suggesting (I think) is to modify ExecInsert() to pass a
> valid bistate to table_tuple_insert, instead of just NULL, and store the
> vmbuffer in it. Not sure how to identify when inserting more than just a
> single row, though ...

Maybe this is relevant.
https://commitfest.postgresql.org/33/2553/
| INSERT SELECT: BulkInsertState and table_multi_insert

The biistate part is small - Simon requested to also use table_multi_insert,
which makes the patch much bigger, and there's probably lots of restrictions I
haven't even thought to check.

This uses a GUC threshold for bulk insert, but I'm still not sure it's really
problematic to use a biistate for a single row.

/* Use bulk insert after a threshold number of tuples */
// XXX: maybe this should only be done if it's not a partitioned table 
or
// if the partitions don't support miinfo, which uses its own bistates
mtstate->ntuples++;
if (mtstate->bistate == NULL &&
mtstate->ntuples > bulk_insert_ntuples &&
bulk_insert_ntuples >= 0)
{
elog(DEBUG1, "enabling bulk insert");
mtstate->bistate = GetBulkInsertState();

-- 
Justin




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-27 Thread Amit Langote
On Tue, Apr 27, 2021 at 4:34 PM Amit Langote  wrote:
> Thanks for the updated patch.  I've been reading it, but I noticed a
> bug in 8aba9322511f, which I thought you'd want to know to make a note
> of when committing this one.
>
> So we decided in 8aba9322511f that it is okay to make the memory
> context in which a transient partdesc is allocated a child of
> PortalContext so that it disappears when the portal does.  But
> PortalContext is apparently NULL when the planner runs, at least in
> the "simple" query protocol, so any transient partdescs built by the
> planner would effectively leak.  Not good.
>
> With this patch, partdesc_nodetached is no longer transient, so the
> problem doesn't exist.
>
> I will write more about the updated patch in a bit...

The first thing that struck me about partdesc_nodetached is that it's
not handled in RelationClearRelation(), where we (re)set a regular
partdesc to NULL so that the next RelationGetPartitionDesc() has to
build it from scratch.  I think partdesc_nodetached and the xmin
should likewise be reset.  Also in load_relcache_init_file(), although
nothing serious there.

That said, I think I may have found a couple of practical problems
with partdesc_nodetached, or more precisely with having it
side-by-side with regular partdesc.  Maybe they can be fixed, so the
problems are not as such deal breakers for the patch's main idea.  The
problems can be seen when different queries in a serializable
transaction have to use both the regular partdesc and
partdesc_detached in a given relcache.  For example, try the following
after first creating a situation where the table p has a
detach-pending partition p2 (for values in (2) and a live partition p1
(for values in (1)).

begin isolation level serializable;
insert into p values (1);
select * from p where a = 1;
insert into p values (1);

The 1st insert succeeds but the 2nd fails with:

ERROR:  no partition of relation "p" found for row
DETAIL:  Partition key of the failing row contains (a) = (1).

I haven't analyzed this error very closely but there is another
situation which causes a crash due to what appears to be a conflict
with rd_pdcxt's design:

-- run in a new session
begin isolation level serializable;
select * from p where a = 1;
insert into p values (1);
select * from p where a = 1;

The 2nd select crashes:

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

The crash occurs because the planner gets handed a stale copy of
partdesc_nodetached for the 2nd select.  It gets stale, because the
context it's allocated in gets made a child of rd_pdcxt, which is in
turn assigned the context of the regular partdesc when it is built for
the insert query.  Any child contexts of rd_pdcxt are deleted as soon
as the Relation's refcount goes to zero, taking it with
partdesc_nodetached.  Note it is this code in
RelationBuildPartitionDesc():

/*
 * But first, a kluge: if there's an old rd_pdcxt, it contains an old
 * partition descriptor that may still be referenced somewhere.
 * Preserve it, while not leaking it, by reattaching it as a child
 * context of the new rd_pdcxt.  Eventually it will get dropped by
 * either RelationClose or RelationClearRelation.
 */
if (rel->rd_pdcxt != NULL)
MemoryContextSetParent(rel->rd_pdcxt, new_pdcxt);
rel->rd_pdcxt = new_pdcxt;

I think we may need a separate context for partdesc_nodetached, likely
with the same kludges as rd_pdcxt.  Maybe the first problem will go
away with that as well.

Few other minor things I noticed:

+* Store it into relcache.  For snapshots built excluding detached
+* partitions, which we save separately, we also record the
+* pg_inherits.xmin of the detached partition that was omitted; this
+* informs a future potential user of such a cached snapshot.

The "snapshot" in the 1st and the last sentence should be "partdesc"?

+ * We keep two partdescs in relcache: rd_partdesc_nodetached excludes
+ * partitions marked concurrently being detached, while rd_partdesc includes
+ * them.

IMHO, describing rd_partdesc first in the sentence would be better.
Like: rd_partdesc includes all partitions including any that are being
concurrently detached, while rd_partdesc_nodetached excludes them.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Should we document the behaviour of TRUNCATE skipping repeated relations?

2021-04-27 Thread Bharath Rupireddy
Hi,

The TRUNCATE command currently skips processing repeated relations
(see if (list_member_oid(relids, myrelid)) continue; in
ExecuteTruncate) because the same relation can't be truncated more
than once as it will be under "use" during the txn. For instance, in
the following use cases 1) TRUNCATE foo, foo; 2) TRUNCATE foo, ONLY
foo, foo; first instance of relation foo is taken into consideration
for processing and other relation instances ( and options specified if
any) are ignored.

I feel that users should be aware of this behaviour so that they can
correct the commands if written in such a way and don't report
unexpected behaviour especially for the use cases like (2) where they
might expect ONLY foo behaviour but it is skipped by the server.
AFAICS, I don't find it anywhere in the docs, should we document it as
a note?

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Replication slot stats misgivings

2021-04-27 Thread Amit Kapila
On Tue, Apr 27, 2021 at 5:40 PM Amit Kapila  wrote:
>
> On Tue, Apr 27, 2021 at 8:58 AM Masahiko Sawada  wrote:
>
> I have pushed this patch and seeing one buildfarm failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-27%2009%3A23%3A14
>
>   starting permutation: s1_init s1_begin s1_insert_tbl1 s1_insert_tbl2
> s2_alter_tbl1_char s1_commit s2_get_changes
> + isolationtester: canceling step s1_init after 314 seconds
>   step s1_init: SELECT 'init' FROM
> pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
>   ?column?
>
> I am analyzing this.
>

After checking below logs corresponding to this test, it seems test
has been executed and create_slot was successful:
2021-04-27 11:06:43.770 UTC [17694956:52] isolation/concurrent_ddl_dml
STATEMENT:  SELECT 'init' FROM
pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
2021-04-27 11:07:11.748 UTC [5243096:9] LOG:  checkpoint starting: time
2021-04-27 11:09:24.332 UTC [5243096:10] LOG:  checkpoint complete:
wrote 14 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=0.716 s, sync=0.001 s, total=132.584 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=198 kB, estimate=406 kB
2021-04-27 11:09:40.116 UTC [6226046:1] [unknown] LOG:  connection
received: host=[local]
2021-04-27 11:09:40.117 UTC [17694956:53] isolation/concurrent_ddl_dml
LOG:  statement: BEGIN;
2021-04-27 11:09:40.117 UTC [17694956:54] isolation/concurrent_ddl_dml
LOG:  statement: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
2021-04-27 11:09:40.118 UTC [17694956:55] isolation/concurrent_ddl_dml
LOG:  statement: INSERT INTO tbl2 (val1, val2) VALUES (1, 1);
2021-04-27 11:09:40.119 UTC [10944636:49] isolation/concurrent_ddl_dml
LOG:  statement: ALTER TABLE tbl1 ALTER COLUMN val2 TYPE character
varying;

I am not sure but there is some possibility that even though create
slot is successful, the isolation tester got successful in canceling
it, maybe because create_slot is just finished at the same time. As we
can see from logs, during this test checkpoint also happened which
could also lead to the slowness of this particular command.

Also, I see a lot of messages like below which indicate stats
collector is also quite slow:
2021-04-27 10:57:59.385 UTC [18743536:1] LOG:  using stale statistics
instead of current ones because stats collector is not responding

I am not sure if the timeout happened because the machine is slow or
is it in any way related to code. I am seeing some previous failures
due to timeout on this machine [1][2]. In those failures, I see the
"using stale stats" message. Also, I am not able to see why it can
fail due to this patch?

[1] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-02-23%2004%3A23%3A56
[2] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2020-12-24%2005%3A31%3A43

-- 
With Regards,
Amit Kapila.




Re: Better sanity checking for message length words

2021-04-27 Thread Tom Lane
Aleksander Alekseev  writes:
> I'm having slight issues applying your patch to the `master` branch.
> Is it the right target?

[ scratches head ... ]  The patch still applies perfectly cleanly
for me, using either "patch" or "git apply".

> Regarding the idea, I think extra checks are a good thing and
> definitely something that can be introduced in the next major version.
> If we receive a complaint during beta-testing we can revert the patch
> or maybe increase the limit for small messages.

Actually, I did some more testing yesterday and found that
"make check-world" passes with PQ_SMALL_MESSAGE_LIMIT values
as small as 16.  That may say more about our testing than
anything else --- for example, it implies we're not using long
statement or portal names anywhere.  But still, it suggests
that 3 is between one and two orders of magnitude too large.
I'm now thinking that 1 would be a good conservative setting,
or we could try 1000 if we want to be a bit more aggressive.
As you say, beta-testing feedback could result in further
modifications.

regards, tom lane




Re: Replication slot stats misgivings

2021-04-27 Thread Masahiko Sawada
On Tue, Apr 27, 2021 at 11:29 PM Amit Kapila  wrote:
>
> On Tue, Apr 27, 2021 at 5:40 PM Amit Kapila  wrote:
> >
> > On Tue, Apr 27, 2021 at 8:58 AM Masahiko Sawada  
> > wrote:
> >
> > I have pushed this patch and seeing one buildfarm failure:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-27%2009%3A23%3A14
> >
> >   starting permutation: s1_init s1_begin s1_insert_tbl1 s1_insert_tbl2
> > s2_alter_tbl1_char s1_commit s2_get_changes
> > + isolationtester: canceling step s1_init after 314 seconds
> >   step s1_init: SELECT 'init' FROM
> > pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
> >   ?column?
> >
> > I am analyzing this.
> >
>
> After checking below logs corresponding to this test, it seems test
> has been executed and create_slot was successful:

The pg_create_logical_replication_slot() was executed at 11:04:25:

2021-04-27 11:04:25.494 UTC [17694956:49] isolation/concurrent_ddl_dml
LOG:  statement: SELECT 'init' FROM
pg_create_logical_replication_slot('isolation_slot', 'test_decoding');

Therefore this command took 314 sec that matches the number the
isolation test reported. And the folling logs follow:

2021-04-27 11:06:43.770 UTC [17694956:50] isolation/concurrent_ddl_dml
LOG:  logical decoding found consistent point at 0/17F9078
2021-04-27 11:06:43.770 UTC [17694956:51] isolation/concurrent_ddl_dml
DETAIL:  There are no running transactions.

> 2021-04-27 11:06:43.770 UTC [17694956:52] isolation/concurrent_ddl_dml
> STATEMENT:  SELECT 'init' FROM
> pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
> 2021-04-27 11:07:11.748 UTC [5243096:9] LOG:  checkpoint starting: time
> 2021-04-27 11:09:24.332 UTC [5243096:10] LOG:  checkpoint complete:
> wrote 14 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled;
> write=0.716 s, sync=0.001 s, total=132.584 s; sync files=0,
> longest=0.000 s, average=0.000 s; distance=198 kB, estimate=406 kB
> 2021-04-27 11:09:40.116 UTC [6226046:1] [unknown] LOG:  connection
> received: host=[local]
> 2021-04-27 11:09:40.117 UTC [17694956:53] isolation/concurrent_ddl_dml
> LOG:  statement: BEGIN;
> 2021-04-27 11:09:40.117 UTC [17694956:54] isolation/concurrent_ddl_dml
> LOG:  statement: INSERT INTO tbl1 (val1, val2) VALUES (1, 1);
> 2021-04-27 11:09:40.118 UTC [17694956:55] isolation/concurrent_ddl_dml
> LOG:  statement: INSERT INTO tbl2 (val1, val2) VALUES (1, 1);
> 2021-04-27 11:09:40.119 UTC [10944636:49] isolation/concurrent_ddl_dml
> LOG:  statement: ALTER TABLE tbl1 ALTER COLUMN val2 TYPE character
> varying;
>
> I am not sure but there is some possibility that even though create
> slot is successful, the isolation tester got successful in canceling
> it, maybe because create_slot is just finished at the same time.

Yeah, we see the test log "canceling step s1_init after 314 seconds"
but don't see any log indicating canceling query.

>  As we
> can see from logs, during this test checkpoint also happened which
> could also lead to the slowness of this particular command.

Yes. I also think the checkpoint could somewhat lead to the slowness.
And since create_slot() took 2min to find a consistent snapshot the
system might have already been busy.

>
> Also, I see a lot of messages like below which indicate stats
> collector is also quite slow:
> 2021-04-27 10:57:59.385 UTC [18743536:1] LOG:  using stale statistics
> instead of current ones because stats collector is not responding
>
> I am not sure if the timeout happened because the machine is slow or
> is it in any way related to code. I am seeing some previous failures
> due to timeout on this machine [1][2]. In those failures, I see the
> "using stale stats" message.

It seems like a time-dependent issue but I'm wondering why the logical
decoding test failed at this time.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Performance degradation of REFRESH MATERIALIZED VIEW

2021-04-27 Thread Masahiko Sawada
On Tue, Apr 27, 2021 at 10:43 PM Tomas Vondra
 wrote:
>
>
>
> On 4/27/21 7:34 AM, Masahiko Sawada wrote:
> > On Tue, Apr 27, 2021 at 8:07 AM Andres Freund  wrote:
> >>
> >> Hi,
> >>
> >> On 2021-04-26 23:59:17 +0200, Tomas Vondra wrote:
> >>> On 4/26/21 9:27 PM, Andres Freund wrote:
>  On 2021-04-26 15:31:02 +0200, Tomas Vondra wrote:
> > I'm not sure what to do about this :-( I don't have any ideas about how 
> > to
> > eliminate this overhead, so the only option I see is reverting the 
> > changes
> > in heap_insert. Unfortunately, that'd mean inserts into TOAST tables 
> > won't
> > be frozen ...
> 
>  ISTM that the fundamental issue here is not that we acquire pins that we
>  shouldn't, but that we do so at a much higher frequency than needed.
> 
>  It's probably too invasive for 14, but I think it might be worth 
>  exploring
>  passing down a BulkInsertState in nodeModifyTable.c's 
>  table_tuple_insert() iff
>  the input will be more than one row.
> 
>  And then add the vm buffer of the target page to BulkInsertState, so that
>  hio.c can avoid re-pinning the buffer.
> 
> >>>
> >>> Yeah. The question still is what to do about 14, though. Shall we leave 
> >>> the
> >>> code as it is now, or should we change it somehow? It seem a bit 
> >>> unfortunate
> >>> that a COPY FREEZE optimization should negatively influence other (more)
> >>> common use cases, so I guess we can't just keep the current code ...
> >>
> >> I'd suggest prototyping the use of BulkInsertState in nodeModifyTable.c
> >> and see whether that fixes the regression.
> >
> > Is this idea to have RelationGetBufferForTuple() skip re-pinning
> > vmbuffer? If so, is this essentially the same as the one in the v3
> > patch?
> >
>
> I don't think it is the same approach - it's a bit hard to follow what
> exactly happens in RelationGetBufferForTuple, but AFAICS it always
> starts with vmbuffer = InvalidBuffer, so it may pin the vmbuffer quite
> often, no?

With that patch, we pin the vmbuffer only when inserting a frozen
tuple into an empty page. That is, when inserting a frozen tuple into
an empty page, we pin the vmbuffer and heap_insert() will mark the
page all-visible and set all-frozen bit on vm. And from the next
insertion (into the same page) until the page gets full, since the
page is already all-visible, we skip pinning the vmbuffer. IOW, if the
target page is not empty but all-visible, we skip pinning the
vmbuffer. We pin the vmbuffer only once per heap page used during
insertion.

>
> What Andres is suggesting (I think) is to modify ExecInsert() to pass a
> valid bistate to table_tuple_insert, instead of just NULL, and store the
> vmbuffer in it.

Understood. This approach keeps using the same vmbuffer until we need
another vm page corresponding to the target heap page, which seems
better.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Performance degradation of REFRESH MATERIALIZED VIEW

2021-04-27 Thread Masahiko Sawada
On Wed, Apr 28, 2021 at 12:26 AM Masahiko Sawada  wrote:
>
> On Tue, Apr 27, 2021 at 10:43 PM Tomas Vondra
>  wrote:
> >
> >
> >
> > On 4/27/21 7:34 AM, Masahiko Sawada wrote:
> > > On Tue, Apr 27, 2021 at 8:07 AM Andres Freund  wrote:
> > >>
> > >> Hi,
> > >>
> > >> On 2021-04-26 23:59:17 +0200, Tomas Vondra wrote:
> > >>> On 4/26/21 9:27 PM, Andres Freund wrote:
> >  On 2021-04-26 15:31:02 +0200, Tomas Vondra wrote:
> > > I'm not sure what to do about this :-( I don't have any ideas about 
> > > how to
> > > eliminate this overhead, so the only option I see is reverting the 
> > > changes
> > > in heap_insert. Unfortunately, that'd mean inserts into TOAST tables 
> > > won't
> > > be frozen ...
> > 
> >  ISTM that the fundamental issue here is not that we acquire pins that 
> >  we
> >  shouldn't, but that we do so at a much higher frequency than needed.
> > 
> >  It's probably too invasive for 14, but I think it might be worth 
> >  exploring
> >  passing down a BulkInsertState in nodeModifyTable.c's 
> >  table_tuple_insert() iff
> >  the input will be more than one row.
> > 
> >  And then add the vm buffer of the target page to BulkInsertState, so 
> >  that
> >  hio.c can avoid re-pinning the buffer.
> > 
> > >>>
> > >>> Yeah. The question still is what to do about 14, though. Shall we leave 
> > >>> the
> > >>> code as it is now, or should we change it somehow? It seem a bit 
> > >>> unfortunate
> > >>> that a COPY FREEZE optimization should negatively influence other (more)
> > >>> common use cases, so I guess we can't just keep the current code ...
> > >>
> > >> I'd suggest prototyping the use of BulkInsertState in nodeModifyTable.c
> > >> and see whether that fixes the regression.
> > >
> > > Is this idea to have RelationGetBufferForTuple() skip re-pinning
> > > vmbuffer? If so, is this essentially the same as the one in the v3
> > > patch?
> > >
> >
> > I don't think it is the same approach - it's a bit hard to follow what
> > exactly happens in RelationGetBufferForTuple, but AFAICS it always
> > starts with vmbuffer = InvalidBuffer, so it may pin the vmbuffer quite
> > often, no?
>
> With that patch, we pin the vmbuffer only when inserting a frozen
> tuple into an empty page. That is, when inserting a frozen tuple into
> an empty page, we pin the vmbuffer and heap_insert() will mark the
> page all-visible and set all-frozen bit on vm. And from the next
> insertion (into the same page) until the page gets full, since the
> page is already all-visible, we skip pinning the vmbuffer. IOW, if the
> target page is not empty but all-visible, we skip pinning the
> vmbuffer. We pin the vmbuffer only once per heap page used during
> insertion.
>
> >
> > What Andres is suggesting (I think) is to modify ExecInsert() to pass a
> > valid bistate to table_tuple_insert, instead of just NULL, and store the
> > vmbuffer in it.
>
> Understood. This approach keeps using the same vmbuffer until we need
> another vm page corresponding to the target heap page, which seems
> better.

But how is ExecInsert() related to REFRESH MATERIALIZED VIEW?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-27 Thread Alvaro Herrera
On 2021-Apr-27, Amit Langote wrote:

> On Tue, Apr 27, 2021 at 4:34 PM Amit Langote  wrote:

> I think we may need a separate context for partdesc_nodetached, likely
> with the same kludges as rd_pdcxt.  Maybe the first problem will go
> away with that as well.

Ooh, seems I completely misunderstood what RelationClose was doing.  I
thought it was deleting the whole rd_pdcxt, *including* the "current"
partdesc.  But that's not at all what it does: it only deletes the
*children* memcontexts, so the partdesc that is currently valid remains
valid.  I agree that your proposed fix appears to be promising, in that
a separate "context tree" rd_pddcxt (?) can be used for this.  I'll try
it out now.

> Few other minor things I noticed:
> 
> +* Store it into relcache.  For snapshots built excluding detached
> +* partitions, which we save separately, we also record the
> +* pg_inherits.xmin of the detached partition that was omitted; this
> +* informs a future potential user of such a cached snapshot.
> 
> The "snapshot" in the 1st and the last sentence should be "partdesc"?

Doh, yeah.

> + * We keep two partdescs in relcache: rd_partdesc_nodetached excludes
> + * partitions marked concurrently being detached, while rd_partdesc includes
> + * them.
> 
> IMHO, describing rd_partdesc first in the sentence would be better.
> Like: rd_partdesc includes all partitions including any that are being
> concurrently detached, while rd_partdesc_nodetached excludes them.

Makes sense.

-- 
Álvaro Herrera   Valdivia, Chile




Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch

2021-04-27 Thread Tom Lane
I wrote:
>> Perhaps it'd be worth documenting that you can get the standard
>> astronomical definition of Julian date by transposing to time zone UTC-12
>> before converting.

BTW ... I'd first thought that the way to do this was to rotate to
time zone UTC+12.  I convinced myself on two separate days that UTC-12
was correct instead, but now I'm thinking I was right the first time.
In particular, the results I'm getting with UTC-12 don't square with
the example on Wikipedia [1], which says "the Julian Date for
00:30:00.0 UT January 1, 2013, is 2 456 293.520 833":

regression=# select extract(julian from '2013-01-01 00:30+00'::timestamptz at 
time zone 'utc-12');
   extract
--
 2456294.5208
(1 row)

But using UTC+12 does match:

regression=# select extract(julian from '2013-01-01 00:30+00'::timestamptz at 
time zone 'utc+12');
   extract
--
 2456293.5208
(1 row)

Of course Wikipedia has been known to contain errors, but now
I'm inclined to think I blew this.  Anyone want to check my work?

regards, tom lane

[1] https://en.wikipedia.org/wiki/Julian_day




Re: SQL-standard function body

2021-04-27 Thread Tom Lane
Peter Eisentraut  writes:
> On 18.04.21 23:33, Tom Lane wrote:
>> The actual use-case for that seems pretty thin, so we never bothered
>> to worry about it before.  But if we're going to build loop-breaking
>> logic to handle function body dependencies, it should deal with this
>> too.  I think that all that's required is for the initial dummy
>> function declaration to omit defaults as well as providing a dummy
>> body.

> I have studied this a bit.  I'm not sure where the dummy function 
> declaration should be created.  The current dependency-breaking logic in 
> pg_dump_sort.c doesn't appear to support injecting additional objects 
> into the set of dumpable objects.  So we would need to create it perhaps 
> in dumpFunc() and then later set flags that indicate whether it will be 
> required.

Hmm, good point.  The existing code that breaks loops involving views
depends on the fact that the view relation and the view's ON SELECT
rule are already treated as distinct objects within pg_dump.  So we
just need to mark the rule object to indicate whether to emit it or
not.  To make it work for functions, there would have to be a secondary
object representing the function body (and the default expressions,
I guess).

That's kind of a lot of complication, and inefficiency, for a corner case
that may never arise in practice.  We've ignored the risk for default
expressions, and AFAIR have yet to receive any field complaints about it.
So maybe it's okay to do the same for SQL-style function bodies, at least
for now.

> Another option would be that we disallow this at creation time.

Don't like that one much.  The backend shouldn't be in the business
of rejecting valid commands just because pg_dump might be unable
to cope later.

regards, tom lane




Re: Performance degradation of REFRESH MATERIALIZED VIEW

2021-04-27 Thread Tomas Vondra

On 4/27/21 5:44 PM, Masahiko Sawada wrote:

On Wed, Apr 28, 2021 at 12:26 AM Masahiko Sawada  wrote:


On Tue, Apr 27, 2021 at 10:43 PM Tomas Vondra
 wrote:




On 4/27/21 7:34 AM, Masahiko Sawada wrote:

On Tue, Apr 27, 2021 at 8:07 AM Andres Freund  wrote:


Hi,

On 2021-04-26 23:59:17 +0200, Tomas Vondra wrote:

On 4/26/21 9:27 PM, Andres Freund wrote:

On 2021-04-26 15:31:02 +0200, Tomas Vondra wrote:

I'm not sure what to do about this :-( I don't have any ideas about how to
eliminate this overhead, so the only option I see is reverting the changes
in heap_insert. Unfortunately, that'd mean inserts into TOAST tables won't
be frozen ...


ISTM that the fundamental issue here is not that we acquire pins that we
shouldn't, but that we do so at a much higher frequency than needed.

It's probably too invasive for 14, but I think it might be worth exploring
passing down a BulkInsertState in nodeModifyTable.c's table_tuple_insert() iff
the input will be more than one row.

And then add the vm buffer of the target page to BulkInsertState, so that
hio.c can avoid re-pinning the buffer.



Yeah. The question still is what to do about 14, though. Shall we leave the
code as it is now, or should we change it somehow? It seem a bit unfortunate
that a COPY FREEZE optimization should negatively influence other (more)
common use cases, so I guess we can't just keep the current code ...


I'd suggest prototyping the use of BulkInsertState in nodeModifyTable.c
and see whether that fixes the regression.


Is this idea to have RelationGetBufferForTuple() skip re-pinning
vmbuffer? If so, is this essentially the same as the one in the v3
patch?



I don't think it is the same approach - it's a bit hard to follow what
exactly happens in RelationGetBufferForTuple, but AFAICS it always
starts with vmbuffer = InvalidBuffer, so it may pin the vmbuffer quite
often, no?


With that patch, we pin the vmbuffer only when inserting a frozen
tuple into an empty page. That is, when inserting a frozen tuple into
an empty page, we pin the vmbuffer and heap_insert() will mark the
page all-visible and set all-frozen bit on vm. And from the next
insertion (into the same page) until the page gets full, since the
page is already all-visible, we skip pinning the vmbuffer. IOW, if the
target page is not empty but all-visible, we skip pinning the
vmbuffer. We pin the vmbuffer only once per heap page used during
insertion.



What Andres is suggesting (I think) is to modify ExecInsert() to pass a
valid bistate to table_tuple_insert, instead of just NULL, and store the
vmbuffer in it.


Understood. This approach keeps using the same vmbuffer until we need
another vm page corresponding to the target heap page, which seems
better.


But how is ExecInsert() related to REFRESH MATERIALIZED VIEW?



TBH I haven't looked into the details, but Andres talked about 
nodeModifyTable and table_tuple_insert, and ExecInsert is the only place 
calling it. But maybe I'm just confused and Andres meant something else?



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-27 Thread Alvaro Herrera
This v3 handles things as you suggested and works correctly AFAICT.  I'm
going to add some more tests cases to verify the behavior in the
scenarios you showed, and get them to run under cache-clobber options to
make sure it's good.

Thanks!

-- 
Álvaro Herrera   Valdivia, Chile
>From 145ed63c4338d7c7804329e736019a00c97f1a7a Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 26 Apr 2021 14:53:04 -0400
Subject: [PATCH v3] Allow a partdesc-omitting-partitions to be cached

Makes partition descriptor acquisition faster during the transient
period in which a partition is in the process of being detached.

Per gripe from Amit Langote
Discussion: https://postgr.es/m/ca+hiwqfgpp1lxjzobygt9rpvtjxxkg5qg2+xch2z1q7krqz...@mail.gmail.com
---
 src/backend/catalog/pg_inherits.c |  52 -
 src/backend/commands/tablecmds.c  |  66 +++-
 src/backend/commands/trigger.c|   3 +-
 src/backend/partitioning/partdesc.c   | 101 --
 src/backend/utils/cache/relcache.c|  33 +-
 src/include/catalog/pg_inherits.h |   6 +-
 src/include/utils/rel.h   |   5 +
 .../detach-partition-concurrently-3.out   |  57 +-
 .../detach-partition-concurrently-3.spec  |   9 +-
 9 files changed, 254 insertions(+), 78 deletions(-)

diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 6447b52854..c373faf2d6 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -52,6 +52,22 @@ typedef struct SeenRelsEntry
  * then no locks are acquired, but caller must beware of race conditions
  * against possible DROPs of child relations.
  *
+ * Partitions marked as being detached are omitted; see
+ * find_inheritance_children_extended for details.
+ */
+List *
+find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
+{
+	return find_inheritance_children_extended(parentrelId, true, lockmode,
+			  NULL, NULL);
+}
+
+/*
+ * find_inheritance_children_extended
+ *
+ * As find_inheritance_children, with more options regarding detached
+ * partitions.
+ *
  * If a partition's pg_inherits row is marked "detach pending",
  * *detached_exist (if not null) is set true.
  *
@@ -60,11 +76,13 @@ typedef struct SeenRelsEntry
  * marked "detach pending" is visible to that snapshot, then that partition is
  * omitted from the output list.  This makes partitions invisible depending on
  * whether the transaction that marked those partitions as detached appears
- * committed to the active snapshot.
+ * committed to the active snapshot.  In addition, *detached_xmin (if not null)
+ * is set to the xmin of the row of the detached partition.
  */
 List *
-find_inheritance_children(Oid parentrelId, bool omit_detached,
-		  LOCKMODE lockmode, bool *detached_exist)
+find_inheritance_children_extended(Oid parentrelId, bool omit_detached,
+   LOCKMODE lockmode, bool *detached_exist,
+   TransactionId *detached_xmin)
 {
 	List	   *list = NIL;
 	Relation	relation;
@@ -132,7 +150,32 @@ find_inheritance_children(Oid parentrelId, bool omit_detached,
 snap = GetActiveSnapshot();
 
 if (!XidInMVCCSnapshot(xmin, snap))
+{
+	if (detached_xmin)
+	{
+		/*
+		 * Two detached partitions should not occur (see
+		 * checks in MarkInheritDetached), but if they do,
+		 * track the newer of the two.  Make sure to warn the
+		 * user, so that they can clean up.  Since this is
+		 * just a cross-check against potentially corrupt
+		 * catalogs, we don't make it a full-fledged error
+		 * message.
+		 */
+		if (*detached_xmin != InvalidTransactionId)
+		{
+			elog(WARNING, "more than one partition pending detach found for table with OID %u",
+ parentrelId);
+			if (TransactionIdFollows(xmin, *detached_xmin))
+*detached_xmin = xmin;
+		}
+		else
+			*detached_xmin = xmin;
+	}
+
+	/* Don't add the partition to the output list */
 	continue;
+}
 			}
 		}
 
@@ -247,8 +290,7 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
 		ListCell   *lc;
 
 		/* Get the direct children of this rel */
-		currentchildren = find_inheritance_children(currentrel, true,
-	lockmode, NULL);
+		currentchildren = find_inheritance_children(currentrel, lockmode);
 
 		/*
 		 * Add to the queue only those children not already seen. This avoids
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8e717ada28..d9ba87a2a3 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3492,7 +3492,7 @@ renameatt_internal(Oid myrelid,
 		 * expected_parents will only be 0 if we are not already recursing.
 		 */
 		if (expected_parents == 0 &&
-			find_inheritance_children(myrelid, true, NoLock, NULL) != NIL)
+			find_inheritance_children(myrelid, NoLock) != NIL)
 			ereport(ERROR,
 	(

Re: Addition of authenticated ID to pg_stat_activity

2021-04-27 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Mon, Apr 26, 2021 at 03:21:46PM -0400, Stephen Frost wrote:
> > * Andres Freund (and...@anarazel.de) wrote:
> >> I'm getting a bit worried about the incremental increase in
> >> pg_stat_activity width - it's probably by far the view that's most
> >> viewed interactively. I think we need to be careful not to add too niche
> >> things to it. This is especially true for columns that may be wider.
> >> 
> >> It'd be bad for discoverability, but perhaps something like this, that's
> >> not that likely to be used interactively, would be better done as a
> >> separate function that would need to be used explicitly?
> > 
> > I mean.. we already have separate functions and views for this, though
> > they're auth-method-specific currently, but also provide more details,
> > since it isn't actually a "one size fits all" kind of thing like this
> > entire approach is imagining it to be.
> 
> Referring to pg_stat_ssl and pg_stat_gssapi here, right?  Yes, that
> would be very limited as this leads to no visibility for LDAP, all
> password-based authentications and more.

Yes, of course.  The point being made was that we could do the same for
the other auth methods rather than adding something to pg_stat_activity.

> I am wondering if we should take this as an occasion to move some data
> out of pg_stat_activity into a separate biew, dedicated to the data
> related to the connection that remains set to the same value for the
> duration of a backend's life, as of the following set:
> - the backend PID
> - client_addr
> - client_hostname
> - client_port
> - authenticated ID
> - application_name?  (well, this one could change on reload, so I am
> lying).

application_name certainly changes, as pointed out elsewhere.

> It would be tempting to move the database name and the username but
> these are popular fields with monitoring.  Maybe we could name that
> pg_stat_connection_status, pg_stat_auth_status or just
> pg_stat_connection?

I don't know that there's really any of the existing fields that
aren't "popular fields with monitoring"..  The issue that Andres brought
up wasn't about monitoring though- it was about users looking
interactively.  Monitoring systems can adjust their queries for the new
major version to do whatever joins, et al, they need and that's a
once-per-major-version to do.  On the other hand, people doing:

table pg_stat_activity;

Would like to get the info they really want out of that and not anything
else.  If we're going to adjust the fields returned from that then
that's really the lens we should use.

So, what fields are people really looking at when querying
pg_stat_activity interactively?  User, database, pid, last query,
transaction start, query start, state, wait event info, maybe backend
xmin/xid?  I doubt most people looking at pg_stat_activity interactively
actually care about the non-user backends (autovacuum, et al).

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Addition of authenticated ID to pg_stat_activity

2021-04-27 Thread Justin Pryzby
On Tue, Apr 27, 2021 at 09:59:18AM +0900, Michael Paquier wrote:
> On Mon, Apr 26, 2021 at 03:21:46PM -0400, Stephen Frost wrote:
> > * Andres Freund (and...@anarazel.de) wrote:
> >> I'm getting a bit worried about the incremental increase in
> >> pg_stat_activity width - it's probably by far the view that's most
> >> viewed interactively. I think we need to be careful not to add too niche
> >> things to it. This is especially true for columns that may be wider.
> >> 
> >> It'd be bad for discoverability, but perhaps something like this, that's
> >> not that likely to be used interactively, would be better done as a
> >> separate function that would need to be used explicitly?
> > 
> > I mean.. we already have separate functions and views for this, though
> > they're auth-method-specific currently, but also provide more details,
> > since it isn't actually a "one size fits all" kind of thing like this
> > entire approach is imagining it to be.
> 
> I am wondering if we should take this as an occasion to move some data
> out of pg_stat_activity into a separate biew, dedicated to the data
> related to the connection that remains set to the same value for the
> duration of a backend's life, as of the following set:
> - the backend PID
> - client_addr
> - client_hostname
> - client_port
> - authenticated ID
> - application_name?  (well, this one could change on reload, so I am
> lying).

+backend type
+leader_PID

> It would be tempting to move the database name and the username but
> these are popular fields with monitoring.  Maybe we could name that
> pg_stat_connection_status, pg_stat_auth_status or just
> pg_stat_connection?

Maybe - there could also be a trivial view which JOINs pg_stat_activity and
pg_stat_connection ON (pid).

Technically I think it could also move backend_start/backend_xmin, but it'd be
odd to move them if the other timestamp/xid columns stayed in pg_stat_activity.

There's no reason that pg_stat_connection would *have* to be "static" per
connction, right ?  That's just how you're defining what would be included.

Stephen wrote:
> Would like to get the info they really want out of that and not anything
> else.  If we're going to adjust the fields returned from that then
> that's really the lens we should use.
> 
> So, what fields are people really looking at when querying
> pg_stat_activity interactively?  User, database, pid, last query,
> transaction start, query start, state, wait event info, maybe backend
> xmin/xid?  I doubt most people looking at pg_stat_activity interactively
> actually care about the non-user backends (autovacuum, et al).

I think the narrow/userfacing view would exclude only the OID/XID fields:

 datid| oid  |   |  |
 usesysid | oid  |   |  |
 backend_xid  | xid  |   |  |
 backend_xmin | xid  |   |  |

I think interactive users *would* care about other backend types - they're
frequently wondering "what's going on?"

TBH, query text is often so long that I have to write left(query,33), and then
the idea of a "userfacing" variant loses its appeal, since it's necessary to
enumerate columns anyway.

-- 
Justin




Re: Performance degradation of REFRESH MATERIALIZED VIEW

2021-04-27 Thread Andres Freund
Hi,

On 2021-04-28 00:44:47 +0900, Masahiko Sawada wrote:
> On Wed, Apr 28, 2021 at 12:26 AM Masahiko Sawada  
> wrote:
> > > What Andres is suggesting (I think) is to modify ExecInsert() to pass a
> > > valid bistate to table_tuple_insert, instead of just NULL, and store the
> > > vmbuffer in it.
> >
> > Understood. This approach keeps using the same vmbuffer until we need
> > another vm page corresponding to the target heap page, which seems
> > better.
> 
> But how is ExecInsert() related to REFRESH MATERIALIZED VIEW?

I was thinking of the CONCURRENTLY path for REFRESH MATERIALIZED VIEW I
think. Or something.

That actually makes it easier - we already pass in a bistate in the relevant
paths. So if we add a current_vmbuf to BulkInsertStateData, we can avoid
needing to pin so often. It seems that'd end up with a good bit cleaner and
less risky code than the skip_vmbuffer_for_frozen_tuple_insertion_v3.patch
approach.

The current RelationGetBufferForTuple() interface / how it's used in heapam.c
doesn't make this quite as trivial as it could be... Attached is a quick hack
implementing this. For me it reduces the overhead noticably:

REFRESH MATERIALIZED VIEW mv;
before:
Time: 26542.333 ms (00:26.542)
after:
Time: 23105.047 ms (00:23.105)

Greetings,

Andres Freund
diff --git i/src/include/access/hio.h w/src/include/access/hio.h
index 1d611287c08..6d8f18152f1 100644
--- i/src/include/access/hio.h
+++ w/src/include/access/hio.h
@@ -30,6 +30,7 @@ typedef struct BulkInsertStateData
 {
 	BufferAccessStrategy strategy;	/* our BULKWRITE strategy object */
 	Buffer		current_buf;	/* current insertion target page */
+	Buffer		current_vmbuf;
 } BulkInsertStateData;
 
 
diff --git i/src/backend/access/heap/heapam.c w/src/backend/access/heap/heapam.c
index 13396eb7f2c..5a63efc4386 100644
--- i/src/backend/access/heap/heapam.c
+++ w/src/backend/access/heap/heapam.c
@@ -2011,6 +2011,7 @@ GetBulkInsertState(void)
 	bistate = (BulkInsertState) palloc(sizeof(BulkInsertStateData));
 	bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);
 	bistate->current_buf = InvalidBuffer;
+	bistate->current_vmbuf = InvalidBuffer;
 	return bistate;
 }
 
@@ -2020,8 +2021,7 @@ GetBulkInsertState(void)
 void
 FreeBulkInsertState(BulkInsertState bistate)
 {
-	if (bistate->current_buf != InvalidBuffer)
-		ReleaseBuffer(bistate->current_buf);
+	ReleaseBulkInsertStatePin(bistate);
 	FreeAccessStrategy(bistate->strategy);
 	pfree(bistate);
 }
@@ -2033,8 +2033,15 @@ void
 ReleaseBulkInsertStatePin(BulkInsertState bistate)
 {
 	if (bistate->current_buf != InvalidBuffer)
+	{
 		ReleaseBuffer(bistate->current_buf);
-	bistate->current_buf = InvalidBuffer;
+		bistate->current_buf = InvalidBuffer;
+	}
+	if (bistate->current_vmbuf != InvalidBuffer)
+	{
+		ReleaseBuffer(bistate->current_vmbuf);
+		bistate->current_vmbuf = InvalidBuffer;
+	}
 }
 
 
@@ -2277,8 +2284,12 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	}
 
 	UnlockReleaseBuffer(buffer);
-	if (vmbuffer != InvalidBuffer)
+
+	if (vmbuffer != InvalidBuffer &&
+		(!bistate || bistate->current_vmbuf != vmbuffer))
+	{
 		ReleaseBuffer(vmbuffer);
+	}
 
 	/*
 	 * If tuple is cachable, mark it for invalidation from the caches in case
@@ -2655,8 +2666,11 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	}
 
 	/* We're done with inserting all tuples, so release the last vmbuffer. */
-	if (vmbuffer != InvalidBuffer)
+	if (vmbuffer != InvalidBuffer &&
+		(!bistate || bistate->current_vmbuf != vmbuffer))
+	{
 		ReleaseBuffer(vmbuffer);
+	}
 
 	/*
 	 * We're done with the actual inserts.  Check for conflicts again, to
diff --git i/src/backend/access/heap/hio.c w/src/backend/access/heap/hio.c
index ffc89685bff..a573322bb6d 100644
--- i/src/backend/access/heap/hio.c
+++ w/src/backend/access/heap/hio.c
@@ -136,7 +136,8 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
  * be less than block2.
  */
 static void
-GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
+GetVisibilityMapPins(Relation relation, BulkInsertState bistate,
+	 Buffer buffer1, Buffer buffer2,
 	 BlockNumber block1, BlockNumber block2,
 	 Buffer *vmbuffer1, Buffer *vmbuffer2)
 {
@@ -157,6 +158,12 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 		if (!need_to_pin_buffer1 && !need_to_pin_buffer2)
 			return;
 
+		if (bistate && bistate->current_vmbuf != InvalidBuffer)
+		{
+			ReleaseBuffer(bistate->current_vmbuf);
+			bistate->current_vmbuf = InvalidBuffer;
+		}
+
 		/* We must unlock both buffers before doing any I/O. */
 		LockBuffer(buffer1, BUFFER_LOCK_UNLOCK);
 		if (buffer2 != InvalidBuffer && buffer2 != buffer1)
@@ -269,6 +276,26 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 	FreeSpaceMapVacuumRange(relation, firstBlock, blockNum + 1);
 }
 
+static void
+visibilitymap_pin_bi(Relation rel, BlockNumber targetBlock, BulkInsertState bistate, Buffer *vmbuffer)
+{
+	if (bistate != NULL)
+	{
+		if (*vmbuf

Re: Addition of authenticated ID to pg_stat_activity

2021-04-27 Thread Andres Freund
Hi,

On 2021-04-27 12:40:29 -0400, Stephen Frost wrote:
> So, what fields are people really looking at when querying
> pg_stat_activity interactively?  User, database, pid, last query,
> transaction start, query start, state, wait event info, maybe backend
> xmin/xid?  I doubt most people looking at pg_stat_activity interactively
> actually care about the non-user backends (autovacuum, et al).

Not representative, but I personally am about as often interested in one
of the non-connection processes as the connection
ones. E.g. investigating what is autovacuum's bottleneck, are
checkpointer / wal writer / bgwriter io bound or keeping up, etc.

Greetings,

Andres Freund




Release 14 Beta 1

2021-04-27 Thread Andrew Dunstan


Greetings.

The Release Management Team (Pete Geoghegan, Michael Paquier and myself) 
proposes that the date of the Beta 1 release will be **Thursday May 20, 2021**, 
which aligns with past practice.


cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-04-27 Thread Alvaro Herrera
On 2021-Apr-27, Alvaro Herrera wrote:

> This v3 handles things as you suggested and works correctly AFAICT.  I'm
> going to add some more tests cases to verify the behavior in the
> scenarios you showed, and get them to run under cache-clobber options to
> make sure it's good.

Yep, it seems to work.  Strangely, the new isolation case doesn't
actually crash before the fix -- it merely throws a memory allocation
error.

-- 
Álvaro Herrera   Valdivia, Chile
"Linux transformó mi computadora, de una `máquina para hacer cosas',
en un aparato realmente entretenido, sobre el cual cada día aprendo
algo nuevo" (Jaime Salinas)
>From 8e41eb137b835f614ddeeb7c79e70d7e0740f522 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 27 Apr 2021 19:04:37 -0400
Subject: [PATCH v4] Allow a partdesc-omitting-partitions to be cached

Makes partition descriptor acquisition faster during the transient
period in which a partition is in the process of being detached.

(This incidentally fixes an alleged bug in 8aba9322511 whereby a
memory context holding a transient partdesc is reparented to NULL ...
leading to permanent leak of that memory.  Reported by Amit Langote.)

Per gripe from Amit Langote
Discussion: https://postgr.es/m/ca+hiwqfgpp1lxjzobygt9rpvtjxxkg5qg2+xch2z1q7krqz...@mail.gmail.com
---
 src/backend/catalog/pg_inherits.c |  52 -
 src/backend/commands/tablecmds.c  |  66 ++-
 src/backend/commands/trigger.c|   3 +-
 src/backend/partitioning/partdesc.c   | 110 --
 src/backend/utils/cache/relcache.c|  33 +-
 src/include/catalog/pg_inherits.h |   6 +-
 src/include/utils/rel.h   |   5 +
 .../detach-partition-concurrently-3.out   | 110 +-
 .../detach-partition-concurrently-3.spec  |  14 ++-
 9 files changed, 320 insertions(+), 79 deletions(-)

diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 6447b52854..c373faf2d6 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -52,6 +52,22 @@ typedef struct SeenRelsEntry
  * then no locks are acquired, but caller must beware of race conditions
  * against possible DROPs of child relations.
  *
+ * Partitions marked as being detached are omitted; see
+ * find_inheritance_children_extended for details.
+ */
+List *
+find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
+{
+	return find_inheritance_children_extended(parentrelId, true, lockmode,
+			  NULL, NULL);
+}
+
+/*
+ * find_inheritance_children_extended
+ *
+ * As find_inheritance_children, with more options regarding detached
+ * partitions.
+ *
  * If a partition's pg_inherits row is marked "detach pending",
  * *detached_exist (if not null) is set true.
  *
@@ -60,11 +76,13 @@ typedef struct SeenRelsEntry
  * marked "detach pending" is visible to that snapshot, then that partition is
  * omitted from the output list.  This makes partitions invisible depending on
  * whether the transaction that marked those partitions as detached appears
- * committed to the active snapshot.
+ * committed to the active snapshot.  In addition, *detached_xmin (if not null)
+ * is set to the xmin of the row of the detached partition.
  */
 List *
-find_inheritance_children(Oid parentrelId, bool omit_detached,
-		  LOCKMODE lockmode, bool *detached_exist)
+find_inheritance_children_extended(Oid parentrelId, bool omit_detached,
+   LOCKMODE lockmode, bool *detached_exist,
+   TransactionId *detached_xmin)
 {
 	List	   *list = NIL;
 	Relation	relation;
@@ -132,7 +150,32 @@ find_inheritance_children(Oid parentrelId, bool omit_detached,
 snap = GetActiveSnapshot();
 
 if (!XidInMVCCSnapshot(xmin, snap))
+{
+	if (detached_xmin)
+	{
+		/*
+		 * Two detached partitions should not occur (see
+		 * checks in MarkInheritDetached), but if they do,
+		 * track the newer of the two.  Make sure to warn the
+		 * user, so that they can clean up.  Since this is
+		 * just a cross-check against potentially corrupt
+		 * catalogs, we don't make it a full-fledged error
+		 * message.
+		 */
+		if (*detached_xmin != InvalidTransactionId)
+		{
+			elog(WARNING, "more than one partition pending detach found for table with OID %u",
+ parentrelId);
+			if (TransactionIdFollows(xmin, *detached_xmin))
+*detached_xmin = xmin;
+		}
+		else
+			*detached_xmin = xmin;
+	}
+
+	/* Don't add the partition to the output list */
 	continue;
+}
 			}
 		}
 
@@ -247,8 +290,7 @@ find_all_inheritors(Oid parentrelId, LOCKMODE lockmode, List **numparents)
 		ListCell   *lc;
 
 		/* Get the direct children of this rel */
-		currentchildren = find_inheritance_children(currentrel, true,
-	lockmode, NULL);
+		currentchildren = find_inheritance_children(currentrel, lockmode);
 
 		/*
 		 * Add to the qu

Re: wal stats questions

2021-04-27 Thread Masahiro Ikeda



On 2021/04/27 21:56, Fujii Masao wrote:
> 
> 
> On 2021/04/26 10:11, Masahiro Ikeda wrote:
>>
>> First patch has only the changes for pg_stat_wal view.
>> ("v6-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch")
>>
> 
> +    pgWalUsage.wal_records == prevWalUsage.wal_records &&
> +    walStats.wal_write == 0 && walStats.wal_sync == 0 &&
> > WalStats.m_wal_write should be checked here instead of walStats.wal_write?

Thanks! Yes, I'll fix it.


> Is there really the case where the number of sync is larger than zero when
> the number of writes is zero? If not, it's enough to check only the number
> of writes?

I thought that there is the case if "wal_sync_method" is fdatasync, fsync or
fsync_writethrough. The example case is following.

(1) backend-1 writes the wal data because wal buffer has no space. But, it
doesn't sync the wal data.
(2) backend-2 reads data pages. In the execution, it need to write and sync
the wal because dirty pages is selected as victim pages. backend-2 need to
only sync the wal data because the wal data were already written by backend-1,
but they weren't synced.

I'm ok to change it since it's rare case.


> + * wal records weren't generated. So, the counters of 'wal_fpi',
> + * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
> 
> It's better to add the assertion check that confirms
> m_wal_buffers_full == 0 whenever wal_records is larger than zero?

Sorry, I couldn't understand yet. I thought that m_wal_buffers_full can be
larger than 0 if wal_records > 0.

Do you suggest that the following assertion is needed?

-   if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-   return false;
+   if (pgWalUsage.wal_records == prevWalUsage.wal_records &&
+   WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0)
+   {
+   Assert(pgWalUsage.wal_fpi == 0 && pgWalUsage.wal_bytes &&
+   WalStats.m_wal_buffers_full == 0 &&
WalStats.m_wal_write_time == 0 &&
+   WalStats.m_wal_sync_time == 0);
+   return;
+   }


>> Second one has the changes for the type of the BufferUsage's and WalUsage's
>> members. I change the type from long to int64. Is it better to make new 
>> thread?
>> ("v6-0002-change-the-data-type-of-XXXUsage-from-long-to-int64.patch")
> 
> Will review the patch later. I'm ok to discuss that in this thread.

Thanks!


Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION




Re: Unresolved repliaction hang and stop problem.

2021-04-27 Thread Alvaro Herrera
On 2021-Apr-14, Amit Kapila wrote:

> On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois
>  wrote:
> >
> > Hello,
> > After upgrading the cluster from 10.x to 13.1 we've started getting a 
> > problem describe pgsql-general:
> > https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
> > We've noticed similar issue being described on this list in
> > https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
> > with a fix being rolled out in 13.2.
> 
> The fix for the problem discussed in the above threads is committed
> only in PG-14, see [1]. I don't know what makes you think it is fixed
> in 13.2. Also, it is not easy to back-patch that because this fix
> depends on some of the infrastructure introduced in PG-14.
> 
> [1] - 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc

Hmm ... On what does it depend (other than plain git conflicts, which
are aplenty)?  On a quick look to the commit, it's clear that we need to
be careful in order not to cause an ABI break, but that doesn't seem
impossible to solve, but I'm wondering if there is more to it than that.

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: Bug fix for tab completion of ALTER TABLE ... VALIDATE CONSTRAINT ...

2021-04-27 Thread David Fetter
On Tue, Apr 27, 2021 at 12:33:25PM +0300, Aleksander Alekseev wrote:
> Hi David,
> 
> > I noticed that $subject completes with already valid constraints,
> > please find attached a patch that fixes it. I noticed that there are
> > other places constraints can be validated, but didn't check whether
> > similar bugs exist there yet.
> 
> There was a typo in the patch ("... and and not convalidated"). I've fixed
> it. Otherwise the patch passes all the tests and works as expected:
> 
> eax=# create table foo (x int);
> CREATE TABLE
> eax=# alter table foo add constraint bar check (x < 3) not valid;
> ALTER TABLE
> eax=# alter table foo add constraint baz check (x <> 5) not valid;
> ALTER TABLE
> eax=# alter table foo validate constraint ba
> bar baz
> eax=# alter table foo validate constraint bar;
> ALTER TABLE

Sorry about that typo, and thanks for poking at this!

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Error in libpq docs for target_session_attrs, prefer-standby mode

2021-04-27 Thread Greg Nancarrow
Hi,

I spotted an error in the development version documentation for
libpq's connection parameter "target_session_attrs" (34.1.2 Parameter
Key Words).
In the description for the "prefer-standby" mode, it says "... but if
none of the listed hosts is a standby server, try again in all mode".
There is no such "all" mode. It should instead say "any" mode.
Patch is attached.

Regards,
Greg Nancarrow
Fujitsu Australia


libpq_target_session_attrs_doc_fix.patch
Description: Binary data


Re: Replication slot stats misgivings

2021-04-27 Thread Amit Kapila
On Tue, Apr 27, 2021 at 8:28 PM Masahiko Sawada  wrote:
>
> On Tue, Apr 27, 2021 at 11:29 PM Amit Kapila  wrote:
> >
> > On Tue, Apr 27, 2021 at 5:40 PM Amit Kapila  wrote:
> > >
> >
> > I am not sure if the timeout happened because the machine is slow or
> > is it in any way related to code. I am seeing some previous failures
> > due to timeout on this machine [1][2]. In those failures, I see the
> > "using stale stats" message.
>
> It seems like a time-dependent issue but I'm wondering why the logical
> decoding test failed at this time.
>

As per the analysis done till now, it appears to be due to the reason
that the machine is slow which leads to timeout and there appear to be
some prior failures related to timeout as well. I think it is better
to wait for another run (or few runs) to see if this occurs again.

-- 
With Regards,
Amit Kapila.




Re: Replication slot stats misgivings

2021-04-27 Thread vignesh C
On Wed, Apr 28, 2021 at 8:28 AM Amit Kapila  wrote:
>
> On Tue, Apr 27, 2021 at 8:28 PM Masahiko Sawada  wrote:
> >
> > On Tue, Apr 27, 2021 at 11:29 PM Amit Kapila  
> > wrote:
> > >
> > > On Tue, Apr 27, 2021 at 5:40 PM Amit Kapila  
> > > wrote:
> > > >
> > >
> > > I am not sure if the timeout happened because the machine is slow or
> > > is it in any way related to code. I am seeing some previous failures
> > > due to timeout on this machine [1][2]. In those failures, I see the
> > > "using stale stats" message.
> >
> > It seems like a time-dependent issue but I'm wondering why the logical
> > decoding test failed at this time.
> >
>
> As per the analysis done till now, it appears to be due to the reason
> that the machine is slow which leads to timeout and there appear to be
> some prior failures related to timeout as well. I think it is better
> to wait for another run (or few runs) to see if this occurs again.
>

Yes, checkpoint seems to take a lot of time, could be because the
machine is slow. Let's wait for the next run and see.

Regards,
Vignesh




Re: [PATCH] Re: pg_identify_object_as_address() doesn't support pg_event_trigger oids

2021-04-27 Thread Michael Paquier
On Tue, Apr 27, 2021 at 02:33:36PM +0200, Joel Jacobson wrote:
> I've successfully tested 
> fix_event_trigger_pg_identify_object_as_address3.patch.

Thanks.  Applied down to 9.6 then.
--
Michael


signature.asc
Description: PGP signature


Re: Unresolved repliaction hang and stop problem.

2021-04-27 Thread Amit Kapila
On Wed, Apr 28, 2021 at 6:48 AM Alvaro Herrera  wrote:
>
> On 2021-Apr-14, Amit Kapila wrote:
>
> > On Tue, Apr 13, 2021 at 1:18 PM Krzysztof Kois
> >  wrote:
> > >
> > > Hello,
> > > After upgrading the cluster from 10.x to 13.1 we've started getting a 
> > > problem describe pgsql-general:
> > > https://www.postgresql.org/message-id/8bf8785c-f47d-245c-b6af-80dc1eed40db%40unitygroup.com
> > > We've noticed similar issue being described on this list in
> > > https://www.postgresql-archive.org/Logical-replication-CPU-bound-with-TRUNCATE-DROP-CREATE-many-tables-tt6155123.html
> > > with a fix being rolled out in 13.2.
> >
> > The fix for the problem discussed in the above threads is committed
> > only in PG-14, see [1]. I don't know what makes you think it is fixed
> > in 13.2. Also, it is not easy to back-patch that because this fix
> > depends on some of the infrastructure introduced in PG-14.
> >
> > [1] - 
> > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d7eb52d7181d83cf2363570f7a205b8eb1008dbc
>
> Hmm ... On what does it depend (other than plain git conflicts, which
> are aplenty)?  On a quick look to the commit, it's clear that we need to
> be careful in order not to cause an ABI break, but that doesn't seem
> impossible to solve, but I'm wondering if there is more to it than that.
>

As mentioned in the commit message, we need another commit [1] change
to make this work.

[1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c55040ccd0

-- 
With Regards,
Amit Kapila.




Re: Replication slot stats misgivings

2021-04-27 Thread Amit Kapila
On Tue, Apr 27, 2021 at 11:02 AM vignesh C  wrote:
>
> On Tue, Apr 27, 2021 at 9:48 AM vignesh C  wrote:
> >
>
> Attached patch has the changes to update statistics during
> spill/stream which prevents the statistics from being lost during
> interrupt.
>

 void
-UpdateDecodingStats(LogicalDecodingContext *ctx)
+UpdateDecodingStats(ReorderBuffer *rb)

I don't think you need to change this interface because
reorderbuffer->private_data points to LogicalDecodingContext. See
StartupDecodingContext. Other than that there is a comment in the code
"Update the decoding stats at transaction prepare/commit/abort...".
This patch should extend that comment by saying something like
"Additionally we send the stats when we spill or stream the changes to
avoid losing them in case the decoding is interrupted."


-- 
With Regards,
Amit Kapila.




Re: Skip temporary table schema name from explain-verbose output.

2021-04-27 Thread Amul Sul
On Tue, Apr 27, 2021 at 7:08 PM Bharath Rupireddy
 wrote:
>
> On Tue, Apr 27, 2021 at 6:59 PM Ashutosh Bapat
>  wrote:
> >
> > On Tue, Apr 27, 2021 at 12:23 PM Amul Sul  wrote:
> > > >
> > > > How about using an explain filter to replace the unstable text
> > > > pg_temp_3 to pg_temp_N instead of changing it in the core? Following
> > > > are the existing explain filters: explain_filter,
> > > > explain_parallel_append, explain_analyze_without_memory,
> > > > explain_resultcache, explain_parallel_sort_stats, explain_sq_limit.
> > > >
> > >
> > > Well, yes eventually, that will be the kludge. I was wondering if that
> > > table is accessible in a query via pg_temp schema then why should
> > > bother about printing the pg_temp_N schema name which is an internal
> > > purpose.
> >
> > Although only the associated session can access objects from that
> > schema, I think, the entries in pg_class have different namespace oids
> > and are accessible from other sessions. So knowing the actual schema
> > name is useful for debugging purposes. Using auto_explain, the explain
> > output goes to server log, where access to two temporary tables with
> > the same name from different sessions can be identified by the actual
> > schema name easily.
> >

Make sense, we would lose the ability to differentiate temporary
tables from the auto_explain logs.

Regards,
Amul




Re: Replication slot stats misgivings

2021-04-27 Thread vignesh C
On Wed, Apr 28, 2021 at 8:59 AM Amit Kapila  wrote:
>
> On Tue, Apr 27, 2021 at 11:02 AM vignesh C  wrote:
> >
> > On Tue, Apr 27, 2021 at 9:48 AM vignesh C  wrote:
> > >
> >
> > Attached patch has the changes to update statistics during
> > spill/stream which prevents the statistics from being lost during
> > interrupt.
> >
>
>  void
> -UpdateDecodingStats(LogicalDecodingContext *ctx)
> +UpdateDecodingStats(ReorderBuffer *rb)
>
> I don't think you need to change this interface because
> reorderbuffer->private_data points to LogicalDecodingContext. See
> StartupDecodingContext. Other than that there is a comment in the code
> "Update the decoding stats at transaction prepare/commit/abort...".
> This patch should extend that comment by saying something like
> "Additionally we send the stats when we spill or stream the changes to
> avoid losing them in case the decoding is interrupted."

Thanks for the comments, Please find the attached v4 patch having the
fixes for the same.

Regards,
Vignesh
From 533c45c4cbd4545350d464bbf7a2df91ee668a75 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Tue, 27 Apr 2021 10:56:02 +0530
Subject: [PATCH v4] Update replication statistics after every stream/spill.

Currently, replication slot statistics are updated at prepare, commit, and
rollback. Now, if the transaction is interrupted the stats might not get
updated. Fixed this by updating replication statistics after every
stream/spill.
---
 src/backend/replication/logical/decode.c| 14 --
 src/backend/replication/logical/reorderbuffer.c |  6 ++
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7924581cdc..888e064ec0 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -746,9 +746,10 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	}
 
 	/*
-	 * Update the decoding stats at transaction prepare/commit/abort. It is
-	 * not clear that sending more or less frequently than this would be
-	 * better.
+	 * Update the decoding stats at transaction prepare/commit/abort.
+	 * Additionally we send the stats when we spill or stream the changes to
+	 * avoid losing them in case the decoding is interrupted. It is not clear
+	 * that sending more or less frequently than this would be better.
 	 */
 	UpdateDecodingStats(ctx);
 }
@@ -828,9 +829,10 @@ DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 	ReorderBufferPrepare(ctx->reorder, xid, parsed->twophase_gid);
 
 	/*
-	 * Update the decoding stats at transaction prepare/commit/abort. It is
-	 * not clear that sending more or less frequently than this would be
-	 * better.
+	 * Update the decoding stats at transaction prepare/commit/abort.
+	 * Additionally we send the stats when we spill or stream the changes to
+	 * avoid losing them in case the decoding is interrupted. It is not clear
+	 * that sending more or less frequently than this would be better.
 	 */
 	UpdateDecodingStats(ctx);
 }
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index c27f710053..ceb83bcbf9 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -3551,6 +3551,9 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 		/* don't consider already serialized transactions */
 		rb->spillTxns += (rbtxn_is_serialized(txn) || rbtxn_is_serialized_clear(txn)) ? 0 : 1;
+
+		/* update the decoding stats */
+		UpdateDecodingStats(rb->private_data);
 	}
 
 	Assert(spilled == txn->nentries_mem);
@@ -3920,6 +3923,9 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	/* Don't consider already streamed transaction. */
 	rb->streamTxns += (txn_is_streamed) ? 0 : 1;
 
+	/* update the decoding stats */
+	UpdateDecodingStats(rb->private_data);
+
 	Assert(dlist_is_empty(&txn->changes));
 	Assert(txn->nentries == 0);
 	Assert(txn->nentries_mem == 0);
-- 
2.25.1



Re: Replication slot stats misgivings

2021-04-27 Thread Masahiko Sawada
On Wed, Apr 28, 2021 at 12:29 PM Amit Kapila  wrote:
>
> On Tue, Apr 27, 2021 at 11:02 AM vignesh C  wrote:
> >
> > On Tue, Apr 27, 2021 at 9:48 AM vignesh C  wrote:
> > >
> >
> > Attached patch has the changes to update statistics during
> > spill/stream which prevents the statistics from being lost during
> > interrupt.
> >
>
>  void
> -UpdateDecodingStats(LogicalDecodingContext *ctx)
> +UpdateDecodingStats(ReorderBuffer *rb)
>
> I don't think you need to change this interface because
> reorderbuffer->private_data points to LogicalDecodingContext. See
> StartupDecodingContext.

+1

With this approach, we could still miss the totalTxns and totalBytes
updates if the decoding a large but less than
logical_decoding_work_mem is interrupted, right?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Replication slot stats misgivings

2021-04-27 Thread vignesh C
On Wed, Apr 28, 2021 at 9:37 AM Masahiko Sawada  wrote:
>
> On Wed, Apr 28, 2021 at 12:29 PM Amit Kapila  wrote:
> >
> > On Tue, Apr 27, 2021 at 11:02 AM vignesh C  wrote:
> > >
> > > On Tue, Apr 27, 2021 at 9:48 AM vignesh C  wrote:
> > > >
> > >
> > > Attached patch has the changes to update statistics during
> > > spill/stream which prevents the statistics from being lost during
> > > interrupt.
> > >
> >
> >  void
> > -UpdateDecodingStats(LogicalDecodingContext *ctx)
> > +UpdateDecodingStats(ReorderBuffer *rb)
> >
> > I don't think you need to change this interface because
> > reorderbuffer->private_data points to LogicalDecodingContext. See
> > StartupDecodingContext.
>
> +1
>
> With this approach, we could still miss the totalTxns and totalBytes
> updates if the decoding a large but less than
> logical_decoding_work_mem is interrupted, right?

Yes you are right, I felt that is reasonable and that way it reduces
frequent calls to the stats collector to update the stats.

Regards,
Vignesh




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2021-04-27 Thread Amit Kapila
Tom Lane has raised a complaint on pgsql-commiters [1] about one of
the commits related to this work [2]. The new member wrasse is showing
Warning:

"/export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/reorderbuffer.c",
line 2510: Warning: Likely null pointer dereference (*(curtxn+272)):
ReorderBufferProcessTXN

The Warning is for line:
curtxn->concurrent_abort = true;

Now, we can simply fix this warning by adding an if check like:
if (curtxn)
curtxn->concurrent_abort = true;

However, on further discussion, it seems that is not sufficient here
because the callbacks can throw the surrounding error code
(ERRCODE_TRANSACTION_ROLLBACK) where we set concurrent_abort flag for
a completely different scenario. I think here we need a
stronger check to ensure that we set concurrent abort flag and do
other things in that check only when we are decoding non-committed
xacts. The idea I have is to additionally check that we are decoding
streaming or prepared transaction (the same check as we have for
setting curtxn) or we can check if CheckXidAlive is a valid
transaction id. What do you think?

[1] - https://www.postgresql.org/message-id/2752962.1619568098%40sss.pgh.pa.us
[2] - 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7259736a6e5b7c7588fff9578370736a6648acbb

-- 
With Regards,
Amit Kapila.




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2021-04-27 Thread Dilip Kumar
On Wed, Apr 28, 2021 at 11:00 AM Amit Kapila  wrote:
>
> Tom Lane has raised a complaint on pgsql-commiters [1] about one of
> the commits related to this work [2]. The new member wrasse is showing
> Warning:
>
> "/export/home/nm/farm/studio64v12_6/HEAD/pgsql.build/../pgsql/src/backend/replication/logical/reorderbuffer.c",
> line 2510: Warning: Likely null pointer dereference (*(curtxn+272)):
> ReorderBufferProcessTXN
>
> The Warning is for line:
> curtxn->concurrent_abort = true;
>
> Now, we can simply fix this warning by adding an if check like:
> if (curtxn)
> curtxn->concurrent_abort = true;
>
> However, on further discussion, it seems that is not sufficient here
> because the callbacks can throw the surrounding error code
> (ERRCODE_TRANSACTION_ROLLBACK) where we set concurrent_abort flag for
> a completely different scenario. I think here we need a
> stronger check to ensure that we set concurrent abort flag and do
> other things in that check only when we are decoding non-committed
> xacts.

That makes sense.

 The idea I have is to additionally check that we are decoding
> streaming or prepared transaction (the same check as we have for
> setting curtxn) or we can check if CheckXidAlive is a valid
> transaction id. What do you think?

I think a check based on CheckXidAlive looks good to me.  This will
protect against if a similar error is raised from any other path as
you mentioned above.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch

2021-04-27 Thread Thomas Munro
On Wed, Apr 28, 2021 at 3:56 AM Tom Lane  wrote:
> Of course Wikipedia has been known to contain errors, but now
> I'm inclined to think I blew this.  Anyone want to check my work?

I tried a couple of examples not from Wikipedia.  First, from the
definition of Julian days as used by astronomers[1], counting from
noon on 4713-01-01 BC Julian AKA 4714-11-24 BC Gregorian, days 0 and 1
look right with 'utc+12':

postgres=# select extract(julian from '4714-11-24 11:00:00+00
BC'::timestamptz at time zone 'utc+12');
ERROR:  timestamp out of range
postgres=# select extract(julian from '4714-11-24 12:00:00+00
BC'::timestamptz at time zone 'utc+12');
extract

 0.
(1 row)

postgres=# select extract(julian from '4714-11-25 11:00:00+00
BC'::timestamptz at time zone 'utc+12');
extract

 0.9583
(1 row)

postgres=# select extract(julian from '4714-11-25 12:00:00+00
BC'::timestamptz at time zone 'utc+12');
extract

 1.
(1 row)

Next I found a worked example in an aerospace textbook[1] and it agrees, too:

postgres=# select extract(julian from '2004-05-12
14:45:30+00'::timestamptz at time zone 'utc+12');
   extract
--
 2453138.11493056
(1 row)

[1] 
http://curious.astro.cornell.edu/people-and-astronomy/125-observational-astronomy/timekeeping/calendars/763-how-was-the-starting-point-for-the-julian-date-system-chosen-advanced
[2] https://www.sciencedirect.com/topics/engineering/julian-day-number




pg_hba.conf.sample wording improvement

2021-04-27 Thread Peter Eisentraut
I propose the attached patch to shake up the wording in the connection 
type section of pg_hba.conf.sample a bit.  After the hostgssenc part was 
added on, the whole thing became a bit wordy, and it's also a bit 
inaccurate for example in that the current wording for "host" appears to 
say that it does not apply to GSS-encrypted connections.
From dc64f4826c4dbf3bcd1cdadb1e9f351ce45f9074 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 28 Apr 2021 07:46:48 +0200
Subject: [PATCH] pg_hba.conf.sample: Reword connection type section

---
 src/backend/libpq/pg_hba.conf.sample | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/backend/libpq/pg_hba.conf.sample 
b/src/backend/libpq/pg_hba.conf.sample
index b6de12b298..ead092ffab 100644
--- a/src/backend/libpq/pg_hba.conf.sample
+++ b/src/backend/libpq/pg_hba.conf.sample
@@ -18,12 +18,13 @@
 #
 # (The uppercase items must be replaced by actual values.)
 #
-# The first field is the connection type: "local" is a Unix-domain
-# socket, "host" is either a plain or SSL-encrypted TCP/IP socket,
-# "hostssl" is an SSL-encrypted TCP/IP socket, and "hostnossl" is a
-# non-SSL TCP/IP socket.  Similarly, "hostgssenc" uses a
-# GSSAPI-encrypted TCP/IP socket, while "hostnogssenc" uses a
-# non-GSSAPI socket.
+# The first field is the connection type:
+# - "local" is a Unix-domain socket
+# - "host" is a TCP/IP socket (encrypted or not)
+# - "hostssl" is an SSL-encrypted TCP/IP socket
+# - "hostnossl" is a non-SSL TCP/IP socket
+# - "hostgssenc" is a GSSAPI-encrypted TCP/IP socket
+# - "hostnogssenc" is a not GSSAPI-encrypted socket
 #
 # DATABASE can be "all", "sameuser", "samerole", "replication", a
 # database name, or a comma-separated list thereof. The "all"
-- 
2.31.1



Re: Replication slot stats misgivings

2021-04-27 Thread Amit Kapila
On Wed, Apr 28, 2021 at 9:37 AM Masahiko Sawada  wrote:
>
> On Wed, Apr 28, 2021 at 12:29 PM Amit Kapila  wrote:
> >
> > On Tue, Apr 27, 2021 at 11:02 AM vignesh C  wrote:
> > >
> > > On Tue, Apr 27, 2021 at 9:48 AM vignesh C  wrote:
> > > >
> > >
> > > Attached patch has the changes to update statistics during
> > > spill/stream which prevents the statistics from being lost during
> > > interrupt.
> > >
> >
> >  void
> > -UpdateDecodingStats(LogicalDecodingContext *ctx)
> > +UpdateDecodingStats(ReorderBuffer *rb)
> >
> > I don't think you need to change this interface because
> > reorderbuffer->private_data points to LogicalDecodingContext. See
> > StartupDecodingContext.
>
> +1
>
> With this approach, we could still miss the totalTxns and totalBytes
> updates if the decoding a large but less than
> logical_decoding_work_mem is interrupted, right?
>

Right, but is there some simple way to avoid that? I see two
possibilities (a) store stats in ReplicationSlot and then send them at
ReplicationSlotRelease but that will lead to an increase in shared
memory usage and as per the discussion above, we don't want that, (b)
send intermediate stats after decoding say N changes but for that, we
need to additionally compute the size of each change which might add
some overhead.

I am not sure if any of these alternatives are a good idea. What do
you think? Do you have any other ideas for this?

-- 
With Regards,
Amit Kapila.




Re: pg_hba.conf.sample wording improvement

2021-04-27 Thread Laurenz Albe
On Wed, 2021-04-28 at 07:51 +0200, Peter Eisentraut wrote:
> I propose the attached patch to shake up the wording in the connection 
> type section of pg_hba.conf.sample a bit.  After the hostgssenc part was 
> added on, the whole thing became a bit wordy, and it's also a bit 
> inaccurate for example in that the current wording for "host" appears to 
> say that it does not apply to GSS-encrypted connections.

+1

Thanks for taking care of things like that.

Yours,
Laurenz Albe





RE: [BUG] "FailedAssertion" reported when streaming in logical replication

2021-04-27 Thread tanghy.f...@fujitsu.com
> I have modified the patch based on the above comments.

Thanks for your patch.
I tested again after applying your patch and the problem is fixed.

Regards
Tang