date:20200909

On Wed, Sep 9, 2020 at 10:54 AM Fujii Masao  wrote:
>
> On 2020/09/09 12:04, Amit Kapila wrote:
> >
> > No, before patch as well, if we can't read the DB entry say because
> > the file is corrupt, we return true and use timestamp of global stats
> > file and this is what is established by the original commit 187492b6.
> > So, if we consider that as correct then the functionality is something
> > like once we have read the timestamp of the global statfile, we use
> > that if we can't read the actual db entry for whatever reason. It
> > seems if we return false, the caller will call this function again in
> > the loop. Now, I see the point that if we can't read some part of the
> > file we should return false instead but not sure if that is helpful
> > from the perspective of the caller of
> > pgstat_read_db_statsfile_timestamp.
>
> When false is returned, the caller backend_read_statsfile() seems to
> request the stats collector to write a fresh stats data into the file,
> and then pgstat_read_db_statsfile_timestamp() can try to read the fresh
> file that may not be corrupted. So returning false in that case seems ok
> to me...
>

Hmm, the request to get fresh data is sent irrespective of true or
false. We send it to get the latest data if it is not present and it
is not guaranteed that the request will lead to a write of stats file.
So, I am not sure if we can override that with the corrupted file
case, sure there is something to it but if we really want to rely on
that mechanism we should explicitly send such a request on detection
of a corrupted file.

-- 
With Regards,
Amit Kapila.

Re: PATCH: Attempt to make dbsize a bit more consistent

2020-09-09 Thread gkokolatos







‐‐‐ Original Message ‐‐‐
On Tuesday, 8 September 2020 16:49, John Naylor  
wrote:

> On Thu, Aug 27, 2020 at 9:39 AM gkokola...@pm.me wrote:
>
> > Hi all,
> > this minor patch is attempting to force the use of the tableam api in 
> > dbsize where ever it is required.
> > Apparently something similar was introduced for toast relations only. 
> > Intuitively it seems that the distinction between a table and a toast table 
> > is not needed.
>
> I suspect the reason is found in the comment for table_block_relation_size():
>
> -   If a table AM uses the various relation forks as the sole place where data
> -   is stored, and if it uses them in the expected manner (e.g. the actual 
> data
> -   is in the main fork rather than some other), it can use this 
> implementation
> -   of the relation_size callback rather than implementing its own.


Thank you for your answer and interest at the patch.

I agree with the comment above. However I do not see why it is relevant here. 
When issuing:

SELECT pg_table_size('foo'::regclass);

I should not have to care about the on disk layout of the relation 'foo'.
Without this patch, one will get a correct result only when 'foo' is a heap 
table.
For custom layouts the result can potentially be wrong.



>
> --
> John Naylor https://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: PATCH: Attempt to make dbsize a bit more consistent

2020-09-09 Thread gkokolatos







‐‐‐ Original Message ‐‐‐
On Tuesday, 8 September 2020 22:26, David Zhang  wrote:

>
>
> I found the function "table_relation_size" is only used by buffer
> manager for "RELKIND_RELATION", "RELKIND_TOASTVALUE" and
> "RELKIND_MATVIEW", i.e.
>
>         case RELKIND_RELATION:
>         case RELKIND_TOASTVALUE:
>         case RELKIND_MATVIEW:
>             {
>                 /*
>                  * Not every table AM uses BLCKSZ wide fixed size blocks.
>                  * Therefore tableam returns the size in bytes - but
> for the
>                  * purpose of this routine, we want the number of blocks.
>                  * Therefore divide, rounding up.
>                  */
>                 uint64        szbytes;
>
>                 szbytes = table_relation_size(relation, forkNum);
>
>                 return (szbytes + (BLCKSZ - 1)) / BLCKSZ;
>             }
>
> So using "calculate_relation_size" and "calculate_toast_table_size" in
> "calculate_table_size" is easy to understand and the original logic is
> simple.
>

You are correct. This is the logic that is attempted to be applied
in dbsize.c in this patch.

So what do you think of the patch?

Re: shared-memory based stats collector

2020-09-09 Thread Kyotaro Horiguchi

At Tue, 08 Sep 2020 17:55:57 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Tue, 08 Sep 2020 17:01:47 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > At Thu, 3 Sep 2020 13:16:59 -0400, Stephen Frost  wrote 
> > in> > I've looked through (some of) this thread and through the patches also
> > > and hope to provide a review of the bits that should be targetting v14
> > > (unlike the above) soon.
> > 
> > Thanks. Current the patch is found to lead to write contention heavier
> > than the current stats collector when nearly a thousand backend
> > heavily write to the same table. I need to address that.
> > 
> > - Collect stats via sockets (in the same way as the current implement)
> >   and write/read to/from shared memory.
> > 
> > - Use our own lock-free (maybe) ring-buffer before stats-write enters
> >   lock-waiting mode, then new stats collector(!) process consumes the
> >   queue.
> > 
> > - Some other measures..

- Make dshash search less frequent. To find the actual source of the
  contension, We're going to measure performance with the attached on
  top of the latest patch let sessions cache the result of dshash
  searches for the session lifetime. (table-dropping vacuum clears the
  local hash.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 0ce7b7a891326a3fdcf85b001586f721bef569f9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 9 Sep 2020 16:44:57 +0900
Subject: [PATCH v36 8/8] Experimental local cache for dshash.

This patch suffers LWLock contension than unpatched while very many
sessions commits frequently. As an effort to identify the source, this
patch caches dshash information in a local hash to avoid dshash locks.
---
 src/backend/postmaster/pgstat.c | 67 +
 1 file changed, 67 insertions(+)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b07add0a4f..caa067e9fe 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -35,6 +35,7 @@
 #include "access/xact.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_proc.h"
+#include "common/hashfn.h"
 #include "libpq/libpq.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -112,6 +113,8 @@ typedef struct StatsShmemStruct
dsa_pointer slru_stats; /* Ditto for SLRU stats */
int refcount;   /* # of processes that 
is attaching the shared
 * stats memory 
*/
+   pg_atomic_uint64 del_ent_count; /* # of entries deleted. not protected 
by
+  
StatsLock */
 }  StatsShmemStruct;
 
 /* BgWriter global statistics counters */
@@ -254,6 +257,7 @@ typedef struct PgStatSharedSLRUStats
 typedef struct PgStatLocalHashEntry
 {
PgStatHashEntryKey key; /* hash key */
+   char   status;  /* for simplehash use */
PgStatEnvelope *env;/* pointer to stats envelope in heap */
 }  PgStatLocalHashEntry;
 
@@ -280,6 +284,18 @@ static const dshash_parameters dsh_rootparams = {
LWTRANCHE_STATS
 };
 
+/* define hashtable for dshash caching */
+#define SH_PREFIX pgstat_cache
+#define SH_ELEMENT_TYPE PgStatLocalHashEntry
+#define SH_KEY_TYPE PgStatHashEntryKey
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) hash_bytes((unsigned char *)&key, 
sizeof(PgStatHashEntryKey))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(PgStatHashEntryKey)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
+
 /* The shared hash to index activity stats entries. */
 static dshash_table *pgStatSharedHash = NULL;
 
@@ -326,6 +342,8 @@ typedef struct TwoPhasePgStatRecord
 } TwoPhasePgStatRecord;
 
 /* Variables for backend status snapshot */
+static pgstat_cache_hash *pgStatEntHash = NULL;
+static int  pgStatEntHashAge = 0;
 static MemoryContext pgStatLocalContext = NULL;
 static MemoryContext pgStatSnapshotContext = NULL;
 static HTAB *pgStatSnapshotHash = NULL;
@@ -437,6 +455,7 @@ StatsShmemInit(void)
Assert(!found);
 
StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+   pg_atomic_init_u64(&StatsShmem->del_ent_count, 0);
}
 }
 
@@ -1432,6 +1451,9 @@ pgstat_vacuum_stat(void)
 
delete_stat_entry(p->type, p->databaseid, p->objectid, true);
}
+
+   if (nvictims > 0)
+   pg_atomic_add_fetch_u64(&StatsShmem->del_ent_count, 1);
 }
 
 
@@ -1493,6 +1515,8 @@ pgstat_drop_database(Oid databaseid)
delete_stat_entry((*p)->type, (*p)->databaseid, (*p)->objectid, 
true);
 
pfree(envlist);
+
+   pg_atomic_add_fetch_u64(&StatsShmem->del_ent_count, 1);
 }
 
 
@@ -5119,6 +5143,7 @@ get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
 {
boolcreate = (initfunc != NULL);
PgSta

Re: Improving connection scalability: GetSnapshotData()

2020-09-09 Thread Ian Barwick


On 2020/09/08 13:23, Ian Barwick wrote:

On 2020/09/08 13:11, Andres Freund wrote:

Hi,

On 2020-09-08 13:03:01 +0900, Ian Barwick wrote:

(...)

I wonder if it's possible to increment "xactCompletionCount"
during replay along these lines:

 *** a/src/backend/access/transam/xact.c
 --- b/src/backend/access/transam/xact.c
 *** xact_redo_commit(xl_xact_parsed_commit *
 *** 5915,5920 
 --- 5915,5924 
  */
 if (XactCompletionApplyFeedback(parsed->xinfo))
 XLogRequestWalReceiverReply();
 +
 +   LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 +   ShmemVariableCache->xactCompletionCount++;
 +   LWLockRelease(ProcArrayLock);
   }

which seems to work (though quite possibly I've overlooked something I don't
know that I don't know about and it will all break horribly somewhere,
etc. etc.).


We'd also need the same in a few more places. Probably worth looking at
the list where we increment it on the primary (particularly we need to
also increment it for aborts, and 2pc commit/aborts).


Yup.


At first I was very confused as to why none of the existing tests have
found this significant issue. But after thinking about it for a minute
that's because they all use psql, and largely separate psql invocations
for each query :(. Which means that there's no cached snapshot around...

Do you want to try to write a patch?


Sure, I'll give it a go as I have some time right now.



Attached, though bear in mind I'm not very familiar with parts of this,
particularly 2PC stuff, so consider it educated guesswork.


Regards


Ian Barwick

--
Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
commit 544e2b1661413fe08e3083f03063c12c0d7cf3aa
Author: Ian Barwick 
Date:   Tue Sep 8 12:24:14 2020 +0900

Fix snapshot caching on standbys

Addresses issue introduced in 623a9ba.

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..227d03bbce 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3320,6 +3320,14 @@ multixact_redo(XLogReaderState *record)
 	}
 	else
 		elog(PANIC, "multixact_redo: unknown op code %u", info);
+
+	/*
+	 * Advance xactCompletionCount so rebuilds of snapshot contents
+	 * can be triggered.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->xactCompletionCount++;
+	LWLockRelease(ProcArrayLock);
 }
 
 Datum
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index af6afcebb1..04ca858918 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5915,6 +5915,14 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
 	 */
 	if (XactCompletionApplyFeedback(parsed->xinfo))
 		XLogRequestWalReceiverReply();
+
+	/*
+	 * Advance xactCompletionCount so rebuilds of snapshot contents
+	 * can be triggered.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->xactCompletionCount++;
+	LWLockRelease(ProcArrayLock);
 }
 
 /*
@@ -5978,6 +5986,14 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
 
 	/* Make sure files supposed to be dropped are dropped */
 	DropRelationFiles(parsed->xnodes, parsed->nrels, true);
+
+	/*
+	 * Advance xactCompletionCount so rebuilds of snapshot contents
+	 * can be triggered.
+	 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->xactCompletionCount++;
+	LWLockRelease(ProcArrayLock);
 }
 
 void

Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch


On 2020-09-07 01:46, Tom Lane wrote:

Peter Eisentraut  writes:

Here is a new patch series version.
I have created a new internal function for converting integers to
numeric, to make the implementation a bit more elegant and compact.


I reviewed the 0002 patch, finding one bug (in int8_sum)


Ouch, no test coverage.  Should we perhaps remove this function, since 
it's obsolete and unused?



and a few
more calls of int8_numeric that could be converted.  I think the
attached updated version is committable, and I'd recommend going
ahead with that regardless of the rest of this.  I hadn't realized
how many random calls of int8_numeric and int4_numeric we'd grown,
but there are a lot, so this is nice cleanup.


Yes, please go ahead with it.


I continue to think that we can't commit 0003 in this form, because
of the breakage that will ensure in stored views.  As I said upthread,
we should leave the existing SQL-exposed functions alone, invent
new ones that return numeric, and alter the parser to translate
EXTRACT constructs to the new functions.  This approach would also
provide an "out" for anyone who does complain about the performance
cost --- they can just continue to use the old functions.


Okay, I will continue looking into this.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: extension patch of CREATE OR REPLACE TRIGGER

2020-09-09 Thread Peter Smith

On Tue, Sep 8, 2020 at 9:36 PM osumi.takami...@fujitsu.com
 wrote:
>
> I've fixed all except one point.

Thanks for addressing my previous review comments in your new v09 patch.

Those are fixed OK now, but I found 2 new review points.

COMMENT trigger.c (typo)

+ ereport(ERROR,
+ (errcode(ERRCODE_DUPLICATE_OBJECT),
+ errmsg("trigger \"%s\" for relation \"%s\" is a constraint trigger",
+ stmt->trigname, RelationGetRelationName(rel)),
+ errhint("use CREATE OR REPLACE CONSTRAINT TRIGGER to replace a
costraint trigger")));

Typo in the errhint text.
"costraint" -> "constraint"

COMMENT create_trigger.sgmg (add more help?)

I noticed that the CREATE OR REPLACE FUNCTION help [1] describes the
OR REPLACE syntax ("Description" section) and also mentions some of
the restrictions when using REPLACE ("Notes" section).
[1] - https://www.postgresql.org/docs/current/sql-createfunction.html

~~
OTOH this trigger patch does not add anything much at all in the trigger help.

Shouldn't the "Description" at least say something like:
"CREATE OR REPLACE will either create a new trigger, or replace an
existing definition."

Shouldn't the "Notes" include information about restrictions when
using OR REPLACE
e.g. cannot replace triggers with triggers of a different kind
e.g. cannot replace triggers with pending events

What do you think?

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions


Hi,

while looking at the streaming code I noticed two minor issues:

1) logicalrep_read_stream_stop is never defined/called, so the prototype
in logicalproto.h is unnecessary

2) minor typo in one of the comments

Patch attached.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
diff --git a/src/backend/replication/pgoutput/pgoutput.c 
b/src/backend/replication/pgoutput/pgoutput.c
index c29c088813..343f03129f 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -77,7 +77,7 @@ static void send_relation_and_attrs(Relation relation, 
TransactionId xid,
  * and with streamed transactions the commit order may be different from
  * the order the transactions are sent in. Also, the (sub) transactions
  * might get aborted so we need to send the schema for each (sub) transaction
- * so that we don't loose the schema information on abort. For handling this,
+ * so that we don't lose the schema information on abort. For handling this,
  * we maintain the list of xids (streamed_txns) for those we have already sent
  * the schema.
  *
diff --git a/src/include/replication/logicalproto.h 
b/src/include/replication/logicalproto.h
index 53905ee608..607a728508 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -133,7 +133,6 @@ extern void logicalrep_write_stream_start(StringInfo out, 
TransactionId xid,
 extern TransactionId logicalrep_read_stream_start(StringInfo in,

  bool *first_segment);
 extern void logicalrep_write_stream_stop(StringInfo out);
-extern TransactionId logicalrep_read_stream_stop(StringInfo in);
 extern void logicalrep_write_stream_commit(StringInfo out, ReorderBufferTXN 
*txn,

   XLogRecPtr commit_lsn);
 extern TransactionId logicalrep_read_stream_commit(StringInfo out,

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-09 Thread Andrey V. Lepikhov


On 9/8/20 8:34 PM, Alexey Kondratov wrote:

On 2020-09-08 17:00, Amit Langote wrote:

 wrote:

On 2020-09-08 10:34, Amit Langote wrote:
Another ambiguous part of the refactoring was in changing
InitResultRelInfo() arguments:

@@ -1278,6 +1280,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
  Relation resultRelationDesc,
  Index resultRelationIndex,
  Relation partition_root,
+ bool use_multi_insert,
  int instrument_options)

Why do we need to pass this use_multi_insert flag here? Would it be
better to set resultRelInfo->ri_usesMultiInsert in the
InitResultRelInfo() unconditionally like it is done for
ri_usesFdwDirectModify? And after that it will be up to the caller
whether to use multi-insert or not based on their own circumstances.
Otherwise now we have a flag to indicate that we want to check for
another flag, while this check doesn't look costly.


Hmm, I think having two flags seems confusing and bug prone,
especially if you consider partitions.  For example, if a partition's
ri_usesMultiInsert is true, but CopyFrom()'s local flag is false, then
execPartition.c: ExecInitPartitionInfo() would wrongly perform
BeginForeignCopy() based on only ri_usesMultiInsert, because it
wouldn't know CopyFrom()'s local flag.  Am I missing something?


No, you're right. If someone want to share a state and use ResultRelInfo 
(RRI) for that purpose, then it's fine, but CopyFrom() may simply 
override RRI->ri_usesMultiInsert if needed and pass this RRI further.


This is how it's done for RRI->ri_usesFdwDirectModify. 
InitResultRelInfo() initializes it to false and then 
ExecInitModifyTable() changes the flag if needed.


Probably this is just a matter of personal choice, but for me the 
current implementation with additional argument in InitResultRelInfo() 
doesn't look completely right. Maybe because a caller now should pass an 
additional argument (as false) even if it doesn't care about 
ri_usesMultiInsert at all. It also adds additional complexity and feels 
like abstractions leaking.
I didn't feel what the problem was and prepared a patch version 
according to Alexey's suggestion (see Alternate.patch).
This does not seem very convenient and will lead to errors in the 
future. So, I agree with Amit.


--
regards,
Andrey Lepikhov
Postgres Professional
>From 73705843d300ad1016384e6cb8893c80246372a6 Mon Sep 17 00:00:00 2001
From: amitlan 
Date: Mon, 24 Aug 2020 15:08:37 +0900
Subject: [PATCH 1/2] Move multi-insert decision logic into executor

When 0d5f05cde introduced support for using multi-insert mode when
copying into partitioned tables, it introduced single variable of
enum type CopyInsertMethod shared across all potential target
relations (partitions) that, along with some target relation
proprties, dictated whether to engage multi-insert mode for a given
target relation.

Move that decision logic into InitResultRelInfo which now sets a new
boolean field ri_usesMultiInsert of ResultRelInfo when a target
relation is first initialized.  That prevents repeated computation
of the same information in some cases, especially for partitions,
and the new arrangement results in slightly more readability.
---
 src/backend/commands/copy.c  | 189 +--
 src/backend/commands/tablecmds.c |   1 +
 src/backend/executor/execMain.c  |  40 +
 src/backend/executor/execPartition.c |   7 +
 src/backend/replication/logical/worker.c |   1 +
 src/include/nodes/execnodes.h|   9 +-
 6 files changed, 127 insertions(+), 120 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index db7d24a511..94f6e71a94 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -85,16 +85,6 @@ typedef enum EolType
 	EOL_CRNL
 } EolType;
 
-/*
- * Represents the heap insert method to be used during COPY FROM.
- */
-typedef enum CopyInsertMethod
-{
-	CIM_SINGLE,	/* use table_tuple_insert or fdw routine */
-	CIM_MULTI,	/* always use table_multi_insert */
-	CIM_MULTI_CONDITIONAL		/* use table_multi_insert only if valid */
-} CopyInsertMethod;
-
 /*
  * This struct contains all the state variables used throughout a COPY
  * operation. For simplicity, we use the same struct for all variants of COPY,
@@ -2715,12 +2705,10 @@ CopyFrom(CopyState cstate)
 	CommandId	mycid = GetCurrentCommandId(true);
 	int			ti_options = 0; /* start with default options for insert */
 	BulkInsertState bistate = NULL;
-	CopyInsertMethod insertMethod;
 	CopyMultiInsertInfo multiInsertInfo = {0};	/* pacify compiler */
 	uint64		processed = 0;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
-	bool		leafpart_use_multi_insert = false;
 
 	Assert(cstate->rel);
 
@@ -2833,6 +2821,57 @@ CopyFrom(CopyState cstate)
 	  0);
 	target_resultRelInfo = resultRelInfo;
 
+	/*
+	 * It's ge

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra
 wrote:
>
> Hi,
>
> while looking at the streaming code I noticed two minor issues:
>
> 1) logicalrep_read_stream_stop is never defined/called, so the prototype
> in logicalproto.h is unnecessary
>
> 2) minor typo in one of the comments
>
> Patch attached.
>

LGTM.

-- 
With Regards,
Amit Kapila.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-09-09 Thread Dilip Kumar

On Wed, Sep 9, 2020 at 2:13 PM Tomas Vondra 
wrote:

> Hi,
>
> while looking at the streaming code I noticed two minor issues:
>
> 1) logicalrep_read_stream_stop is never defined/called, so the prototype
> in logicalproto.h is unnecessary
>
>
Yeah, right.


> 2) minor typo in one of the comments
>
> Patch attached.
>

 Looks good to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Online checksums verification in the backend

On Mon, Sep 07, 2020 at 05:50:38PM +0900, Michael Paquier wrote:
> On Mon, Sep 07, 2020 at 09:38:30AM +0200, Julien Rouhaud wrote:
> > Did you mean creating a new checksumfuncs.c file? I don't find any
> > such file in the current tree.
> 
> Your patch adds checksumfuncs.c, so the subroutines grabbing a given
> block could just be moved there.
> 

Sorry, I was in the middle of a rebase for another patch and missed the new
files added in this one.  I added a new checksumfuncs.h for the required
include that should not be seen by client code.  I kept checksumfuncs.c and
checksums.c so that the SQL visible declaration are separated from the rest of
the implementation as this is what we already do elsewhere I think.  If that's
a problem I'll change and put everything in checksumfuncs.[ch].

I also moved the tap tests in src/test/modules and renamed the file with a 3
digits.  For the record I initially copied src/test/modules/brin, and this is
apparently the only subdir that has a 2 digits pattern.

I also added a new WAIT_EVENT_CHECK_DELAY wait event.

> > I'm not sure I understand.  Unless I missed something this approach
> > *cannot* raise a false positive.  What it does is force a 2nd check
> > with stronger lock *to make sure it's actually a corruption*, so we
> > don't raise false positive.  The only report that can happen in this
> > 1st loop is if smgread raises an error, which AFAICT can only happen
> > (at least with mdread) if the whole block couldn't be read, which is a
> > sign of a very bad problem.  This should clearly be reported, as this
> > cannot be caused by the locking heuristics used here.
> 
> We don't know how much this optimization matters though?  Could it be
> possible to get an idea of that?  For example, take the case of one
> relation with a fixed size in a read-only workload and a read-write
> workload (as long as autovacuum and updates make the number of
> relation blocks rather constant for the read-write case), doing a
> checksum verification in parallel of multiple clients working on the
> relation concurrently.  Assuming that the relation is fully in the OS
> cache, we could get an idea of the impact with multiple
> (shared_buffers / relation size) rates to make the eviction more
> aggressive?  The buffer partition locks, knowing that
> NUM_BUFFER_PARTITIONS caps that, should be the bottleneck, still it
> seems to me that it would be good to see if we have a difference.
> What do you think?

I assumed that the odds of having to check the buffer twice were so low, and
avoiding to keep a bufmapping lock while doing some IO was an uncontroversial
enough optimisation, but maybe that's only wishful thinking.

I'll do some becnhmarking and see if I can get some figures, but it'll probably
take some time.  In the meantime I'm attaching v11 of the patch that should
address all other comments.
>From 08b28331e11160ab842b7b72a61c4478e7b3561a Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v11] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: 
https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  85 +
 doc/src/sgml/func.sgml|  51 +++
 src/backend/postmaster/pgstat.c   |   3 +
 src/backend/storage/page/checksum.c   | 322 +-
 src/backend/utils/adt/Makefile|   1 +
 src/backend/utils/adt/checksumfuncs.c | 218 
 src/backend/utils/init/globals.c  |   7 +
 src/backend/utils/misc/guc.c  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat   |  16 +
 src/include/miscadmin.h   |   7 +
 src/include/pgstat.h  |   3 +-
 src/include/utils/checksumfuncs.h |  31 ++
 src/include/utils/guc_tables.h|   1 +
 src/test/Makefile |   3 +-
 src/test/modules/Makefile |   1 +
 src/test/modules/check_relation/.gitignore|   2 +
 src/test/modules/check_relation/Makefile  |  14 +
 src/test/modules/check_relation/README|  23 ++
 .../check_relation/t/001_checksums_check.pl   | 276 +++
 20 files changed, 1098 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/include/utils/checksumfuncs.h
 create mode 100644 src/test/modules/check_relation/.gitignore
 create mode 100644 src/test/modules/check_relation/Makefile
 create mode 100644 src/test/modules/check_relation/README
 create mode 100644

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-09 Thread Andrey V. Lepikhov


Version 8 split into two patches (in accordance with Amit suggestion).
Also I eliminate naming inconsistency (thanks to Alexey).
Based on master, f481d28232.

--
regards,
Andrey Lepikhov
Postgres Professional
>From 21b11f4ec0bec71bc7226014ef15c58dee9002da Mon Sep 17 00:00:00 2001
From: amitlan 
Date: Mon, 24 Aug 2020 15:08:37 +0900
Subject: [PATCH 1/2] Move multi-insert decision logic into executor

When 0d5f05cde introduced support for using multi-insert mode when
copying into partitioned tables, it introduced single variable of
enum type CopyInsertMethod shared across all potential target
relations (partitions) that, along with some target relation
proprties, dictated whether to engage multi-insert mode for a given
target relation.

Move that decision logic into InitResultRelInfo which now sets a new
boolean field ri_usesMultiInsert of ResultRelInfo when a target
relation is first initialized.  That prevents repeated computation
of the same information in some cases, especially for partitions,
and the new arrangement results in slightly more readability.
---
 src/backend/commands/copy.c  | 186 ---
 src/backend/commands/tablecmds.c |   1 +
 src/backend/executor/execMain.c  |  49 ++
 src/backend/executor/execPartition.c |   3 +-
 src/backend/replication/logical/worker.c |   2 +-
 src/include/executor/executor.h  |   1 +
 src/include/nodes/execnodes.h|   9 +-
 7 files changed, 129 insertions(+), 122 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index db7d24a511..4e63926cb7 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -85,16 +85,6 @@ typedef enum EolType
 	EOL_CRNL
 } EolType;
 
-/*
- * Represents the heap insert method to be used during COPY FROM.
- */
-typedef enum CopyInsertMethod
-{
-	CIM_SINGLE,	/* use table_tuple_insert or fdw routine */
-	CIM_MULTI,	/* always use table_multi_insert */
-	CIM_MULTI_CONDITIONAL		/* use table_multi_insert only if valid */
-} CopyInsertMethod;
-
 /*
  * This struct contains all the state variables used throughout a COPY
  * operation. For simplicity, we use the same struct for all variants of COPY,
@@ -2715,12 +2705,11 @@ CopyFrom(CopyState cstate)
 	CommandId	mycid = GetCurrentCommandId(true);
 	int			ti_options = 0; /* start with default options for insert */
 	BulkInsertState bistate = NULL;
-	CopyInsertMethod insertMethod;
+	bool		use_multi_insert;
 	CopyMultiInsertInfo multiInsertInfo = {0};	/* pacify compiler */
 	uint64		processed = 0;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
-	bool		leafpart_use_multi_insert = false;
 
 	Assert(cstate->rel);
 
@@ -2820,6 +2809,52 @@ CopyFrom(CopyState cstate)
 		ti_options |= TABLE_INSERT_FROZEN;
 	}
 
+	/*
+	 * It's generally more efficient to prepare a bunch of tuples for
+	 * insertion, and insert them in bulk, for example, with one
+	 * table_multi_insert() call than call table_tuple_insert() separately
+	 * for every tuple. However, there are a number of reasons why we might
+	 * not be able to do this.  We check some conditions below while some
+	 * other target relation properties are left for InitResultRelInfo() to
+	 * check, because they must also be checked for partitions which are
+	 * initialized later.
+	 */
+	if (cstate->volatile_defexprs || list_length(cstate->attnumlist) == 0)
+	{
+		/*
+		 * Can't support bufferization of copy into foreign tables without any
+		 * defined columns or if there are any volatile default expressions in the
+		 * table. Similarly to the trigger case above, such expressions may query
+		 * the table we're inserting into.
+		 *
+		 * Note: It does not matter if any partitions have any volatile
+		 * default expressions as we use the defaults from the target of the
+		 * COPY command.
+		 */
+		use_multi_insert = false;
+	}
+	else if (contain_volatile_functions(cstate->whereClause))
+	{
+		/*
+		 * Can't support multi-inserts if there are any volatile function
+		 * expressions in WHERE clause.  Similarly to the trigger case above,
+		 * such expressions may query the table we're inserting into.
+		 */
+		use_multi_insert = false;
+	}
+	else
+	{
+		/*
+		 * Looks okay to try multi-insert, but that may change once we
+		 * check few more properties in InitResultRelInfo().
+		 *
+		 * For partitioned tables, whether or not to use multi-insert depends
+		 * on the individual parition's properties which are also checked in
+		 * InitResultRelInfo().
+		 */
+		use_multi_insert = true;
+	}
+
 	/*
 	 * We need a ResultRelInfo so we can use the regular executor's
 	 * index-entry-making machinery.  (There used to be a huge amount of code
@@ -2830,6 +2865,7 @@ CopyFrom(CopyState cstate)
 	  cstate->rel,
 	  1,		/* must match rel's position in range_table */
 	  NULL,
+	  use_multi_insert,
 	  0);
 	target_resultRelInfo = resultRelInfo;
 
@@ -2854,10 +2890,14 @@ CopyFrom(CopyState cstate)
 	mts

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-09 Thread Alexey Kondratov


On 2020-09-09 11:45, Andrey V. Lepikhov wrote:

On 9/8/20 8:34 PM, Alexey Kondratov wrote:

On 2020-09-08 17:00, Amit Langote wrote:

 wrote:

On 2020-09-08 10:34, Amit Langote wrote:
Another ambiguous part of the refactoring was in changing
InitResultRelInfo() arguments:

@@ -1278,6 +1280,7 @@ InitResultRelInfo(ResultRelInfo 
*resultRelInfo,

  Relation resultRelationDesc,
  Index resultRelationIndex,
  Relation partition_root,
+ bool use_multi_insert,
  int instrument_options)

Why do we need to pass this use_multi_insert flag here? Would it be
better to set resultRelInfo->ri_usesMultiInsert in the
InitResultRelInfo() unconditionally like it is done for
ri_usesFdwDirectModify? And after that it will be up to the caller
whether to use multi-insert or not based on their own circumstances.
Otherwise now we have a flag to indicate that we want to check for
another flag, while this check doesn't look costly.


Hmm, I think having two flags seems confusing and bug prone,
especially if you consider partitions.  For example, if a partition's
ri_usesMultiInsert is true, but CopyFrom()'s local flag is false, 
then

execPartition.c: ExecInitPartitionInfo() would wrongly perform
BeginForeignCopy() based on only ri_usesMultiInsert, because it
wouldn't know CopyFrom()'s local flag.  Am I missing something?


No, you're right. If someone want to share a state and use 
ResultRelInfo (RRI) for that purpose, then it's fine, but CopyFrom() 
may simply override RRI->ri_usesMultiInsert if needed and pass this 
RRI further.


This is how it's done for RRI->ri_usesFdwDirectModify. 
InitResultRelInfo() initializes it to false and then 
ExecInitModifyTable() changes the flag if needed.


Probably this is just a matter of personal choice, but for me the 
current implementation with additional argument in InitResultRelInfo() 
doesn't look completely right. Maybe because a caller now should pass 
an additional argument (as false) even if it doesn't care about 
ri_usesMultiInsert at all. It also adds additional complexity and 
feels like abstractions leaking.

I didn't feel what the problem was and prepared a patch version
according to Alexey's suggestion (see Alternate.patch).


Yes, that's very close to what I've meant.

+	leaf_part_rri->ri_usesMultiInsert = (leaf_part_rri->ri_usesMultiInsert 
&&

+   rootResultRelInfo->ri_usesMultiInsert) ? true : false;

This could be just:

+	leaf_part_rri->ri_usesMultiInsert = (leaf_part_rri->ri_usesMultiInsert 
&&

+   rootResultRelInfo->ri_usesMultiInsert);


This does not seem very convenient and will lead to errors in the
future. So, I agree with Amit.


And InitResultRelInfo() may set ri_usesMultiInsert to false by default, 
since it's used only by COPY now. Then you won't need this in several 
places:


+   resultRelInfo->ri_usesMultiInsert = false;

While the logic of turning multi-insert on with all the validations 
required could be factored out of InitResultRelInfo() to a separate 
routine.


Anyway, I don't insist at all and think it's fine to stick to the 
original v7's logic.


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Re: Inconsistency in determining the timestamp of the db statfile.

On Wed, Sep 9, 2020 at 5:04 AM Amit Kapila  wrote:

> On Tue, Sep 8, 2020 at 7:03 PM Magnus Hagander 
> wrote:
> >
> > On Tue, Sep 8, 2020 at 3:11 PM Fujii Masao 
> wrote:
> >>
> >>
> >>
> >> On 2020/09/08 19:28, Magnus Hagander wrote:
> >> >
> >> >
> >> > On Tue, Sep 8, 2020 at 8:10 AM Amit Kapila  > wrote:
> >> >
> >> > We use the timestamp of the global statfile if we are not able to
> >> > determine it for a particular database either because the entry
> for
> >> > that database doesn't exist or there is an error while reading the
> >> > specific database entry. This was not taken care of while reading
> >> > other entries like ArchiverStats or SLRUStats. See
> >> > pgstat_read_db_statsfile_timestamp. I have observed this while
> >> > reviewing Sawada-San's patch related to logical replication stats
> [1].
> >> >
> >> > I think this can only happen if due to some reason the statfile
> got
> >> > corrupt or we
> >> > have some bug in stats writing code, the chances of both are rare
> and even
> >> > if that happens we will use stale statistics.
> >> >
> >> > The attached patch by Sawada-San will fix this issue. As the
> chances of this
> >> > the problem is rare and nobody has reported any issue related to
> this,
> >> > so it might be okay not to backpatch this.
> >> >
> >> > Thoughts?
> >> >
> >> >
> >> > Why are we reading the archiver statis and and slru stats in
> "pgstat_read_db_statsfile_timestamp" in the first place?
> >>
> >> Maybe because they are written before database stats entries? That is,
> >> to read the target database stats entry and get the timestamp from it,
> >> we need to read (or lseek) all the global stats entries written before
> them.
> >>
> >
> > Oh meh. Yeah, I'm reading this thing completely wrong :/ Ignore my
> comments :)
> >
> >
> >
> >> > That seems well out of scope for that function.
> >> >
> >> > If nothing else the comment at the top of that function is out of
> touch with reality. We do seem to read it into local buffers and then
> ignore the contents. I guess we're doing it just to verify that it's not
> corrupt -- so perhaps the function should actually have a different name
> than it does now, since it certainly seems to do more than actually read
> the statsfile timestamp.
> >> >
> >> > Specifically in this patch it looks wrong -- in the case of say the
> SLRU stats being corrupt, we will now return the timestamp of the global
> stats file even if there is one existing for the database requested, which
> definitely breaks the contract of the function.
> >>
> >> Yes.
> >> We should return false when fread() for database entry fails, instead?
> That is,
> >>
> >> 1. If corrupted stats file is found, the function always returns false.
> >> 2. If the file is not currupted and the target database entry is found,
> its timestamp is returned.
> >> 3. If the file is not corrupted and the target is NOT found, the
> timestamp of global entry is returned
> >
> >
> > Yeah, with more coffee and re-reading it, I'm not sure how we could have
> the database stats being OK if the archiver or slru stats are not.
> >
> > That said, I think it still makes sense to return false if the stats
> file is corrupt. How much can we trust the first block, if the block right
> after it is corrupt? So yeah, I think your described order seems correct.
> But that's also what the code already did before this patch, isn't it?
> >
>
> No, before patch as well, if we can't read the DB entry say because
> the file is corrupt, we return true and use timestamp of global stats
> file and this is what is established by the original commit 187492b6.
>

Uh, the patch changes:
-   return false;
+   return true;

In the "codepath of corruption". After also setting the value.

So AFAICT before the patch, it returns false if the file is corrupt
(there's a set  of such scenarios, all returning false), just after logging
that it's corrupt.The patch changes it to log that it's corrupt and then
return true.

The fact that it doesn't find the database doesn't mean the file is
corrupt, it just means the database is inactive. But missing global,
archiver or slru stats means it's corrupt.



> So, if we consider that as correct then the functionality is something
> like once we have read the timestamp of the global statfile, we use
> that if we can't read the actual db entry for whatever reason. It
> seems if we return false, the caller will call this function again in
> the loop. Now, I see the point that if we can't read some part of the
> file we should return false instead but not sure if that is helpful
> from the perspective of the caller of
> pgstat_read_db_statsfile_timestamp.
>

But we are not talking about the "if we can't read the actual db entry"
case, we are talking about the *global* parts of the file.

Though in fact the one inconsistent place in the code now is that if it is
corrupt in the db entry par

Re: PG 13 release notes, first draft

2020-09-09 Thread Amit Langote

On Tue, May 12, 2020 at 10:28 AM Amit Langote  wrote:
>
> On Tue, May 12, 2020 at 8:51 AM Bruce Momjian  wrote:
> > On Fri, May  8, 2020 at 12:07:09PM +0900, Amit Langote wrote:
> > > I have attached a patch with my suggestions above.
> >
> > OK, I slightly modified the wording of your first change, patch
> > attached.
>
> Thanks.  I checked that what you committed looks fine.

Sorry about not having reported this earlier, but I had noticed that
the wording of the partitioned tables logical replication item isn't
correct grammatically, which I noticed again while going through the
webpage.  Attached fixes it as follows:

-to be automatically published.  Addition/removal of partitions from
-partitioned tables are automatically added/removed from publications.
+to be automatically published.  Adding/removing partitions from
+a partitioned table automatically adds/removes them from publications.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com


release-13-wording.patch
Description: Binary data

Re: Inconsistency in determining the timestamp of the db statfile.

On Wed, Sep 9, 2020 at 9:11 AM Amit Kapila  wrote:

> On Wed, Sep 9, 2020 at 10:54 AM Fujii Masao 
> wrote:
> >
> > On 2020/09/09 12:04, Amit Kapila wrote:
> > >
> > > No, before patch as well, if we can't read the DB entry say because
> > > the file is corrupt, we return true and use timestamp of global stats
> > > file and this is what is established by the original commit 187492b6.
> > > So, if we consider that as correct then the functionality is something
> > > like once we have read the timestamp of the global statfile, we use
> > > that if we can't read the actual db entry for whatever reason. It
> > > seems if we return false, the caller will call this function again in
> > > the loop. Now, I see the point that if we can't read some part of the
> > > file we should return false instead but not sure if that is helpful
> > > from the perspective of the caller of
> > > pgstat_read_db_statsfile_timestamp.
> >
> > When false is returned, the caller backend_read_statsfile() seems to
> > request the stats collector to write a fresh stats data into the file,
> > and then pgstat_read_db_statsfile_timestamp() can try to read the fresh
> > file that may not be corrupted. So returning false in that case seems ok
> > to me...
> >
>
> Hmm, the request to get fresh data is sent irrespective of true or
> false. We send it to get the latest data if it is not present and it
>

No we don't. Just before we request it, there is:
/* Normal acceptance case: file is not older than cutoff time */
if (ok && file_ts >= min_ts)
break;

So it only requests a new file in the case that it returned false.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Resetting spilled txn statistics in pg_stat_replication

On Tue, Sep 8, 2020 at 7:02 PM Amit Kapila  wrote:
>
> Comments on the latest patch:
> =
>

Apart from the comments I gave yesterday, another thing I was
wondering is how to write some tests for this patch. The two ideas I
could think of are as follows:

1. One idea was to keep these stats for each WALSender as it was in
the commit that we reverted as b074813d48. If we had that then we can
query the stats for tests added in commit 58b5ae9d62. I am not sure
whether we want to display it in view pg_stat_replication but it would
be a really good way to test the streamed and serialized transactions
in a predictable manner.

2. Then the second way is to try doing something similar to what we do
in src/test/regress/sql/stats.sql

I think we should do both if possible.

-- 
With Regards,
Amit Kapila.

Re: [HACKERS] logical decoding of two-phase transactions

2020-09-09 Thread Ajin Cherian

On Mon, Sep 7, 2020 at 11:17 PM Amit Kapila  wrote:

>
> Nikhil has a test for the same
> (0004-Teach-test_decoding-plugin-to-work-with-2PC.Jan4) in his last
> email [1]. You might want to use it to test this behavior. I think you
> can also keep the tests as a separate patch as Nikhil had.
>
> Done. I've added the tests and also tweaked code to make sure that the
aborts during 2 phase commits are also handled.

>
> I don't know why the patch has used this way to implement an option to
> enable two-phase. Can't we use how we implement 'stream-changes'
> option in commit 7259736a6e? Just refer how we set ctx->streaming and
> you can use a similar way to set this parameter.
>

Done, I've moved the checks for callbacks to inside the corresponding
wrappers.

Regards,
Ajin Cherian
Fujitsu Australia


0001-Support-decoding-of-two-phase-transactions-at-PREPAR.patch
Description: Binary data


0002-Tap-test-to-test-concurrent-aborts-during-2-phase-co.patch
Description: Binary data

Re: Inconsistency in determining the timestamp of the db statfile.

On Wed, Sep 9, 2020 at 3:15 PM Magnus Hagander  wrote:
>
> On Wed, Sep 9, 2020 at 5:04 AM Amit Kapila  wrote:
>>
>
> Though in fact the one inconsistent place in the code now is that if it is 
> corrupt in the db entry part of the file it returns true and the global 
> timestamp, which I would argue is perhaps incorrect and it should return 
> false.
>

Yeah, this is exactly the case I was pointing out where we return true
before the patch, basically the code below:
case 'D':
if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
  fpin) != offsetof(PgStat_StatDBEntry, tables))
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
statfile)));
goto done;
}

done:
FreeFile(fpin);
return true;

Now, if we decide to return 'false' here, then surely there is no
argument and we should return false in other cases as well. Basically,
I think we should be consistent in handling the corrupt file case.

-- 
With Regards,
Amit Kapila.

Re: Inconsistency in determining the timestamp of the db statfile.

On Wed, Sep 9, 2020 at 3:17 PM Magnus Hagander  wrote:
>
> On Wed, Sep 9, 2020 at 9:11 AM Amit Kapila  wrote:
>>
>> On Wed, Sep 9, 2020 at 10:54 AM Fujii Masao  
>> wrote:
>> >
>> > On 2020/09/09 12:04, Amit Kapila wrote:
>> > >
>> > > No, before patch as well, if we can't read the DB entry say because
>> > > the file is corrupt, we return true and use timestamp of global stats
>> > > file and this is what is established by the original commit 187492b6.
>> > > So, if we consider that as correct then the functionality is something
>> > > like once we have read the timestamp of the global statfile, we use
>> > > that if we can't read the actual db entry for whatever reason. It
>> > > seems if we return false, the caller will call this function again in
>> > > the loop. Now, I see the point that if we can't read some part of the
>> > > file we should return false instead but not sure if that is helpful
>> > > from the perspective of the caller of
>> > > pgstat_read_db_statsfile_timestamp.
>> >
>> > When false is returned, the caller backend_read_statsfile() seems to
>> > request the stats collector to write a fresh stats data into the file,
>> > and then pgstat_read_db_statsfile_timestamp() can try to read the fresh
>> > file that may not be corrupted. So returning false in that case seems ok
>> > to me...
>> >
>>
>> Hmm, the request to get fresh data is sent irrespective of true or
>> false. We send it to get the latest data if it is not present and it
>
>
> No we don't. Just before we request it, there is:
> /* Normal acceptance case: file is not older than cutoff time */
> if (ok && file_ts >= min_ts)
> break;
>
> So it only requests a new file in the case that it returned false.
>

What if the second part of the above 'if' statement is false, then
basically even when pgstat_read_db_statsfile_timestamp() has returned
true, we can send a request. IIUC, here the basic idea is if the stats
in the file is updated before cut_off time, then we do send the
request and wait for updated stats.

-- 
With Regards,
Amit Kapila.

Re: Yet another fast GiST build

2020-09-09 Thread Andrey M. Borodin

Thanks Darafei!

> 9 сент. 2020 г., в 12:05, Darafei Komяpa Praliaskouski  
> написал(а):
> 
> > How does the 'sortsupport' routine interact with 'compress'/'decompress'? 
> > Which representation is passed to the comparator routine: the original 
> > value from the table, the compressed representation, or the decompressed 
> > representation? Do the comparetup_index_btree() and readtup_index() 
> > routines agree with that?
> 
> Currently we pass compressed values, which seems not very good.
> But there was a request from PostGIS maintainers to pass values before 
> decompression.
> Darafei, please, correct me if I'm wrong. Also can you please provide link on 
> PostGIS B-tree sorting functions?
> 
> We were expecting to reuse btree opclass for this thing. This way btree_gist 
> extension will become a lot thinner. :)

I think if we aim at reusing B-tree sort support functions we have to pass 
uncompressed values. They can be a lot bigger and slower in case of PostGIS. We 
will be sorting actual geometries instead of MBRs.

In my view it's better to implement GiST-specific sort support in btree_gist, 
rather than trying to reuse existing sort supports.

Best regards, Andrey Borodin.

Re: file_fdw vs relative paths

On Wed, Sep 9, 2020 at 3:39 AM Ian Barwick 
wrote:

> Hi
>
> On 2020/09/07 2:31, Magnus Hagander wrote:
> > On Mon, Aug 31, 2020 at 5:03 PM Bruce Momjian  br...@momjian.us>> wrote:
> >
> > On Mon, Aug 31, 2020 at 01:16:05PM +0200, Magnus Hagander wrote:
> >  > Bruce, I've applied and backpatched your docs patch for this.
> >  >
> >  > Gah, and of course right after doing that, I remembered I wanted
> to get a
> >  > second change in :) To solve the "who's this Josh" question, I
> suggest we also
> >  > change the example to point to the data/log directory which is
> likely to exist
> >  > in a lot more of the cases. I keep getting people who ask "who is
> josh" based
> >  > on the /home/josh path. Not that it's that important, but...
> >
> > Thanks, and agreed.
> >
> >
> > Thanks, applied. I backpacked to 13 but didn't bother with the rest as
> it's not technically *wrong* before..
>
> It's missing the leading single quote from the filename parameter:
>
>  diff --git a/doc/src/sgml/file-fdw.sgml b/doc/src/sgml/file-fdw.sgml
>  (...)
>  -OPTIONS ( filename '/home/josh/data/log/pglog.csv', format 'csv' );
>  +OPTIONS ( filename log/pglog.csv', format 'csv' );
>  (...)
>

GAH.

Thanks!


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Global snapshots

2020-09-09 Thread Alexey Kondratov

On 2020-09-09 08:35, Masahiko Sawada wrote:

On Wed, 9 Sep 2020 at 02:00, Alexey Kondratov
wrote:

On 2020-09-08 14:48, Fujii Masao wrote:
>
> IIUC, yes, the information required for automatic resolution is
> WAL-logged and the standby tries to resolve those orphan transactions
> from WAL after the failover. But Sawada-san's patch provides
> the special function for manual resolution, so there may be some cases
> where manual resolution is necessary.
>

I've found a note about manual resolution in the v25 0002:

+After that we prepare all foreign transactions by calling
+PrepareForeignTransaction() API. If we failed on any of them we
change

to
+rollback, therefore at this time some participants might be prepared
whereas
+some are not prepared. The former foreign transactions need to be
resolved
+using pg_resolve_foreign_xact() manually and the latter ends
transaction
+in one-phase by calling RollbackForeignTransaction() API.

but it's not yet clear for me.

Sorry, the above description in README is out of date. In the v25
patch, it's true that if a backend fails to prepare a transaction on a
foreign server, it’s possible that some foreign transactions are
prepared whereas others are not. But at the end of the transaction
after changing to rollback, the process does rollback (or rollback
prepared) all of them. So the use case of pg_resolve_foreign_xact() is
to resolve orphaned foreign prepared transactions or to resolve a
foreign transaction that is not resolved for some reasons, bugs etc.

OK, thank you for the explanation!

Once the transaction is committed locally any ERROR (or higher level
message) will be escalated to PANIC.

I think this is true only inside the critical section and it's not
necessarily true for all errors happening after the local commit,
right?

It's not actually related to critical section errors escalation. Any
error in the backend after the local commit and
ProcArrayEndTransaction() will try to abort the current transaction and
do RecordTransactionAbort(), but it's too late to do so and PANIC will
be risen:

/*
* Check that we haven't aborted halfway through
RecordTransactionCommit.

*/
if (TransactionIdDidCommit(xid))
elog(PANIC, "cannot abort transaction %u, it was already
committed",
xid);

At least that's how I understand it.

And I do see possible ERROR level
messages in the postgresCommitForeignTransaction() for example:

+ if (PQresultStatus(res) != PGRES_COMMAND_OK)
+ ereport(ERROR, (errmsg("could not commit transaction
on server %s",
+
frstate->server->servername)));

I don't think that it's very convenient to get a PANIC every time we
failed to commit one of the prepared foreign xacts, since it could be
not so rare in the distributed system. That's why I tried to get rid
of

possible ERRORs as far as possible in my proposed patch.

In my patch, the second phase of 2PC is executed only by the resolver
process. Therefore, even if an error would happen during committing a
foreign prepared transaction, we just need to relaunch the resolver
process and trying again. During that, the backend process will be
just waiting. If a backend process raises an error after the local
commit, the client will see transaction failure despite the local
transaction having been committed. An error could happen even by
palloc. So the patch uses a background worker to commit prepared
foreign transactions, not by backend itself.

Yes, if it's a background process, then it seems to be safe.

BTW, it seems that I've chosen a wrong thread for posting my patch and
staring a discussion :) Activity from this thread moved to [1] and you
solution with built-in resolver is discussed [2]. I'll try to take a
look on v25 closely and write to [2] instead.

[1]
https://www.postgresql.org/message-id/2020081009525213277261%40highgo.ca

[2]
https://www.postgresql.org/message-id/CAExHW5uBy9QwjdSO4j82WC4aeW-Q4n2ouoZ1z70o%3D8Vb0skqYQ%40mail.gmail.com

Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Re: Yet another fast GiST build

2020-09-09 Thread Heikki Linnakangas


On 09/09/2020 13:28, Andrey M. Borodin wrote:

Thanks Darafei!


9 сент. 2020 г., в 12:05, Darafei Komяpa Praliaskouski
 написал(а):


How does the 'sortsupport' routine interact with
'compress'/'decompress'? Which representation is passed to the
comparator routine: the original value from the table, the
compressed representation, or the decompressed representation? Do
the comparetup_index_btree() and readtup_index() routines agree
with that?


Currently we pass compressed values, which seems not very good. But
there was a request from PostGIS maintainers to pass values before
decompression. Darafei, please, correct me if I'm wrong. Also can
you please provide link on PostGIS B-tree sorting functions?

We were expecting to reuse btree opclass for this thing. This way
btree_gist extension will become a lot thinner. :)


I think if we aim at reusing B-tree sort support functions we have to
pass uncompressed values. They can be a lot bigger and slower in case
of PostGIS. We will be sorting actual geometries instead of MBRs.

In my view it's better to implement GiST-specific sort support in
btree_gist, rather than trying to reuse existing sort supports.


Yeah, I don't think reusing existing sortsupport functions directly is 
important. The comparison function should be short anyway for 
performance reasons, so it won't be a lot of code to copy-paste. And if 
there are some common subroutines, you can put them in a separate 
internal functions for reuse.


Using the 'compressed' format seems reasonable to me. It's natural to 
the gistbuild.c code, and the comparison routine can 'decompress' itself 
if it wishes. If the decompressions is somewhat expensive, it's 
unfortunate if you need to do it repeatedly in the comparator, but 
tuplesort.c would need pretty big changes to keep around a separate 
in-memory representation compare. However, you could use the sort 
"abbreviation" functionality to mitigate that.


Come to think of it, the point z-order comparator could benefit a lot 
from key abbreviation, too. You could do the point -> zorder conversion 
in the abbreviation routine.


- Heikki

Re: Inconsistent Japanese name order in v13 contributors list


On 2020-09-09 07:40, Fujii Masao wrote:

Attached is a patch to standardize Japanese names as given-name-first
in the v13 contributors list as before.


Using given-name-first order is our consensus? I was thinking we have not
reached that yet and our "vague" consensus was to use the name that each
contributor prefers, for example the name that used in the email signature, etc.

BTW, if possible I'd like to use family-name-first, i.e., "Fujii Masao".


See commit 53c89aed7b38ab412fddc1d6118822ce5d962acd for when this was 
changed.


At least it's the current practice.  It can be debated whether it's a 
good practice.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Inconsistent Japanese name order in v13 contributors list


On 2020-09-09 07:15, Etsuro Fujita wrote:

Attached is a patch to standardize Japanese names as given-name-first
in the v13 contributors list as before.


Given existing practice, this patch looks okay.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Yet another fast GiST build

2020-09-09 Thread Komяpa

On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:

> On 09/09/2020 13:28, Andrey M. Borodin wrote:
> > Thanks Darafei!
> >
> >> 9 сент. 2020 г., в 12:05, Darafei Komяpa Praliaskouski
> >>  написал(а):
> >>
> >>> How does the 'sortsupport' routine interact with
> >>> 'compress'/'decompress'? Which representation is passed to the
> >>> comparator routine: the original value from the table, the
> >>> compressed representation, or the decompressed representation? Do
> >>> the comparetup_index_btree() and readtup_index() routines agree
> >>> with that?
> >>
> 
>
> Come to think of it, the point z-order comparator could benefit a lot
> from key abbreviation, too. You could do the point -> zorder conversion
> in the abbreviation routine.
>

That's how it works in PostGIS, only that we moved to more
effecient Hilbert curve:
https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-09 Thread Michael Paquier

On Tue, Sep 08, 2020 at 07:17:58PM -0500, Justin Pryzby wrote:
> Initially I added List *params, and Michael suggested to retire
> ReindexStmt->concurrent.  I provided a patch to do so, initially by leaving 
> int
> options and then, after this, removing it to "complete the thought", and get
> rid of the remnants of the "old way" of doing it.  This is also how vacuum and
> explain are done.
> https://www.postgresql.org/message-id/20200902022410.GA20149%40telsasoft.com

Defining a set of DefElem when parsing and then using the int
"options" with bitmasks where necessary at the beginning of the
execution looks like a good balance to me.  This way, you can extend
the grammar to use things like (verbose = true), etc.

By the way, skimming through the patch set, I was wondering if we
could do the refactoring of patch 0005 as a first step, until I
noticed this part:
+common_option_name:
NonReservedWord { $$ = $1; }
| analyze_keyword { $$ = "analyze"; }
This is not a good idea as you make ANALYZE an option available for
all the commands involved in the refactoring.  A portion of that could
be considered though, like the use of common_option_arg.
--
Michael


signature.asc
Description: PGP signature

Re: Online checksums verification in the backend

2020-09-09 Thread Michael Paquier

On Wed, Sep 09, 2020 at 11:25:24AM +0200, Julien Rouhaud wrote:
> I assumed that the odds of having to check the buffer twice were so low, and
> avoiding to keep a bufmapping lock while doing some IO was an uncontroversial
> enough optimisation, but maybe that's only wishful thinking.

Perhaps it is worth it, so it would be good to make sure of it and see
if that's actually worth the extra complication.  This also depends if
the page is in the OS cache if the page is not in shared buffers,
meaning that smgrread() is needed to fetch the page to check.  I would
be more curious to see if there is an actual difference if the page is
the OS cache.

> I'll do some becnhmarking and see if I can get some figures, but it'll 
> probably
> take some time.  In the meantime I'm attaching v11 of the patch that should
> address all other comments.

Thanks.

Another thing that was itching me is the introduction of 3 GUCs with
one new category for the sake of two SQL functions.  For VACUUM we
have many things relying on the GUC delays, with autovacuum and manual
vacuum.  Perhaps it would make sense to have these if we have some day
a background worker doing checksum verifications, still that could
perfectly be in contrib/, and that would be kind of hard to tune as
well.  The patch enabling checksums on-the-fly could also be a reason
good enough.  Another thing we could consider is to pass down those
parameters as function arguments, at the cost of not being able to
reload them.
--
Michael


signature.asc
Description: PGP signature

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-09 Thread Amit Langote

On Wed, Sep 9, 2020 at 6:42 PM Alexey Kondratov
 wrote:
> On 2020-09-09 11:45, Andrey V. Lepikhov wrote:
> > This does not seem very convenient and will lead to errors in the
> > future. So, I agree with Amit.
>
> And InitResultRelInfo() may set ri_usesMultiInsert to false by default,
> since it's used only by COPY now. Then you won't need this in several
> places:
>
> +   resultRelInfo->ri_usesMultiInsert = false;
>
> While the logic of turning multi-insert on with all the validations
> required could be factored out of InitResultRelInfo() to a separate
> routine.

Interesting idea.  Maybe better to have a separate routine like Alexey says.


--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: Online checksums verification in the backend

On Wed, Sep 9, 2020 at 2:37 PM Michael Paquier  wrote:
>
> Another thing that was itching me is the introduction of 3 GUCs with
> one new category for the sake of two SQL functions.  For VACUUM we
> have many things relying on the GUC delays, with autovacuum and manual
> vacuum.  Perhaps it would make sense to have these if we have some day
> a background worker doing checksum verifications, still that could
> perfectly be in contrib/, and that would be kind of hard to tune as
> well.  The patch enabling checksums on-the-fly could also be a reason
> good enough.  Another thing we could consider is to pass down those
> parameters as function arguments, at the cost of not being able to
> reload them.

I'm not terribly happy with adding that for now, but it's quite clear
that there'll eventually be a lot of new stuff added that will benefit
from either the category or the GUC.  FTR once we reach an agreement
on how to do this check (I'm wondering if it'll stay an SQL function
or become a plain backend command, in which case GUCs would be
mandatory), I'll also be happy to work on a background worker to help
people running the check regularly.  So in my opinion it's better to
add them now so we won't have to change the sql function definition
later when other facilities will rely on them.

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-09 Thread Alexey Kondratov


On 2020-09-09 15:22, Michael Paquier wrote:

On Tue, Sep 08, 2020 at 07:17:58PM -0500, Justin Pryzby wrote:

Initially I added List *params, and Michael suggested to retire
ReindexStmt->concurrent.  I provided a patch to do so, initially by 
leaving int
options and then, after this, removing it to "complete the thought", 
and get
rid of the remnants of the "old way" of doing it.  This is also how 
vacuum and

explain are done.
https://www.postgresql.org/message-id/20200902022410.GA20149%40telsasoft.com


Defining a set of DefElem when parsing and then using the int
"options" with bitmasks where necessary at the beginning of the
execution looks like a good balance to me.  This way, you can extend
the grammar to use things like (verbose = true), etc.

By the way, skimming through the patch set, I was wondering if we
could do the refactoring of patch 0005 as a first step



Yes, I did it with intention to put as a first patch, but wanted to get 
some feedback. It's easier to refactor the last patch without rebasing 
others.




until I
noticed this part:
+common_option_name:
NonReservedWord { $$ = $1; }
| analyze_keyword { $$ = "analyze"; }
This is not a good idea as you make ANALYZE an option available for
all the commands involved in the refactoring.  A portion of that could
be considered though, like the use of common_option_arg.



From the grammar perspective ANY option is available for any command 
that uses parenthesized option list. All the checks and validations are 
performed at the corresponding command code.
This analyze_keyword is actually doing only an ANALYZE word 
normalization if it's used as an option. Why it could be harmful?



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Re: Fix for parallel BTree initialization bug

On Tue, Sep 8, 2020 at 11:55 PM Jameson, Hunter 'James'
 wrote:
>
> Hi, I ran across a small (but annoying) bug in initializing parallel BTree 
> scans, which causes the parallel-scan state machine to get confused.
>
>
> To reproduce, you need a query that:
>
>
>
> 1. Executes parallel BTree index scan;
>
> 2. Has an IN-list of size > 1;
>
> 3. Has an additional index filter that makes it impossible to satisfy the
>
> first IN-list condition.
>
>
>
> (We encountered such a query, and therefore the bug, on a production 
> instance.)
>
>

I think I can understand what you are pointing out here but it would
be great if you can have a reproducible test case because that will
make it apparent and we might want to include that in the regression
tests if possible.

-- 
With Regards,
Amit Kapila.

RE: extension patch of CREATE OR REPLACE TRIGGER

2020-09-09 Thread osumi.takami...@fujitsu.com

Hi


> Those are fixed OK now, but I found 2 new review points.
> 
> 
> 
> COMMENT trigger.c (typo)
> 
> + ereport(ERROR,
> + (errcode(ERRCODE_DUPLICATE_OBJECT),
> + errmsg("trigger \"%s\" for relation \"%s\" is a constraint trigger",
> + stmt->trigname, RelationGetRelationName(rel)),
> + errhint("use CREATE OR REPLACE CONSTRAINT TRIGGER to replace a
> costraint trigger")));
> 
> 
> Typo in the errhint text.
> "costraint" -> "constraint"
Fixed. Thank you.

> 
> 
> COMMENT create_trigger.sgmg (add more help?)
> 
> I noticed that the CREATE OR REPLACE FUNCTION help [1] describes the OR
> REPLACE syntax ("Description" section) and also mentions some of the
> restrictions when using REPLACE ("Notes" section).
> [1] - https://www.postgresql.org/docs/current/sql-createfunction.html
> 
> ~~
> OTOH this trigger patch does not add anything much at all in the trigger help.
> 
> Shouldn't the "Description" at least say something like:
> "CREATE OR REPLACE will either create a new trigger, or replace an existing
> definition."
> 
> Shouldn't the "Notes" include information about restrictions when using OR
> REPLACE e.g. cannot replace triggers with triggers of a different kind e.g.
> cannot replace triggers with pending events
> 
> What do you think?
That's a great idea. I've applied this idea to the latest patch v10.

Regards,
Takamichi Osumi


CREATE_OR_REPLACE_TRIGGER_v10.patch
Description: CREATE_OR_REPLACE_TRIGGER_v10.patch

Minor cleanup of partbounds.c

2020-09-09 Thread Etsuro Fujita

Here is a patch for minor cleanup of the partbounds.c changes made by
commit c8434d64c: 1) removes a useless assignment (in normal builds)
and 2) improves comments a little.

Best regards,
Etsuro Fujita


partbounds-cleanup.patch
Description: Binary data

Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch

Peter Eisentraut  writes:
> On 2020-09-07 01:46, Tom Lane wrote:
>> I reviewed the 0002 patch, finding one bug (in int8_sum)

> Ouch, no test coverage.  Should we perhaps remove this function, since 
> it's obsolete and unused?

I don't feel a need to.

>> and a few
>> more calls of int8_numeric that could be converted.  I think the
>> attached updated version is committable, and I'd recommend going
>> ahead with that regardless of the rest of this.  I hadn't realized
>> how many random calls of int8_numeric and int4_numeric we'd grown,
>> but there are a lot, so this is nice cleanup.

> Yes, please go ahead with it.

It's your patch, I figured you'd want to commit it.

regards, tom lane

Re: Online checksums verification in the backend

On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote:
> 
> I'll do some becnhmarking and see if I can get some figures, but it'll 
> probably
> take some time.  In the meantime I'm attaching v11 of the patch that should
> address all other comments.

I just realized that I forgot to update one of the Makefile when moving the TAP
test folder.  v12 attached should fix that.
>From 84d0a4c5ec5d6b6f832fad02d421c204f1bee98b Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v12] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: 
https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  85 +
 doc/src/sgml/func.sgml|  51 +++
 src/backend/postmaster/pgstat.c   |   3 +
 src/backend/storage/page/checksum.c   | 322 +-
 src/backend/utils/adt/Makefile|   1 +
 src/backend/utils/adt/checksumfuncs.c | 218 
 src/backend/utils/init/globals.c  |   7 +
 src/backend/utils/misc/guc.c  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat   |  16 +
 src/include/miscadmin.h   |   7 +
 src/include/pgstat.h  |   3 +-
 src/include/utils/checksumfuncs.h |  31 ++
 src/include/utils/guc_tables.h|   1 +
 src/test/modules/Makefile |   1 +
 src/test/modules/check_relation/.gitignore|   2 +
 src/test/modules/check_relation/Makefile  |  14 +
 src/test/modules/check_relation/README|  23 ++
 .../check_relation/t/001_checksums_check.pl   | 276 +++
 19 files changed, 1096 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/include/utils/checksumfuncs.h
 create mode 100644 src/test/modules/check_relation/.gitignore
 create mode 100644 src/test/modules/check_relation/Makefile
 create mode 100644 src/test/modules/check_relation/README
 create mode 100644 src/test/modules/check_relation/t/001_checksums_check.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4ba49ffaf..b7629fde60 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2137,6 +2137,91 @@ include_dir 'conf.d'
  
 
 
+
+ Cost-based Checksum Verification Delay
+
+ 
+  During the execution of 
+  function, the system maintains an internal counter that keeps track of
+  the estimated cost of the various I/O operations that are performed.
+  When the accumulated cost reaches a limit (specified by
+  checksum_cost_limit), the process performing the
+  operation will sleep for a short period of time, as specified by
+  checksum_cost_delay. Then it will reset the counter
+  and continue execution.
+ 
+
+ 
+  This feature is disabled by default. To enable it, set the
+  checksum_cost_delay variable to a nonzero
+  value.
+ 
+
+ 
+  
+   checksum_cost_delay (floating 
point)
+   
+checksum_cost_delay configuration 
parameter
+   
+   
+   
+
+ The amount of time that the process will sleep
+ when the cost limit has been exceeded.
+ If this value is specified without units, it is taken as milliseconds.
+ The default value is zero, which disables the cost-based checksum
+ verification delay feature.  Positive values enable cost-based
+ checksum verification.
+
+   
+  
+
+  
+   checksum_cost_page (integer)
+   
+checksum_cost_page configuration 
parameter
+   
+   
+   
+
+ The estimated cost for verifying a buffer, whether it's found in the
+ shared buffer cache or not. It represents the cost to lock the buffer
+ pool, lookup the shared hash table, read the content of the page from
+ disk and compute its checksum.  The default value is 10.
+
+   
+  
+
+  
+   checksum_cost_limit (integer)
+   
+checksum_cost_limit configuration 
parameter
+   
+   
+   
+
+ The accumulated cost that will cause the verification process to 
sleep.
+ The default value is 200.
+
+   
+  
+ 
+
+ 
+  
+   There are certain operations that hold critical locks and should
+   therefore complete as quickly as possible.  Cost-based checksum
+   verification delays do not occur during such operations.  Therefore i

Re: Inconsistency in determining the timestamp of the db statfile.


On Wed, Sep 09, 2020 at 03:53:40PM +0530, Amit Kapila wrote:

On Wed, Sep 9, 2020 at 3:15 PM Magnus Hagander  wrote:


On Wed, Sep 9, 2020 at 5:04 AM Amit Kapila  wrote:




Though in fact the one inconsistent place in the code now is that if it is 
corrupt in the db entry part of the file it returns true and the global 
timestamp, which I would argue is perhaps incorrect and it should return false.



Yeah, this is exactly the case I was pointing out where we return true
before the patch, basically the code below:
case 'D':
if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
 fpin) != offsetof(PgStat_StatDBEntry, tables))
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
statfile)));
goto done;
}

done:
FreeFile(fpin);
return true;

Now, if we decide to return 'false' here, then surely there is no
argument and we should return false in other cases as well. Basically,
I think we should be consistent in handling the corrupt file case.



FWIW I do agree with this - we should return false here, to make it
return false like in the other data corruption cases. AFAICS that's the
only inconsistency here.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Inconsistency in determining the timestamp of the db statfile.

On Wed, Sep 9, 2020 at 3:56 PM Tomas Vondra 
wrote:

> On Wed, Sep 09, 2020 at 03:53:40PM +0530, Amit Kapila wrote:
> >On Wed, Sep 9, 2020 at 3:15 PM Magnus Hagander 
> wrote:
> >>
> >> On Wed, Sep 9, 2020 at 5:04 AM Amit Kapila 
> wrote:
> >>>
> >>
> >> Though in fact the one inconsistent place in the code now is that if it
> is corrupt in the db entry part of the file it returns true and the global
> timestamp, which I would argue is perhaps incorrect and it should return
> false.
> >>
> >
> >Yeah, this is exactly the case I was pointing out where we return true
> >before the patch, basically the code below:
> >case 'D':
> >if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
> >  fpin) != offsetof(PgStat_StatDBEntry, tables))
> >{
> >ereport(pgStatRunningInCollector ? LOG : WARNING,
> >(errmsg("corrupted statistics file \"%s\"",
> >statfile)));
> >goto done;
> >}
> >
> >done:
> >FreeFile(fpin);
> >return true;
> >
> >Now, if we decide to return 'false' here, then surely there is no
> >argument and we should return false in other cases as well. Basically,
> >I think we should be consistent in handling the corrupt file case.
> >
>
> FWIW I do agree with this - we should return false here, to make it
> return false like in the other data corruption cases. AFAICS that's the
> only inconsistency here.
>

+1, I think that's the place to fix, rather than reversing all the other
places.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Minor fixes for upcoming version 13

2020-09-09 Thread Alexander Lakhin

Hello hackers,

Please consider fixing minor defects in the source code and documentation.
1. a missing dot in guc.c
"SELECT name, short_desc FROM pg_settings WHERE short_desc NOT LIKE
'%.'" finds only this one instance.
2. inrange -> in_range
the former spelling is unique
3. sigature -> signature
4. "start timeline %u not found history of timeline %u" -> "start
timeline %u not found in history of timeline %u"
it seems that a preposition is missing
5. BackgroundWorker *worker -> BackgroundWorker
*worker
incorrect Docbook tag
6. "unhandled previous state in xqs" -> "unhandled previous state at end
quote"?
"xqs" is rather cryptic for a user-visible message

I'm not sure about 6, so the attached patch covers only 1-5.

Best regards,
Alexander
diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
index 6e1cf121de0..1876dfbd5a5 100644
--- a/doc/src/sgml/bgworker.sgml
+++ b/doc/src/sgml/bgworker.sgml
@@ -34,11 +34,11 @@
   PostgreSQL is started by including the module name in
   shared_preload_libraries.  A module wishing to run a background
   worker can register it by calling
-  RegisterBackgroundWorker(BackgroundWorker *worker)
+  RegisterBackgroundWorker(BackgroundWorker *worker)
   from its _PG_init().  Background workers can also be started
   after the system is up and running by calling the function
-  RegisterDynamicBackgroundWorker(BackgroundWorker
-  *worker, BackgroundWorkerHandle **handle).  Unlike
+  RegisterDynamicBackgroundWorker(BackgroundWorker *worker,
+  BackgroundWorkerHandle **handle).  Unlike
   RegisterBackgroundWorker, which can only be called from within
   the postmaster, RegisterDynamicBackgroundWorker must be
   called from a regular backend or another background worker.
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index d03ee4d6fa0..4a1ac40297c 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -263,7 +263,7 @@

   
   
-   inrange
+   in_range

 
  in_range support functions
diff --git a/doc/src/sgml/intarray.sgml b/doc/src/sgml/intarray.sgml
index 9d2eb52eeb4..956c01d7493 100644
--- a/doc/src/sgml/intarray.sgml
+++ b/doc/src/sgml/intarray.sgml
@@ -452,7 +452,7 @@
 -- a message can be in one or more sections
 CREATE TABLE message (mid INT PRIMARY KEY, sections INT[], ...);
 
--- create specialized index with sigature length of 32 bytes
+-- create specialized index with signature length of 32 bytes
 CREATE INDEX message_rdtree_idx ON message USING GIST (sections gist__int_ops(siglen=32));
 
 -- select messages in section 1 OR 2 - OVERLAP operator
diff --git a/src/backend/replication/backup_manifest.c b/src/backend/replication/backup_manifest.c
index b6260049271..a43c793e289 100644
--- a/src/backend/replication/backup_manifest.c
+++ b/src/backend/replication/backup_manifest.c
@@ -272,7 +272,7 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
 	 */
 	if (!found_start_timeline)
 		ereport(ERROR,
-errmsg("start timeline %u not found history of timeline %u",
+errmsg("start timeline %u not found in history of timeline %u",
 	   starttli, endtli));
 
 	/* Terminate the list of WAL ranges. */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 20d4784b335..44c3e5051df 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3133,7 +3133,7 @@ static struct config_int ConfigureNamesInt[] =
 	},
 	{
 		{"autovacuum_vacuum_insert_threshold", PGC_SIGHUP, AUTOVACUUM,
-			gettext_noop("Minimum number of tuple inserts prior to vacuum, or -1 to disable insert vacuums"),
+			gettext_noop("Minimum number of tuple inserts prior to vacuum, or -1 to disable insert vacuums."),
 			NULL
 		},
 		&autovacuum_vac_ins_thresh,

Re: Fix for parallel BTree initialization bug

2020-09-09 Thread Jameson, Hunter 'James'

Hi, I spent some time trying to create a repro (other than testing it on the 
production instance where we encountered the bug), but was unable to create one 
within a reasonable time.

The tricky part is that the bug symptoms are run-time symptoms -- so not only 
do you need, first, to satisfy conditions (1), (2), and (3), without the query 
optimizer optimizing them away! -- but you also need, second, a query that runs 
long enough for one or more of the parallel workers' state machines to get 
confused. (This wasn't a problem on the production instance where we 
encountered the bug and I tested the fix.)

Also, third-- passing InvalidBlockNumber to ReadBuffer() generally just appends 
a new block to the relation, so the bug doesn't even result in an error 
condition on an RW instance. (The production instance was RO...) So the bug, 
although very small!, is annoying!

James

On 9/9/20, 6:14 AM, "Amit Kapila"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.

On Tue, Sep 8, 2020 at 11:55 PM Jameson, Hunter 'James'
 wrote:
>
> Hi, I ran across a small (but annoying) bug in initializing parallel 
BTree scans, which causes the parallel-scan state machine to get confused.
>
>
> To reproduce, you need a query that:
>
>
>
> 1. Executes parallel BTree index scan;
>
> 2. Has an IN-list of size > 1;
>
> 3. Has an additional index filter that makes it impossible to satisfy the
>
> first IN-list condition.
>
>
>
> (We encountered such a query, and therefore the bug, on a production 
instance.)
>
>

I think I can understand what you are pointing out here but it would
be great if you can have a reproducible test case because that will
make it apparent and we might want to include that in the regression
tests if possible.

--
With Regards,
Amit Kapila.

Re: More aggressive vacuuming of temporary tables

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2020-08-28 11:46:49 -0400, Tom Lane wrote:
> > It strikes me that when we are vacuuming a temporary table (which
> > necessarily will be one of our own session), we don't really need
> > to care what the global xmin horizon is.  If we're not inside a
> > user transaction block, then there are no tuples in the table that
> > could be in-doubt anymore.  Neither are there any snapshots in our
> > session that could see any dead tuples.  Nor do we give a fig what
> > other sessions might think of those tuples.  So we could just set
> > the xmin cutoff as aggressively as possible, which is to say
> > equal to the nextXid counter.  While vacuuming a temp table is
> > perhaps not something people do very often, I think when they do
> > do it they would like us to clean out all the dead tuples not just
> > some.
> 
> That seems like a good idea.

Agreed.

> I've been toying with a patch that introduces more smarts about when a
> row is removable, by looking more closely whether a specific row
> versions are visible (e.g. in the common case of one old snapshot and
> lots of newer rows). But that's orders of magnitude more complicated. So
> going for something as simple as this seems like a good idea.

I've wondered about this for a long time- very cool that you've found
time to actually work on a patch.  A couple of different ideas were
discussed previously about how to do that kind of a check- mind talking
about what method you're using, or perhaps just sharing that patch? :)

The potential of such an improvement is huge.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: unsupportable composite type partition keys

2020-09-09 Thread Jobin Augustine

Is there a way out if someone accidentally executes the same test case
against PG12?

testdb=# create table partitioned (a int, b int)
testdb-#   partition by list ((row(a, b)::partitioned));
CREATE TABLE
testdb=# DROP TABLE partitioned;
ERROR:  cache lookup failed for type 18269


>
> Ah, I wasn't sure that additional tests on a table would be worthwhile
> enough.
> Thanks for tweaking and pushing!
>
>
>

Re: Online checksums patch - once again

2020-09-09 Thread Daniel Gustafsson

> On 7 Sep 2020, at 09:17, Michael Paquier  wrote:

> Daniel, could you look at that?

I believe this boils down to a timing issue, I've included a fix in the v21
patch attached to a previous mail upthread.

cheers ./daniel

Re: Reduce the time required for a database recovery from archive.

Greetings,

* Dmitry Shulga (d.shu...@postgrespro.ru) wrote:
> Overall archive file processing is done one by one, and this might
> create a performance bottleneck if archived WAL files are delivered slowly,
> because the database server has to wait for arrival of the next
> WAL segment before applying its records.
> 
> To address this issue it is proposed to receive archived WAL files in parallel
> so that when the next WAL segment file is required for processing of redo log
> records it would be already available.

Yes, pgbackrest already does exactly this (if configured)- uses parallel
processes to fetch the WAL and have it be available ahead of time.

> Implementation of this approach assumes running several background processes 
> (bgworkers)
> each of which runs a shell command specified by the parameter restore_command
> to deliver an archived WAL file. Number of running parallel processes is 
> limited
> by the new parameter max_restore_command_workers. If this parameter has value > 0
> then WAL files delivery is performed using the original algorithm, that is in
> one-by-one manner. If this parameter has value greater than 0 then the 
> database
> server starts several bgworker processes up to the limit specified by
> the parameter max_restore_command_workers and passes to every process
> WAL file name to deliver. Active processes start prefetching of specified
> WAL files and store received files in the directory pg_wal/pgsql_tmp. After
> bgworker process finishes receiving a file it marks itself as a free process
> and waits for a new request to receive a next WAL file. The main process
> performing database recovery still handles WAL files in one-by-one manner,
> but instead of waiting for a next required WAL file's availability it checks 
> for
> that file in the prefetched directory. If a new file is present there,
> the main process starts its processing.

I'm a bit confused about this description- surely it makes sense for the
parallel workers to continue to loop and fetch up to some specified
max..?  Then to monitor and to fetch more when the amount pre-fetched so
far drops before that level?  The description above makes it sound like
X WAL will be fetched ahead of time, and then the recovery process will
go through those until it runs out and then it'll have to wait for the
next X WAL to be fetched, which means it's still going to end up being
delayed even with these parallel processes, which isn't good.

Does this also properly handle timeline switches..?

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Optimising compactify_tuples()

2020-09-09 Thread David Rowley

On Wed, 9 Sep 2020 at 05:38, Thomas Munro  wrote:
>
> On Wed, Sep 9, 2020 at 3:47 AM David Rowley  wrote:
> > On Tue, 8 Sep 2020 at 12:08, Thomas Munro  wrote:
> > > One thought is that if we're going to copy everything out and back in
> > > again, we might want to consider doing it in a
> > > memory-prefetcher-friendly order.  Would it be a good idea to
> > > rearrange the tuples to match line pointer order, so that the copying
> > > work and also later sequential scans are in a forward direction?
> >
> > That's an interesting idea but wouldn't that require both the copy to
> > the separate buffer *and* a qsort? That's the worst of both
> > implementations. We'd need some other data structure too in order to
> > get the index of the sorted array by reverse lineitem point, which
> > might require an additional array and an additional sort.
>
> Well I may not have had enough coffee yet but I thought you'd just
> have to spin though the item IDs twice.  Once to compute sum(lp_len)
> so you can compute the new pd_upper, and the second time to copy the
> tuples from their random locations on the temporary page to new
> sequential locations, so that afterwards item ID order matches offset
> order.

I think you were adequately caffeinated.  You're right that this is
fairly simple to do, but it looks even more simple than looping twice
of the array.  I think it's just a matter of looping over the
itemidbase backwards and putting the higher itemid tuples at the end
of the page. I've done it this way in the attached patch.

I also added a presorted path which falls back on doing memmoves
without the temp buffer when the itemidbase array indicates that
higher lineitems all have higher offsets.  I'm doing the presort check
in the calling function since that loops over the lineitems already.
We can just memmove the tuples in reverse order without overwriting
any yet to be moved tuples when these are in order.

Also, I added code to collapse the memcpy and memmoves for adjacent
tuples so that we perform the minimal number of calls to those
functions. Once we've previously compacted a page it seems that the
code is able to reduce the number of calls significantly.  I added
some logging and reviewed at after a run of the benchmark and saw that
for about 192 tuples we're mostly just doing 3-4 memcpys in the
non-presorted path and just 2 memmoves, for the presorted code path.
I also found that in my test the presorted path was only taken 12.39%
of the time. Trying with 120 million UPDATEs instead of 12 million in
the test ended up reducing this to just 10.89%.  It seems that it'll
just be 1 or 2 tuples spoiling it since new tuples will still be added
earlier in the page after we free up space to add more.

I also experimented seeing what would happen if I also tried to
collapse the memcpys for copying to the temp buffer. The performance
got a little worse from doing that. So I left that code #ifdef'd out

With the attached v3, performance is better. The test now runs
recovery 65.6 seconds, vs master's 148.5 seconds. So about 2.2x
faster.

We should probably consider what else can be done to try to write
pages with tuples for earlier lineitems earlier in the page.  VACUUM
FULLs and friends will switch back to the opposite order when
rewriting the heap.

Also fixed my missing libc debug symbols:

  24.90%  postgres  postgres[.] PageRepairFragmentation
  15.26%  postgres  libc-2.31.so[.] __memmove_avx_unaligned_erms
   9.61%  postgres  postgres[.] hash_search_with_hash_value
   8.03%  postgres  postgres[.] compactify_tuples
   6.25%  postgres  postgres[.] XLogReadBufferExtended
   3.74%  postgres  postgres[.] PinBuffer
   2.25%  postgres  postgres[.] hash_bytes
   1.79%  postgres  postgres[.] heap_xlog_update
   1.47%  postgres  postgres[.] LWLockRelease
   1.44%  postgres  postgres[.] XLogReadRecord
   1.33%  postgres  postgres[.] PageGetHeapFreeSpace
   1.16%  postgres  postgres[.] DecodeXLogRecord
   1.13%  postgres  postgres[.] pg_comp_crc32c_sse42
   1.12%  postgres  postgres[.] LWLockAttemptLock
   1.09%  postgres  postgres[.] StartupXLOG
   0.90%  postgres  postgres[.] ReadBuffer_common
   0.84%  postgres  postgres[.] SlruSelectLRUPage
   0.74%  postgres  libc-2.31.so[.] __memcmp_avx2_movbe
   0.68%  postgres  [kernel.kallsyms]   [k] copy_user_generic_string
   0.66%  postgres  postgres[.] PageAddItemExtended
   0.66%  postgres  postgres[.] PageIndexTupleOverwrite
   0.62%  postgres  postgres[.] smgropen
   0.60%  postgres  postgres[.] ReadPageInternal
   0.57%  postgres  postgres[.] GetPrivateRefCountEntry
   0.52%  postgres  postgres[.] heap_redo
   0.51%  postgres  postgres[.] AdvanceNextFullTransactionIdPastXid

David
diff --git

Re: [UNVERIFIED SENDER] Re: Fix for parallel BTree initialization bug

2020-09-09 Thread Jameson, Hunter 'James'

Also, the behavior (=line of code) added by the bug fix is the same as existing 
code in the same function, _bt_first(), at lines 898, 1096, 1132, 1367. And the 
calls to _bt_parallel_readpage(), line 903, and _bt_steppage(), line 1416, will 
also ultimately call _bt_parallel_done(). So the bug seems to be a pretty 
simple oversight: in 6 out of 7 cases in _bt_first(), we call 
_bt_parallel_done() before returning "false"; but in the 7th case (fixed in 
this bug fix), we do not. The fix is to make case #7 the same as the other 6.

James

On 9/9/20, 7:11 AM, "Jameson, Hunter 'James'"  wrote:

Hi, I spent some time trying to create a repro (other than testing it on 
the production instance where we encountered the bug), but was unable to create 
one within a reasonable time.

The tricky part is that the bug symptoms are run-time symptoms -- so not 
only do you need, first, to satisfy conditions (1), (2), and (3), without the 
query optimizer optimizing them away! -- but you also need, second, a query 
that runs long enough for one or more of the parallel workers' state machines 
to get confused. (This wasn't a problem on the production instance where we 
encountered the bug and I tested the fix.)

Also, third-- passing InvalidBlockNumber to ReadBuffer() generally just 
appends a new block to the relation, so the bug doesn't even result in an error 
condition on an RW instance. (The production instance was RO...) So the bug, 
although very small!, is annoying!

James

On 9/9/20, 6:14 AM, "Amit Kapila"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.

On Tue, Sep 8, 2020 at 11:55 PM Jameson, Hunter 'James'
 wrote:
>
> Hi, I ran across a small (but annoying) bug in initializing parallel 
BTree scans, which causes the parallel-scan state machine to get confused.
>
>
> To reproduce, you need a query that:
>
>
>
> 1. Executes parallel BTree index scan;
>
> 2. Has an IN-list of size > 1;
>
> 3. Has an additional index filter that makes it impossible to satisfy 
the
>
> first IN-list condition.
>
>
>
> (We encountered such a query, and therefore the bug, on a production 
instance.)
>
>

I think I can understand what you are pointing out here but it would
be great if you can have a reproducible test case because that will
make it apparent and we might want to include that in the regression
tests if possible.

--
With Regards,
Amit Kapila.

Re: doc review for v13

I've added a few more.

-- 
Justin
>From 137321a0d476f66b5e5f21c2f627c407330e50b1 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 30 Mar 2020 19:43:22 -0500
Subject: [PATCH v8 01/14] doc: btree deduplication

commit 0d861bbb702f8aa05c2a4e3f1650e7e8df8c8c27
Author: Peter Geoghegan 
---
 doc/src/sgml/btree.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index d03ee4d6fa..69c1ee0e97 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -642,7 +642,7 @@ options(relopts local_relopts *) returns
   
   
Deduplication works by periodically merging groups of duplicate
-   tuples together, forming a single posting list tuple for each
+   tuples together, forming a single posting list tuple for each
group.  The column key value(s) only appear once in this
representation.  This is followed by a sorted array of
TIDs that point to rows in the table.  This
-- 
2.17.0

>From fc883d317a895140771ce34564847bb9ca98b7e3 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 29 Mar 2020 21:22:43 -0500
Subject: [PATCH v8 02/14] doc: pg_stat_progress_basebackup

commit e65497df8f85ab9b9084c928ff69f384ea729b24
Author: Fujii Masao 
---
 doc/src/sgml/monitoring.sgml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 304c49f07b..ea8780327f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6069,8 +6069,8 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   waiting for checkpoint to finish
   
The WAL sender process is currently performing
-   pg_start_backup to set up for
-   taking a base backup, and waiting for backup start
+   pg_start_backup to prepare to
+   take a base backup, and waiting for the start-of-backup
checkpoint to finish.
   
  
-- 
2.17.0

>From 2d7c3ca43db77c910e8cda4b53fd6c6b09421b40 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 7 Apr 2020 19:56:56 -0500
Subject: [PATCH v8 03/14] doc: Allow users to limit storage reserved by
 replication slots

commit c6550776394e25c1620bc8258427c8f1d448080d
Author: Alvaro Herrera 
---
 doc/src/sgml/config.sgml | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7a7177c550..190157df0a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3900,9 +3900,9 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
 slots are allowed to retain in the pg_wal
 directory at checkpoint time.
 If max_slot_wal_keep_size is -1 (the default),
-replication slots retain unlimited amount of WAL files.  If
-restart_lsn of a replication slot gets behind more than that megabytes
-from the current LSN, the standby using the slot may no longer be able
+replication slots may retain an unlimited amount of WAL files.  Otherwise, if
+restart_lsn of a replication slot falls behind the current LSN by more
+than the given size, the standby using the slot may no longer be able
 to continue replication due to removal of required WAL files. You
 can see the WAL availability of replication slots
 in pg_replication_slots.
-- 
2.17.0

>From e06f8c6f048daa9829dc3bf0450caf5c099d9e4f Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 6 Apr 2020 17:16:07 -0500
Subject: [PATCH v8 04/14] doc: s/evade/avoid/

---
 src/backend/access/gin/README | 2 +-
 src/backend/utils/adt/jsonpath_exec.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/gin/README b/src/backend/access/gin/README
index 125a82219b..41d4e1e8a0 100644
--- a/src/backend/access/gin/README
+++ b/src/backend/access/gin/README
@@ -413,7 +413,7 @@ leftmost leaf of the tree.
 Deletion algorithm keeps exclusive locks on left siblings of pages comprising
 currently investigated path.  Thus, if current page is to be removed, all
 required pages to remove both downlink and rightlink are already locked.  That
-evades potential right to left page locking order, which could deadlock with
+avoids potential right to left page locking order, which could deadlock with
 concurrent stepping right.
 
 A search concurrent to page deletion might already have read a pointer to the
diff --git a/src/backend/utils/adt/jsonpath_exec.c b/src/backend/utils/adt/jsonpath_exec.c
index f146767bfc..2d2eb7d7a3 100644
--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -35,7 +35,7 @@
  * executeItemOptUnwrapTarget() function have 'unwrap' argument, which indicates
  * whether unwrapping of array is needed.  When unwrap == true, each of array
  * members is passed to executeItemOptUnwrapTarget() again but with unwrap == false
- * in order to evade subsequent array unwrapping.
+ * in order to avoid subse

Re: SIGQUIT handling, redux

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> This is to pull together a couple of recent threads that are seemingly
> unrelated to $SUBJECT.
> 
> Over in the long thread at [1] we've agreed to try to make the backend
> code actually do what assorted comments claim it does, namely allow
> SIGQUIT to be accepted at any point after a given process has established
> its signal handlers.  (Right now, it mostly is not accepted during error
> recovery, but there's no clear reason to preserve that exception.)
> 
> Of course, this is only safe if the SIGQUIT handler is safe to be invoked
> anywhere, so I did a quick survey of backend signal handlers to see if
> that is true.  I immediately found pgarch_exit() which surely is not.  It
> turns out that Horiguchi-san already noticed that and proposed to fix it,
> within the seemingly entirely unrelated patch series for a shared-memory
> based stats collector (see patch 0001 in [2]).  I think we should go ahead
> and commit that, and maybe even back-patch it.  There seems no good reason
> for the archiver to treat SIGQUIT as nonfatal when other postmaster
> children do not.

As I mentioned over there, I agree that we should do this and we should
further have the statistics collector also do so, which currently sets
up SIGQUIT with ShutdownRequestPending() and in its loop decides it's
fine to write out the stats file (which we're going to remove during
recovery anyway...) and then call exit(0).  I also think we should
back-patch these changes, as the commit mentioned in Horiguchi-san's
patch for pgarch_exit() was.

> Another thing I found is startup_die() in postmaster.c, which catches
> SIGQUIT during the authentication phase of a new backend.  This calls
> proc_exit(1), which (a) is unsafe on its face, and (b) potentially
> demotes a SIGQUIT into something that will not cause the parent postmaster
> to perform a DB-wide restart.  There seems no good reason for that
> behavior, either.  I propose that we use SignalHandlerForCrashExit
> for SIGQUIT here too.

Yikes, agreed.

> In passing, it's worth noting that startup_die() isn't really much safer
> for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
> those is that code that applies BlockSig will at least manage to block the
> former.  So it's slightly tempting to drop startup_die() altogether in
> favor of using SignalHandlerForCrashExit for the SIGTERM-during-auth case
> too.  However, that'd have the effect of causing a "fast shutdown" to get
> converted to a panic if it happens while any sessions are in
> authentication phase, so I think this cure is likely worse than the
> disease.

Agreed, that's definitely no good.

> Worth reading in connection with this is the thread that led up to
> commits 8e19a8264 et al [3].  We had a long discussion about how
> quickdie() and startup_die() might be made safe(r), without any
> real consensus emerging about any alternatives being better than the
> status quo.  I still don't have an improvement idea for quickdie();
> I don't want to give up trying to send a message to the client.
> Note, however, that quickdie() does end with _exit(2) so that at
> least it's not trying to execute arbitrary process-cleanup code.
> 
> In short then, I want to ensure that postmaster children's SIGQUIT
> handlers always end with _exit(2) rather than some other exit method.
> We have two exceptions now and they need to get fixed.

I agree we should fix these.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: shared-memory based stats collector

Greetings,

* Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> > Shouldn't this:
> > 
> > a) be back-patched, as the other change was
> > b) also include a change to have the stats collector (which I realize is
> >removed later on in this patch set, but we're talking about fixing
> >existing things..) for the same reason, and because there isn't much
> >point in trying to write out the stats after we get a SIGQUIT, since
> >we're just going to blow them away again since we're going to go
> >through crash recovery..?
> > 
> > Might be good to have a separate thread to address these changes.
> 
> +1

Just FYI, Tom's started a thread which includes this over here-

https://postgr.es/m/1850884.1599601...@sss.pgh.pa.us

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: SIGQUIT handling, redux

Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Of course, this is only safe if the SIGQUIT handler is safe to be invoked
>> anywhere, so I did a quick survey of backend signal handlers to see if
>> that is true.  I immediately found pgarch_exit() which surely is not.  It
>> turns out that Horiguchi-san already noticed that and proposed to fix it,
>> within the seemingly entirely unrelated patch series for a shared-memory
>> based stats collector (see patch 0001 in [2]).  I think we should go ahead
>> and commit that, and maybe even back-patch it.  There seems no good reason
>> for the archiver to treat SIGQUIT as nonfatal when other postmaster
>> children do not.

> As I mentioned over there, I agree that we should do this and we should
> further have the statistics collector also do so, which currently sets
> up SIGQUIT with ShutdownRequestPending() and in its loop decides it's
> fine to write out the stats file (which we're going to remove during
> recovery anyway...) and then call exit(0).

I noticed that that was different from everything else, but it's not
actually signal-unsafe, so it seems like a different topic from what
I'm on about at the moment.  I don't mind if you or somebody else
wants to change it, but I don't see it as a back-patchable bug fix.

> I also think we should
> back-patch these changes, as the commit mentioned in Horiguchi-san's
> patch for pgarch_exit() was.

Agreed; I'll go make it happen.

regards, tom lane

Re: SIGQUIT handling, redux

I wrote:
> Stephen Frost  writes:
>> As I mentioned over there, I agree that we should do this and we should
>> further have the statistics collector also do so, which currently sets
>> up SIGQUIT with ShutdownRequestPending() and in its loop decides it's
>> fine to write out the stats file (which we're going to remove during
>> recovery anyway...) and then call exit(0).

> I noticed that that was different from everything else, but it's not
> actually signal-unsafe, so it seems like a different topic from what
> I'm on about at the moment.  I don't mind if you or somebody else
> wants to change it, but I don't see it as a back-patchable bug fix.

Note also that the postmaster actually uses SIGQUIT to command normal
shutdown of the stats collector (cf reaper(), around line 3125 in HEAD).
So this needs a change in signaling conventions, not just internal
tweaks in the collector.  Not a big deal, but it reinforces my feeling
that it should be a HEAD-only change.

regards, tom lane

Re: unsupportable composite type partition keys

On Wed, Sep 9, 2020 at 4:17 PM Jobin Augustine  wrote:
>
> Is there a way out if someone accidentally executes the same test case 
> against PG12?
>
> testdb=# create table partitioned (a int, b int)
> testdb-#   partition by list ((row(a, b)::partitioned));
> CREATE TABLE
> testdb=# DROP TABLE partitioned;
> ERROR:  cache lookup failed for type 18269

AFAICT this is only a side effect of that particular use case if you
try to drop it without having a relcache entry.  Do any access before
dropping it and it should be fine, for instance:

rjuju=# create table partitioned (a int, b int)
rjuju-# partition by list ((row(a, b)::partitioned));
CREATE TABLE
rjuju=# DROP TABLE partitioned;
ERROR:  cache lookup failed for type 144845
rjuju=# \d partitioned
  Partitioned table "public.partitioned"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |  |
Partition key: LIST ((ROW(a, b)::partitioned))
Number of partitions: 0

rjuju=# DROP TABLE partitioned;
DROP TABLE

Re: Minor fixes for upcoming version 13

Alexander Lakhin  writes:
> Please consider fixing minor defects in the source code and documentation.

I agree with all of these, except the markup fixes in bgworker.sgml
still seem not right; it should be more like

  RegisterBackgroundWorker(BackgroundWorker 
*worker)

> 6. "unhandled previous state in xqs" -> "unhandled previous state at end
> quote"?
> "xqs" is rather cryptic for a user-visible message

But it's not user-visible is it?  That should surely be a can't-happen
case.

Will push in a little bit.

regards, tom lane

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

On Wed, Sep 09, 2020 at 09:22:00PM +0900, Michael Paquier wrote:
> On Tue, Sep 08, 2020 at 07:17:58PM -0500, Justin Pryzby wrote:
> > Initially I added List *params, and Michael suggested to retire
> > ReindexStmt->concurrent.  I provided a patch to do so, initially by leaving 
> > int
> > options and then, after this, removing it to "complete the thought", and get
> > rid of the remnants of the "old way" of doing it.  This is also how vacuum 
> > and
> > explain are done.
> > https://www.postgresql.org/message-id/20200902022410.GA20149%40telsasoft.com
> 
> Defining a set of DefElem when parsing and then using the int
> "options" with bitmasks where necessary at the beginning of the
> execution looks like a good balance to me.  This way, you can extend
> the grammar to use things like (verbose = true), etc.

It doesn't need to be extended - defGetBoolean already handles that.  I don't
see what good can come from storing the information in two places in the same
structure.

|postgres=# CLUSTER (VERBOSE on) pg_attribute USING 
pg_attribute_relid_attnum_index ;
|INFO:  clustering "pg_catalog.pg_attribute" using index scan on 
"pg_attribute_relid_attnum_index"
|INFO:  "pg_attribute": found 0 removable, 2968 nonremovable row versions in 55 
pages
|DETAIL:  0 dead row versions cannot be removed yet.
|CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s.
|CLUSTER

-- 
Justin

Re: Yet another fast GiST build

2020-09-09 Thread Heikki Linnakangas


On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:

On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:


Come to think of it, the point z-order comparator could benefit a lot
from key abbreviation, too. You could do the point -> zorder conversion
in the abbreviation routine.


That's how it works in PostGIS, only that we moved to more
effecient Hilbert curve:
https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171


Thanks, that's interesting.

I implemented the abbreviated keys for the point opclass, too, and 
noticed that the patch as it was never used it. I reworked the patch so 
that tuplesort_begin_index_gist() is responsible for looking up the 
sortsupport function, like tuplesort_begin_index_btree() does, and uses 
abbreviation when possible.


I think this is pretty much ready for commit now. I'll do a bit more 
testing (do we have regression test coverage for this?), also on a 
SIZEOF_DATUM==4 system since the abbreviation works differently with 
that, and push if nothing new comes up. And clarify the documentation 
and/or comments that the sortsupport function sees "compressed" values.


I wonder if we could use sorting to also speed up building tsvector 
indexes? The values stored there are bit signatures, what would be a 
good sort order for those?


- Heikki

Re: Yet another fast GiST build

2020-09-09 Thread Heikki Linnakangas


On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:

On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:


Come to think of it, the point z-order comparator could benefit a lot
from key abbreviation, too. You could do the point -> zorder conversion
in the abbreviation routine.


That's how it works in PostGIS, only that we moved to more
effecient Hilbert curve:
https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171


Thanks, that's interesting.

I implemented the abbreviated keys for the point opclass, too, and 
noticed that the patch as it was never used it. I reworked the patch so 
that tuplesort_begin_index_gist() is responsible for looking up the 
sortsupport function, like tuplesort_begin_index_btree() does, and uses 
abbreviation when possible.


I think this is pretty much ready for commit now. I'll do a bit more 
testing (do we have regression test coverage for this?), also on a 
SIZEOF_DATUM==4 system since the abbreviation works differently with 
that, and push if nothing new comes up. And clarify the documentation 
and/or comments that the sortsupport function sees "compressed" values.


I wonder if we could use sorting to also speed up building tsvector 
indexes? The values stored there are bit signatures, what would be a 
good sort order for those?


- Heikki
>From 3a4d9c14631ae54b983d75433e0286f3dfedf432 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 9 Sep 2020 18:33:18 +0300
Subject: [PATCH v17 1/1] Add sort support for point gist_point_sortsupport

Implement GiST build using sort support

We use special sorting function provided by opclass to approximate
GiST tree with B-tree-like structure. This approach allows to
radically reduce build time in some cases.

Author: Andrey Borodin
Reviewed-by: Pavel Borisov, Thomas Munro
Discussion: https://www.postgresql.org/message-id/1A36620E-CAD8-4267-9067-FB31385E7C0D%40yandex-team.ru
---
 doc/src/sgml/gist.sgml  |  66 
 src/backend/access/gist/gistbuild.c | 499 
 src/backend/access/gist/gistproc.c  | 224 +++
 src/backend/access/gist/gistutil.c  |  53 ++-
 src/backend/access/gist/gistvalidate.c  |   6 +-
 src/backend/access/transam/xloginsert.c |  57 +++
 src/backend/utils/sort/sortsupport.c|  34 ++
 src/backend/utils/sort/tuplesort.c  |  57 +++
 src/include/access/gist.h   |   3 +-
 src/include/access/gist_private.h   |   3 +
 src/include/access/xloginsert.h |   2 +
 src/include/catalog/catversion.h|   1 +
 src/include/catalog/pg_amproc.dat   |   2 +
 src/include/catalog/pg_proc.dat |   3 +
 src/include/utils/sortsupport.h |   1 +
 src/include/utils/tuplesort.h   |   4 +
 16 files changed, 912 insertions(+), 103 deletions(-)

diff --git a/doc/src/sgml/gist.sgml b/doc/src/sgml/gist.sgml
index f9226e7a35c..b049094c811 100644
--- a/doc/src/sgml/gist.sgml
+++ b/doc/src/sgml/gist.sgml
@@ -259,6 +259,8 @@ CREATE INDEX ON my_table USING GIST (my_inet_column inet_ops);
compress method is omitted. The optional tenth method
options is needed if the operator class provides
the user-specified parameters.
+   The sortsupport method is also optional and is used to speed up
+   building a GiST index.
  
 
  
@@ -1065,6 +1067,70 @@ my_compress(PG_FUNCTION_ARGS)
   
  
 
+
+
+ sortsupport
+ 
+  
+   Returns a comparator function to sort data in a way that preserves
+   locality. It is used by CREATE INDEX and
+   REINDEX. The quality of the created index depends on
+   how well the sort order determined by the comparator routine preserves
+   locality of the inputs.
+  
+  
+   The sortsupport method is optional. If it is not
+   provided, CREATE INDEX builds the index by inserting
+   each tuple to the tree using the penalty and
+   picksplit functions, which is much slower.
+  
+
+  
+   The SQL declaration of the function must look like
+   this:
+
+
+CREATE OR REPLACE FUNCTION my_sortsupport(internal)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+
+   The argument is a pointer to a SortSupport
+   struct. At a minimum, the function must fill in its comparator field,
+   the full API is defined in
+   src/include/utils/sortsupport.h.
+   
+
+   
+The matching code in the C module could then follow this skeleton:
+
+
+PG_FUNCTION_INFO_V1(my_sortsupport);
+
+static int
+my_fastcmp(Datum x, Datum y, SortSupport ssup)
+{
+  /* establish order between x and y by computing some sorting value z */
+
+  int z1 = ComputeSpatialCode(x);
+  int z2 = ComputeSpatialCode(y);
+
+  return z1 == z2 ? 0 : z1 > z2 ? 1 : -1;
+}
+
+Datum
+my_sortsupport(PG_FUNCTION_ARGS)
+{
+  SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
+
+  ssup->comparator = my_fastcmp;
+  PG_RETURN_VOID();
+}
+
+  
+ 
+
   
 
   
diff --git

Re: Inconsistency in determining the timestamp of the db statfile.

On 2020-Sep-09, Amit Kapila wrote:

> I have included Alvaro as he is a committer for 187492b6, so he might
> remember something and let us know if this is a mistake or there is
> some reason for doing so (return true even when the db entry we are
> trying to read is corrupt).

Thanks -- I have to excuse myself here, as I don't have too many
memories about this.  It seems y'all have derived more insight that I
could possibly offer.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: BRIN multi-range indexes

2020-09-09 Thread John Naylor

On Sat, Sep 5, 2020 at 7:21 PM Tomas Vondra
 wrote:
>
> OK, here is a rebased version. Most of the breakage was due to changes
> to the BRIN sgml docs.

Hi Tomas,

I plan on trying some different queries on different data
distributions to get a sense of when the planner chooses a
multi-minmax index, and whether the choice is good.

Just to start, I used the artificial example in [1], but scaled down a
bit to save time. Config is at the default except for:
shared_buffers = 1GB
random_page_cost = 1.1;
effective_cache_size = 4GB;

create table t (a bigint, b int) with (fillfactor=95);

insert into t select i + 1000*random(), i+1000*random()
  from generate_series(1,1000) s(i);

update t set a = 1, b = 1 where random() < 0.001;
update t set a = 1000, b = 1000 where random() < 0.001;

analyze t;

create index on t using brin (a);
CREATE INDEX
Time: 1631.452 ms (00:01.631)

explain analyze select * from t
  where a between 1923300::int and 1923600::int;

QUERY PLAN
--
 Bitmap Heap Scan on t  (cost=16.10..43180.43 rows=291 width=12)
(actual time=217.770..1131.366 rows=288 loops=1)
   Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
   Rows Removed by Index Recheck: 712
   Heap Blocks: lossy=56819
   ->  Bitmap Index Scan on t_a_idx  (cost=0.00..16.03 rows=22595
width=0) (actual time=3.054..3.055 rows=568320 loops=1)
 Index Cond: ((a >= 1923300) AND (a <= 1923600))
 Planning Time: 0.328 ms
 Execution Time: 1131.411 ms
(8 rows)

Now add the multi-minmax:

create index on t using brin (a int8_minmax_multi_ops);
CREATE INDEX
Time: 6521.026 ms (00:06.521)

The first interesting thing is, with both BRIN indexes available, the
planner still chooses the conventional BRIN index. Only when I disable
it, does it choose the multi-minmax index:

explain analyze select * from t
  where a between 1923300::int and 1923600::int;

   QUERY PLAN
-
 Bitmap Heap Scan on t  (cost=68.10..43160.86 rows=291 width=12)
(actual time=1.835..4.196 rows=288 loops=1)
   Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
   Rows Removed by Index Recheck: 22240
   Heap Blocks: lossy=128
   ->  Bitmap Index Scan on t_a_idx1  (cost=0.00..68.03 rows=22523
width=0) (actual time=0.691..0.691 rows=1280 loops=1)
 Index Cond: ((a >= 1923300) AND (a <= 1923600))
 Planning Time: 0.250 ms
 Execution Time: 4.244 ms
(8 rows)

I wonder if this is a clue that something in the costing unfairly
penalizes a multi-minmax index. Maybe not enough to matter in
practice, since I wouldn't expect a user to put different kinds of
index on the same column.

The second thing is, with parallel seq scan, the query is faster than
a BRIN bitmap scan, with this pathological data distribution, but the
planner won't choose it unless forced to:

set enable_bitmapscan = 'off';
explain analyze select * from t
  where a between 1923300::int and 1923600::int;
  QUERY PLAN
---
 Gather  (cost=1000.00..120348.10 rows=291 width=12) (actual
time=372.766..380.364 rows=288 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t  (cost=0.00..119319.00 rows=121
width=12) (actual time=268.326..366.228 rows=96 loops=3)
 Filter: ((a >= 1923300) AND (a <= 1923600))
 Rows Removed by Filter: 237
 Planning Time: 0.089 ms
 Execution Time: 380.434 ms
(8 rows)

And just to compare size:

BRIN 32kB
BRIN multi  136kB
Btree   188MB

[1] 
https://www.postgresql.org/message-id/459eef3e-48c7-0f5a-8356-992442a78bb6%402ndquadrant.com

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Inconsistency in determining the timestamp of the db statfile.





On 2020/09/09 22:57, Magnus Hagander wrote:

On Wed, Sep 9, 2020 at 3:56 PM Tomas Vondra mailto:tomas.von...@2ndquadrant.com>> wrote:

On Wed, Sep 09, 2020 at 03:53:40PM +0530, Amit Kapila wrote:
 >On Wed, Sep 9, 2020 at 3:15 PM Magnus Hagander mailto:mag...@hagander.net>> wrote:
 >>
 >> On Wed, Sep 9, 2020 at 5:04 AM Amit Kapila mailto:amit.kapil...@gmail.com>> wrote:
 >>>
 >>
 >> Though in fact the one inconsistent place in the code now is that if it 
is corrupt in the db entry part of the file it returns true and the global timestamp, 
which I would argue is perhaps incorrect and it should return false.
 >>
 >
 >Yeah, this is exactly the case I was pointing out where we return true
 >before the patch, basically the code below:
 >case 'D':
 >if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
 >  fpin) != offsetof(PgStat_StatDBEntry, tables))
 >{
 >ereport(pgStatRunningInCollector ? LOG : WARNING,
 >(errmsg("corrupted statistics file \"%s\"",
 >statfile)));
 >goto done;
 >}
 >
 >done:
 >FreeFile(fpin);
 >return true;
 >
 >Now, if we decide to return 'false' here, then surely there is no
 >argument and we should return false in other cases as well. Basically,
 >I think we should be consistent in handling the corrupt file case.
 >

FWIW I do agree with this - we should return false here, to make it
return false like in the other data corruption cases. AFAICS that's the
only inconsistency here.


+1, I think that's the place to fix, rather than reversing all the other places.


+1 as I suggested upthread!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: VACUUM (INTERRUPTIBLE)?

On Wed, Sep 9, 2020 at 12:58 AM Andres Freund  wrote:

> Hi,
>
> On 2020-09-08 22:30:40 +0200, Magnus Hagander wrote:
> > One thing I've been wanting many times but never properly got around to
> > investigating how much work it would be to make happen, was to be able to
> > trigger an autovacuum manually (yeah, strange choice of terms). That is,
> > instead of the above, you'd have something like "VACUUM BACKGROUND" which
> > would trigger an autovacuum worker to do the work, and then release your
> > session. The points then being both (1) the ability to interrupt it, and
> > (2) that it'd run in the backgorund and thus the foreground session could
> > disconnect.
> >
> > I think both would probably solve your problem, and being able to
> trigger a
> > background one would add some extra value? But we'd have to figure out
> and
> > be clear about what to do if all workers are busy for example - queue or
> > error?
> >
> > Worth considering, or am I missing something?
>
> It seems like it could be useful in general. Not that much for my case
> however. It'd be much easier to test whether vacuum was successfully
> cancelled if we can see the cancellation, which we can't in the
> autovacuum case. And there'd likely be some fairly ugly logic around
>

That does bring up the other thing that I even put together some hacky
patch for at one point (long since lost or badly-rebased-into-nothingness)
which is to have the stats collector track on a per-relation basis the
number of autovacuum interruptions that have happened on a specific table :)

But yes, with that it would still not be *as easy* to use, definitely.


> needing to wait until the "autovacuum request" is processed etc,
> including the case that all workers are currently busy.


> So my INTERRUPTIBLE idea seems to be a better fit for the tests, and
> independently quite useful. E.g. wanting to know whether VACUUM errored
> out is useful for scripts that want their VACUUMs to be interruptible,
> and that doesn't work well with the "backgrounding" idea of yours.
>
> Having said that, your idea does seem like it could be helpful. The
> difficulty seems to depend a bit on the exact desired
> semantics. E.g. would the "queue" command block until vacuum started or
> not? The latter would obviously be much easier...


Yeah, that's where I stalled in my own thoughts about it I think. OTOH, why
wait for it to start, if you're not waiting for it to finish... But also,
if there is a max size of the queue, what do you do if you hit that one?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Minor fixes for upcoming version 13

2020-09-09 Thread Alexander Lakhin

Hello Tom,

09.09.2020 18:29, Tom Lane wrote:
> Alexander Lakhin  writes:
>> Please consider fixing minor defects in the source code and documentation.
> I agree with all of these, except the markup fixes in bgworker.sgml
> still seem not right; it should be more like
>
>   RegisterBackgroundWorker(BackgroundWorker 
> *worker)
Yes, but I've tried to minimize the change and align with the adjacent
text. For example:
Once running, the process can connect to a database by calling
   BackgroundWorkerInitializeConnection(char
*dbname...
>> 6. "unhandled previous state in xqs" -> "unhandled previous state at end
>> quote"?
>> "xqs" is rather cryptic for a user-visible message
> But it's not user-visible is it?  That should surely be a can't-happen
> case.
I've encountered this while translating NLS messages in postgres.po and
ecpg.po. So I think it should at least puzzle translators.

Best regards,
Alexander

Re: Minor fixes for upcoming version 13

2020-09-09 Thread John Naylor

On Wed, Sep 9, 2020 at 12:16 PM Alexander Lakhin  wrote:
>
> Hello Tom,
>
> 09.09.2020 18:29, Tom Lane wrote:
> > But it's not user-visible is it?  That should surely be a can't-happen
> > case.
> I've encountered this while translating NLS messages in postgres.po and
> ecpg.po. So I think it should at least puzzle translators.

Maybe instead Assert(0) with a comment or something like that?

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Minor fixes for upcoming version 13

John Naylor  writes:
> On Wed, Sep 9, 2020 at 12:16 PM Alexander Lakhin  wrote:
>> 09.09.2020 18:29, Tom Lane wrote:
>>> But it's not user-visible is it?  That should surely be a can't-happen
>>> case.

>> I've encountered this while translating NLS messages in postgres.po and
>> ecpg.po. So I think it should at least puzzle translators.

> Maybe instead Assert(0) with a comment or something like that?

Maybe what we need is yyerror_internal() for messages that don't require
translation.  Or just convert it to a plain elog(ERROR) ... that would
lose the possibility of providing an error cursor, but that seems like
something we don't need here either.

regards, tom lane

Re: Remove line length restriction in passwordFromFile()





On 2020/09/02 0:14, Tom Lane wrote:

Fujii Masao  writes:

The patch looks good to me, except the following minor thing.
+   if (fgets(buf.data + buf.len, buf.maxlen - buf.len - 1, fp) == 
NULL)
IIUC fgets() reads the data with the specified size - 1, so ISTM that -1 of
"buf.maxlen - buf.len - 1" is not necessary.


Good point, I was being unduly conservative.  Thanks for reviewing
the patch!


Thanks for committing the patch!

The patch was back-patched to v13, but v13 release note still has the following 
item.

Tighten libpq's overlength-line handling and comment detection for .pgpass 
files (Fujii Masao)

This should be changed to the following or something?

Teach libpq to handle arbitrary-length lines in .pgpass files (Tom Lane)

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Inconsistent Japanese name order in v13 contributors list





On 2020/09/09 21:15, Peter Eisentraut wrote:

On 2020-09-09 07:40, Fujii Masao wrote:

Attached is a patch to standardize Japanese names as given-name-first
in the v13 contributors list as before.


Using given-name-first order is our consensus? I was thinking we have not
reached that yet and our "vague" consensus was to use the name that each
contributor prefers, for example the name that used in the email signature, etc.

BTW, if possible I'd like to use family-name-first, i.e., "Fujii Masao".


See commit 53c89aed7b38ab412fddc1d6118822ce5d962acd for when this was changed.

At least it's the current practice.  It can be debated whether it's a good 
practice.


Thanks for letting me know that! Ok, using family-name-first rule only
for me would make the maintain of contributors list harder, so I'm ok
to follow the given-name-first rule for the list.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Remove line length restriction in passwordFromFile()

Fujii Masao  writes:
> The patch was back-patched to v13, but v13 release note still has the 
> following item.

>  Tighten libpq's overlength-line handling and comment detection for 
> .pgpass files (Fujii Masao)

> This should be changed to the following or something?

>  Teach libpq to handle arbitrary-length lines in .pgpass files (Tom Lane)

Hm.  Actually, since that went further back than v13, I don't think
it should appear in the v13 notes at all; it will be material for
the next quarterly update release notes.

We could adjust the release-note entry for your patch to say
"Improve comment detection for .pgpass files", or we could decide
it's not worth documenting that part and just drop the entry.

regards, tom lane

Re: Yet another fast GiST build

2020-09-09 Thread Andrey M. Borodin




> 9 сент. 2020 г., в 20:39, Heikki Linnakangas  написал(а):
> 
> On 09/09/2020 15:20, Darafei "Komяpa" Praliaskouski wrote:
>> On Wed, Sep 9, 2020 at 3:09 PM Heikki Linnakangas  wrote:
>>> Come to think of it, the point z-order comparator could benefit a lot
>>> from key abbreviation, too. You could do the point -> zorder conversion
>>> in the abbreviation routine.
>> That's how it works in PostGIS, only that we moved to more
>> effecient Hilbert curve:
>> https://github.com/postgis/postgis/blob/54399b9f6b0f02e8db9444f9f042b8d4ca6d4fa4/postgis/lwgeom_btree.c#L171
> 
> Thanks, that's interesting.
> 
> I implemented the abbreviated keys for the point opclass, too, and noticed 
> that the patch as it was never used it. I reworked the patch so that 
> tuplesort_begin_index_gist() is responsible for looking up the sortsupport 
> function, like tuplesort_begin_index_btree() does, and uses abbreviation when 
> possible.
Wow, abbreviated sort made gist for points construction even 1.5x faster!
btw there is small typo in arg names in gist_bbox_zorder_cmp_abbrev(); z1,z2 -> 
a,b

> do we have regression test coverage for this?
Yes, sorting build for points is tested in point.sql, but with small dataset. 
index_including_gist.sql seems to be working with boxes, but triggers point 
paths too.

> , also on a SIZEOF_DATUM==4 system since the abbreviation works differently 
> with that, and push if nothing new comes up. And clarify the documentation 
> and/or comments that the sortsupport function sees "compressed" values.
> 
> I wonder if we could use sorting to also speed up building tsvector indexes? 
> The values stored there are bit signatures, what would be a good sort order 
> for those?
We need an order so that nearby values have a lot of bits in common.
What is the length of this signature?
For each 4 bytes we can compute number of 1s in it's binary representation. 
Then z-order these dwords as values 0-32.

This will be very inefficient grouping, but it will tend to keep empty and 
dense 4-byte regions apart.

Thanks for working on this!


Best regards, Andrey Borodin.

Re: Remove line length restriction in passwordFromFile()





On 2020/09/10 1:48, Tom Lane wrote:

Fujii Masao  writes:

The patch was back-patched to v13, but v13 release note still has the following 
item.



  Tighten libpq's overlength-line handling and comment detection for 
.pgpass files (Fujii Masao)



This should be changed to the following or something?



  Teach libpq to handle arbitrary-length lines in .pgpass files (Tom Lane)


Hm.  Actually, since that went further back than v13, I don't think
it should appear in the v13 notes at all; it will be material for
the next quarterly update release notes.

We could adjust the release-note entry for your patch to say
"Improve comment detection for .pgpass files", or we could decide
it's not worth documenting that part and just drop the entry.


"Improve comment detection for .pgpass files" is also the material for
minor version release because that change was also back-patched?
If so, I agree to drop the entry.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Minor cleanup of partbounds.c

On 2020-Sep-09, Etsuro Fujita wrote:

> Here is a patch for minor cleanup of the partbounds.c changes made by
> commit c8434d64c: 1) removes a useless assignment (in normal builds)

LGTM.

> and 2) improves comments a little.

No objection to changing "bounds" to "range bounds".

I think the point other is to replace the only appearance of "dummy
relation" to better match the extensive use of "dummy partition" in this
file.  The concept of a "dummy relation" is well established in the
planner.  I didn't know if "dummy partition" is itself a concept
(apparently in the newfangled partition-wise join stuff), or just
glorified wording to say "a dummy relation that happens to be a
partition".  Looking at is_dummy_partition, apparently a dummy partition
is either a dummy relation or a partition that doesn't have a
RelOptInfo.  So my conclusion is that this wording is okay to change
too.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Remove line length restriction in passwordFromFile()

Fujii Masao  writes:
> On 2020/09/10 1:48, Tom Lane wrote:
>> We could adjust the release-note entry for your patch to say
>> "Improve comment detection for .pgpass files", or we could decide
>> it's not worth documenting that part and just drop the entry.

> "Improve comment detection for .pgpass files" is also the material for
> minor version release because that change was also back-patched?
> If so, I agree to drop the entry.

Hm, in a quick search I only see 2eb3bc588 which was not back-patched
... which commits are you thinking of?

regards, tom lane

Re: Global snapshots

On 2020/09/09 2:00, Alexey Kondratov wrote:

On 2020-09-08 14:48, Fujii Masao wrote:

On 2020/09/08 19:36, Alexey Kondratov wrote:

On 2020-09-08 05:49, Fujii Masao wrote:

On 2020/09/05 3:31, Alexey Kondratov wrote:

Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds
a GUC 'postgres_fdw.use_twophase'. Also it solves these errors handling issues
above and tries to add proper comments everywhere. I think, that 0003 should be
rebased on the top of it, or it could be a first patch in the set, since it may
be used independently. What do you think?

Thanks for the patch!

Sawada-san was proposing another 2PC patch at [1]. Do you have any thoughts
about pros and cons between your patch and Sawada-san's?

[1]
https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com

Thank you for the link!

After a quick look on the Sawada-san's patch set I think that there are two
major differences:

Thanks for sharing your thought! As far as I read your patch quickly,
I basically agree with your this view.

1. There is a built-in foreign xacts resolver in the [1], which should be much
more convenient from the end-user perspective. It involves huge in-core changes
and additional complexity that is of course worth of.

However, it's still not clear for me that it is possible to resolve all foreign
prepared xacts on the Postgres' own side with a 100% guarantee. Imagine a
situation when the coordinator node is actually a HA cluster group (primary +
sync + async replica) and it failed just after PREPARE stage of after local
COMMIT. In that case all foreign xacts will be left in the prepared state.
After failover process complete synchronous replica will become a new primary.
Would it have all required info to properly resolve orphan prepared xacts?

IIUC, yes, the information required for automatic resolution is
WAL-logged and the standby tries to resolve those orphan transactions
from WAL after the failover. But Sawada-san's patch provides
the special function for manual resolution, so there may be some cases
where manual resolution is necessary.

I've found a note about manual resolution in the v25 0002:

+After that we prepare all foreign transactions by calling
+PrepareForeignTransaction() API. If we failed on any of them we change to
+rollback, therefore at this time some participants might be prepared whereas
+some are not prepared. The former foreign transactions need to be resolved
+using pg_resolve_foreign_xact() manually and the latter ends transaction
+in one-phase by calling RollbackForeignTransaction() API.

but it's not yet clear for me.

Implementing 2PC feature only inside postgres_fdw seems to cause
another issue; COMMIT PREPARED is issued to the remote servers
after marking the local transaction as committed
(i.e., ProcArrayEndTransaction()).

According to the Sawada-san's v25 0002 the logic is pretty much the same there:

+2. Pre-Commit phase (1st phase of two-phase commit)

+3. Commit locally
+Once we've prepared all of them, commit the transaction locally.

+4. Post-Commit Phase (2nd phase of two-phase commit)

Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact /
FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction() in the
CommitTransaction(). Thus, I don't see many difference between these approach
and CallXactCallbacks() usage regarding this point.

IIUC the commit logic in Sawada-san's patch looks like

1. PreCommit_FdwXact()
PREPARE TRANSACTION command is issued

2. RecordTransactionCommit()
2-1. WAL-log the commit record
2-2. Update CLOG
2-3. Wait for sync rep
2-4. FdwXactWaitForResolution()
Wait until COMMIT PREPARED commands are issued to the remote
servers and completed.

3. ProcArrayEndTransaction()
4. AtEOXact_FdwXact(true)

So ISTM that the timing of when COMMIT PREPARED is issued
to the remote server is different between the patches.
Am I missing something?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: More aggressive vacuuming of temporary tables

Hi,

On 2020-09-09 10:14:04 -0400, Stephen Frost wrote:
> > I've been toying with a patch that introduces more smarts about when a
> > row is removable, by looking more closely whether a specific row
> > versions are visible (e.g. in the common case of one old snapshot and
> > lots of newer rows). But that's orders of magnitude more complicated. So
> > going for something as simple as this seems like a good idea.
> 
> I've wondered about this for a long time- very cool that you've found
> time to actually work on a patch.  A couple of different ideas were
> discussed previously about how to do that kind of a check- mind talking
> about what method you're using, or perhaps just sharing that patch? :)

It's very very early, and it doesn't really work. I basically tried to
just plug a bit more intelligence into the dead tuple detection (which
now can easily store more and incrementally built state with the recent
changes for snapshot scalability).  Detection that tuples newer than the
horizon are dead isn't that complicated - what's hard is not breaking
things due to ctid chains lacking intermediate versions.  To avoid that
I had to restrict it to inserted (not updated) tuples that were
subsequently deleted. And my heuristic only supported only one old
snapshot.

Building a bsearchable list of ranges of valid (xmin-xmax] ranges isn't
that hard. Some care needs to be taken to make the list non-overlapping,
but that's easy enough by just merging entries.

Obviously lookup in such a more complicated structure isn't free. Nor is
building it. So we'd need some heuristics about when to do so. It'd
probably be OK to occasionally look at the age of the oldest in-progress
statement, to infer the time for old snapshots based on that. And then
we could have a GUC that says when it's worth doing more complicated
lookups.

I don't have a handle on how to deal with the ctid chaining for
intermediate row versions.

Greetings,

Andres Freund

Re: WIP: BRIN multi-range indexes


On Wed, Sep 09, 2020 at 12:04:28PM -0400, John Naylor wrote:

On Sat, Sep 5, 2020 at 7:21 PM Tomas Vondra
 wrote:


OK, here is a rebased version. Most of the breakage was due to changes
to the BRIN sgml docs.


Hi Tomas,

I plan on trying some different queries on different data
distributions to get a sense of when the planner chooses a
multi-minmax index, and whether the choice is good.

Just to start, I used the artificial example in [1], but scaled down a
bit to save time. Config is at the default except for:
shared_buffers = 1GB
random_page_cost = 1.1;
effective_cache_size = 4GB;

create table t (a bigint, b int) with (fillfactor=95);

insert into t select i + 1000*random(), i+1000*random()
 from generate_series(1,1000) s(i);

update t set a = 1, b = 1 where random() < 0.001;
update t set a = 1000, b = 1000 where random() < 0.001;

analyze t;

create index on t using brin (a);
CREATE INDEX
Time: 1631.452 ms (00:01.631)

explain analyze select * from t
 where a between 1923300::int and 1923600::int;

   QUERY PLAN
--
Bitmap Heap Scan on t  (cost=16.10..43180.43 rows=291 width=12)
(actual time=217.770..1131.366 rows=288 loops=1)
  Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
  Rows Removed by Index Recheck: 712
  Heap Blocks: lossy=56819
  ->  Bitmap Index Scan on t_a_idx  (cost=0.00..16.03 rows=22595
width=0) (actual time=3.054..3.055 rows=568320 loops=1)
Index Cond: ((a >= 1923300) AND (a <= 1923600))
Planning Time: 0.328 ms
Execution Time: 1131.411 ms
(8 rows)

Now add the multi-minmax:

create index on t using brin (a int8_minmax_multi_ops);
CREATE INDEX
Time: 6521.026 ms (00:06.521)

The first interesting thing is, with both BRIN indexes available, the
planner still chooses the conventional BRIN index. Only when I disable
it, does it choose the multi-minmax index:

explain analyze select * from t
 where a between 1923300::int and 1923600::int;

  QUERY PLAN
-
Bitmap Heap Scan on t  (cost=68.10..43160.86 rows=291 width=12)
(actual time=1.835..4.196 rows=288 loops=1)
  Recheck Cond: ((a >= 1923300) AND (a <= 1923600))
  Rows Removed by Index Recheck: 22240
  Heap Blocks: lossy=128
  ->  Bitmap Index Scan on t_a_idx1  (cost=0.00..68.03 rows=22523
width=0) (actual time=0.691..0.691 rows=1280 loops=1)
Index Cond: ((a >= 1923300) AND (a <= 1923600))
Planning Time: 0.250 ms
Execution Time: 4.244 ms
(8 rows)

I wonder if this is a clue that something in the costing unfairly
penalizes a multi-minmax index. Maybe not enough to matter in
practice, since I wouldn't expect a user to put different kinds of
index on the same column.



I think this is much more an estimation issue than a costing one. Notice
that in the "regular" BRIN minmax index we have this:

   ->  Bitmap Index Scan on t_a_idx  (cost=0.00..16.03 rows=22595
   width=0) (actual time=3.054..3.055 rows=568320 loops=1)

while for the multi-minmax we have this:

   ->  Bitmap Index Scan on t_a_idx1  (cost=0.00..68.03 rows=22523
   width=0) (actual time=0.691..0.691 rows=1280 loops=1)

So yes, the multi-minmax index is costed a bit higher, mostly because
the index is a bit larger. (There's also a tweak to the correlation, but
that does not make much difference because it's just 0.99 vs. 1.0.)

The main difference is that for minmax the bitmap index scan actually
matches ~586k rows (a bit confusing, considering the heap scan has to
process almost 10M rows during recheck). But the multi-minmax only
matches ~1300 rows, with a recheck of 22k.

I'm not sure how to consider this during costing, as we only see these
numbers at execution time. One way would be to also consider "size" of
the ranges (i.e. max-min) vs. range of the whole column. But that's not
something we already have.

I'm not sure how troublesome this issue really is - I don't think people
are very likely to have both minmax and multi-minmax indexes on the same
column.


The second thing is, with parallel seq scan, the query is faster than
a BRIN bitmap scan, with this pathological data distribution, but the
planner won't choose it unless forced to:

set enable_bitmapscan = 'off';
explain analyze select * from t
 where a between 1923300::int and 1923600::int;
 QUERY PLAN
---
Gather  (cost=1000.00..120348.10 rows=291 width=12) (actual
time=372.766..380.364 rows=288 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Seq Scan on t  (cost=0.00..119319.00 rows=121
width=12) (actual time=268.326..366.228 rows=96 loops=3)
Filte

Re: Transactions involving multiple postgres foreign servers, take 2





On 2020/09/08 12:03, Amit Kapila wrote:

On Tue, Sep 8, 2020 at 8:05 AM Fujii Masao  wrote:


On 2020/09/08 10:34, Amit Kapila wrote:

On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao  wrote:


IMO it's not easy to commit this 2PC patch at once because it's still large
and complicated. So I'm thinking it's better to separate the feature into
several parts and commit them gradually.



Hmm, I don't see that we have a consensus on the design and or
interfaces of this patch and without that proceeding for commit
doesn't seem advisable. Here are a few points which I remember offhand
that require more work.


Thanks!


1. There is a competing design proposed and being discussed in another
thread [1] for this purpose. I think both the approaches have pros and
cons but there doesn't seem to be any conclusion yet on which one is
better.


I was thinking that [1] was discussing global snapshot feature for
"atomic visibility" rather than the solution like 2PC for "atomic commit".
But if another approach for "atomic commit" was also proposed at [1],
that's good. I will check that.



Okay, that makes sense.


I read Alexey's 2PC patch 
(0001-Add-postgres_fdw.use_twophase-GUC-to-use-2PC.patch)
proposed at [1]. As Alexey told at that thread, there are two big differences
between his patch and Sawada-san's; 1) whether there is the resolver process
for foreign transactions, 2) 2PC logic is implemented only inside postgres_fdw
or both FDW and PostgreSQL core.

I think that 2) is the first decision point. Alexey's 2PC patch is very simple
and all the 2PC logic is implemented only inside postgres_fdw. But this
means that 2PC is not usable if multiple types of FDW (e.g., postgres_fdw
and mysql_fdw) participate at the transaction. This may be ok if we implement
2PC feature only for PostgreSQL sharding using postgres_fdw. But if we
implement 2PC as the improvement on FDW independently from PostgreSQL
sharding, I think that it's necessary to support other FDW. And this is our
direction, isn't it?

Sawada-san's patch supports that case by implememnting some conponents
for that also in PostgreSQL core. For example, with the patch, all the remote
transactions that participate at the transaction are managed by PostgreSQL
core instead of postgres_fdw layer.

Therefore, at least regarding the difference 2), I think that Sawada-san's
approach is better. Thought?

[1]
https://postgr.es/m/3ef7877bfed0582019eab3d462a43...@postgrespro.ru

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: shared-memory based stats collector

Stephen Frost  writes:
> Just FYI, Tom's started a thread which includes this over here-
> https://postgr.es/m/1850884.1599601...@sss.pgh.pa.us

Per that discussion, I'm about to go and commit the 0001 patch from
this thread, which will cause the cfbot to not be able to apply the
patchset anymore till you repost it without 0001.  However, before
reposting, you might want to fix the compile errors the cfbot is
showing currently.

On the Windows side:

src/backend/postmaster/postmaster.c(6410): error C2065: 'pgStatSock' : 
undeclared identifier [C:\projects\postgresql\postgres.vcxproj]

On the Linux side:

1711pgstat.c: In function ‘pgstat_vacuum_stat’:
1712pgstat.c:1411:7: error: ‘key’ may be used uninitialized in this function 
[-Werror=maybe-uninitialized]
1713   if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
1714   ^
1715pgstat.c:1373:8: note: ‘key’ was declared here
1716   Oid *key;
1717^
1718pgstat.c:1411:7: error: ‘oidtab’ may be used uninitialized in this function 
[-Werror=maybe-uninitialized]
1719   if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
1720   ^
1721pgstat.c:1372:9: note: ‘oidtab’ was declared here
1722   HTAB*oidtab;
1723 ^
1724pgstat.c: In function ‘pgstat_reset_single_counter’:
1725pgstat.c:1625:6: error: ‘stattype’ may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
1726  env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
1727  ^
1728pgstat.c:1601:14: note: ‘stattype’ was declared here
1729  PgStatTypes stattype;
1730  ^
1731cc1: all warnings being treated as errors

regards, tom lane

Re: Remove line length restriction in passwordFromFile()





On 2020/09/10 2:16, Tom Lane wrote:

Fujii Masao  writes:

On 2020/09/10 1:48, Tom Lane wrote:

We could adjust the release-note entry for your patch to say
"Improve comment detection for .pgpass files", or we could decide
it's not worth documenting that part and just drop the entry.



"Improve comment detection for .pgpass files" is also the material for
minor version release because that change was also back-patched?
If so, I agree to drop the entry.


Hm, in a quick search I only see 2eb3bc588 which was not back-patched
... which commits are you thinking of?


I thought your commit b55b4dad99 included the improvement on comment detection 
and was back-patched

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Inconsistent Japanese name order in v13 contributors list

On 2020-Sep-09, Fujii Masao wrote:

> On 2020/09/09 14:15, Etsuro Fujita wrote:
> > Hi,
> > 
> > Attached is a patch to standardize Japanese names as given-name-first
> > in the v13 contributors list as before.
> 
> Using given-name-first order is our consensus? I was thinking we have not
> reached that yet and our "vague" consensus was to use the name that each
> contributor prefers, for example the name that used in the email signature, 
> etc.

That's indeed our historical practice.  See previous thread where we've
discussed this at length,
https://www.postgresql.org/message-id/flat/20150613231826.GY133018%40postgresql.org#88d245a5cdd2b32e1e3e80fc07eab6f2

The Economist piece Peter G cited is also relevant.

The commit Peter E cited seems more anecdotical than precedence-setting,
since there was no actual discussion, and whatever little there was was
confined to pgsql-committers.

An easy way to avoid any confusion is to uppercase the family name in
the cases where it goes first.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Remove line length restriction in passwordFromFile()

Fujii Masao  writes:
> On 2020/09/10 2:16, Tom Lane wrote:
>> Hm, in a quick search I only see 2eb3bc588 which was not back-patched
>> ... which commits are you thinking of?

> I thought your commit b55b4dad99 included the improvement on comment 
> detection and was back-patched

Oh, right, I did include the check for '#' in what was back-patched.
I debated whether to do that or not, and was misremembering my decision.

So yeah, it seems there's no need to document 2eb3bc588 in the v13
notes.  I'll try to remember to give you part credit for b55b4dad9
when I do the November notes.

regards, tom lane

Re: WIP: BRIN multi-range indexes

On 2020-Sep-09, John Naylor wrote:

> create index on t using brin (a);
> CREATE INDEX
> Time: 1631.452 ms (00:01.631)

> create index on t using brin (a int8_minmax_multi_ops);
> CREATE INDEX
> Time: 6521.026 ms (00:06.521)

It seems strange that the multi-minmax index takes so much longer to
build.  I wonder if there's some obvious part of the algorithm that can
be improved?

> The second thing is, with parallel seq scan, the query is faster than
> a BRIN bitmap scan, with this pathological data distribution, but the
> planner won't choose it unless forced to:
> 
> set enable_bitmapscan = 'off';
> explain analyze select * from t
>   where a between 1923300::int and 1923600::int;

This is probably explained by the fact that you likely have the whole
table in shared buffers, or at least in OS cache.  I'm not sure if the
costing should necessarily account for this.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Inconsistent Japanese name order in v13 contributors list

Alvaro Herrera  writes:
> On 2020-Sep-09, Fujii Masao wrote:
>> Using given-name-first order is our consensus?

> That's indeed our historical practice.  See previous thread where we've
> discussed this at length,
> https://www.postgresql.org/message-id/flat/20150613231826.GY133018%40postgresql.org#88d245a5cdd2b32e1e3e80fc07eab6f2

> The Economist piece Peter G cited is also relevant.

Right.  I think the decree the Economist cites might be sufficient
reason to reopen the discussion, though I surely don't want it to
turn into another long thread.

regards, tom lane

Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch


On 2020-09-09 15:38, Tom Lane wrote:

and a few
more calls of int8_numeric that could be converted.  I think the
attached updated version is committable, and I'd recommend going
ahead with that regardless of the rest of this.  I hadn't realized
how many random calls of int8_numeric and int4_numeric we'd grown,
but there are a lot, so this is nice cleanup.



Yes, please go ahead with it.


It's your patch, I figured you'd want to commit it.


ok done

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: SIGQUIT handling, redux

Hi,

On 2020-09-08 17:39:24 -0400, Tom Lane wrote:
> Of course, this is only safe if the SIGQUIT handler is safe to be invoked
> anywhere, so I did a quick survey of backend signal handlers to see if
> that is true.  I immediately found pgarch_exit() which surely is not.  It
> turns out that Horiguchi-san already noticed that and proposed to fix it,
> within the seemingly entirely unrelated patch series for a shared-memory
> based stats collector (see patch 0001 in [2]).  I think we should go ahead
> and commit that, and maybe even back-patch it.  There seems no good reason
> for the archiver to treat SIGQUIT as nonfatal when other postmaster
> children do not.

+1

> Another thing I found is startup_die() in postmaster.c, which catches
> SIGQUIT during the authentication phase of a new backend.  This calls
> proc_exit(1), which (a) is unsafe on its face, and (b) potentially
> demotes a SIGQUIT into something that will not cause the parent postmaster
> to perform a DB-wide restart.  There seems no good reason for that
> behavior, either.  I propose that we use SignalHandlerForCrashExit
> for SIGQUIT here too.

Yikes, yea, that seems like an important change.

I wish startup_die() weren't named startup_ - every single time I see
the name I think it's about the startup process...

I think StartupPacketTimeoutHandler is another case?

> In passing, it's worth noting that startup_die() isn't really much safer
> for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
> those is that code that applies BlockSig will at least manage to block the
> former.

Which is pretty unconvincing...

I wonder if we could at least *try* to rely on CFR() for a while. It'd
not be pretty, but probably doable, to set up a timer inside the signal
handler and try to exit using normal mechanisms for a while (some
overlap with authentication timeout).

That'd leave the question of what we do once that timer expires. There's
quite a few possible ways that could reproducibly happen, e.g. if DNS
lookups are required during auth.

The long term correct way to handle this would obviously be to
restructure everything that happens covered by startup_die() in a
non-blocking manner and just rely on CFR(). But that's a tall order to
get done anytime soon, particularly things like DNS are IIRC pretty hard
without relying on custom libraries.

Shorter term I don't really see an alternative to escalating to SIGQUIT
which is obviously terrible.

> So it's slightly tempting to drop startup_die() altogether in favor of
> using SignalHandlerForCrashExit for the SIGTERM-during-auth case too.
> However, that'd have the effect of causing a "fast shutdown" to get
> converted to a panic if it happens while any sessions are in
> authentication phase, so I think this cure is likely worse than the
> disease.

Indeed.

> I don't want to give up trying to send a message to the client.

That still doesn't make much sense to me. The potential for hanging
(e.g. inside malloc) is so much worse than not sending a message... And
we already infer things about the server dying in libpq when the
connection just closes (which I am admittedly also not happy about). But
I also think we can reduce the risk a bit, see below.

I only had one coffee so far (and it looks like the sun has died
outside), so maybe I'm just slow: But, uh, we don't currently send a
message startup_die(), right?

So that part is about quickdie()?

> Worth reading in connection with this is the thread that led up to
> commits 8e19a8264 et al [3].  We had a long discussion about how
> quickdie() and startup_die() might be made safe(r), without any
> real consensus emerging about any alternatives being better than the
> status quo.  I still don't have an improvement idea for quickdie();

I still think that we should at least mitigate the risk to hang inside
quickdie() by at least disabling translations (since the rest happens
inside the pre-allocated error memory context, which should lower the
risk).  And perhaps do similarly for the memory required for encryption,
if enabled, and where possible.

Greetings,

Andres Freund

Re: WIP: BRIN multi-range indexes


On Wed, Sep 09, 2020 at 03:40:41PM -0300, Alvaro Herrera wrote:

On 2020-Sep-09, John Naylor wrote:


create index on t using brin (a);
CREATE INDEX
Time: 1631.452 ms (00:01.631)



create index on t using brin (a int8_minmax_multi_ops);
CREATE INDEX
Time: 6521.026 ms (00:06.521)


It seems strange that the multi-minmax index takes so much longer to
build.  I wonder if there's some obvious part of the algorithm that can
be improved?



There are some minor optimizations possible - for example I noticed we
call minmax_multi_get_strategy_procinfo often because it happens in a
loop, and we could easily do it just once. But that saves only about 10%
or so, it's not a ground-breaking optimization.

The main reason for the slowness is that we pass the values one by one
to brin_minmax_multi_add_value - and on each call we need to deserialize
(and then sometimes also serialize) the summary, which may be quite
expensive. The regular minmax does not have this issue, it just swaps
the Datum value and that's it.

I see two possible optimizations - firstly, adding some sort of batch
variant of the add_value function, which would get a bunch of values
instead of just a single one, amortizing the serialization costs.

Another option would be to teach add_value to keep the deserialized
summary somewhere, and then force serialization at the end of the BRIN
page range. The end result would be roughly the same, I think.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: BRIN multi-range indexes

On 2020-Sep-09, Tomas Vondra wrote:

> There are some minor optimizations possible - for example I noticed we
> call minmax_multi_get_strategy_procinfo often because it happens in a
> loop, and we could easily do it just once. But that saves only about 10%
> or so, it's not a ground-breaking optimization.

Well, I guess this kind of thing should be fixed regardless while we
still know it's there, just to avoid an obvious inefficiency.

> The main reason for the slowness is that we pass the values one by one
> to brin_minmax_multi_add_value - and on each call we need to deserialize
> (and then sometimes also serialize) the summary, which may be quite
> expensive. The regular minmax does not have this issue, it just swaps
> the Datum value and that's it.

Ah, right, that's more interesting.  The original dumb BRIN code
separates BrinMemTuple from BrinTuple so that things can be operated
efficiently in memory.  Maybe something similar can be done in this
case, which also sounds like your second suggestion:

> Another option would be to teach add_value to keep the deserialized
> summary somewhere, and then force serialization at the end of the BRIN
> page range. The end result would be roughly the same, I think.

Also, I think you could get a few initial patches pushed soon, since
they look like general improvements rather than specific to multi-range.

On a differen train of thought, I wonder if we shouldn't drop the idea
of there being two minmax opclasses; just have one (still called
"minmax") and have the multi-range version be the v2 of it.  We would
still need to keep code to operate on the old one, but if you ever
REINDEX then your index is upgraded to the new one.  I see no reason to
keep the dumb minmax version around, assuming the performance is roughly
similar.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: SIGQUIT handling, redux

Andres Freund  writes:
> I wish startup_die() weren't named startup_ - every single time I see
> the name I think it's about the startup process...

We could call it startup_packet_die or something?

> I think StartupPacketTimeoutHandler is another case?

Yeah.  Although it's a lot less risky, since if the timeout is reached
we're almost certainly waiting for client input.

>> In passing, it's worth noting that startup_die() isn't really much safer
>> for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
>> those is that code that applies BlockSig will at least manage to block the
>> former.

> Which is pretty unconvincing...

Agreed, it'd be nice if this were less shaky.  On the other hand,
we've seen darn few complaints traceable to this AFAIR.  I'm not
really sure it's worth putting a lot of effort into.

> The long term correct way to handle this would obviously be to
> restructure everything that happens covered by startup_die() in a
> non-blocking manner and just rely on CFR(). But that's a tall order to
> get done anytime soon, particularly things like DNS are IIRC pretty hard
> without relying on custom libraries.

Not only DNS, but all the various auth libraries would have to be
contended with.  Lots of work there compared to the likely rewards.

>> I don't want to give up trying to send a message to the client.

> That still doesn't make much sense to me. The potential for hanging
> (e.g. inside malloc) is so much worse than not sending a message...

We see backends going through this code on a very regular basis in the
buildfarm, but complete hangs are rare as can be.  I think you
overestimate the severity of the problem.

> I only had one coffee so far (and it looks like the sun has died
> outside), so maybe I'm just slow: But, uh, we don't currently send a
> message startup_die(), right?
> So that part is about quickdie()?

Right.  Note that startup_die() is pre-authentication, so I'm doubtful
that we should tell the would-be client anything about the state of
the server at that point, even ignoring these risk factors.  (I'm a
bit inclined to remove the comment suggesting that'd be desirable.)

regards, tom lane

Re: SIGQUIT handling, redux

Hi,

On 2020-09-09 16:09:00 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I wish startup_die() weren't named startup_ - every single time I see
> > the name I think it's about the startup process...
> 
> We could call it startup_packet_die or something?

Yea, I think that'd be good.

> > I think StartupPacketTimeoutHandler is another case?
> 
> Yeah.  Although it's a lot less risky, since if the timeout is reached
> we're almost certainly waiting for client input.

An adversary could control that to a significant degree - but ordinarily
I agree...

> >> In passing, it's worth noting that startup_die() isn't really much safer
> >> for SIGTERM than it is for SIGQUIT; the only argument for distinguishing
> >> those is that code that applies BlockSig will at least manage to block the
> >> former.
> 
> > Which is pretty unconvincing...
> 
> Agreed, it'd be nice if this were less shaky.  On the other hand,
> we've seen darn few complaints traceable to this AFAIR.  I'm not
> really sure it's worth putting a lot of effort into.

Not sure how many would plausibly reach us though.  A common reaction
will likely just to be to force-restart the server, not to fully
investigate the issue. Particularly because it'll often be once
something already has gone wrong...

> >> I don't want to give up trying to send a message to the client.
> 
> > That still doesn't make much sense to me. The potential for hanging
> > (e.g. inside malloc) is so much worse than not sending a message...
> 
> We see backends going through this code on a very regular basis in the
> buildfarm, but complete hangs are rare as can be.  I think you
> overestimate the severity of the problem.

I don't think the BF exercises the problmetic paths to a significant
degree. It's mostly local socket connections, and where not it's
localhost. There's no slow DNS, no more complicated authentication
methods, no packet loss. How often do we ever actually end up even
getting close to any of the paths but immediate shutdowns? And in the
SIGQUIT path, how often do we end up in the SIGKILL path, masking
potential deadlocks?

> > I only had one coffee so far (and it looks like the sun has died
> > outside), so maybe I'm just slow: But, uh, we don't currently send a
> > message startup_die(), right?
> > So that part is about quickdie()?
> 
> Right.  Note that startup_die() is pre-authentication, so I'm doubtful
> that we should tell the would-be client anything about the state of
> the server at that point, even ignoring these risk factors.  (I'm a
> bit inclined to remove the comment suggesting that'd be desirable.)

Yea, I think just putting in an editorialized version of your paragraph
would make sense.

Greetings,

Andres Freund

Re: WIP: BRIN multi-range indexes


On Wed, Sep 09, 2020 at 04:53:30PM -0300, Alvaro Herrera wrote:

On 2020-Sep-09, Tomas Vondra wrote:


There are some minor optimizations possible - for example I noticed we
call minmax_multi_get_strategy_procinfo often because it happens in a
loop, and we could easily do it just once. But that saves only about 10%
or so, it's not a ground-breaking optimization.


Well, I guess this kind of thing should be fixed regardless while we
still know it's there, just to avoid an obvious inefficiency.



Sure. I was just suggesting it's not something that'd make this very
close to plain minmax opclass.


The main reason for the slowness is that we pass the values one by one
to brin_minmax_multi_add_value - and on each call we need to deserialize
(and then sometimes also serialize) the summary, which may be quite
expensive. The regular minmax does not have this issue, it just swaps
the Datum value and that's it.


Ah, right, that's more interesting.  The original dumb BRIN code
separates BrinMemTuple from BrinTuple so that things can be operated
efficiently in memory.  Maybe something similar can be done in this
case, which also sounds like your second suggestion:


Another option would be to teach add_value to keep the deserialized
summary somewhere, and then force serialization at the end of the BRIN
page range. The end result would be roughly the same, I think.




Well, the patch already has Ranges (memory) and SerializedRanges (disk)
but it's not very clear to me where to stash the in-memory data and
where to make the conversion.



Also, I think you could get a few initial patches pushed soon, since
they look like general improvements rather than specific to multi-range.



Yeah, I agree. I plan to review those once again in a couple days and
then push them.



On a differen train of thought, I wonder if we shouldn't drop the idea
of there being two minmax opclasses; just have one (still called
"minmax") and have the multi-range version be the v2 of it.  We would
still need to keep code to operate on the old one, but if you ever
REINDEX then your index is upgraded to the new one.  I see no reason to
keep the dumb minmax version around, assuming the performance is roughly
similar.



I'm not a huge fan of that. I think it's unlikely we'll ever make this
new set of oplasses just as fast as the plain minmax, and moreover it
does have some additional requirements (e.g. the distance proc, which
may not make sense for some data types).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: SIGQUIT handling, redux

Andres Freund  writes:
> On 2020-09-09 16:09:00 -0400, Tom Lane wrote:
>> We could call it startup_packet_die or something?

> Yea, I think that'd be good.

I'll make it so.

>> We see backends going through this code on a very regular basis in the
>> buildfarm, but complete hangs are rare as can be.  I think you
>> overestimate the severity of the problem.

> I don't think the BF exercises the problmetic paths to a significant
> degree. It's mostly local socket connections, and where not it's
> localhost. There's no slow DNS, no more complicated authentication
> methods, no packet loss. How often do we ever actually end up even
> getting close to any of the paths but immediate shutdowns?

Since we're talking about quickdie(), immediate shutdown/crash restart
is exactly the case of concern, and the buildfarm exercises it all the
time.

> And in the
> SIGQUIT path, how often do we end up in the SIGKILL path, masking
> potential deadlocks?

True, we can't really tell that.  I wonder if we should make the
postmaster emit a log message when it times out and goes to SIGKILL.
After a few months we could scrape the buildfarm logs and get a
pretty good handle on it.

regards, tom lane

Re: SIGQUIT handling, redux

I wrote:
> Not only DNS, but all the various auth libraries would have to be
> contended with.  Lots of work there compared to the likely rewards.

Wait a minute.  The entire authentication cycle happens inside
InitPostgres, using the backend's normal signal handlers.  So
maybe we are overthinking the problem.  What if we simply postpone
ProcessStartupPacket into that same place, and run it under the same
rules as we do for authentication?  We would waste more cycles than
we do now for the case where the client closes the connection without
sending a startup packet, but not enormously so, I think --- and
optimizing that case doesn't seem like a high-priority goal anyway.
And cases like DNS lookup taking forever don't seem like any more of
an issue than auth lookup taking forever.

regards, tom lane

Re: PG 13 release notes, first draft

2020-09-09 Thread Jonathan S. Katz

Hi,

On 5/4/20 11:16 PM, Bruce Momjian wrote:
> I have committed the first draft of the PG 13 release notes.  You can
> see them here:
> 
>   https://momjian.us/pgsql_docs/release-13.html
> 
> It still needs markup, word wrap, and indenting.  The community doc
> build should happen in a few hours.

Thank you again for compiling and maintaining the release notes through
another major release cycle, I know it's no small undertaking!

Attached is a proposal for the "major enhancements" section. I borrowed
from the press release[1] but tried to stay true to the release notes
format and listed out the enhancements that way.

Open to suggestion, formatting changes, etc.

Thanks!

Jonathan

[1]
https://www.postgresql.org/message-id/3bd579f8-438a-ed1a-ee20-738292099aae%40postgresql.org

diff --git a/doc/src/sgml/release-13.sgml b/doc/src/sgml/release-13.sgml
index 68f9a0a9f1..95106921d7 100644
--- a/doc/src/sgml/release-13.sgml
+++ b/doc/src/sgml/release-13.sgml
@@ -6,29 +6,49 @@
 
   
Release date:
-   2020-XX-XX, CURRENT AS OF 2020-08-09
+   2020-09-24, CURRENT AS OF 2020-09-09
   
 
   
Overview
 
 
- Major enhancements in PostgreSQL 13 include:
+ PostgreSQL 13 contains many new features and
+ enhancements, including:
 
 
-   
-

-
 
-TBD
+ 
+  Space savings and performance gains from B-tree index deduplication
+ 
+
+
+ 
+  Improved response times for queries that use aggregates or partitions
+ 
+
+
+ 
+  Better query planning when using enhanced statistics
+ 
+
+
+ 
+  Parallelized vacuuming of B-tree indexes
+ 
+
+
+ 
+  Incremental sorting
+ 
 
-

 
-
- The above items are explained in more detail in the sections below.
-
+   
+and more. The above items and other new features of PostgreSQL 13 are
+explained in more detail in the sections below.
+   
 
   
 


signature.asc
Description: OpenPGP digital signature

Re: [Patch] ALTER SYSTEM READ ONLY

Hi,

Thomas, there's one point below that could be relevant for you. You can
search for your name and/or checkpoint...


On 2020-09-01 16:43:10 +0530, Amul Sul wrote:
> diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
> index 42050ab7195..0ac826d3c2f 100644
> --- a/src/backend/nodes/readfuncs.c
> +++ b/src/backend/nodes/readfuncs.c
> @@ -2552,6 +2552,19 @@ _readAlternativeSubPlan(void)
>   READ_DONE();
>  }
>  
> +/*
> + * _readAlterSystemWALProhibitState
> + */
> +static AlterSystemWALProhibitState *
> +_readAlterSystemWALProhibitState(void)
> +{
> + READ_LOCALS(AlterSystemWALProhibitState);
> +
> + READ_BOOL_FIELD(WALProhibited);
> +
> + READ_DONE();
> +}
> +

Why do we need readfuncs support for this?

> +
> +/*
> + * AlterSystemSetWALProhibitState
> + *
> + * Execute ALTER SYSTEM READ { ONLY | WRITE } statement.
> + */
> +static void
> +AlterSystemSetWALProhibitState(AlterSystemWALProhibitState *stmt)
> +{
> + /* some code */
> + elog(INFO, "AlterSystemSetWALProhibitState() called");
> +}

As long as it's not implemented it seems better to return an ERROR.

> @@ -3195,6 +3195,16 @@ typedef struct AlterSystemStmt
>   VariableSetStmt *setstmt;   /* SET subcommand */
>  } AlterSystemStmt;
>  
> +/* --
> + *   Alter System Read Statement
> + * --
> + */
> +typedef struct AlterSystemWALProhibitState
> +{
> + NodeTag type;
> + boolWALProhibited;
> +} AlterSystemWALProhibitState;
> +

All the nearby fields use under_score_style names.



> From f59329e4a7285c5b132ca74473fe88e5ba537254 Mon Sep 17 00:00:00 2001
> From: Amul Sul 
> Date: Fri, 19 Jun 2020 06:29:36 -0400
> Subject: [PATCH v6 3/5] Implement ALTER SYSTEM READ ONLY using global barrier.
> 
> Implementation:
> 
>  1. When a user tried to change server state to WAL-Prohibited using
> ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState()
> raises request to checkpointer by marking current state to inprogress in
> shared memory.  Checkpointer, noticing that the current state is has

"is has"

> WALPROHIBIT_TRANSITION_IN_PROGRESS flag set, does the barrier request, and
> then acknowledges back to the backend who requested the state change once
> the transition has been completed.  Final state will be updated in control
> file to make it persistent across the system restarts.

What makes checkpointer the right backend to do this work?


>  2. When a backend receives the WAL-Prohibited barrier, at that moment if
> it is already in a transaction and the transaction already assigned XID,
> then the backend will be killed by throwing FATAL(XXX: need more 
> discussion
> on this)


>  3. Otherwise, if that backend running transaction which yet to get XID
> assigned we don't need to do anything special

Somewhat garbled sentence...


>  4. A new transaction (from existing or new backend) starts as a read-only
> transaction.

Maybe "(in an existing or in a new backend)"?


>  5. Autovacuum launcher as well as checkpointer will don't do anything in
> WAL-Prohibited server state until someone wakes us up.  E.g. a backend
> might later on request us to put the system back to read-write.

"will don't do anything", "might later on request us"


>  6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
> and xlog rotation. Starting up again will perform crash recovery(XXX:
> need some discussion on this as well) but the end of recovery checkpoint
> will be skipped and it will be performed when the system changed to
> WAL-Permitted mode.

Hm, this has some interesting interactions with some of Thomas' recent
hacking.


>  8. Only super user can toggle WAL-Prohibit state.

Hm. I don't quite agree with this. We try to avoid if (superuser())
style checks these days, because they can't be granted to other
users. Look at how e.g. pg_promote() - an operation of similar severity
- is handled. We just revoke the permission from public in
system_views.sql:
REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public;


>  9. Add system_is_read_only GUC show the system state -- will true when system
> is wal prohibited or in recovery.

*shows the system state. There's also some oddity in the second part of
the sentence.

Is it really correct to show system_is_read_only as true during
recovery? For one, recovery could end soon after, putting the system
into r/w mode, if it wasn't actually ALTER SYSTEM READ ONLY'd. But also,
during recovery the database state actually changes if there are changes
to replay.  ISTM it would not be a good idea to mix ASRO and
pg_is_in_recovery() into one GUC.


> --- /dev/null
> +++ b/src/backend/access/transam/walprohibit.c
> @@ -0,0 +1,321 @@
> +/*-
> + *
> + * walprohibit.c
> + *   PostgreSQL write-ahead log prohibit states
> + *
>

Re: [HACKERS] [PATCH] Generic type subscripting

On Wed, Aug 05, 2020 at 04:04:22PM +0200, Dmitry Dolgov wrote:
> > On Sun, Aug 02, 2020 at 12:50:12PM +0200, Pavel Stehule wrote:
> > >
> > > > Maybe this could be salvaged by flushing 0005 in its current form and
> > > > having the jsonb subscript executor do something like "if the current
> > > > value-to-be-subscripted is a JSON array, then try to convert the textual
> > > > subscript value to an integer".  Not sure about what the error handling
> > > > rules ought to be like, though.
> > >
> > > I'm fine with the idea of separating 0005 patch and potentially prusuing
> > > it as an independent item. Just need to rebase 0006, since Pavel
> > > mentioned that it's a reasonable change he would like to see in the
> > > final result.
> >
> > +1
> 
> Here is what I had in mind. Worth noting that, as well as the original

This seems to already hit a merge conflict (8febfd185).
Would you re-rebase ?

-- 
Justin

Re: SIGQUIT handling, redux

Hi,

On 2020-09-09 16:30:37 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2020-09-09 16:09:00 -0400, Tom Lane wrote:
> >> We could call it startup_packet_die or something?
> 
> > Yea, I think that'd be good.
> 
> I'll make it so.

Thanks!


> >> We see backends going through this code on a very regular basis in the
> >> buildfarm, but complete hangs are rare as can be.  I think you
> >> overestimate the severity of the problem.
> 
> > I don't think the BF exercises the problmetic paths to a significant
> > degree. It's mostly local socket connections, and where not it's
> > localhost. There's no slow DNS, no more complicated authentication
> > methods, no packet loss. How often do we ever actually end up even
> > getting close to any of the paths but immediate shutdowns?
> 
> Since we're talking about quickdie(), immediate shutdown/crash restart
> is exactly the case of concern, and the buildfarm exercises it all the
> time.

Yea, but only in simple cases. Largely no SSL / kerberos. Largely
untranslated. Mostly the immediate shutdowns aren't when inside plpython
or such.


> > And in the
> > SIGQUIT path, how often do we end up in the SIGKILL path, masking
> > potential deadlocks?
> 
> True, we can't really tell that.  I wonder if we should make the
> postmaster emit a log message when it times out and goes to SIGKILL.
> After a few months we could scrape the buildfarm logs and get a
> pretty good handle on it.

I think that'd be a good idea.

Greetings,

Andres Freund

Re: v13: show extended stats target in \d

On Tue, Sep 01, 2020 at 12:35:19PM +, Georgios Kokolatos wrote:
> I will humbly disagree with the current review. I shall refrain from changing 
> the status though because I do not have very strong feeling about it.

Sorry but this was in my spam, and didn't see until now.

> 
> However the patch contains:
> 
> - "  'm' = 
> any(stxkind) AS mcv_enabled\n"
> + "  'm' = 
> any(stxkind) AS mcv_enabled,\n"
> + "  %s"
>   "FROM 
> pg_catalog.pg_statistic_ext stat "
>   "WHERE stxrelid = 
> '%s'\n"
>   "ORDER BY 1;",
> + (pset.sversion >= 
> 13 ? "stxstattarget\n" : "-1\n"),
>   oid);
> 
> This seems to be breaking a bit the pattern in describeOneTableDetails().
> First, it is customary to write the whole query for the version in its own 
> block. I do find this pattern rather helpful and clean. So in my humble 
> opinion, the ternary expression should be replaced with a distinct if block 
> that would implement stxstattarget. Second, setting the value to -1 as an 
> indication of absence breaks the pattern a bit further. More on that bellow.

Hm, I did like this using the "hastriggers" code as a template.  But you're
right that everywhere else does it differently.

> +   if (strcmp(PQgetvalue(result, i, 8), 
> "-1") != 0)
> +   appendPQExpBuffer(&buf, " 
> STATISTICS %s",
> + 
> PQgetvalue(result, i, 8));
> +
> 
> In the same function, one will find a bit of a different pattern for printing 
> the statistics, e.g.
>  if (strcmp(PQgetvalue(result, i, 7), "t") == 0) 
> I will again hold the opinion that if the query gets crafted a bit 
> differently, one can inspect if the table has `stxstattarget` and then, go 
> ahead and print the value.
> 
> Something in the lines of:
> if (strcmp(PQgetvalue(result, i, 8), "t") == 0)
> appendPQExpBuffer(&buf, " STATISTICS %s", 
> PQgetvalue(result, i, 9));

I think what you've written wouldn't give the desired behavior, since it would
show the stats target even when it's set to the default.  I don't see the point
of having additional, separate, version-specific boolean columns for 1) column
exists; 2) column is not default, in addition to 3) column value.  But I added
comment about what Peter and I agree is desirable, anyway.

> Finally, the patch might be more complete if a note is added in the 
> documentation.
> Have you considered adding something in doc/src/sgml/ref/psql-ref.sgml? If 
> no, will you consider it? If yes, why did you discard it?

I don't think the details of psql output are currently documented.  This shows
nothing about column statistics target, nor about MV stats at all.
https://www.postgresql.org/docs/13/app-psql.html

As for the discussion about a separator, I think maybe a comma is enough.  I
doubt anyone is going to think that you can get a valid command by prefixing
this by "CREATE STATISTICS".  Actually, it didn't even occur to me it was valid
command without the stats target - after all, that's not true of indexes.

+"public"."ab1_a_b_stats" (ndistinct, dependencies, mcv) ON a, b FROM ab1, 
STATISTICS 0

This revision only shows the stats target in verbose mode (slash dee plus).

-- 
Justin
>From e5b351cc9b0518c9162a4b988b17ed920f09d952 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 30 Aug 2020 23:38:17 -0500
Subject: [PATCH v2] Show stats target of extended statistics

The stats target can be set since commit d06215d03, but wasn't shown by psql.
 ALTER STATISISTICS .. SET STATISTICS n.
Traditional column stats targets are shown in \d+ table, so do the same for
extended/MV stats.
---
 src/bin/psql/describe.c | 14 --
 src/test/regress/expected/stats_ext.out | 18 ++
 src/test/regress/sql/stats_ext.sql  |  2 ++
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 0861d74a6f..3c391fe686 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2683,8 +2683,13 @@ describeOneTableDetails(const char *schemaname,
 			  "a.attnum = s.attnum AND NOT attisdropped)) AS columns,\n"
 			  "  'd' = any(stxkind) AS ndist_enabled,\n"
 			  "  'f' = any(stxkind) AS deps_enabled,\n"
-			  "  'm' = any(stxkind) AS mcv_enabled\n"
-			  "FROM pg_catalog.pg_statistic_ext stat "
+			  "  'm' = any(stxkind) AS mcv_enabled");
+
+			if (pset.sversion >= 1300

Re: SIGQUIT handling, redux

I wrote:
>> Not only DNS, but all the various auth libraries would have to be
>> contended with.  Lots of work there compared to the likely rewards.

> Wait a minute.  The entire authentication cycle happens inside
> InitPostgres, using the backend's normal signal handlers.  So
> maybe we are overthinking the problem.  What if we simply postpone
> ProcessStartupPacket into that same place, and run it under the same
> rules as we do for authentication?

Or, continuing to look for other answers ...

During BackendInitialize we have not yet touched shared memory.
This is not incidental but must be so, per its header comment.
Therefore it seems like we could have these signal handlers (for
SIGTERM or timeout) do _exit(1), thereby resolving the signal
unsafety while not provoking a database restart.  We don't
care whether inside-the-process state is sane, and we shouldn't
have changed anything yet in shared memory.

This is obviously not 100.00% safe, since maybe something could
violate these assumptions, but it seems a lot safer than using
proc_exit() from a signal handler.

One way to help catch any such assumption-violations is to add
an assertion that no on_shmem_exit handlers have been set up yet.
If there are none, there should be no expectation of having to
clean up shared state.

regards, tom lane

Re: v13: show extended stats target in \d

On 2020-Sep-09, Justin Pryzby wrote:

> As for the discussion about a separator, I think maybe a comma is enough.  I
> doubt anyone is going to think that you can get a valid command by prefixing
> this by "CREATE STATISTICS".  Actually, it didn't even occur to me it was 
> valid
> command without the stats target - after all, that's not true of indexes.
> 
> +"public"."ab1_a_b_stats" (ndistinct, dependencies, mcv) ON a, b FROM 
> ab1, STATISTICS 0

I vote to use a semicolon instead of comma, and otherwise +1.

> This revision only shows the stats target in verbose mode (slash dee plus).

I don't think this interferes enough with other stuff to relegate it to
the "plus" version, since it's not shown if default.

Tomas -- this needs to go in pg13, right?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: v13: show extended stats target in \d