date:20210517

"ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Peter Smith

While doing logical replication testing we encountered a problem which
causes a deadlock error to be logged when replicating a TRUNCATE
simultaneously to 2 Subscriptions.
e.g.
--
2021-05-12 11:30:19.457 AEST [11393] ERROR:  deadlock detected
2021-05-12 11:30:19.457 AEST [11393] DETAIL:  Process 11393 waits for
ShareLock on transaction 597; blocked by process 11582.
Process 11582 waits for ShareLock on relation 16384 of database 14896;
blocked by process 11393.
--

At this time, both the subscriber (apply worker) processes are
executing inside the ExecuteTruncateGuts function simultaneously and
they become co-dependent.

e.g.
--
Process 11582
(gdb) bt
#0  0x7fa1979515e3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x0093e5d0 in WaitEventSetWaitBlock (set=0x2facac8,
cur_timeout=-1, occurred_events=0x7ffed5fdff00, nevents=1) at
latch.c:1450
#2  0x0093e468 in WaitEventSetWait (set=0x2facac8, timeout=-1,
occurred_events=0x7ffed5fdff00, nevents=1, wait_event_info=50331648)
at latch.c:1396
#3  0x0093d8cd in WaitLatch (latch=0x7fa191042654,
wakeEvents=33, timeout=0, wait_event_info=50331648) at latch.c:473
#4  0x009660f0 in ProcSleep (locallock=0x2fd06d8,
lockMethodTable=0xcd90a0 ) at proc.c:1361
#5  0x00954fd5 in WaitOnLock (locallock=0x2fd06d8,
owner=0x2fd9a48) at lock.c:1858
#6  0x00953c14 in LockAcquireExtended (locktag=0x7ffed5fe0370,
lockmode=5, sessionLock=false, dontWait=false, reportMemoryError=true,
locallockp=0x7ffed5fe0368) at lock.c:1100
#7  0x009511f1 in LockRelationOid (relid=16384, lockmode=5) at
lmgr.c:117
#8  0x0049e779 in relation_open (relationId=16384, lockmode=5)
at relation.c:56
#9  0x005652ff in table_open (relationId=16384, lockmode=5) at
table.c:43
#10 0x005c8b5a in reindex_relation (relid=16384, flags=1,
params=0x7ffed5fe05f0) at index.c:3990
#11 0x006d2c85 in ExecuteTruncateGuts
(explicit_rels=0x3068aa8, relids=0x3068b00, relids_extra=0x3068b58,
relids_logged=0x3068bb0, behavior=DROP_RESTRICT, restart_seqs=false)
at tablecmds.c:2036
#12 0x008ebc50 in apply_handle_truncate (s=0x7ffed5fe08d0) at
worker.c:2252
#13 0x008ebe94 in apply_dispatch (s=0x7ffed5fe08d0) at worker.c:2308
#14 0x008ec424 in LogicalRepApplyLoop (last_received=24192624)
at worker.c:2564
--
Process 11393
(gdb) bt
#0  0x7fa197917f90 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x7fa197917e44 in sleep () from /lib64/libc.so.6
#2  0x00950f84 in DeadLockReport () at deadlock.c:1151
#3  0x00955013 in WaitOnLock (locallock=0x2fd05d0,
owner=0x2fd9a48) at lock.c:1873
#4  0x00953c14 in LockAcquireExtended (locktag=0x7ffed5fe01d0,
lockmode=5, sessionLock=false, dontWait=false, reportMemoryError=true,
locallockp=0x0) at lock.c:1100
#5  0x009531bc in LockAcquire (locktag=0x7ffed5fe01d0,
lockmode=5, sessionLock=false, dontWait=false) at lock.c:751
#6  0x00951d86 in XactLockTableWait (xid=597,
rel=0x7fa1986e9e08, ctid=0x7ffed5fe0284, oper=XLTW_Update) at
lmgr.c:674
#7  0x004f84d8 in heap_update (relation=0x7fa1986e9e08,
otid=0x3067dc4, newtup=0x3067dc0, cid=0, crosscheck=0x0, wait=true,
tmfd=0x7ffed5fe03b0, lockmode=0x7ffed5fe03ac) at heapam.c:3549
#8  0x004fa5dd in simple_heap_update (relation=0x7fa1986e9e08,
otid=0x3067dc4, tup=0x3067dc0) at heapam.c:4222
#9  0x005c9932 in CatalogTupleUpdate (heapRel=0x7fa1986e9e08,
otid=0x3067dc4, tup=0x3067dc0) at indexing.c:312
#10 0x00af5774 in RelationSetNewRelfilenode
(relation=0x7fa1986fdc80, persistence=112 'p') at relcache.c:3707
#11 0x006d2b4d in ExecuteTruncateGuts
(explicit_rels=0x30688b8, relids=0x3068910, relids_extra=0x3068968,
relids_logged=0x30689c0, behavior=DROP_RESTRICT, restart_seqs=false)
at tablecmds.c:2012
#12 0x008ebc50 in apply_handle_truncate (s=0x7ffed5fe08d0) at
worker.c:2252
#13 0x008ebe94 in apply_dispatch (s=0x7ffed5fe08d0) at worker.c:2308
--

The essence of the trouble seems to be that the apply_handle_truncate
function never anticipated it may end up truncating the same table
from 2 separate workers (subscriptions) like this test case is doing.
Probably this is quite an old problem because the
apply_handle_truncate code has not changed much for a long time. The
code of apply_handle_truncate function (worker.c) has a very similar
pattern to the ExecuteTruncate function (tablecmds.c) but the
ExecuteTruncate is using a more powerful AcccessExclusiveLock than the
apply_handle_truncate was using.

~~

PSA a patch to make the apply_handle_truncate use AccessExclusiveLock
same as the ExecuteTruncate function does.

PSA a patch adding a test for this scenario.


Kind Regards,
Peter Smith.
Fujitsu Australia


v2_test_for_deadlock.patch
Description: Binary data


v2-0001-Fix-deadlock-for-multiple-replicating-truncates-o.patch
Description: Binary data

Re: wal stats questions

2021-05-17 Thread torikoshia


On 2021-05-13 09:05, Masahiro Ikeda wrote:

On 2021/05/12 19:19, Fujii Masao wrote:



On 2021/05/11 18:46, Masahiro Ikeda wrote:



On 2021/05/11 16:44, Fujii Masao wrote:



On 2021/04/28 9:10, Masahiro Ikeda wrote:



On 2021/04/27 21:56, Fujii Masao wrote:



On 2021/04/26 10:11, Masahiro Ikeda wrote:


First patch has only the changes for pg_stat_wal view.
("v6-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch")





+    pgWalUsage.wal_records == prevWalUsage.wal_records &&
+    walStats.wal_write == 0 && walStats.wal_sync == 0 &&
WalStats.m_wal_write should be checked here instead of 
walStats.wal_write?


Thanks! Yes, I'll fix it.


Thanks!


Thanks for your comments!
I fixed them.


Thanks for updating the patch!

 if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
 pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+    pgWalUsage.wal_records == prevWalUsage.wal_records &&
+    WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0 &&

I'm just wondering if the above WAL activity counters need to be 
checked.
Maybe it's not necessary because "pgStatXactCommit == 0 && 
pgStatXactRollback

== 0"
is checked? IOW, is there really the case where WAL activity counters 
are updated

even when both pgStatXactCommit and pgStatXactRollback are zero?


Thanks for checking.

I haven't found the case yet. But, I added the condition because there 
is a

discussion that it's safer.

(It seems the mail thread chain is broken, Sorry...)
I copy the discussion at the time below.

https://www.postgresql.org/message-id/20210330.172843.267174731834281075.horikyota.ntt%40gmail.com
3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) 
== 0)

   just to figure out if there's been any changes isn't all that
   cheap. This is regularly exercised in read-only workloads too. 
Seems

   adding a boolean WalStats.have_pending = true or such would be
   better.
4) For plain backends pgstat_report_wal() is called by
   pgstat_report_stat() - but it is not checked as part of the 
"Don't

   expend a clock check if nothing to do" check at the top.  It's
   probably rare to have pending wal stats without also passing one 
of

   the other conditions, but ...


I added the logic to check if the stats counters are updated or not 
in
pgstat_report_stat() using not only generated wal record but also 
write/sync

counters, and it can skip to call reporting function.


I removed the checking code whether the wal stats counters were 
updated or not
in pgstat_report_stat() since I couldn't understand the valid case 
yet.
pgstat_report_stat() is called by backends when the transaction is 
finished.

This means that the condition seems always pass.


Doesn't the same holds for all other counters?  If you are saying that
"wal counters should be zero if all other stats counters are zero", we
need an assertion to check that and a comment to explain that.

I personally find it safer to add the WAL-stats condition to the
fast-return check, rather than addin such assertion.




+    if (pgWalUsage.wal_records != prevWalUsage.wal_records)
+    {
+    WalUsage    walusage;
+
+    /*
+ * Calculate how much WAL usage counters were increased by 
substracting
+ * the previous counters from the current ones. Fill the 
results in

+ * WAL stats message.
+ */
+    MemSet(&walusage, 0, sizeof(WalUsage));
+    WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);

Isn't it better to move the code "prevWalUsage = pgWalUsage" into 
here?

Because it's necessary only when pgWalUsage.wal_records !=
prevWalUsage.wal_records.


Yes, I fixed it.


Regards,


Thanks for updating the patch!

+	 * is executed, wal records aren't nomally generated (although HOT 
makes


nomally -> normally?


+* It's not enough to check the number of generated wal records, for
+* example the walwriter may write/sync the WAL although it doesn't


You use both 'wal' and 'WAL' in the comments, but are there any 
intension?


Regards,

Re: PG 14 release notes, first draft

2021-05-17 Thread Fujii Masao





On 2021/05/10 15:03, Bruce Momjian wrote:

I have committed the first draft of the PG 14 release notes.  You can
see the most current  build of them here:

https://momjian.us/pgsql_docs/release-14.html


Thanks a lot!!



I need clarification on many items, and the document still needs its
items properly ordered, and markup added.  I also expect a lot of
feedback.




Add function pg_backend_memory_contexts() to output the memory contexts of 
arbitrary backends (Atsushi Torikoshi)


Could you fix the function name from pg_backend_memory_contexts() to
pg_log_backend_memory_contexts()? That is, "log_" is necessary just after
"pg_" in the function name.




Add lock wait time to pg_locks (Atsushi Torikoshi)


Could you change "lock wait time" to "lock wait start time"? Because
the pg_locks.waitstart column that was added in v14 reports the time
when the server process *started* waiting for the lock.


Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [Patch] ALTER SYSTEM READ ONLY

2021-05-17 Thread Dilip Kumar

On Mon, May 17, 2021 at 11:48 AM Amul Sul  wrote:
>
> On Sat, May 15, 2021 at 3:12 PM Dilip Kumar  wrote:
> >
> > On Thu, May 13, 2021 at 2:56 PM Dilip Kumar  wrote:
> > >
> > > Great thanks.  I will review the remaining patch soon.
> >
> > I have reviewed v28-0003, and I have some comments on this.
> >
> > ===
> > @@ -126,9 +127,14 @@ XLogBeginInsert(void)
> >  Assert(mainrdata_last == (XLogRecData *) &mainrdata_head);
> >  Assert(mainrdata_len == 0);
> >
> > +/*
> > + * WAL permission must have checked before entering the critical 
> > section.
> > + * Otherwise, WAL prohibited error will force system panic.
> > + */
> > +Assert(walpermit_checked_state != WALPERMIT_UNCHECKED ||
> > !CritSectionCount);
> > +
> >  /* cross-check on whether we should be here or not */
> > -if (!XLogInsertAllowed())
> > -elog(ERROR, "cannot make new WAL entries during recovery");
> > +CheckWALPermitted();
> >
> > We must not call CheckWALPermitted inside the critical section,
> > instead if we are here we must be sure that
> > WAL is permitted, so better put an assert.  Even if that is ensured by
> > some other mean then also I don't
> > see any reason for calling this error generating function.
> >
>
> I understand that we should not have an error inside a critical section but
> this check is not wrong. Patch has enough checking so that errors due to WAL
> prohibited state must not hit in the critical section, see assert just before
> CheckWALPermitted().  Before entering into the critical section, we do have an
> explicit WAL prohibited check. And to make sure that check has been done for
> all current critical section for the wal writes, we have aforesaid assert
> checking, for more detail on this please have a look at the "WAL prohibited
> system state" section of src/backend/access/transam/README added in 0004 
> patch.
> This assertion also ensures that future development does not miss the WAL
> prohibited state check before entering into a newly added critical section for
> WAL writes.

I think we need CheckWALPermitted(); check, in XLogBeginInsert()
function because if XLogBeginInsert() maybe called outside critical
section e.g. pg_truncate_visibility_map() then we should error out.
So this check make sense to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: wal stats questions

2021-05-17 Thread Masahiro Ikeda


Thanks for your comments!

>> + * is executed, wal records aren't nomally generated (although HOT makes
> 
> nomally -> normally?

Yes, fixed.

>> + * It's not enough to check the number of generated wal records, for
>> + * example the walwriter may write/sync the WAL although it doesn't
> 
> You use both 'wal' and 'WAL' in the comments, but are there any intension?

No, I changed to use 'WAL' only. Although some other comments have 'wal' and
'WAL', it seems that 'WAL' is often used.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2b106d8473..355778028e 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -17,6 +17,12 @@
 
 #include "executor/instrument.h"
 
+/*
+ * Buffer and generated WAL usage counters.
+ *
+ * The counters are accumulated values. There are infrastructures
+ * to add the values and calculate the difference within a specific period.
+ */
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index cdd07770a0..75a95f3de7 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -583,7 +583,7 @@ HandleCheckpointerInterrupts(void)
 		BgWriterStats.m_requested_checkpoints++;
 		ShutdownXLOG(0, 0);
 		pgstat_send_bgwriter();
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 249b17c92b..734a7cf802 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -133,8 +133,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_report_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_report_wal() calls, by substracting
+ * pgstat_send_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_send_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -852,9 +852,19 @@ pgstat_report_stat(bool disconnect)
 	TabStatusArray *tsa;
 	int			i;
 
-	/* Don't expend a clock check if nothing to do */
+	/*
+	 * Don't expend a clock check if nothing to do.
+	 *
+	 * Note: regarding to WAL statistics counters, it's not enough to check
+	 * only whether the WAL record is generated or not. If a read transaction
+	 * is executed, WAL records aren't normally generated (although HOT makes
+	 * WAL records). But, just writes and syncs the WAL data may happen when
+	 * to flush buffers.
+	 */
 	if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
 		pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+		pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0 &&
 		!have_function_stats && !disconnect)
 		return;
 
@@ -948,7 +958,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_report_wal();
+	pgstat_send_wal(true);
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -2918,7 +2928,7 @@ void
 pgstat_initialize(void)
 {
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -3030,44 +3040,6 @@ pgstat_send_bgwriter(void)
 	MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* --
- * pgstat_report_wal() -
- *
- * Calculate how much WAL usage counters are increased and send
- * WAL statistics to the collector.
- *
- * Must be called by processes that generate WAL.
- * --
- */
-void
-pgstat_report_wal(void)
-{
-	WalUsage	walusage;
-
-	/*
-	 * Calculate how much WAL usage counters are increased by substracting the
-	 * previous counters from the current ones. Fill the results in WAL stats
-	 * message.
-	 */
-	MemSet(&walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-	WalStats.m_wal_records = walusage.wal_records;
-	WalStats.m_wal_fpi = walusage.wal_fpi;
-	WalStats.m_wal_bytes = walusage.wal_bytes;
-
-	/*
-	 * Send WAL stats message to the collector.
-	 */
-	if (!pgstat_send_wal(true))
-		return;
-
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-}
-
 /* --
  * pgstat_send_wal() -
  *
@@ -3075,24 +3047,38 @@ pgstat_report_wal(void

Re: Different compression methods for FPI

2021-05-17 Thread Michael Paquier

On Sun, Mar 21, 2021 at 02:30:04PM -0500, Justin Pryzby wrote:
> rebased to keep cfbot happy.  This will run with default=zlib.

I have been looking at bit at this patch set, and I think that we
should do something here for 15~.

First, I think that we should make more tests and pick up one
compression method that could be used as an alternative to pglz, not
three among zlib, LZ4 or zstd.  Looking at the past archives, we could
do more tests like this one: 
https://www.postgresql.org/message-id/cab7npqsc97o-ue5paxfmukwcxe_jioyxo1m4a0pmnmyqane...@mail.gmail.com

The I/O should not be the bottleneck here, so on top of disabling
fsync we could put the whole data folder on a ramdisk and compare the
whole.  The patch set does not apply, by the way, so it needs a
rebase.

Based on my past experiences, I'd see LZ4 as a good choice in terms of
performance, speed and compression ratio, and on top of that we now
have the possibility to build PG with --with-lz4 for TOAST
compression so ./configure is already tweaked for it.  For this patch,
this is going to require a bit more in terms of library linking as the
block decompression is done in xlogreader.c, so that's one thing to
worry about.

 #define BKPIMAGE_IS_COMPRESSED 0x02/* page image is  compressed */
 #define BKPIMAGE_APPLY 0x04/* page image should be restored during
  * replay */   

+#define BKPIMAGE_COMPRESS_METHOD1  0x08/* bits to encode
compression method */
+#define BKPIMAGE_COMPRESS_METHOD2  0x10/* 0=pglz; 1=zlib; */

The interface used in xlogrecord.h is confusing to me with
BKPIMAGE_IS_COMPRESSED, followed by extra bits set for the compression
method.  Wouldn't it be cleaner to have a set of BKPIMAGE_COMPRESS_XXX
(XXX={lz4,zlib,etc.})?  There is no even need to steal one bit for
some kind of BKPIMAGE_COMPRESS_NONE as we know that the page is
compressed if we know it has a method assigned, so with pglz, the
default, and one extra method we need only two bits here.

+   {
+   {"wal_compression_method", PGC_SIGHUP, WAL_SETTINGS,
+   gettext_noop("Set the method used to compress full page
images in the WAL."),
+   NULL
+   },
+   &wal_compression_method,
+   WAL_COMPRESSION_PGLZ, wal_compression_options,
+   NULL, NULL, NULL
+   },
The interface is not quite right to me.  I think that we should just
change wal_compression to become an enum, with extra values for pglz
and the new method.  "on" would be a synonym for "pglz".

+/* This is a mapping indexed by wal_compression */
+// XXX: maybe this is better done as a GUC hook to assign the 1)
method; and 2) level
+struct walcompression walmethods[] = {
+   {"pglz",WAL_COMPRESSION_PGLZ},
+   {"zlib",WAL_COMPRESSION_ZLIB},
+};
Don't think you need a hook here, but zlib, or any other method which
is not supported by the build, had better not be listed instead.  This
ensures that the value cannot be assigned if the binaries don't
support that.

+   {"wal_compression_method", PGC_SIGHUP, WAL_SETTINGS,
+   gettext_noop("Set the method used to compress full page
images in the WAL."),
+   NULL
+   },
+   &wal_compression_method,
+   WAL_COMPRESSION_PGLZ, wal_compression_options,
+   NULL, NULL, NULL
Any reason to not make that user-settable?  If you merge that with
wal_compression, that's not an issue.

The patch set is a gathering of various things, and not only things
associated to the compression method used for FPIs.

 my $node = get_new_node('primary');
-$node->init(allows_streaming => 1);
+$node->init();
 $node->start;
What is the point of that in patch 0002?

> Subject: [PATCH 03/12] Make sure published XIDs are persistent

Patch 0003 looks unrelated to this thread.

> Subject: [PATCH 04/12] wal_compression_method: default to zlib..

Patch 0004 could happen, however there are no reasons given why this
is adapted.  Changing the default is not going to happen for the time
release where this feature is added, anyway.

+   default:
+   report_invalid_record(record, "image at %X/%X is compressed with 
unsupported codec, block %d (%d/%s)",
+ (uint32) (record->ReadRecPtr >> 32),
+ (uint32) record->ReadRecPtr,
+ block_id,
+ compression_method,
+
wal_compression_name(compression_method));
+   return false;
In xlogreader.c, the error message is helpful this way.  However, we
would not know which compression method failed if there is a
decompression failure for a method supported by the build restoring
this block.  That would be good to add.

I think that what we actually need for this thread are patches 0001,
0005 and 0006 merged together to study first the performance we have
with each one of the compression methods proposed, and then let's just
pick one.  Reading around, zstd and zlib compresse more b

Winflex docs and distro

2021-05-17 Thread Magnus Hagander

Since PostgreSQL 9.3, in commit a266f7dd93b, we've added the text:

+   The obsolete "winflex" binaries distributed on the PostgreSQL FTP site
+   and referenced in older documentation will fail with "flex: fatal
+   internal error, exec failed" on 64-bit Windows hosts. Use flex from
+   msys instead.

At this point. I suggest we simply stop distributing winflex on our
download site, and just remove this note from the documentation. (This
is just a note, the general documentation still says get flex from
msys, separately).

Surely a binary that doesn't work on a 64-bit system is not of help to
anybody these days.. And "older documentation" now refers to 9.2 which
was EOL in 2017.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Addition of authenticated ID to pg_stat_activity

2021-05-17 Thread Magnus Hagander

On Mon, May 17, 2021 at 6:35 AM Michael Paquier  wrote:
>
> On Thu, Apr 29, 2021 at 04:56:42PM +0200, Magnus Hagander wrote:
> > I definitely use it all the time to monitor autovacuum all the time.
> > The others as well regularly, but autovacuum continuously. I also see
> > a lot of people doing things like "from pg_stat_activity where query
> > like '%mytablename%'" where they'd want both any regular queries and
> > any autovacuums currently processing the table.
>
> When it comes to development work, I also look at things different
> than backend connections, checkpointer and WAL writer included.

While I think we should optimize these primarily for users and not
developers, I definitely do those things as well. In particular wait
events for the background processes.

> > I'd say client address is also pretty common to identify which set of
> > app servers connections are coming in from -- but client port and
> > client hostname are a lot less interesting. But it'd be kind of weird
> > to split those out.
>
> Yes, I agree that it would be confusing to split the client_* fields
> across multiple views.
>
> > For *interactive use* I'd find pretty much all other columns
> > interesting and commonly used. Probably not that interested in the
> > oids of the database and user, but again they are the cheap ones. We
> > could get rid of the joints if we only showed the oids, but in
> > interactive use it's really the names that are interesting. But if
> > we're just trying to save column count, I'd say get rid of datid and
> > usesysid.
> >
> > I'd hold everything else as interesting.
>
> Yes, you have an argument here about the removal of usesysid and
> datid.  Now I find joins involving OIDs to be much more natural than
> the object names, because that's the base of what we use in the
> catalogs.

Agreed. And I'm not sure the actual gain is that big if we can just
remove oid columns...

> Not sure if we would be able to agree on something here, but the
> barrier to what a session and a connection hold is thin when it comes
> to roles and application_name.  Thinking more about that, I would be
> really tempted to get to do a more straight split with data that's
> associated to a session, to a transaction and to a connection, say:
> 1) pg_stat_session, data that may change.
> - PID
> - leader PID
> - the role name
> - role ID
> - application_name
> - wait_event_type
> - wait_event
> 2) pg_stat_connection, static data associated to a connection.
> - PID
> - database name
> - database OID
> - client_addr
> - client_hostname
> - client_port
> - backend_start
> - authn ID
> - backend_type
> 3) pg_stat_transaction, or pg_stat_activity, for the transactional
> activity.
> - PID
> - xact_start
> - query_start
> - backend_xid
> - state_change
> - query string
> - query ID
> - state

This seems extremely user-unfriendly to me.

I mean. Timestamps are nso split out between different views so you
can't track the process iwthout it. And surely wait_event info is
*extremely* related to things like what query is running and what
state the session is in. And putting backend_type off in a separate
view means most queries are going to have to join that in anyway. Or
include it in all views. And if we're forcing the majority of queries
to join multiple views, what have we actually gained?

Based on your list above, I'd definitely want at least (1) and (2) to
be in the same one, but they'd have to also gain at least the database
oid/name and backend_type, and maybe also backend_start.

So basically, it would be moving out client_*, and authn_id. If we're
doing that then as you say maybe pg_stat_connection is a good name and
could then *also* gain the information that's currently in the ssl and
gss views for a net simplification.

tld;dr; I think we have to be really careful here or the cure is going
to be way worse than the disease.

> Or I could just drop a new function that fetches the authn ID for a
> given PID, even if this makes things potentially less consistent when
> it comes to the lookup of PgBackendStatus, guarantee given now by
> pg_stat_get_activity().

Well, the authnid will never change so I'm not sure the consistency
part is a big problem? Or maybe I'm misunderstanding the risk you're
referring to?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: FDW and connections

2021-05-17 Thread Phil Godfrin

Apologies, in my mind this was an internals to the postgres_fdw code, 
which is why I cam here. I checked that part of the docs and nowhere 
does it say anything about defaulting to 5432. However in the referred 
section,  34.1.2, there it says that libpq defaults to the "port number 
established when PostgreSQL was built". I'm not well informed about the 
internals of libpq nor the mailing lists, again I'm sorry. Seems to me I 
need to learn more about both . Thanks.


pg

On 5/16/2021 7:43 PM, tsunakawa.ta...@fujitsu.com wrote:


From: Phil Godfrin 

My question is - how does the call to GetConnection() know what port 
to use? Lets say we're using PGBouncer to connect on the local server 
at port 6432, but there is no pgbouncer listening at the foreign 
server, what port gets passed? My first thought is whatever the client 
connects port is, but I believe pgbouncer ultimately hands of the 
connection to whatever port you have defined for the local database...


This gets important when one has an HAProxy instance between the local 
and foreign servers which is interrogating the port number to decide 
which ip:port to send the request to, ultimately the master or 
replicant at the foreign remoter server.


So how does the port number get propagated from local to foreign server???

--

postgres_fdw uses libpq as a client to connect to the foreign server. 
So, as the following says, you can specify the libpq's "port" 
parameter in CREATE SERVER.  If it's ommitted as in your case, the 
default 5432 will be used.


F.35.1.1. Connection Options

https://www.postgresql.org/docs/devel/postgres-fdw.html

"A foreign server using the postgres_fdw foreign data wrapper can have 
the same options that libpq accepts in connection strings, as 
described in Section 34.1.2, except that these options are not allowed 
or have special handling:"


I'm afraid it's better to post user-level questions like this to 
pgsql-gene...@lists.postgresql.org.


Regards

Takayuki Tsunakawa

Re: Asynchronous Append on postgres_fdw nodes.

2021-05-17 Thread Etsuro Fujita

On Sun, May 16, 2021 at 11:39 PM Etsuro Fujita  wrote:
> Attached is a new version.

I have committed the patch.

Best regards,
Etsuro Fujita

Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Amit Kapila

On Mon, May 17, 2021 at 12:30 PM Peter Smith  wrote:
>
> While doing logical replication testing we encountered a problem which
> causes a deadlock error to be logged when replicating a TRUNCATE
> simultaneously to 2 Subscriptions.
> e.g.
> --
> 2021-05-12 11:30:19.457 AEST [11393] ERROR:  deadlock detected
> 2021-05-12 11:30:19.457 AEST [11393] DETAIL:  Process 11393 waits for
> ShareLock on transaction 597; blocked by process 11582.
> Process 11582 waits for ShareLock on relation 16384 of database 14896;
> blocked by process 11393.
> --
>
> At this time, both the subscriber (apply worker) processes are
> executing inside the ExecuteTruncateGuts function simultaneously and
> they become co-dependent.
>
> e.g.
> --
> Process 11582
> (gdb) bt
> #0  0x7fa1979515e3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1  0x0093e5d0 in WaitEventSetWaitBlock (set=0x2facac8,
> cur_timeout=-1, occurred_events=0x7ffed5fdff00, nevents=1) at
> latch.c:1450
> #2  0x0093e468 in WaitEventSetWait (set=0x2facac8, timeout=-1,
> occurred_events=0x7ffed5fdff00, nevents=1, wait_event_info=50331648)
> at latch.c:1396
> #3  0x0093d8cd in WaitLatch (latch=0x7fa191042654,
> wakeEvents=33, timeout=0, wait_event_info=50331648) at latch.c:473
> #4  0x009660f0 in ProcSleep (locallock=0x2fd06d8,
> lockMethodTable=0xcd90a0 ) at proc.c:1361
..
> --
> Process 11393
> (gdb) bt
> #0  0x7fa197917f90 in __nanosleep_nocancel () from /lib64/libc.so.6
> #1  0x7fa197917e44 in sleep () from /lib64/libc.so.6
> #2  0x00950f84 in DeadLockReport () at deadlock.c:1151
> #3  0x00955013 in WaitOnLock (locallock=0x2fd05d0,
> owner=0x2fd9a48) at lock.c:1873
>
..
> --
>
> The essence of the trouble seems to be that the apply_handle_truncate
> function never anticipated it may end up truncating the same table
> from 2 separate workers (subscriptions) like this test case is doing.
> Probably this is quite an old problem because the
> apply_handle_truncate code has not changed much for a long time.
>

Yeah, have you checked it in the back branches?

I am also able to reproduce and have analyzed the cause of the above
error. In the above, Process 11393 waits while updating pg_class tuple
via RelationSetNewRelfilenode() which is already updated by process
11582 (with transaction id 597 which is yet not committed). Now,
process 11582 waits for acquiring ShareLock on relation 16384 which is
acquired RowExclusiveMode by process 11393 in function
apply_handle_truncate. So, both the processes started waiting on each
other which causes a deadlock.

>
> PSA a patch adding a test for this scenario.
>

+
+$node_publisher->safe_psql('postgres',
+ "ALTER SYSTEM SET synchronous_standby_names TO 'any 2(sub5_1, sub5_2)'");
+$node_publisher->safe_psql('postgres', "SELECT pg_reload_conf()");

Do you really need these steps to reproduce the problem? IIUC, this
has nothing to do with synchronous replication.

-- 
With Regards,
Amit Kapila.

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-05-17 Thread Magnus Hagander

On Fri, Apr 23, 2021 at 12:40 PM Fujii Masao
 wrote:
>
>
>
> On 2021/04/23 19:11, Magnus Hagander wrote:
> > On Fri, Apr 23, 2021 at 12:04 PM Fujii Masao
> >  wrote:
> >>
> >>
> >>
> >> On 2021/04/23 18:46, Magnus Hagander wrote:
> >>> On Fri, Apr 23, 2021 at 9:10 AM Fujii Masao  
> >>> wrote:
> 
> 
> 
>  On 2021/04/22 18:23, Julien Rouhaud wrote:
> > On Thu, Apr 22, 2021 at 12:28:11AM +0900, Fujii Masao wrote:
> >>
> >> I found another small issue in pg_stat_statements docs. The following
> >> description in the docs should be updated so that toplevel is included?
> >>
> >>> This view contains one row for each distinct database ID, user ID and 
> >>> query ID
> >
> > Indeed!  I'm adding Magnus in Cc.
> >
> > PFA a patch to fix at, and also mention that toplevel will only
> > contain True values if pg_stat_statements.track is set to top.
> 
>  Thanks for the patch! LGTM.
> >>>
> >>> Agreed, in general. But going by the example a few lines down, I
> >>> changed the second part to:
> >>>   True if the query was executed as a top level statement
> >>> +   (if pg_stat_statements.track is set to
> >>> +   all, otherwise always false)
> >>
> >> Isn't this confusing? Users may mistakenly read this as that the toplevel
> >> column always indicates false if pg_stat_statements.track is not "all".
> >
> > Hmm. I think you're right. It should say "always true", shouldn't it?
>
> You're thinking something like the following?
>
>  True if the query was executed as a top level statement
>  (if pg_stat_statements.track is set to
>  top, always true)
>
> > So not just confusing, but completely wrong? :)
>
> Yeah :)

Ugh. I completely lost track of this email.

I've applied the change suggested above with another slight reordering
of the words:

+   (always true if pg_stat_statements.track is set to
+   top)


> I'm fine with the original wording by Julien.
> Of course, the parameter name should be corrected as you did, though.
>
> Or what about the following?
>
>  True if the query was executed as a top level statement
>  (this can be false only if
>  pg_stat_statements.track is set to
>  all and nested statements are also tracked)

I found my suggestion, once the final reordering of words was done,
easier to parse.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Schema variables - new implementation for Postgres 15

2021-05-17 Thread Pavel Stehule

st 12. 5. 2021 v 7:37 odesílatel Pavel Stehule 
napsal:

>
>
> st 12. 5. 2021 v 6:17 odesílatel Pavel Stehule 
> napsal:
>
>> Hi
>>
>> only rebase
>>
>
> second try - rebase after serial_scheduler remove
>

only rebase

Regards

Pavel


> Regards
>
> Pavel
>
>
>> Regards
>>
>> Pavel
>>
>


schema-variables-20210517.patch.gz
Description: application/gzip

Re: logical replication worker accesses catalogs in error context callback

2021-05-17 Thread Bharath Rupireddy

On Fri, Apr 16, 2021 at 3:23 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:
>
> On Tue, Mar 16, 2021 at 2:21 AM Tom Lane  wrote:
> >
> > Bharath Rupireddy  writes:
> > > Thanks for pointing to the changes in the commit message. I corrected
> > > them. Attaching v4 patch set, consider it for further review.
> >
> > I took a quick look at this.  I'm quite worried about the potential
> > performance cost of the v4-0001 patch (the one for fixing
> > slot_store_error_callback).  Previously, we didn't pay any noticeable
> > cost for having the callback unless there actually was an error.
> > As patched, we perform several catalog lookups per column per row,
> > even in the non-error code path.  That seems like it'd be a noticeable
> > performance hit.  Just to add insult to injury, it leaks memory.
> >
> > I propose a more radical but simpler solution: let's just not bother
> > with including the type names in the context message.  How much are
> > they really buying?
>
> << Attaching v5-0001 here again for completion >>
> I'm attaching v5-0001 patch that avoids printing the column type names
> in the context message thus no cache lookups have to be done in the
> error context callback. I think the column name is enough to know on
> which column the error occurred and if required it's type can be known
> by the user. This patch gets rid of printing local and remote type
> names in slot_store_error_callback and also
> logicalrep_typmap_gettypname because it's unnecessary. Thoughts?
>
> > v4-0002 (for postgres_fdw's conversion_error_callback) has the same
> > problems, although mitigated a bit by not needing to do any catalog
> > lookups in the non-join case.  For the join case, I wonder whether
> > we could push the lookups back to plan setup time, so they don't
> > need to be done over again for each row.  (Not doing lookups at all
> > doesn't seem attractive there; it'd render the context message nearly
> > useless.)  A different idea is to try to get the column names out
> > of the query's rangetable instead of going to the catalogs.
>
> I'm attaching v5-0002 which stores the required attribute information
> for foreign joins in postgresBeginForeignScan which is a one time job
> as opposed to the per-row system catalog lookup v4-0001 was doing. I'm
> storing the foreign table relid(as key), relname and the retrieved
> attributes' attnum and attname into a hash table. Whenever a
> conversion error occurs, using relid, the hash table is looked up to
> fetch the relname and attname. Thoughts?

Attaching rebased v6 patches.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From a65e13d591bb151eb71eb8ec42834ce28a3db304 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 14 Apr 2021 13:25:16 +0530
Subject: [PATCH v6] Avoid Catalogue Accesses In slot_store_error_callback

Avoid accessing system catalogues inside slot_store_error_callback
error context callback, because the the entire transaction might
have been broken at this point. logicalrep_typmap_gettypname()
and format_type_be could search the sys cache.

As per suggestion from Tom, a simple solution is to just avoid
printing the column type names in the message, just the column
name is enough to know on which column the error occurred.

The above solution makes logicalrep_typmap_gettypname redundant
hence removed it.
---
 src/backend/replication/logical/relation.c | 49 --
 src/backend/replication/logical/worker.c   | 25 +--
 src/include/replication/logicalrelation.h  |  2 -
 3 files changed, 2 insertions(+), 74 deletions(-)

diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index e861c0ff80..cbc5612dbe 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -490,55 +490,6 @@ logicalrep_typmap_update(LogicalRepTyp *remotetyp)
 	MemoryContextSwitchTo(oldctx);
 }
 
-/*
- * Fetch type name from the cache by remote type OID.
- *
- * Return a substitute value if we cannot find the data type; no message is
- * sent to the log in that case, because this is used by error callback
- * already.
- */
-char *
-logicalrep_typmap_gettypname(Oid remoteid)
-{
-	LogicalRepTyp *entry;
-	bool		found;
-
-	/* Internal types are mapped directly. */
-	if (remoteid < FirstGenbkiObjectId)
-	{
-		if (!get_typisdefined(remoteid))
-		{
-			/*
-			 * This can be caused by having a publisher with a higher
-			 * PostgreSQL major version than the subscriber.
-			 */
-			return psprintf("unrecognized %u", remoteid);
-		}
-
-		return format_type_be(remoteid);
-	}
-
-	if (LogicalRepTypMap == NULL)
-	{
-		/*
-		 * If the typemap is not initialized yet, we cannot possibly attempt
-		 * to search the hash table; but there's no way we know the type
-		 * locally yet, since we haven't received a message about this type,
-		 * so this is the best we can do.
-		 */
-		return psprintf("unrecognized %u", remoteid);
-	}

Re: Winflex docs and distro

2021-05-17 Thread Daniel Gustafsson

> On 17 May 2021, at 10:17, Magnus Hagander  wrote:

> Since PostgreSQL 9.3, in commit a266f7dd93b, we've added the text:
> 
> +   The obsolete "winflex" binaries distributed on the PostgreSQL FTP site

Which was slightly updated in 0a9ae44288d.

> At this point. I suggest we simply stop distributing winflex on our
> download site, and just remove this note from the documentation.

Sounds reasonable, are there (easily retrieved) logs/tracking for when it was
accessed by anyone last?

--
Daniel Gustafsson   https://vmware.com/

Re: Forget close an open relation in ReorderBufferProcessTXN()

2021-05-17 Thread Amit Kapila

On Fri, May 14, 2021 at 12:44 PM Amit Langote  wrote:
>
> On Thu, May 13, 2021 at 7:43 PM Amit Kapila  wrote:
>
> > Also, don't we need to free the
> > entire map as suggested by me?
>
> Yes, I had missed that too.  Addressed in the updated patch.
>

+ relentry->map = convert_tuples_by_name(indesc, outdesc);
+ if (relentry->map == NULL)
+ {
+ /* Map not necessary, so free the TupleDescs too. */
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }

I think the patch frees these descriptors when the map is NULL but not
otherwise because free_conversion_map won't free these descriptors.
BTW, have you tried this patch in back branches because I think we
should backpatch this fix?

-- 
With Regards,
Amit Kapila.

RE: making update/delete of inheritance trees scale better

2021-05-17 Thread houzj.f...@fujitsu.com

> On Mon, May 17, 2021 at 3:07 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > Hi
> >
> > After 86dc900, In " src/include/nodes/pathnodes.h ", I noticed that it
> > uses the word " partitioned UPDATE " in the comment above struct
> RowIdentityVarInfo.
> >
> > But, it seems " inherited UPDATE " is used in the rest of places.
> > Is it better to keep them consistent by using " inherited UPDATE " ?
> 
> Yeah, I would not be opposed to fixing that.  Like this maybe (patch 
> attached)?

> - * In partitioned UPDATE/DELETE it's important for child partitions to share
> + * In an inherited UPDATE/DELETE it's important for child tables to 
> + share

Thanks for the change, it looks good to me.

Best regards,
houzj

Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Dilip Kumar

On Mon, May 17, 2021 at 12:30 PM Peter Smith  wrote:

> The essence of the trouble seems to be that the apply_handle_truncate
> function never anticipated it may end up truncating the same table
> from 2 separate workers (subscriptions) like this test case is doing.
> Probably this is quite an old problem because the
> apply_handle_truncate code has not changed much for a long time. The
> code of apply_handle_truncate function (worker.c) has a very similar
> pattern to the ExecuteTruncate function (tablecmds.c) but the
> ExecuteTruncate is using a more powerful AcccessExclusiveLock than the
> apply_handle_truncate was using.

Right, that's a problem.

>
> PSA a patch to make the apply_handle_truncate use AccessExclusiveLock
> same as the ExecuteTruncate function does.

I think the fix makes sense to me.

> PSA a patch adding a test for this scenario.

I am not sure this test case is exactly targeting the problematic
behavior because that will depends upon the order of execution of the
apply workers right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread tanghy.f...@fujitsu.com

On Monday, May 17, 2021 5:47 PM, Amit Kapila  wrote
> +$node_publisher->safe_psql('postgres',
> + "ALTER SYSTEM SET synchronous_standby_names TO 'any 2(sub5_1,
> sub5_2)'");
> +$node_publisher->safe_psql('postgres', "SELECT pg_reload_conf()");
> 
> Do you really need these steps to reproduce the problem? IIUC, this
> has nothing to do with synchronous replication.

Agreed. 
I tested in asynchronous mode, and could reproduce this problem, too.

The attached patch removed the steps for setting synchronous replication.
And the test passed after applying Peter's patch.
Please take it as your reference.

Regards
Tang



v3_test_for_deadlock.patch
Description: v3_test_for_deadlock.patch

Re: Forget close an open relation in ReorderBufferProcessTXN()

2021-05-17 Thread Amit Kapila

On Fri, May 14, 2021 at 2:20 PM osumi.takami...@fujitsu.com
 wrote:
>
> On Thursday, May 13, 2021 7:21 PM Amit Kapila  wrote:
> > I don't think we can reproduce it with core plugins as they don't lock user
> > catalog tables.
> OK. My current understanding about how the deadlock happens is below.
>
> 1. TRUNCATE command is performed on user_catalog_table.
> 2. TRUNCATE command locks the table and index with ACCESS EXCLUSIVE LOCK.
> 3. TRUNCATE waits for the subscriber's synchronization
> when synchronous_standby_names is set.
> 4. Here, the walsender stops, *if* it tries to acquire a lock on the 
> user_catalog_table
> because the table where it wants to see is locked by the TRUNCATE 
> already.
>
> If this is right,
>

Yeah, the above steps are correct, so if we take a lock on
user_catalog_table when walsender is processing the WAL, it would lead
to a problem.

> we need to go back to a little bit higher level discussion,
> since whether we should hack any plugin to simulate the deadlock caused by 
> user_catalog_table reference
> with locking depends on the assumption if the plugin takes a lock on the 
> user_catalog_table or not.
> In other words, if the plugin does read only access to that type of table 
> with no lock
> (by RelationIdGetRelation for example ?), the deadlock concern disappears and 
> we don't
> need to add anything to plugin sides, IIUC.
>

True, if the plugin doesn't acquire any lock on user_catalog_table,
then it is fine but we don't prohibit plugins to acquire locks on
user_catalog_tables. This is similar to system catalogs, the plugins
and decoding code do acquire lock on those.

> Here, we haven't gotten any response about whether output plugin takes 
> (should take)
> the lock on the user_catalog_table. But, I would like to make a consensus
> about this point before the implementation.
>
> By the way, Amit-san already mentioned the main reason of this
> is that we allow decoding of TRUNCATE operation for user_catalog_table in 
> synchronous mode.
> The choices are provided by Amit-san already in the past email in [1].
> (1) disallow decoding of TRUNCATE operation for user_catalog_tables
> (2) disallow decoding of any operation for user_catalog_tables like system 
> catalog tables
>
> Yet, I'm not sure if either option solves the deadlock concern completely.
> If application takes an ACCESS EXCLUSIVE lock by LOCK command (not by 
> TRUNCATE !)
> on the user_catalog_table in a transaction, and if the plugin tries to take a 
> lock on it,
> I think the deadlock happens. Of course, having a consensus that the plugin 
> takes no lock at all
> would remove this concern, though.
>

This is true for system catalogs as well. See the similar report [1]

> Like this, I'd like to discuss those two items in question together at first.
> * the plugin should take a lock on user_catalog_table or not
> * the range of decoding related to user_catalog_table
>
> To me, taking no lock on the user_catalog_table from plugin is fine
>

We allow taking locks on system catalogs, so why prohibit
user_catalog_tables? However, I agree that if we want plugins to
acquire the lock on user_catalog_tables then we should either prohibit
decoding of such relations or do something else to avoid deadlock
hazards.

[1] - 
https://www.postgresql.org/message-id/CALDaNm1UB%3D%3DgL9Poad4ETjfcyGdJBphWEzEZocodnBd--kJpVw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Winflex docs and distro

2021-05-17 Thread Magnus Hagander

On Mon, May 17, 2021 at 11:11 AM Daniel Gustafsson  wrote:
>
> > On 17 May 2021, at 10:17, Magnus Hagander  wrote:
>
> > Since PostgreSQL 9.3, in commit a266f7dd93b, we've added the text:
> >
> > +   The obsolete "winflex" binaries distributed on the PostgreSQL FTP 
> > site
>
> Which was slightly updated in 0a9ae44288d.

It's been touched a couple of times, but not in any material fashion.


> > At this point. I suggest we simply stop distributing winflex on our
> > download site, and just remove this note from the documentation.
>
> Sounds reasonable, are there (easily retrieved) logs/tracking for when it was
> accessed by anyone last?

Not really. We don't keep logs going very far back. I can see it being
accessed a handful of time in the past 14 days. But AFAICT from the
limited information we have it's all bots.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Forget close an open relation in ReorderBufferProcessTXN()

2021-05-17 Thread Amit Langote

On Mon, May 17, 2021 at 6:13 PM Amit Kapila  wrote:
> On Fri, May 14, 2021 at 12:44 PM Amit Langote  wrote:
> > On Thu, May 13, 2021 at 7:43 PM Amit Kapila  wrote:
> >
> > > Also, don't we need to free the
> > > entire map as suggested by me?
> >
> > Yes, I had missed that too.  Addressed in the updated patch.
> >
>
> + relentry->map = convert_tuples_by_name(indesc, outdesc);
> + if (relentry->map == NULL)
> + {
> + /* Map not necessary, so free the TupleDescs too. */
> + FreeTupleDesc(indesc);
> + FreeTupleDesc(outdesc);
> + }
>
> I think the patch frees these descriptors when the map is NULL but not
> otherwise because free_conversion_map won't free these descriptors.

You're right.  I have fixed that by making the callback free the
TupleDescs explicitly.

> BTW, have you tried this patch in back branches because I think we
> should backpatch this fix?

I have created a version of the patch for v13, the only older release
that has this code, and can see that tests, including the newly added
one, pass.

Both patches are attached.

--
Amit Langote
EDB: http://www.enterprisedb.com


PG13-pgoutput-tupeconv-map-leak_v3.patch
Description: Binary data


HEAD-pgoutput-tupeconv-map-leak_v3.patch
Description: Binary data

postgres_fdw - should we tighten up batch_size, fetch_size options against non-numeric values?

2021-05-17 Thread Bharath Rupireddy

Hi,

It looks like the values such as '123.456', '789.123' '100$%$#$#',
'9,223,372,' are accepted and treated as valid integers for
postgres_fdw options batch_size and fetch_size. Whereas this is not
the case with fdw_startup_cost and fdw_tuple_cost options for which an
error is thrown. Attaching a patch to fix that.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v1-0001-Tighten-up-batch_size-fetch_size-options-against-.patch
Description: Binary data

Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Amit Kapila

On Mon, May 17, 2021 at 3:05 PM Dilip Kumar  wrote:
>
> On Mon, May 17, 2021 at 12:30 PM Peter Smith  wrote:
>
> > PSA a patch adding a test for this scenario.
>
> I am not sure this test case is exactly targeting the problematic
> behavior because that will depends upon the order of execution of the
> apply workers right?
>

Yeah, so we can't guarantee that this test will always reproduce the
problem but OTOH, I have tried two times and it reproduced both times.
I guess we don't have a similar test where Truncate will replicate to
two subscriptions, otherwise, we would have caught such a problem.
Having said that, I am fine with leaving this test if others feel so
on the grounds that it won't always lead to the problem reported.

-- 
With Regards,
Amit Kapila.

Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Dilip Kumar

On Mon, May 17, 2021 at 3:43 PM Amit Kapila  wrote:
>
> On Mon, May 17, 2021 at 3:05 PM Dilip Kumar  wrote:
> >
> > On Mon, May 17, 2021 at 12:30 PM Peter Smith  wrote:
> >
> > > PSA a patch adding a test for this scenario.
> >
> > I am not sure this test case is exactly targeting the problematic
> > behavior because that will depends upon the order of execution of the
> > apply workers right?
> >
>
> Yeah, so we can't guarantee that this test will always reproduce the
> problem but OTOH, I have tried two times and it reproduced both times.
> I guess we don't have a similar test where Truncate will replicate to
> two subscriptions, otherwise, we would have caught such a problem.
> Having said that, I am fine with leaving this test if others feel so
> on the grounds that it won't always lead to the problem reported.

Although it is not guaranteed to reproduce the scenario every time, it
is testing a new scenario, so +1 for the test.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Inaccurate error message when set fdw batch_size to 0

2021-05-17 Thread Amit Kapila

On Tue, May 11, 2021 at 11:28 AM Bharath Rupireddy
 wrote:
>
> On Mon, May 10, 2021 at 7:39 PM Tom Lane  wrote:
> >
> > Bharath Rupireddy  writes:
> > > On Mon, May 10, 2021 at 12:00 PM Tom Lane  wrote:
> > >> Yeah, this error message seems outright buggy.  However, it's a minor
> > >> matter.  Also, some people think "positive" is the same thing as
> > >> "non-negative", so maybe we need less ambiguous wording?
> >
> > > Since value 0 can't be considered as either a positive or negative
> > > integer, I think we can do as following(roughly):
> >
> > > if (value < 0) "requires a zero or positive integer value"
> > > if (value <= 0) "requires a positive integer value"
> >
> > I was thinking of avoiding the passive voice and writing
> >
> > "foo must be greater than zero"
>
> +1 for "foo must be greater than zero" if (foo <= 0) kind of errors.
> But, we also have some values for which zero is accepted, see below
> error messages. How about the error message "foo must be greater than
> or equal to zero"?
>

+1 for your proposed message for the cases where we have a check if
(foo < 0). Tom, Michael, do you see any problem with the proposed
message? We would like to make a similar change at another place [1]
so wanted to be consistent.

[1] - 
https://www.postgresql.org/message-id/CALj2ACWGB9oHCR5ygkc8u6_QDqecObf9j2MxtOgsjZMMKsLj%3DQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-05-17 Thread Daniel Gustafsson

> On 11 May 2021, at 09:29, Michael Paquier  wrote:

> FWIW, I think that the case of getting some information about any
> failed connections while a connection has been successfully made
> within the scope of the connection string parameters provided by the
> user is rather thin, and I really feel that this is going to cause
> more pain to users than this is worth it.  So my vote would be to
> clean up conn->errorMessage after a successful connection.

Agreed, given how conservative we typically are with backwards compatibility it
seems a too thin benefit to warrant potential breakage.

My vote would too be to restore the behavior by clearing conn->errorMessage.

--
Daniel Gustafsson   https://vmware.com/

RE: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-05-17 Thread Pengchengliu

Hi Tom & Robert,
  Could you review this Assert(TransactionIdFollowsOrEquals(xid, 
TransactionXmin)) in SubTransGetTopmostTransaction. 
  I think this assert is unsuitable for parallel work process.

 Before we discuss it in 

https://www.postgresql-archive.org/Parallel-scan-with-SubTransGetTopmostTransaction-assert-coredump-td6197408.html

Thanks
Pengcheng

-Original Message-
From: Greg Nancarrow  
Sent: 2021年5月15日 0:44
To: Pengchengliu 
Cc: Andres Freund ; PostgreSQL-development 

Subject: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

On Fri, May 14, 2021 at 8:36 PM Pengchengliu  wrote:
>   Did you use pgbench with the script sub_120.sql which I provide in 
> attachment?

yes

>
>   Did you increase the number PGPROC_MAX_CACHED_SUBXIDS? Please don't change 
> any codes, now we just use the origin codes in PG13.2.
>

No, I have made no source code changes at all.
That was my suggestion, for you to try - because if the problem is avoided by 
increasing PGPROC_MAX_CACHED_SUBXIDS (to say 128) then it probably indicates 
the overflow condition is affecting the xmin.xmax of the two snapshots such 
that it invalidates the condition that is asserted.

I think one problem is that in your settings, you haven't set 
"max_worker_processes", yet have set "max_parallel_workers = 128".
I'm finding no more than 8 parallel workers are actually active at any one time.
On top of this, you've got pgbench running with 200 concurrent clients.
So many queries are actually executing parallel plans without using parallel 
workers, as the workers can't actually be launched (and this is probably why 
I'm finding it hard to reproduce the issue, if the problem involves snapshot 
suboverflow and parallel workers).
I find that the following settings improve the parallelism per query and the 
whole test runs very much faster:

max_connections = 2000
parallel_setup_cost=0
parallel_tuple_cost=0
min_parallel_table_scan_size=0
max_parallel_workers_per_gather=4
max_parallel_workers = 100
max_worker_processes = 128

and adjust the pgbench command-line:   pgbench  -d postgres -p 33550
-n -r -f sub_120.sql   -c 25 -j 25 -T 1800

Problem is, I still get no coredump when using this.
Can you try these settings and let me know if the crash still happens if you 
use these settings?

I also tried:

max_connections = 2000
parallel_setup_cost=0
parallel_tuple_cost=0
min_parallel_table_scan_size=0
max_parallel_workers_per_gather=2
max_parallel_workers = 280
max_worker_processes = 300

and the pgbench command-line:   pgbench  -d postgres -p 33550  -n -r
-f sub_120.sql   -c 140 -j 140 -T 1800

- but I still get no coredump.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Possible memory corruption (src/timezone/zic.c b/src/timezone/zic.c)

2021-05-17 Thread Ranier Vilela

Em dom., 16 de mai. de 2021 às 22:37, Kyotaro Horiguchi <
horikyota@gmail.com> escreveu:

> At Sat, 15 May 2021 11:35:13 -0300, Ranier Vilela 
> wrote in
> > Em sex., 14 de mai. de 2021 às 19:52, Tom Lane 
> escreveu:
> >
> > > I wrote:
> > > > So the question for us is whether it's worth trying to make
> pgreadlink
> > > > conform to the letter of the POSIX spec in this detail.  TBH, I can't
> > > > get excited about that, at least not so far as zic's usage is
> concerned.
> > >
> > > Hmmm ... on closer inspection, though, it might not be that hard.
> > > pgreadlink is already using a fixed-length buffer (with only enough
> > > room for MAX_PATH WCHARs) for the input of WideCharToMultiByte.  So
> > > it could use a fixed-length buffer of say 4 * MAX_PATH bytes for the
> > > output, and then transfer just the appropriate amount of data to the
> > > caller's buffer.
> > >
> > Following your directions, maybe something like this will solve?
>
> -   DWORD   attr;
> -   HANDLE  h;
>
> Why the patch moves the definitions for "attr" and "h"?
>
Hi Kyotaro, thank you for reviewing this.

I changed the declarations of variables for reasons of standardization and
to avoid fragmentation of memory,
following the same principles of declaration of structures.


>
> +   Assert(path != NULL && buf != NULL);
>
> I don't think it's required.  Even if we want to imitate readlink,
> they should (maybe) return EFALUT in that case.
>
Yes. It is not a requirement.
But I try to take every chance to prevent bugs.
And always validating the entries, sooner or later, helps to find errors.


>
>
> +   buf[r] = '\0';
>
> readlink is defined as not appending a terminator.  In the first place
> the "buf[r] = '\0'" is overrunning the given buffer.
>
Ok. I will remove this.


>
>
> -   return 0 <= readlink(name, &c, 1);
> +   return 0 <= readlink(name, linkpath, sizeof(linkpath));
>
> According to the discussion, we don't want to modify zic.c at
> all. (Maybe forgot to remove?)
>
I haven't forgotten.

I just don't agree to use char, as char pointers.
But I can remove it from the patch too.

regards,
Ranier Vilela


v3_fix_possible_memory_corruption_zic.patch
Description: Binary data

Skip partition tuple routing with constant partition key

2021-05-17 Thread houzj.f...@fujitsu.com

Hi,



When loading some data into a partitioned table for testing purpose,

I found even if I specified constant value for the partition key[1], it still do

the tuple routing for each row.



[1]-

UPDATE partitioned set part_key = 2 , …

INSERT into partitioned(part_key, ...) select 1, …

-



I saw such SQLs automatically generated by some programs,

So , personally, It’d be better to skip the tuple routing for this case.



IMO, we can use the following steps to skip the tuple routing:

1) collect the column that has constant value in the targetList.

2) compare the constant column with the columns used in partition key.

3) if all the columns used in key are constant then we cache the routed 
partition

　and do not do the tuple routing again.



In this approach, I did some simple and basic performance tests:



For plain single column partition key.(partition by range(col)/list(a)...)

When loading 1 rows into the table, I can see about 5-7% performance 
gain

for both cross-partition UPDATE and INSERT if specified constant for the 
partition key.



For more complicated expression partition key(partition by 
range(UDF_func(col)+x)…)

When loading 1 rows into the table, it will bring more performance gain.

About > 20% performance gain



Besides, I did not see noticeable performance degradation for other cases(small 
data set).



Attaching a POC patch about this improvement.

Thoughts ?



Best regards,

houzj




0001-skip-tuple-routing-for-constant-partition-key.patch
Description: 0001-skip-tuple-routing-for-constant-partition-key.patch

Re: subscriptioncheck failure

2021-05-17 Thread vignesh C

On Mon, May 17, 2021 at 10:40 AM Amit Kapila  wrote:
>
> On Mon, May 17, 2021 at 9:06 AM Amit Kapila  wrote:
> >
> > On Thu, May 13, 2021 at 7:06 PM vignesh C  wrote:
> > >
> > > On Thu, May 13, 2021 at 4:41 PM Michael Paquier  
> > > wrote:
> >
> > Few comments:
> > 1.
> > + # Ensure a transactional logical decoding message shows up on the slot
> > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub 
> > DISABLE");
> >
> > After you have encapsulated this command in the function, the above
> > comment doesn't make sense because we do this for both transactional
> > and non-transactional messages. I suggest we can change it to
> > something like: "This is done to ensure a logical decoding message is
> > shown up on the slot".
> >
> > 2.
> > +# Setup the subscription before checking 
> > pg_logical_slot_peek_binary_changes
> > +sub setup_subscription
> >
> > I think here the functionality is more for the catchup of
> > subscription, so it might be better to name the function as
> > subscription_catchup or catchup_subscription. I think you can expand
> > the comments atop this function a bit as suggested by Michael.
> >
>
> One more point:
> + $node_publisher->wait_for_catchup('tap_sub');
> +
> + # Ensure a transactional logical decoding message shows up on the slot
> + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub 
> DISABLE");
> +
> + # wait for the replication connection to drop from the publisher
> + $node_publisher->poll_query_until('postgres',
> + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE
> slot_name = 'tap_sub' AND active='f'", 1);
>
> In the above sequence, wait_for_catchup will query pg_stat_replication
> whereas after disabling subscription we are checking
> pg_replication_slots. I understand from your explanation why we can't
> rely on pg_stat_replication after DISABLE but it might be better to
> check that the slot is active before disabling it. I think currently,
> the test assumes that, isn't it better to have an explicit check for
> that?

I felt this is not required, wait_for_catchup will poll_query_until
the state = 'streaming', even if START_REPLICATION takes time, state
will be in 'startup' state, this way poll_query_until will take care
of handling this.

On further analysis I found that we need to do "Alter subscription
tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple
time, Instead we can change pg_logical_slot_peek_binary_changes to
pg_logical_slot_get_binary_changes at appropriate steps. This way the
common function can be removed and the enable/disable multiple times
can be removed.  If we are going ahead with this approach the above
comments provided are no more valid.

I have made the changes in similar lines in the attached patch. If you
are ok we can go ahead with the new approach which will simplify the
changes required.
Thoughts?

Regards,
Vignesh
From 33fdfd312a9b652a40891c09dc9e075d2dd41e44 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Mon, 17 May 2021 17:23:35 +0530
Subject: [PATCH v3] Messages test failure fix.

This test was failing because there was no wait for catch up after
creating subscription. After disabling subscription pg_stat_replication
was used to verify if the walsender is exited. The steps of walsender
exit includes cleaning up of walsendr and then releasing replication
slot. There will be random test failure if pg_logical_slot_peek_binary_changes
is called in the narrow time window. Fixed it by checking
active column in pg_replication_slot instead of pg_stat_replication
which is more reliable.
---
 src/test/subscription/t/020_messages.pl | 27 +
 1 file changed, 5 insertions(+), 22 deletions(-)

diff --git a/src/test/subscription/t/020_messages.pl b/src/test/subscription/t/020_messages.pl
index 0940d0f45f..6a76d96d12 100644
--- a/src/test/subscription/t/020_messages.pl
+++ b/src/test/subscription/t/020_messages.pl
@@ -35,12 +35,14 @@ $node_subscriber->safe_psql('postgres',
 	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr' PUBLICATION tap_pub"
 );
 
+$node_publisher->wait_for_catchup('tap_sub');
+
 # Ensure a transactional logical decoding message shows up on the slot
 $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub DISABLE");
 
 # wait for the replication connection to drop from the publisher
 $node_publisher->poll_query_until('postgres',
-	'SELECT COUNT(*) FROM pg_catalog.pg_stat_replication', 0);
+	"SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub' AND active='f'", 1);
 
 $node_publisher->safe_psql('postgres',
 	"SELECT pg_logical_emit_message(true, 'pgoutput', 'a transactional message')"
@@ -77,7 +79,7 @@ is($result, qq(1|pgoutput),
 $result = $node_publisher->safe_psql(
 	'postgres', qq(
 		SELECT get_byte(data, 0)
-		FROM pg_logical_slot_peek_binary_changes('tap_sub', NULL, NULL,
+		FROM pg_logical_slot_get_binary_changes('tap_sub', NULL, NULL,
 			'proto_version', '1',
 			'publication_names',

RE: Forget close an open relation in ReorderBufferProcessTXN()

2021-05-17 Thread osumi.takami...@fujitsu.com

On Monday, May 17, 2021 6:52 PM Amit Langote  wrote:
> On Mon, May 17, 2021 at 6:13 PM Amit Kapila 
> wrote:
> > On Fri, May 14, 2021 at 12:44 PM Amit Langote
>  wrote:
> > > On Thu, May 13, 2021 at 7:43 PM Amit Kapila 
> wrote:
> > >
> > > > Also, don't we need to free the
> > > > entire map as suggested by me?
> > >
> > > Yes, I had missed that too.  Addressed in the updated patch.
> > >
> >
> > + relentry->map = convert_tuples_by_name(indesc, outdesc);
> > + if (relentry->map == NULL)
> > + {
> > + /* Map not necessary, so free the TupleDescs too. */
> > + FreeTupleDesc(indesc); FreeTupleDesc(outdesc); }
> >
> > I think the patch frees these descriptors when the map is NULL but not
> > otherwise because free_conversion_map won't free these descriptors.
> 
> You're right.  I have fixed that by making the callback free the TupleDescs
> explicitly.
This fix looks correct. Also, RT of v3 didn't fail.

Further, I've checked the point of view of the newly added tests.
As you said that with v1's contents, the test of v2 failed but
we are fine with v2 and v3, which proves that we adjust DDL in a right way.

> > BTW, have you tried this patch in back branches because I think we
> > should backpatch this fix?
>
> I have created a version of the patch for v13, the only older release that has
> this code, and can see that tests, including the newly added one, pass.
> 
> Both patches are attached.
The patch for PG13 can be applied to it cleanly and the RT succeeded.

I have few really minor comments on your comments in the patch.

(1) schema_sent's comment

@@ -94,7 +94,8 @@ typedef struct RelationSyncEntry

/*
 * Did we send the schema?  If ancestor relid is set, its schema must 
also
-* have been sent for this to be true.
+* have been sent and the map to convert the relation's tuples into the
+* ancestor's format created before this can be set to be true.
 */
boolschema_sent;
List   *streamed_txns;  /* streamed toplevel transactions with 
this

I suggest to insert a comma between 'created' and 'before'
because the sentence is a bit long and confusing.

Or, I thought another comment idea for this part,
because the original one doesn't care about the cycle of the reset.

"To avoid repetition to send the schema, this is set true after its first 
transmission.
Reset when any change of the relation definition is possible. If ancestor relid 
is set,
its schema must have also been sent while the map to convert the relation's 
tuples into
the ancestor's format created, before this flag is set true."

(2) comment in rel_sync_cache_relation_cb()

@@ -1190,13 +1208,25 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)

  HASH_FIND, NULL);

/*
-* Reset schema sent status as the relation definition may have changed.
+* Reset schema sent status as the relation definition may have changed,
+* also freeing any objects that depended on the earlier definition.

How about divide this sentence into two sentences like
"Reset schema sent status as the relation definition may have changed.
Also, free any objects that depended on the earlier definition."

Best Regards,
Takamichi Osumi

Re: postgres_fdw - should we tighten up batch_size, fetch_size options against non-numeric values?

2021-05-17 Thread Ashutosh Bapat

On Mon, May 17, 2021 at 3:29 PM Bharath Rupireddy
 wrote:
>
> Hi,
>
> It looks like the values such as '123.456', '789.123' '100$%$#$#',
> '9,223,372,' are accepted and treated as valid integers for
> postgres_fdw options batch_size and fetch_size. Whereas this is not
> the case with fdw_startup_cost and fdw_tuple_cost options for which an
> error is thrown. Attaching a patch to fix that.

This looks like a definite improvement. I wonder if we should modify
defGetInt variants to convert strings into integers, so that there's
consistent error message for such errors. We could define defGetUInt
so that we could mention non-negative in the error message. Whether or
not we do that, this looks good.

-- 
Best Wishes,
Ashutosh Bapat

Re: Winflex docs and distro

2021-05-17 Thread Andrew Dunstan



On 5/17/21 5:51 AM, Magnus Hagander wrote:
> On Mon, May 17, 2021 at 11:11 AM Daniel Gustafsson  wrote:
>>> On 17 May 2021, at 10:17, Magnus Hagander  wrote:
>>> Since PostgreSQL 9.3, in commit a266f7dd93b, we've added the text:
>>>
>>> +   The obsolete "winflex" binaries distributed on the PostgreSQL FTP 
>>> site
>> Which was slightly updated in 0a9ae44288d.
> It's been touched a couple of times, but not in any material fashion.
>
>
>>> At this point. I suggest we simply stop distributing winflex on our
>>> download site, and just remove this note from the documentation.
>> Sounds reasonable, are there (easily retrieved) logs/tracking for when it was
>> accessed by anyone last?
> Not really. We don't keep logs going very far back. I can see it being
> accessed a handful of time in the past 14 days. But AFAICT from the
> limited information we have it's all bots.
>



+1 for removing the binary and the reference.

These days my setup for MSVC doesn't use msys: it's basically this PS1
fragment (which assumes chocolatey is installed):

    $utils = 'StrawberryPerl', 'git', 'winflexbison', 'diffutils', 'vim'
    choco install -y --no-progress --limit-output @utils
    $cbin = "c:\ProgramData\chocolatey\bin"
    Rename-Item -Path $cbin\win_bison.exe -NewName bison.exe
    Rename-Item -Path $cbin\win_flex.exe -NewName flex.exe


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-05-17 Thread Greg Nancarrow

On Sat, May 15, 2021 at 12:37 PM 刘鹏程  wrote:
>
>
>   BTW, I test it in a high performance server. It is verly easily be 
> reproduced. My colleague and me use different environment both can reproduce 
> it.
>

Hi Pengcheng,

Although the issue won't reproduce easily in my system, I can
certainly see how, for the snapshots used in the parallel worker case,
the Active snapshot used is potentially an earlier snapshot that the
Transaction snapshot. I don't know why it is getting a newer
Transaction snapshot in InitializeParallelDSM(), when it has
previously pushed the return value of GetTransactionSnapshot() as the
Active snapshot.

So I too hope Tom or Robert can explain what is going on here and how
to resolve it (as you requested them to, in your other post).

I actually think that the Assert in SubTransGetTopmostTransaction() is
correct, but in the parallel-worker case, the snapshots are not being
setup correctly.

Can you try the trivial change below and see if it prevents the coredump?

Regards,
Greg Nancarrow
Fujitsu Australia

diff --git a/src/backend/access/transam/parallel.c
b/src/backend/access/transam/parallel.c
index 14a8690019..870889053f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -216,7 +216,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
  int i;
  FixedParallelState *fps;
  dsm_handle session_dsm_handle = DSM_HANDLE_INVALID;
- Snapshot transaction_snapshot = GetTransactionSnapshot();
+ Snapshot transaction_snapshot = GetActiveSnapshot();
  Snapshot active_snapshot = GetActiveSnapshot();

  /* We might be running in a very short-lived memory context. */

Re: Skip partition tuple routing with constant partition key

2021-05-17 Thread Amit Langote

On Mon, May 17, 2021 at 8:37 PM houzj.f...@fujitsu.com
 wrote:
> When loading some data into a partitioned table for testing purpose,
>
> I found even if I specified constant value for the partition key[1], it still 
> do
>
> the tuple routing for each row.
>
>
> [1]-
>
> UPDATE partitioned set part_key = 2 , …
>
> INSERT into partitioned(part_key, ...) select 1, …
>
> -
>
> I saw such SQLs automatically generated by some programs,

Hmm, does this seem common enough for the added complexity to be worthwhile?

For an example of what's previously been considered worthwhile for a
project like this, see what 0d5f05cde0 did.  The cases it addressed
are common enough -- a file being loaded into a (time range-)
partitioned table using COPY FROM tends to have lines belonging to the
same partition consecutively placed.

--
Amit Langote
EDB: http://www.enterprisedb.com

Re: wal stats questions

2021-05-17 Thread Fujii Masao





On 2021/05/17 16:39, Masahiro Ikeda wrote:


Thanks for your comments!


+ * is executed, wal records aren't nomally generated (although HOT makes


nomally -> normally?


Yes, fixed.


+ * It's not enough to check the number of generated wal records, for
+ * example the walwriter may write/sync the WAL although it doesn't


You use both 'wal' and 'WAL' in the comments, but are there any intension?


No, I changed to use 'WAL' only. Although some other comments have 'wal' and
'WAL', it seems that 'WAL' is often used.


Thanks for updating the patch!

+ * Buffer and generated WAL usage counters.
+ *
+ * The counters are accumulated values. There are infrastructures
+ * to add the values and calculate the difference within a specific period.

Is it really worth adding these comments here?

+* Note: regarding to WAL statistics counters, it's not enough to check
+* only whether the WAL record is generated or not. If a read 
transaction
+* is executed, WAL records aren't normally generated (although HOT 
makes
+* WAL records). But, just writes and syncs the WAL data may happen when
+* to flush buffers.

Aren't the following comments better?

--
To determine whether any WAL activity has occurred since last time, not only 
the number of generated WAL records but also the numbers of WAL writes and 
syncs need to be checked. Because even transaction that generates no WAL 
records can write or sync WAL data when flushing the data pages.
--

-* This function can be called even if nothing at all has happened. In
-* this case, avoid sending a completely empty message to the stats
-* collector.

I think that it's better to leave this comment as it is.

+* First, to check the WAL stats counters were updated.
+*
+* Even if the 'force' is true, we don't need to send the stats if the
+* counters were not updated.
+*
+* We can know whether the counters were updated or not to check only
+* three counters. So, for performance, we don't allocate another memory
+* spaces and check the all stats like pgstat_send_slru().

Is it really worth adding these comments here?

+* The current 'wal_records' is the same as the previous one means that
+* WAL records weren't generated. So, the counters of 'wal_fpi',
+* 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
+*
+* It's not enough to check the number of generated WAL records, for
+* example the walwriter may write/sync the WAL although it doesn't
+* generate WAL records. 'm_wal_write' and 'm_wal_sync' are zero means 
the
+* counters of time spent are zero too.

Aren't the following comments better?


Check wal_records counter to determine whether any WAL activity has happened 
since last time. Note that other WalUsage counters don't need to be checked 
because they are incremented always together with wal_records counter.

m_wal_buffers_full also doesn't need to be checked because it's incremented 
only when at least one WAL record is generated (i.e., wal_records counter is 
incremented). But for safely, we assert that m_wal_buffers_full is always zero 
when no WAL record is generated

This function can be called by a process like walwriter that normally generates 
no WAL records. To determine whether any WAL activity has happened at that 
process since the last time, the numbers of WAL writes and syncs are also 
checked.


+ * The accumulated counters for buffer usage.
+ *
+ * The reason the counters are accumulated values is to avoid unexpected
+ * reset because many callers may use them.

Aren't the following comments better?


These counters keep being incremented infinitely, i.e., must never be reset to 
zero, so that we can calculate how much the counters are incremented in an 
arbitrary period.


Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: PG 14 release notes, first draft

2021-05-17 Thread Bruce Momjian

On Mon, May 17, 2021 at 04:18:27PM +0900, Fujii Masao wrote:
> 
> 
> On 2021/05/10 15:03, Bruce Momjian wrote:
> > I have committed the first draft of the PG 14 release notes.  You can
> > see the most current  build of them here:
> > 
> > https://momjian.us/pgsql_docs/release-14.html
> 
> Thanks a lot!!
> 
> 
> > I need clarification on many items, and the document still needs its
> > items properly ordered, and markup added.  I also expect a lot of
> > feedback.
> 
> > 
> > Add function pg_backend_memory_contexts() to output the memory contexts of 
> > arbitrary backends (Atsushi Torikoshi)
> 
> Could you fix the function name from pg_backend_memory_contexts() to
> pg_log_backend_memory_contexts()? That is, "log_" is necessary just after
> "pg_" in the function name.

Ah, I see my mistake, fixed.

> > 
> > Add lock wait time to pg_locks (Atsushi Torikoshi)
> 
> Could you change "lock wait time" to "lock wait start time"? Because
> the pg_locks.waitstart column that was added in v14 reports the time
> when the server process *started* waiting for the lock.

Yes, good point, fixed.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: pg_dumpall misses --no-toast-compression

2021-05-17 Thread Daniel Gustafsson

> On 17 May 2021, at 03:12, Michael Paquier  wrote:

> I have bumped into $subject while playing with this feature, and this
> can be really useful to be able to reset the compression method for
> all the tables at restore.  The patch is simple but that's perhaps too
> late for 14, so I am adding it to the next CF.

I think there is a reasonable case to be made for this fixing an oversight in
bbe0a81db69bd10bd166907c3701492a29aca294 as opposed to adding a brand new
feature.  Save for --no-synchronized-snapshots all --no-xxx options in pg_dump
are mirrored in pg_dumpall.

--
Daniel Gustafsson   https://vmware.com/

Re: PG 14 release notes, first draft

2021-05-17 Thread Jonathan S. Katz

Hi,

On 5/10/21 2:03 AM, Bruce Momjian wrote:
> I have committed the first draft of the PG 14 release notes.  You can
> see the most current  build of them here:
> 
>   https://momjian.us/pgsql_docs/release-14.html
> 
> I need clarification on many items, and the document still needs its
> items properly ordered, and markup added.  I also expect a lot of
> feedback.
> 
> I plan to work on completing this document this coming week in
> preparation for beta next week.

Again, thanks for compiling the release notes.

I have published the first draft of the release announcement on
-advocacy, you can find the original email here:

https://www.postgresql.org/message-id/f82ccf97-d02c-0974-27b8-e0f5e4c1de02%40postgresql.org

If you are interested in reviewing the release announcement, please
provide feedback on the -advocacy mailing list.

Thanks!

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature

Re: postgres_fdw - should we tighten up batch_size, fetch_size options against non-numeric values?

2021-05-17 Thread Bharath Rupireddy

On Mon, May 17, 2021 at 6:17 PM Ashutosh Bapat
 wrote:
>
> On Mon, May 17, 2021 at 3:29 PM Bharath Rupireddy
>  wrote:
> >
> > Hi,
> >
> > It looks like the values such as '123.456', '789.123' '100$%$#$#',
> > '9,223,372,' are accepted and treated as valid integers for
> > postgres_fdw options batch_size and fetch_size. Whereas this is not
> > the case with fdw_startup_cost and fdw_tuple_cost options for which an
> > error is thrown. Attaching a patch to fix that.
>
> This looks like a definite improvement. I wonder if we should modify
> defGetInt variants to convert strings into integers, so that there's
> consistent error message for such errors. We could define defGetUInt
> so that we could mention non-negative in the error message.

If we do that, then the options that are only accepting unquoted
integers (i.e. 123, 456 etc.) and throwing errors for the quoted
integers ('123', '456' etc.) will then start to accept the quoted
integers. Other hackers might not agree to this change.

Another way is to have new API like
defGetInt32_2/defGetInt64_2/defGetNumeric_2 (or some other better
names) which basically accept both quoted and unquoted strings, see
[1] for a rough sketch of the function. These API can be useful if an
option needs to be parsed in both quoted and unquoted form. Or we can
also have these functions as [2] for only parsing quoted options. I
prefer [2] so that these API can be used without any code duplication.
Thoughts?

> Whether or not we do that, this looks good.

I'm also okay if we can just fix the fetch_size and back_size options
for now as it's done in the patch attached with the first mail. Note
that I have not added any test case as this change is a trivial thing.
If required, I can add one.

[1] -
int32
defGetInt32_2(DefElem *def)
{
if (def->arg == NULL)
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
 errmsg("%s requires an integer value",
 def->defname)));

switch (nodeTag(def->arg))
{
case T_Integer:
return (int32) intVal(def->arg);
default:
{
char*sval;
int32   val;

sval = defGetString(def);
val = strtol(sval, &endp, 10);

if (*endp == '\0')
return val;
}
}

ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("%s requires an integer value",
def->defname)));

return 0;
}

[2] -
int32
defGetInt32_2(DefElem *def)
{
char*sval;
int32   val;

sval = defGetString(def);
val = strtol(sval, &endp, 10);

if (*endp)
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
errmsg("%s requires an integer value",
def->defname)));
return val;

}

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

pg_get_wal_replay_pause_state() should not return 'paused' while a promotion is ongoing.

2021-05-17 Thread Fujii Masao


If a promotion is triggered while recovery is paused, the paused state ends
and promotion continues. But currently pg_get_wal_replay_pause_state()
returns 'paused' in that case. Isn't this a bug?

Attached patch fixes this issue by resetting the recovery pause state to
'not paused' when standby promotion is triggered.

Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 8d163f190f..441a9124cd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12825,6 +12825,14 @@ SetPromoteIsTriggered(void)
XLogCtl->SharedPromoteIsTriggered = true;
SpinLockRelease(&XLogCtl->info_lck);
 
+   /*
+* Mark the recovery pause state as 'not paused' because the paused 
state
+* ends and promotion continues if a promotion is triggered while 
recovery
+* is paused. Otherwise pg_get_wal_replay_pause_state() can mistakenly
+* return 'paused' while a promotion is ongoing.
+*/
+   SetRecoveryPause(false);
+
LocalPromoteIsTriggered = true;
 }

Re: pg_dumpall misses --no-toast-compression

2021-05-17 Thread Tom Lane

Daniel Gustafsson  writes:
>> On 17 May 2021, at 03:12, Michael Paquier  wrote:
>> I have bumped into $subject while playing with this feature, and this
>> can be really useful to be able to reset the compression method for
>> all the tables at restore.  The patch is simple but that's perhaps too
>> late for 14, so I am adding it to the next CF.

> I think there is a reasonable case to be made for this fixing an oversight in
> bbe0a81db69bd10bd166907c3701492a29aca294 as opposed to adding a brand new
> feature.  Save for --no-synchronized-snapshots all --no-xxx options in pg_dump
> are mirrored in pg_dumpall.

+1, seems more like fixing an oversight than anything else.

regards, tom lane

Re: Removed extra memory allocations from create_list_bounds

2021-05-17 Thread Nitin Jadhav

> > While working on [1], I observed that extra memory is allocated in
> > [1] 
> > https://mail.google.com/mail/u/2/#search/multi+column+list/KtbxLxgZZTjRxNrBWvmHzDTHXCHLssSprg?compose=CllgCHrjDqKgWCBNMmLqhzKhmrvHhSRlRVZxPCVcLkLmFQwrccpTpqLNgbWqKkTkTFCHMtZjWnV

I am really sorry for this. I wanted to point to the thread subjected
to 'Multi-Column List Partitioning'.

> If it's worth counting list elements in advance, then you can also allocate 
> the
> PartitionListValue as a single chunk, rather than palloc in a loop.
> This may help large partition heirarchies.
>
> And the same thing in create_hash_bounds with hbounds.

I agree and thanks for creating those patches. I am not able to apply
the patch on the latest HEAD. Kindly check and upload the modified
patches.

> I'm not able to detect that this is saving more than about ~1% less RAM, to
> create or select from 1000 partitions, probably because other data structures
> are using much more, and savings here are relatively small.

Yes it does not save huge memory but it's an improvement.

> I'm going to add to the next CF.  You can add yourself as an author, and watch
> that the patch passes tests in cfbot.
> https://commitfest.postgresql.org/
> http://cfbot.cputube.org/

Thanks for creating the commitfest entry.

> Since the function returns the total number of non null bound values, should 
> it be named get_non_null_list_bounds_count ?
> It can also be named get_count_of_... but that's longer.

Changed it to 'get_non_null_list_bounds_count'.

> The palloc() call would take place even if ndatums is 0. I think in that 
> case, palloc() doesn't need to be called.

I feel there is no such case where the 'ndatums' is 0 because as we
can see below, there is an assert in the 'partition_bounds_create'
function from where we call the 'create_list_bounds' function. Kindly
provide such a case if I am wrong.

PartitionBoundInfo
partition_bounds_create(PartitionBoundSpec **boundspecs, int nparts,
PartitionKey key, int **mapping)
{
int i;

Assert(nparts > 0);

Thanks & Regards,
Nitin Jadhav
On Sun, May 16, 2021 at 10:52 PM Zhihong Yu  wrote:
>
>
>
> On Sun, May 16, 2021 at 10:00 AM Justin Pryzby  wrote:
>>
>> On Sat, May 15, 2021 at 02:40:45PM +0530, Nitin Jadhav wrote:
>> > While working on [1], I observed that extra memory is allocated in
>> > [1] 
>> > https://mail.google.com/mail/u/2/#search/multi+column+list/KtbxLxgZZTjRxNrBWvmHzDTHXCHLssSprg?compose=CllgCHrjDqKgWCBNMmLqhzKhmrvHhSRlRVZxPCVcLkLmFQwrccpTpqLNgbWqKkTkTFCHMtZjWnV
>>
>> This is a link to your gmail, not to anything public.
>>
>> If it's worth counting list elements in advance, then you can also allocate 
>> the
>> PartitionListValue as a single chunk, rather than palloc in a loop.
>> This may help large partition heirarchies.
>>
>> And the same thing in create_hash_bounds with hbounds.
>>
>> create_range_bounds() already doesn't call palloc in a loop.  However, then
>> there's an asymmetry in create_range_bounds, which is still takes a
>> double-indirect pointer.
>>
>> I'm not able to detect that this is saving more than about ~1% less RAM, to
>> create or select from 1000 partitions, probably because other data structures
>> are using much more, and savings here are relatively small.
>>
>> I'm going to add to the next CF.  You can add yourself as an author, and 
>> watch
>> that the patch passes tests in cfbot.
>> https://commitfest.postgresql.org/
>> http://cfbot.cputube.org/
>>
>> Thanks,
>> --
>> Justin
>
> Hi,
> For 0001-Removed-extra-memory-allocations-from-create_list_bo.patch :
>
> +static int
> +get_non_null_count_list_bounds(PartitionBoundSpec **boundspecs, int nparts)
>
> Since the function returns the total number of non null bound values, should 
> it be named get_non_null_list_bounds_count ?
> It can also be named get_count_of_... but that's longer.
>
> +   all_values = (PartitionListValue **)
> +   palloc(ndatums * sizeof(PartitionListValue *));
>
> The palloc() call would take place even if ndatums is 0. I think in that 
> case, palloc() doesn't need to be called.
>
> Cheers
>
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 7925fcc..e0a04fe 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -433,6 +433,32 @@ create_hash_bounds(PartitionBoundSpec **boundspecs, int nparts,
 }
 
 /*
+ * get_non_null_list_bounds_count
+ * 		Calculates the total number of non null bound values of
+ * 		all the partitions.
+ */
+static int
+get_non_null_list_bounds_count(PartitionBoundSpec **boundspecs, int nparts)
+{
+	int	i = 0;
+	int count = 0;
+
+	for (i = 0; i < nparts; i++)
+	{
+		ListCell   *c;
+
+		foreach(c, boundspecs[i]->listdatums)
+		{
+			Const   *val = castNode(Const, lfirst(c));
+
+			if (!val->constisnull)
+count++;
+		}
+	}
+
+	return count;
+}
+/*
  * create_list_bounds
  *		Create a PartitionBoundInfo for a list partitioned table
  */
@@ -442,1

Re: Corrected documentation of data type for the logical replication message formats.

2021-05-17 Thread vignesh C

On Thu, May 13, 2021 at 5:57 AM Peter Smith  wrote:
>
> On Wed, May 12, 2021 at 11:09 PM vignesh C  wrote:
> ...
> >
> > Thanks for the comments. Attached v4 patch has the fix for the same.
> >
>
> I have not tried this patch so I cannot confirm whether it applies or
> renders OK, but just going by the v4 content this now LGTM.

Thanks for having a look at it.
I have added a commitfest entry for this:
https://commitfest.postgresql.org/33/3117/

Regards,
Vignesh

Re: Removed extra memory allocations from create_list_bounds

2021-05-17 Thread Zhihong Yu

On Mon, May 17, 2021 at 7:52 AM Nitin Jadhav 
wrote:

> > > While working on [1], I observed that extra memory is allocated in
> > > [1]
> https://mail.google.com/mail/u/2/#search/multi+column+list/KtbxLxgZZTjRxNrBWvmHzDTHXCHLssSprg?compose=CllgCHrjDqKgWCBNMmLqhzKhmrvHhSRlRVZxPCVcLkLmFQwrccpTpqLNgbWqKkTkTFCHMtZjWnV
>
> I am really sorry for this. I wanted to point to the thread subjected
> to 'Multi-Column List Partitioning'.
>
> > If it's worth counting list elements in advance, then you can also
> allocate the
> > PartitionListValue as a single chunk, rather than palloc in a loop.
> > This may help large partition heirarchies.
> >
> > And the same thing in create_hash_bounds with hbounds.
>
> I agree and thanks for creating those patches. I am not able to apply
> the patch on the latest HEAD. Kindly check and upload the modified
> patches.
>
> > I'm not able to detect that this is saving more than about ~1% less RAM,
> to
> > create or select from 1000 partitions, probably because other data
> structures
> > are using much more, and savings here are relatively small.
>
> Yes it does not save huge memory but it's an improvement.
>
> > I'm going to add to the next CF.  You can add yourself as an author, and
> watch
> > that the patch passes tests in cfbot.
> > https://commitfest.postgresql.org/
> > http://cfbot.cputube.org/
>
> Thanks for creating the commitfest entry.
>
> > Since the function returns the total number of non null bound values,
> should it be named get_non_null_list_bounds_count ?
> > It can also be named get_count_of_... but that's longer.
>
> Changed it to 'get_non_null_list_bounds_count'.
>
> > The palloc() call would take place even if ndatums is 0. I think in that
> case, palloc() doesn't need to be called.
>
> I feel there is no such case where the 'ndatums' is 0 because as we
> can see below, there is an assert in the 'partition_bounds_create'
> function from where we call the 'create_list_bounds' function. Kindly
> provide such a case if I am wrong.
>
> PartitionBoundInfo
> partition_bounds_create(PartitionBoundSpec **boundspecs, int nparts,
> PartitionKey key, int **mapping)
> {
> int i;
>
> Assert(nparts > 0);
>

Hi,
Thanks for pointing out the assertion.
My corresponding comment can be dropped.

Cheers


>
> Thanks & Regards,
> Nitin Jadhav
> On Sun, May 16, 2021 at 10:52 PM Zhihong Yu  wrote:
> >
> >
> >
> > On Sun, May 16, 2021 at 10:00 AM Justin Pryzby 
> wrote:
> >>
> >> On Sat, May 15, 2021 at 02:40:45PM +0530, Nitin Jadhav wrote:
> >> > While working on [1], I observed that extra memory is allocated in
> >> > [1]
> https://mail.google.com/mail/u/2/#search/multi+column+list/KtbxLxgZZTjRxNrBWvmHzDTHXCHLssSprg?compose=CllgCHrjDqKgWCBNMmLqhzKhmrvHhSRlRVZxPCVcLkLmFQwrccpTpqLNgbWqKkTkTFCHMtZjWnV
> >>
> >> This is a link to your gmail, not to anything public.
> >>
> >> If it's worth counting list elements in advance, then you can also
> allocate the
> >> PartitionListValue as a single chunk, rather than palloc in a loop.
> >> This may help large partition heirarchies.
> >>
> >> And the same thing in create_hash_bounds with hbounds.
> >>
> >> create_range_bounds() already doesn't call palloc in a loop.  However,
> then
> >> there's an asymmetry in create_range_bounds, which is still takes a
> >> double-indirect pointer.
> >>
> >> I'm not able to detect that this is saving more than about ~1% less
> RAM, to
> >> create or select from 1000 partitions, probably because other data
> structures
> >> are using much more, and savings here are relatively small.
> >>
> >> I'm going to add to the next CF.  You can add yourself as an author,
> and watch
> >> that the patch passes tests in cfbot.
> >> https://commitfest.postgresql.org/
> >> http://cfbot.cputube.org/
> >>
> >> Thanks,
> >> --
> >> Justin
> >
> > Hi,
> > For 0001-Removed-extra-memory-allocations-from-create_list_bo.patch :
> >
> > +static int
> > +get_non_null_count_list_bounds(PartitionBoundSpec **boundspecs, int
> nparts)
> >
> > Since the function returns the total number of non null bound values,
> should it be named get_non_null_list_bounds_count ?
> > It can also be named get_count_of_... but that's longer.
> >
> > +   all_values = (PartitionListValue **)
> > +   palloc(ndatums * sizeof(PartitionListValue *));
> >
> > The palloc() call would take place even if ndatums is 0. I think in that
> case, palloc() doesn't need to be called.
> >
> > Cheers
> >
>

Re: Removed extra memory allocations from create_list_bounds

2021-05-17 Thread Justin Pryzby

On Mon, May 17, 2021 at 08:22:25PM +0530, Nitin Jadhav wrote:
> I agree and thanks for creating those patches. I am not able to apply
> the patch on the latest HEAD. Kindly check and upload the modified
> patches.

The CFBOT had no issues with the patches, so I suspect an issue on your side.
http://cfbot.cputube.org/nitin-jadhav.html

-- 
Justin

Re: [HACKERS] logical decoding of two-phase transactions

2021-05-17 Thread vignesh C

On Mon, May 17, 2021 at 6:10 PM Ajin Cherian  wrote:
>
> The above patch had some changes missing which resulted in some tap
> tests failing. Sending an updated patchset. Keeping the patchset
> version the same.

Thanks for the updated patch, the updated patch fixes the tap test failures.

Regards,
Vignesh

Re: Rewriting the test of pg_upgrade as a TAP test - take three - remastered set

2021-05-17 Thread Andrew Dunstan



On 5/16/21 9:55 PM, Michael Paquier wrote:
> On Sat, May 15, 2021 at 02:22:24PM -0400, Andrew Dunstan wrote:
>> PostgresNode is currently able to create nodes suitable for upgrade down
>> to release 10. If we add '-w' to the 'pg_ctl start' flags that can
>> extend down to release 9.5. (Just tested) I think we should do that
>> forthwith. '-w' is now the default, but having it there explicitly does
>> no harm.
> Agreed.  When testing manually, I have personally never worked on any
> patches that required binaries older than 9.4, so I would be fine if
> the TAP tests are able to work easily down to 9.5, even if pg_upgrade
> is supported down to 8.4.
>
>> If people are interested in what's incompatible on older versions, they
>> can look at
>> 
>> starting at about line 2764.
> We should really have adjust_conf() at some point in the in-core
> module.



Yes, I'm going to be proposing a series of smallish patches including
these when the tree is branched (which I hope will be in a few weeks).


>> I don't think this will work, though, unless there is enough data to
>> exercise pg_upgrade fairly thoroughly. The approach taken by both
>> test.sh and (somewhat more comprehensively) by the buildfarm cross
>> version upgrade module is to test a cluster where the regression tests
>> have been run.
> Yeah, that's what my patch is doing with pg_regress, FWIW.  This
> requires regress.so from the old version.
>
>> That might be more difficult when testing against older
>> versions, so I have published a snapshot of the dumps of each of the
>> versions we tests against in the buildfarm animal crake. These could be
>> loaded into PostgresNode instances and then an upgrade attempted. See
>> . The data
>> goes back to 9.2. These compressed dumps are a couple of megabytes each,
>> not huge.
> I agree that this can be simpler in some cases.  In your experience,
> how much of an issue is it when it becomes necessary to keep around
> binaries that rely on libraries older than what a system can support?
> It is easy to counter issues in this area with OpenSSL and
> non-necessary things, but we had in the past also cases where we had
> code that conflicted with the kernel, aka 3e68686.


That one at least isn't an issue. Old versions of postgres didn't have
pg_rewind.



>
> At the end of this exercise, what I think we should achieve is to:
> 1) Reduce the diff between the buildfarm code and the in-core code.
> 2) Get rid of test.sh.
> 3) Be able to run easily upgrade tests across major versions for
> developers.
>
> As of now, 3) requires some extra facilities depending on if this is
> done by the buildfarm or the in-core tests:
> 1) Path to the source code of the old version.  This is required once
> it becomes necessary to find out src/test/regress/ for the schedule,
> the tests to run and its regress.so.  There is no need to do that if
> you have a dump of the old instance.
> 2) Path to a custom dump to replace the run with pg_regress from 1).
> 3) Location of the old binaries, for pg_upgrade.  When it comes to
> PostgresNode, we require an install path, so we cannot use directly
> the location of the binaries.
>
> Looking at the code of the buildfarm, its code does something smarter
> than what my patch or HEAD's test.sh does now, as these require the
> path for the old source.  The buildfarm code first scans for the
> probin's used in the regression database and then updates any
> references.  What test.sh and my patch do is using the path to the old
> source code and run a single UPDATE.  The approach taken by the
> buildfarm code is more portable, as a path to the old source code
> becomes necessary only if running pg_regress manually.  So, what about
> doing the following thing?
> 1) Update the TAP test so as probin entries are updated in the same way
> as the buildfarm.
> 2) Allow one to specify a path to a custom dump, or a path to the old
> source code for pg_regress.
>
> If we do that, then it should be possible to reduce the code footprint
> in the buildfarm code, while still allowing people to test major
> upgrades in the same old-fashioned way, right?  That's assuming that
> PostgresNode is made compatible down to 9.2, of course, as a first
> step, as that's the range of the dumps you are keeping around for the
> buildfarm.
>


I'm intending to add some older dumps. -) But for now 9.2 is a good target.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619

2021-05-17 Thread Andres Freund

Hi,

On 2021-05-17 20:14:40 +0200, Drouvot, Bertrand wrote:
> FWIW a patch proposal to copy the oldest unfrozen XID during pg_upgrade (it
> adds a new (- u) parameter to pg_resetwal) has been submitted a couple of
> weeks ago, see: https://commitfest.postgresql.org/33/3105/

I'll try to look at it soon.


> I was also wondering if:
> 
>  * We should keep the old behavior in case pg_resetwal -x is being used
>without -u?  (The proposed patch does not set an arbitrary oldestXID
>anymore in  case -x is used)

I don't think we should. I don't see anything in the old behaviour worth
maintaining.


>  * We should ensure that the xid provided with -x or -u is
> >= FirstNormalTransactionId (Currently the only check is that it is
># 0)?

Applying TransactionIdIsNormal() seems like a good idea. I think it's
important to verify that the xid provided with -x is within a reasonable
range of the oldest xid.

Greetings,

Andres Freund

Re: PG 14 release notes, first draft

2021-05-17 Thread Alvaro Herrera

On 2021-May-16, Bruce Momjian wrote:

> OK, what I was trying to say was that if you dump/restore, and the old
> password was md5, the newly-restored password will be md5, but it was
> very unclear.  I changed it to this:
> 
>   
>   
>   
>   
>   Change the default of the password_encryption server parameter to
>   scram-sha-256 (Peter Eisentraut)
>   
>   
>   
>   Previously it was md5.  All new passwords will be stored as SHA256
>   unless this server variable is changed or the password is specified
>   in md5 format.  Also, the legacy (and undocumented) boolean-like
>   values which were previously synonyms for md5
>   are no longer accepted.
>   
>   

Ah, yeah, now I understand.  WFM.

Thanks

-- 
Álvaro Herrera39°49'30"S 73°17'W
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)

Re: PG 14 release notes, first draft

2021-05-17 Thread Peter Geoghegan

I propose that the B-Tree index item (bottom-up index deletion, "Allow
btree index additions to remove expired index entries to prevent page
splits") link back to the documentation for those that want to drill
down:

https://www.postgresql.org/docs/14/btree-implementation.html#BTREE-DELETION

The same approach was used for the Postgres 13 B-Tree deduplication
release note item (same page from the docs, even).

I understand that it's kind of hard to get the point of the feature
across within the confines of the release notes, which have a very
wide audience. I believe that the practical take-away for more
experienced users should be something like "significantly ameliorates
write-amplification problems in indexes with non-HOT updates". But
that's quite a mouthful. Having a hyperlink to the docs seems like a
sensible compromise. I really just want experienced users to be able
to make the connection between this feature and well-publicized
problems with Postgres index bloat/VACUUM.

Thanks
-- 
Peter Geoghegan

Re: PG 14 release notes, first draft

2021-05-17 Thread Peter Geoghegan

On Mon, May 17, 2021 at 12:45 PM Peter Geoghegan  wrote:
> https://www.postgresql.org/docs/14/btree-implementation.html#BTREE-DELETION

This linke currently 404s --
https://www.postgresql.org/docs/devel/btree-implementation.html#BTREE-DELETION
works, though.

-- 
Peter Geoghegan

Re: Winflex docs and distro

2021-05-17 Thread Magnus Hagander

On Mon, May 17, 2021 at 2:55 PM Andrew Dunstan  wrote:
>
>
> On 5/17/21 5:51 AM, Magnus Hagander wrote:
> > On Mon, May 17, 2021 at 11:11 AM Daniel Gustafsson  wrote:
> >>> On 17 May 2021, at 10:17, Magnus Hagander  wrote:
> >>> Since PostgreSQL 9.3, in commit a266f7dd93b, we've added the text:
> >>>
> >>> +   The obsolete "winflex" binaries distributed on the PostgreSQL FTP 
> >>> site
> >> Which was slightly updated in 0a9ae44288d.
> > It's been touched a couple of times, but not in any material fashion.
> >
> >
> >>> At this point. I suggest we simply stop distributing winflex on our
> >>> download site, and just remove this note from the documentation.
> >> Sounds reasonable, are there (easily retrieved) logs/tracking for when it 
> >> was
> >> accessed by anyone last?
> > Not really. We don't keep logs going very far back. I can see it being
> > accessed a handful of time in the past 14 days. But AFAICT from the
> > limited information we have it's all bots.
> >
>
>
>
> +1 for removing the binary and the reference.

I think we've collected enough +1's, so I'll go ahead and do it.


> These days my setup for MSVC doesn't use msys: it's basically this PS1
> fragment (which assumes chocolatey is installed):
>
> $utils = 'StrawberryPerl', 'git', 'winflexbison', 'diffutils', 'vim'
> choco install -y --no-progress --limit-output @utils
> $cbin = "c:\ProgramData\chocolatey\bin"
> Rename-Item -Path $cbin\win_bison.exe -NewName bison.exe
> Rename-Item -Path $cbin\win_flex.exe -NewName flex.exe

Perhaps it is, as a separate thing, worth including that in the docs
soemwhere? Or maybe as a script int he sourcetree that is referenced
frot he docs?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: Race condition in recovery?

2021-05-17 Thread Robert Haas

On Sat, May 15, 2021 at 1:25 AM Dilip Kumar  wrote:
> >  As I understand it, the general issue here was that if
> > XLogFileReadAnyTLI() was called before expectedTLEs got set, then
> > prior to this commit it would have to fail, because the foreach() loop
> > in that function would be iterating over an empty list. Heikki tried
> > to make it not fail in that case, by setting tles =
> > readTimeLineHistory(recoveryTargetTLI), so that the foreach loop
> > *wouldn't* get an empty list.
>
> I might be missing something but I don't agree with this logic.  If
> you see prior to this commit the code flow was like below[1].  So my
> point is if we are calling XLogFileReadAnyTLI() somewhere while
> reading the checkpoint record then note that expectedTLEs were
> initialized unconditionally before even try to read that checkpoint
> record.  So how expectedTLEs could be uninitialized in
> LogFileReadAnyTLI?

Sorry, you're right. It couldn't be uninitialized, but it could be a
fake 1-element list saying there are no ancestors rather than the real
value. So I think the point was to avoid that.

> Another point which I am not sure about but still I think that one
> line (expectedTLEs = readTimeLineHistory(receiveTLI);), somewhere
> related to the purpose of this commit.  Let me explain why do I think
> so.  Basically, before this commit, we were initializing
> "expectedTLEs" based on the history file of "recoveryTargetTLI", right
> after calling "readRecoveryCommandFile()" (this function will
> initialize recoveryTargetTLI based on the recovery target, and it
> ensures it read the respective history file).  Now, right after this
> point, there was a check as shown below[2], which is checking whether
> the checkpoint TLI exists in the  "expectedTLEs" which is initialized
> based on recoveryTargetTLI.  And it appeared that this check was
> failing in some cases which this commit tried to fix and all other
> code is there to support that.  Because now before going for reading
> the checkpoint we are not initializing "expectedTLEs" so now after
> moving this line from here it was possible that "expectedTLEs" is not
> initialized in XLogFileReadAnyTLI() and the remaining code in
> XLogFileReadAnyTLI() is to handle that part.

I think the issue here is: If expectedTLEs was initialized before the
history file was available, and contained a dummy 1-element list, then
tliOfPointInHistory() is going to say that every LSN is on that
timeline rather than any previous timeline. And if we are supposed to
be switching timelines then that will lead to this sanity check
failing.

> Now, coming to my point that why this one line is related,  In this
> one line (expectedTLEs = readTimeLineHistory(receiveTLI);), we
> completely avoiding recoveryTargetTLI and initializing "expectedTLEs"
> based on the history file of the TL from which we read the checkpoint,
> so now, there is no scope of below[2] check to fail because note that
> we are not initializing "expectedTLEs" based on the
> "recoveryTargetTLI" but we are initializing from the history from
> where we read checkpoint.

I agree, but that's actually bad, isn't it? I mean if we want the
sanity check to never fail we can just take it out. What we want to
happen is that the sanity check should pass if the startup timeline if
the TLI of the startup checkpoint is in the history of the recovery
target timeline, but fail if it isn't. The only way to achieve that
behavior is if expectedTLEs is initialized from the recovery target
timeline.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_hba.conf.sample wording improvement

2021-05-17 Thread Magnus Hagander

On Thu, Apr 29, 2021 at 5:06 PM Stephen Frost  wrote:
>
> Greetings,
>
> * Magnus Hagander (mag...@hagander.net) wrote:
> > On Thu, Apr 29, 2021 at 7:08 AM Peter Eisentraut
> >  wrote:
> > > On 28.04.21 16:09, Alvaro Herrera wrote:
> > > > Looking at it now, I wonder how well do the "hostno" options work.  If I
> > > > say "hostnogssenc", is an SSL-encrypted socket good?  If I say
> > > > "hostnossl", is a GSS-encrypted socket good?  If so, how does that make
> > > > sense?
> > >
> > > I think for example if you want to enforce SSL connections, then writing
> > > "hostnossl ... reject" would be sensible.  That would also reject
> > > GSS-encrypted connections, but that would be what you want in that 
> > > scenario.
> >
> > I'd say the interface has become a lot less well-matching now that we
> > have two separate settings for it. For example right now it's more
> > complex to say "reject anything not encrypted", which I bet is what a
> > lot of people would want. They don't particularly care if it's gss
> > encrypted or ssl encrypted.
>
> I'm not really sure that I agree it's such an issue, particularly since
> you have to come up with a way to specify the auth method to use somehow
> too as we haven't got any fallback mechanism or anything like that.
> While you might use cert-based auth or SCRAM for TLS connections, it
> isn't the case that you can use SCRAM with a GSS encrypted connection.
>
> > Perhaps what we want to do (obviously not for 14) is to allow you to
> > specify more than one entry in the first column, so you could say
> > "hostssl,hostgssenc" on the same row? That would give some strange
> > results with the "no" mappings, but it might work if used right?
>
> In general, I'm not against the idea of giving more options but I'm just
> not sure that it's a real use-case when you consider that the auth
> method also has to be specified.  I also don't recall anyone showing up
> asking about how they could specify "encrypted but I don't care how".

TBH, I bet that is a lot because people don't use gss encryption, or
even know it exists. Which is sad, because it's darn convenient once
you have Kerberos set up...

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: allow specifying direct role membership in pg_hba.conf

2021-05-17 Thread Magnus Hagander

On Fri, May 14, 2021 at 8:58 PM Stephen Frost  wrote:
>
> Greetings,
>
> * Chapman Flack (c...@anastigmatix.net) wrote:
> > If pg_hba syntax changes are being entertained, I would love to be able
> > to set ssl_min_protocol_version locally in a hostssl rule.
> >
> > Some clients at $work are stuck with ancient SSL libraries, but I would
> > much rather be able to weaken ssl_min_protocol_version just for them
> > than do it globally.
>
> This (unlike what was actually proposed) does seem like it'd be a useful
> improvement.  Not sure exaclty how it would work but I'm generally on
> board with the idea.

I agree, but I have no idea how you could do that within the current
pg_hba.conf.

The row is selected by the combination of username/database/ipaddress.
But you have to pick the minimum TLS version before the client has
sent that... Basically we have to make the choice long before we've
even started looking at pg_hba.

It would be good to have a way to do it, but I'm not sure pg_hba.conf
is the place for it.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: allow specifying direct role membership in pg_hba.conf

2021-05-17 Thread Chapman Flack

On 05/17/21 16:15, Magnus Hagander wrote:
> The row is selected by the combination of username/database/ipaddress.
> But you have to pick the minimum TLS version before the client has
> sent that... Basically we have to make the choice long before we've
> even started looking at pg_hba.

Use the peer IP address to pre-filter the available pg_hba entries to
those pertaining to that address ... choose a min protocol version that's
the min specified among those ... then get the username and database name
(by which point a protocol has been negotiated), then further filter the
list down to those pertaining to that user and database and allowing that
protocol version?

Yes, clunky, but avoids a more ambitious redesign of pg_hba.

I'm not sure a more ambitious redesign would be a bad thing in principle;
the pg_hba.conf syntax seems rather clunky and limiting to begin with,
and I keep wondering why it isn't in shared tables or something. But
I suppose a lot of external admin tools have some knowledge of it?

Regards,
-Chap

Re: PG 14 release notes, first draft

2021-05-17 Thread Bruce Momjian

On Mon, May 17, 2021 at 12:45:44PM -0700, Peter Geoghegan wrote:
> I propose that the B-Tree index item (bottom-up index deletion, "Allow
> btree index additions to remove expired index entries to prevent page
> splits") link back to the documentation for those that want to drill
> down:
> 
> https://www.postgresql.org/docs/14/btree-implementation.html#BTREE-DELETION
> 
> The same approach was used for the Postgres 13 B-Tree deduplication
> release note item (same page from the docs, even).
> 
> I understand that it's kind of hard to get the point of the feature
> across within the confines of the release notes, which have a very
> wide audience. I believe that the practical take-away for more
> experienced users should be something like "significantly ameliorates
> write-amplification problems in indexes with non-HOT updates". But
> that's quite a mouthful. Having a hyperlink to the docs seems like a
> sensible compromise. I really just want experienced users to be able
> to make the connection between this feature and well-publicized
> problems with Postgres index bloat/VACUUM.

Yes, I am going to add links to everything, but haven't had time yet.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: allow specifying direct role membership in pg_hba.conf

2021-05-17 Thread Magnus Hagander

On Mon, May 17, 2021 at 10:31 PM Chapman Flack  wrote:
>
> On 05/17/21 16:15, Magnus Hagander wrote:
> > The row is selected by the combination of username/database/ipaddress.
> > But you have to pick the minimum TLS version before the client has
> > sent that... Basically we have to make the choice long before we've
> > even started looking at pg_hba.
>
> Use the peer IP address to pre-filter the available pg_hba entries to
> those pertaining to that address ... choose a min protocol version that's
> the min specified among those ... then get the username and database name
> (by which point a protocol has been negotiated), then further filter the
> list down to those pertaining to that user and database and allowing that
> protocol version?
>
> Yes, clunky, but avoids a more ambitious redesign of pg_hba.

So you're saying that some entries int he parameter section would
depend on the db/user/ip combo and some would depend just on the ip?

That seems like an absolutely terrible idea to me, especially since
this is about security configuration. Way too easy to get wrong by
people who don't know how the internals work. People will *definitely*
set those parameter thinking that they can do it based on the db and
user as well.

> I'm not sure a more ambitious redesign would be a bad thing in principle;
> the pg_hba.conf syntax seems rather clunky and limiting to begin with,
> and I keep wondering why it isn't in shared tables or something. But
> I suppose a lot of external admin tools have some knowledge of it?

I think we'd either need a redesign of that, or a completely different
way of configuring pre-authentication settings.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-17 Thread Andres Freund

Hi,

pg_attribute is one of the biggest table in a new cluster, and often the
biggest table in production clusters. Its size is also quite relevant in
memory, due to all the TupleDescs we allocate.

I just noticed that the new attcompression increased the size not just
by 1 byte, but by 4, due to padding. While an increase from 112 to 116
bytes isn't the end of the world, it does seem worth considering using
existing unused bytes instead?

If we moved attcompression to all the other bool/char fields, we'd avoid
that size increase, as there's an existing 2 byte hole.

Of course there's the argument that we shouldn't change the column order
for existing SELECT * queries, but the existing placement already does
(the CATALOG_VARLEN columns follow).

Greetings,

Andres Freund

ICU bool problem

2021-05-17 Thread Peter Eisentraut

You might find that ICU 69 (pretty new, see 
http://site.icu-project.org/download/69) will cause compile failures 
with PG 10 (pretty old).  ICU 69 has switched to using stdbool.h, which 
conflicts with the home-made definitions that we used until PG10. 
Compile errors look like this:


pg_collation.c:47:1: error: conflicting types for 'CollationCreate'
   47 | CollationCreate(const char *collname, Oid collnamespace,
  | ^~~
In file included from pg_collation.c:25:
../../../src/include/catalog/pg_collation_fn.h:17:12: note: previous 
declaration of 'CollationCreate' was here

   17 | extern Oid CollationCreate(const char *collname, Oid collnamespace,
  |^~~
pg_collation.c: In function 'CollationCreate':
pg_collation.c:171:41: warning: passing argument 3 of 'heap_form_tuple' 
from incompatible pointer type [-Wincompatible-pointer-types]

  171 |  tup = heap_form_tuple(tupDesc, values, nulls);
  | ^
  | |
  | _Bool *
In file included from pg_collation.c:19:
../../../src/include/access/htup_details.h:802:26: note: expected 'bool 
*' {aka 'char *'} but argument is of type '_Bool *'

  802 | Datum *values, bool *isnull);
  |~~^~

The fix is like what we used to use for plperl back then:

diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index f3e04d4d8c..499ada2b69 100644
--- a/src/include/utils/pg_locale.h
+++ b/src/include/utils/pg_locale.h
@@ -17,6 +17,9 @@
 #endif
 #ifdef USE_ICU
 #include 
+#ifdef bool
+#undef bool
+#endif
 #endif

 #include "utils/guc.h"

I'll prepare a full patch in a bit.

Re: ICU bool problem

2021-05-17 Thread Bruce Momjian

On Mon, May 17, 2021 at 10:56:54PM +0200, Peter Eisentraut wrote:
> The fix is like what we used to use for plperl back then:
> 
> diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
> index f3e04d4d8c..499ada2b69 100644
> --- a/src/include/utils/pg_locale.h
> +++ b/src/include/utils/pg_locale.h
> @@ -17,6 +17,9 @@
>  #endif
>  #ifdef USE_ICU
>  #include 
> +#ifdef bool
> +#undef bool
> +#endif
>  #endif
> 
>  #include "utils/guc.h"
> 
> I'll prepare a full patch in a bit.

Yes, that seems like a good plan.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-17 Thread Justin Pryzby

On Mon, May 17, 2021 at 01:48:03PM -0700, Andres Freund wrote:
> pg_attribute is one of the biggest table in a new cluster, and often the
> biggest table in production clusters. Its size is also quite relevant in
> memory, due to all the TupleDescs we allocate.
> 
> I just noticed that the new attcompression increased the size not just
> by 1 byte, but by 4, due to padding. While an increase from 112 to 116
> bytes isn't the end of the world, it does seem worth considering using
> existing unused bytes instead?

+1

FYI: attcompression was an OID until a few weeks before the feature was merged,
and there were several issues related to that:
  aa25d1089 - fixed two issues
  226e2be38

-- 
Justin

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-17 Thread Tom Lane

Andres Freund  writes:
> If we moved attcompression to all the other bool/char fields, we'd avoid
> that size increase, as there's an existing 2 byte hole.

+1.  Looks to me like its existing placement was according to the good
old "add new things at the end" anti-pattern.  It certainly isn't
related to the adjacent fields.

Putting it just after attalign seems like a reasonably sane choice
from the standpoint of grouping things affecting physical storage;
and as you say, that wins from the standpoint of using up alignment
padding rather than adding more.

Personally I'd think the most consistent order in that area would
be attbyval, attalign, attstorage, attcompression; but perhaps it's
too late to swap the order of attstorage and attalign.

regards, tom lane

Re: allow specifying direct role membership in pg_hba.conf

2021-05-17 Thread Chapman Flack

On 05/17/21 16:35, Magnus Hagander wrote:
> So you're saying that some entries int he parameter section would
> depend on the db/user/ip combo and some would depend just on the ip?

I don't *think* that's what I was saying. What I was thinking was this:

The pg_hba.conf file is an ordered list of entries. Each entry can specify
a (broad or narrow) set of IPs it applies to, a (broad or narrow) set of
databases it applies to, and a (broad or narrow) set of users it applies to.

Also, in this hypothetical, it can specify a min protocol version.

Right now, we're doing something like this:

1. accept an incoming connection, learning the client IP
2. SSLRequest message leads to negotiating TLS
3. StartupMessage supplies the desired database and user name
4. pg_hba entries are consulted once and filtered down to the first one
applicable to the client IP, database, and username (and SSLness)
5. that entry is used for authentication

I suggested only:

Insert step 1½, filter the pg_hba entries down to only those that could
possibly accept a connection from this IP address. This is an improper
subset of the whole list, and an improper superset of the singleton to be
generated later in step 4.

Step 2 still negotiates TLS, but can fail early if the protocol would
be older than the oldest allowed in the pre-filtered list.

Step 4 takes that pre-filtered list and completes the restriction down to
first entry matching the IP, database, and username. This should be the
same singleton it would generate now. But it can fail-fast if that entry
would require a higher protocol version than what's been negotiated,
before sending the corresponding authentication request message, so no
authentication data will be exchanged over a less-secure channel than
intended. However, the user, database name, and options in the Startup
message might have been exposed over a lower TLS version than intended.
Maybe that's not the end of the world?

Regards,
-Chap

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-17 Thread Andres Freund

Hi,

On 2021-05-17 17:06:32 -0400, Tom Lane wrote:
> Putting it just after attalign seems like a reasonably sane choice
> from the standpoint of grouping things affecting physical storage;
> and as you say, that wins from the standpoint of using up alignment
> padding rather than adding more.

Makes sense to me.

> Personally I'd think the most consistent order in that area would
> be attbyval, attalign, attstorage, attcompression; but perhaps it's
> too late to swap the order of attstorage and attalign.

Given that we've put in new fields in various positions on a fairly
regular basis, I don't think swapping around attalign, attstorage would
cause a meaningful amount of additional pain.  Personally I don't have a
preference for how these are ordered.

Greetings,

Andres Freund

Re: allow specifying direct role membership in pg_hba.conf

2021-05-17 Thread Tom Lane

Chapman Flack  writes:
> On 05/17/21 16:35, Magnus Hagander wrote:
>> So you're saying that some entries int he parameter section would
>> depend on the db/user/ip combo and some would depend just on the ip?

> I don't *think* that's what I was saying. What I was thinking was this:
> ...

This seems pretty horrid to me, not only from a complexity standpoint,
but because it would break the principle that pg_hba.conf entries are
applied in order.

On the whole, I'm afraid that this idea is going to create a lot
more problems than it solves.

regards, tom lane

Re: Typo in README.barrier

2021-05-17 Thread David Rowley

On Mon, 17 May 2021 at 17:18, Tatsuo Ishii  wrote:
>
> > On Mon, May 17, 2021 at 09:33:27AM +0900, Tatsuo Ishii wrote:
> >> Me too. Let's backpatch.
> >
> > A README is not directly user-facing, it is here for developers, so I
> > would not really bother with a backpatch.  Now it is not a big deal to
> > do so either, so that's not a -1 from me, more a +0, for "please feel
> > free to do what you think is most adapted".
>
> I think README is similar to code comments. If a code comment is
> wrong, we usually fix to back branches. Why can't we do the same thing
> for README?

Thanks for the votes. Since Michael was on the fence and I was just
leaning over it and Ishii-san was pro-backpatch, I backpatched it.

> > You may want to hold on until 14beta1 is tagged, though.
>
> Of course we can wait till that day but I wonder why.

I imagined that would be a good idea for more risky patches so we
don't break something before a good round of buildfarm testing.
However, since this is just a README, I didn't think it would have
mattered. Maybe there's another reason I'm overlooking?

David

Re: Typo in README.barrier

2021-05-17 Thread Tom Lane

David Rowley  writes:
>>> You may want to hold on until 14beta1 is tagged, though.

>> Of course we can wait till that day but I wonder why.

> I imagined that would be a good idea for more risky patches so we
> don't break something before a good round of buildfarm testing.
> However, since this is just a README, I didn't think it would have
> mattered. Maybe there's another reason I'm overlooking?

Generally it's considered poor form to push any inessential patches
during a release window (which I'd define roughly as 48 hours before
the wrap till after the tag is applied).  It complicates the picture
for the final round of buildfarm testing, and if we have to do a
re-wrap then we're faced with the question of whether to back out
the patch.

In this case, it being just a README, I agree there's no harm done.
But we've been burnt by "low risk" patches before, so I'd tend to
err on the side of caution during a release window.

regards, tom lane

Re: Always bump PG_CONTROL_VERSION?

2021-05-17 Thread Andres Freund

Hi,

On 2021-05-13 17:42:52 -0400, Jan Wieck wrote:
> Also, can someone give me a good reason NOT to bump the version?

There's several types of tools (particularly around backup) that need to
parse control files. Unnecessarily increasing the numbers of versions
that need to be dealt with makes that a bit harder.

Greetings,

Andres Freund

Re: Typo in README.barrier

2021-05-17 Thread Michael Paquier

On Mon, May 17, 2021 at 06:37:56PM -0400, Tom Lane wrote:
> Generally it's considered poor form to push any inessential patches
> during a release window (which I'd define roughly as 48 hours before
> the wrap till after the tag is applied).  It complicates the picture
> for the final round of buildfarm testing, and if we have to do a
> re-wrap then we're faced with the question of whether to back out
> the patch.
> 
> In this case, it being just a README, I agree there's no harm done.
> But we've been burnt by "low risk" patches before, so I'd tend to
> err on the side of caution during a release window.

Yes, I've had this experience once in the past.  So I tend to just
wait until the tag is pushed as long as it is not critical for the
release.
--
Michael


signature.asc
Description: PGP signature

Re: pg_dumpall misses --no-toast-compression

2021-05-17 Thread Michael Paquier

On Mon, May 17, 2021 at 10:38:15AM -0400, Tom Lane wrote:
> Daniel Gustafsson  writes:
>> I think there is a reasonable case to be made for this fixing an oversight in
>> bbe0a81db69bd10bd166907c3701492a29aca294 as opposed to adding a brand new
>> feature.  Save for --no-synchronized-snapshots all --no-xxx options in 
>> pg_dump
>> are mirrored in pg_dumpall.
> 
> +1, seems more like fixing an oversight than anything else.

Okay, thanks.  I don't mind taking care of that on HEAD once beta1 is
shipped, then.
--
Michael


signature.asc
Description: PGP signature

Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Peter Smith

On Mon, May 17, 2021 at 8:13 PM Amit Kapila  wrote:
>
> On Mon, May 17, 2021 at 3:05 PM Dilip Kumar  wrote:
> >
> > On Mon, May 17, 2021 at 12:30 PM Peter Smith  wrote:
> >
> > > PSA a patch adding a test for this scenario.
> >
> > I am not sure this test case is exactly targeting the problematic
> > behavior because that will depends upon the order of execution of the
> > apply workers right?
> >
>
> Yeah, so we can't guarantee that this test will always reproduce the
> problem but OTOH, I have tried two times and it reproduced both times.
> I guess we don't have a similar test where Truncate will replicate to
> two subscriptions, otherwise, we would have caught such a problem.
> Having said that, I am fine with leaving this test if others feel so
> on the grounds that it won't always lead to the problem reported.
>

If there is any concern that the problem won't always happen then I
think we should just increase the numbers of subscriptions.

Having more simultaneous subscriptions (e.g. I have tried 4). will
make it much more likely for the test to encounter the deadlock, and
it probably would also be quite a useful worker stress test in it's
own right.

~~

Also, should this test be in the 010_truncate,pl, or does it belong in
the 100_bugs.pl?  (I don't know what are the rules for when a test
gets put into 100_bugs.pl)

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: wal stats questions

2021-05-17 Thread Masahiro Ikeda



On 2021/05/17 22:34, Fujii Masao wrote:
> 
> 
> On 2021/05/17 16:39, Masahiro Ikeda wrote:
>>
>> Thanks for your comments!
>>
 + * is executed, wal records aren't nomally generated (although HOT 
 makes
>>>
>>> nomally -> normally?
>>
>> Yes, fixed.
>>
 + * It's not enough to check the number of generated wal records, for
 + * example the walwriter may write/sync the WAL although it doesn't
>>>
>>> You use both 'wal' and 'WAL' in the comments, but are there any intension?
>>
>> No, I changed to use 'WAL' only. Although some other comments have 'wal' and
>> 'WAL', it seems that 'WAL' is often used.
> 
> Thanks for updating the patch!

Thanks a lot of comments!

> + * Buffer and generated WAL usage counters.
> + *
> + * The counters are accumulated values. There are infrastructures
> + * to add the values and calculate the difference within a specific period.
> 
> Is it really worth adding these comments here?

BufferUsage has almost same comments. So, I removed it.

> + * Note: regarding to WAL statistics counters, it's not enough to check
> + * only whether the WAL record is generated or not. If a read transaction
> + * is executed, WAL records aren't normally generated (although HOT makes
> + * WAL records). But, just writes and syncs the WAL data may happen when
> + * to flush buffers.
> 
> Aren't the following comments better?
> 
> --
> To determine whether any WAL activity has occurred since last time, not only
> the number of generated WAL records but also the numbers of WAL writes and
> syncs need to be checked. Because even transaction that generates no WAL
> records can write or sync WAL data when flushing the data pages.
> --

Thanks. Yes, I fixed it.

> - * This function can be called even if nothing at all has happened. In
> - * this case, avoid sending a completely empty message to the stats
> - * collector.
> 
> I think that it's better to leave this comment as it is.

OK. I leave it.

> + * First, to check the WAL stats counters were updated.
> + *
> + * Even if the 'force' is true, we don't need to send the stats if the
> + * counters were not updated.
> + *
> + * We can know whether the counters were updated or not to check only
> + * three counters. So, for performance, we don't allocate another memory
> + * spaces and check the all stats like pgstat_send_slru().
> 
> Is it really worth adding these comments here?

I removed them because the following comments are enough.

* This function can be called even if nothing at all has happened. In
* this case, avoid sending a completely empty message to the stats
* collector.

> + * The current 'wal_records' is the same as the previous one means that
> + * WAL records weren't generated. So, the counters of 'wal_fpi',
> + * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
> + *
> + * It's not enough to check the number of generated WAL records, for
> + * example the walwriter may write/sync the WAL although it doesn't
> + * generate WAL records. 'm_wal_write' and 'm_wal_sync' are zero means 
> the
> + * counters of time spent are zero too.
> 
> Aren't the following comments better?
> 
> 
> Check wal_records counter to determine whether any WAL activity has happened
> since last time. Note that other WalUsage counters don't need to be checked
> because they are incremented always together with wal_records counter.
> 
> m_wal_buffers_full also doesn't need to be checked because it's incremented
> only when at least one WAL record is generated (i.e., wal_records counter is
> incremented). But for safely, we assert that m_wal_buffers_full is always zero
> when no WAL record is generated
> 
> This function can be called by a process like walwriter that normally
> generates no WAL records. To determine whether any WAL activity has happened
> at that process since the last time, the numbers of WAL writes and syncs are
> also checked.
> 

Yes, I modified them.

> + * The accumulated counters for buffer usage.
> + *
> + * The reason the counters are accumulated values is to avoid unexpected
> + * reset because many callers may use them.
> 
> Aren't the following comments better?
> 
> 
> These counters keep being incremented infinitely, i.e., must never be reset to
> zero, so that we can calculate how much the counters are incremented in an
> arbitrary period.
> 

Yes, thanks.
I fixed it.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index cdd07770a0..75a95f3de7 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-

Re: pg_get_wal_replay_pause_state() should not return 'paused' while a promotion is ongoing.

2021-05-17 Thread Kyotaro Horiguchi

At Mon, 17 May 2021 23:29:18 +0900, Fujii Masao  
wrote in 
> If a promotion is triggered while recovery is paused, the paused state
> ends
> and promotion continues. But currently pg_get_wal_replay_pause_state()
> returns 'paused' in that case. Isn't this a bug?
> 
> Attached patch fixes this issue by resetting the recovery pause state
> to
> 'not paused' when standby promotion is triggered.
> 
> Thought?

Nice catch!

Once the state enteres "paused" state no more WAL record is expected
to be replayed until exiting the state. I'm not sure but maybe we are
also expecting that the server promotes whthout a record replayed when
triggered while pausing.  However, actually there's a chance for a
record to replayed before promotion.  Of course it is existing
behavior but I'd like to make sure whether we deliberately allow that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: allow specifying direct role membership in pg_hba.conf

2021-05-17 Thread Chapman Flack

On 05/17/21 17:55, Tom Lane wrote:
> This seems pretty horrid to me, not only from a complexity standpoint,
> but because it would break the principle that pg_hba.conf entries are
> applied in order.

This makes twice in a row that I've failed to see how.

If you go through the entries, in order, and simply prune from the list
the ones you can already prove would never apply to this connection, how
does that break the ordering principle?

Regards,
-Chap

Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Peter Smith

On Mon, May 17, 2021 at 6:47 PM Amit Kapila  wrote:
>
> On Mon, May 17, 2021 at 12:30 PM Peter Smith  wrote:
> >
[...]
> > The essence of the trouble seems to be that the apply_handle_truncate
> > function never anticipated it may end up truncating the same table
> > from 2 separate workers (subscriptions) like this test case is doing.
> > Probably this is quite an old problem because the
> > apply_handle_truncate code has not changed much for a long time.
> >
>
> Yeah, have you checked it in the back branches?
>

Yes, the apply_handle_truncate function was introduced in April/2018 [1].

REL_11_0 was tagged in Oct/2018.

The "ERROR:  deadlock detected" log is reproducible in PG 11.0.

--
[1] 
https://github.com/postgres/postgres/commit/039eb6e92f20499ac36cc74f8a5cef7430b706f6

Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: Move pg_attribute.attcompression to earlier in struct for reduced size?

2021-05-17 Thread Michael Paquier

On Mon, May 17, 2021 at 02:28:57PM -0700, Andres Freund wrote:
> On 2021-05-17 17:06:32 -0400, Tom Lane wrote:
>> Putting it just after attalign seems like a reasonably sane choice
>> from the standpoint of grouping things affecting physical storage;
>> and as you say, that wins from the standpoint of using up alignment
>> padding rather than adding more.
> 
> Makes sense to me.

+1.

>> Personally I'd think the most consistent order in that area would
>> be attbyval, attalign, attstorage, attcompression; but perhaps it's
>> too late to swap the order of attstorage and attalign.
> 
> Given that we've put in new fields in various positions on a fairly
> regular basis, I don't think swapping around attalign, attstorage would
> cause a meaningful amount of additional pain.  Personally I don't have a
> preference for how these are ordered.

If you switch attcompression, I'd say to go for the others while on
it.  It would not be the first time in history there is a catalog
version bump between betas.
--
Michael


signature.asc
Description: PGP signature

RE: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-05-17 Thread Pengchengliu

Hi Greg,
  > I actually think that the Assert in SubTransGetTopmostTransaction() is 
correct, but in the parallel-worker case, the snapshots are not being setup 
correctly.

   I agree with you that Assert in SubTransGetTopmostTransaction() is correct.  
The root cause is Transaction Xmin are not being setup correctly in the 
parallel-worker.

   Actually I am very confused about ActiveSnapshot and TransactionSnapshot. I 
don't know why main process send ActiveSnapshot and TransactionSnapshot 
separately.  And what is exact difference between them?
   If you know that, could you explain that for me? It will be very 
appreciated. 
   Before we know them exactly, I think we should not modify the 
TransactionSnapshot to ActiveSnapshot in main process. If it is, the main 
process should send ActiveSnapshot only.  

Thanks
Pengcheng

-Original Message-
From: Greg Nancarrow  
Sent: 2021年5月17日 20:59
To: 刘鹏程 
Cc: Andres Freund ; PostgreSQL-development 

Subject: Re: Re: Parallel scan with SubTransGetTopmostTransaction assert 
coredump

On Sat, May 15, 2021 at 12:37 PM 刘鹏程  wrote:
>
>
>   BTW, I test it in a high performance server. It is verly easily be 
> reproduced. My colleague and me use different environment both can reproduce 
> it.
>

Hi Pengcheng,

Although the issue won't reproduce easily in my system, I can certainly see 
how, for the snapshots used in the parallel worker case, the Active snapshot 
used is potentially an earlier snapshot that the Transaction snapshot. I don't 
know why it is getting a newer Transaction snapshot in InitializeParallelDSM(), 
when it has previously pushed the return value of GetTransactionSnapshot() as 
the Active snapshot.

So I too hope Tom or Robert can explain what is going on here and how to 
resolve it (as you requested them to, in your other post).

I actually think that the Assert in SubTransGetTopmostTransaction() is correct, 
but in the parallel-worker case, the snapshots are not being setup correctly.

Can you try the trivial change below and see if it prevents the coredump?

Regards,
Greg Nancarrow
Fujitsu Australia

diff --git a/src/backend/access/transam/parallel.c
b/src/backend/access/transam/parallel.c
index 14a8690019..870889053f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -216,7 +216,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
  int i;
  FixedParallelState *fps;
  dsm_handle session_dsm_handle = DSM_HANDLE_INVALID;
- Snapshot transaction_snapshot = GetTransactionSnapshot();
+ Snapshot transaction_snapshot = GetActiveSnapshot();
  Snapshot active_snapshot = GetActiveSnapshot();

  /* We might be running in a very short-lived memory context. */

Re: Skip partition tuple routing with constant partition key

2021-05-17 Thread David Rowley

On Tue, 18 May 2021 at 01:31, Amit Langote  wrote:
> Hmm, does this seem common enough for the added complexity to be worthwhile?

I'd also like to know if there's some genuine use case for this. For
testing purposes does not seem to be quite a good enough reason.

A slightly different optimization that I have considered and even
written patches before was to have ExecFindPartition() cache the last
routed to partition and have it check if the new row can go into that
one on the next call.  I imagined there might be a use case for
speeding that up for RANGE partitioned tables since it seems fairly
likely that most use cases, at least for time series ranges will
always hit the same partition most of the time.   Since RANGE requires
a binary search there might be some savings there.  I imagine that
optimisation would never be useful for HASH partitioning since it
seems most likely that we'll be routing to a different partition each
time and wouldn't save much since routing to hash partitions are
cheaper than other types.  LIST partitioning I'm not so sure about. It
seems much less likely than RANGE to hit the same partition twice in a
row.

IIRC, the patch did something like call ExecPartitionCheck() on the
new tuple with the previously routed to ResultRelInfo. I think the
last used partition was cached somewhere like relcache (which seems a
bit questionable).   Likely this would speed up the example case here
a bit. Not as much as the proposed patch, but it would likely apply in
many more cases.

I don't think I ever posted the patch to the list, and if so I no
longer have access to it, so it would need to be done again.

David

Re: allow specifying direct role membership in pg_hba.conf

2021-05-17 Thread Chapman Flack

On 05/17/21 21:19, Chapman Flack wrote:
> This makes twice in a row that I've failed to see how.
> 
> If you go through the entries, in order, and simply prune from the list
> the ones you can already prove would never apply to this connection, how
> does that break the ordering principle?

Ok, I see how what I proposed looks out-of-order just in that it lets the
initial TLS negotiation be influenced by the minimum version over all
potentially-applicable entries.

But that's just an optimization anyway. The same ultimate effect would be
achieved by unconditionally allowing anything back to TLSv1 to be negotiated
at SSLRequest time, and then (processing the entries in order as always)
rejecting the connection if the first one that could apply expects a higher
protocol version.

The pre-scan and use of the minimum version encountered has only the effect
of fast-failing a TLS negotiation for a version that won't possibly succeed.

Regards,
-Chap

Re: Rewriting the test of pg_upgrade as a TAP test - take three - remastered set

2021-05-17 Thread Michael Paquier

On Mon, May 17, 2021 at 12:32:13PM -0400, Andrew Dunstan wrote:
> On 5/16/21 9:55 PM, Michael Paquier wrote:
> Yes, I'm going to be proposing a series of smallish patches including
> these when the tree is branched (which I hope will be in a few weeks).

Thanks!  That clearly needs to happen first.  I'll help reviewing
these.

>> If we do that, then it should be possible to reduce the code footprint
>> in the buildfarm code, while still allowing people to test major
>> upgrades in the same old-fashioned way, right?  That's assuming that
>> PostgresNode is made compatible down to 9.2, of course, as a first
>> step, as that's the range of the dumps you are keeping around for the
>> buildfarm.
> 
> I'm intending to add some older dumps. -) But for now 9.2 is a good target.

Makes sense.  For now, I'll update this patch set so as it is possible
to use custom dumps, as an option in parallel of pg_regress when
specifying a different source code path.  I'll also decouple the
business with probin updates and stick with the approach used by the
buildfarm code.
--
Michael


signature.asc
Description: PGP signature

Re: What is lurking in the shadows?

2021-05-17 Thread Peter Smith

On Fri, May 14, 2021 at 11:16 AM David Rowley  wrote:
>
> On Fri, 14 May 2021 at 12:00, Peter Smith  wrote:
> > That bug led me to wonder if similar problems might be going
> > undetected elsewhere in the code. There is a gcc compiler option [3]
> > -Wshadow which informs about the similar scenario where one variable
> > is "shadowing" another (e.g. redeclaring a variable with the same name
> > as one at an outer scope).
>
> > For now, I am not sure how to proceed with this information. Hence this 
> > post...
>
> I'm inclined to think that since a bug has already been found due to a
> local variable shadowing a global one that it would be good to review
> these and then consider if it's worth doing any renaming. I think the
> process of looking at each warning individually will allow us to
> determine if; a) there are any bugs, or; b) if it's worth doing any
> renaming.
>

Hi David, Michael - Thanks for your replies.

Yeah, I would like to work my way through all of these warnings in my
spare time and report back to this thread (after 1-2 months?) with a
detailed analysis.

After that it should become much clearer what / if any action should
be taken next.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

RE: Skip partition tuple routing with constant partition key

2021-05-17 Thread houzj.f...@fujitsu.com

> > Hmm, does this seem common enough for the added complexity to be
> worthwhile?
> 
> I'd also like to know if there's some genuine use case for this. For testing
> purposes does not seem to be quite a good enough reason.

Thanks for the response.

For some big data scenario, we sometimes transfer data from one table(only 
store not expired data)
to another table(historical data) for future analysis.
In this case, we import data into historical table regularly(could be one day 
or half a day),
And the data is likely to be imported with date label specified, then all of 
the data to be
imported this time belong to the same partition which partition by time range.

So, personally, It will be nice if postgres can skip tuple routing for each row 
in this scenario.

> A slightly different optimization that I have considered and even written
> patches before was to have ExecFindPartition() cache the last routed to
> partition and have it check if the new row can go into that one on the next 
> call.
> I imagined there might be a use case for speeding that up for RANGE
> partitioned tables since it seems fairly likely that most use cases, at least 
> for
> time series ranges will
> always hit the same partition most of the time.   Since RANGE requires
> a binary search there might be some savings there.  I imagine that
> optimisation would never be useful for HASH partitioning since it seems most
> likely that we'll be routing to a different partition each time and wouldn't 
> save
> much since routing to hash partitions are cheaper than other types.  LIST
> partitioning I'm not so sure about. It seems much less likely than RANGE to 
> hit
> the same partition twice in a row.

I think your approach looks good too,
and it seems does not conflict with the approach proposed here.

Best regards,
houzj

Re: Performance degradation of REFRESH MATERIALIZED VIEW

2021-05-17 Thread Masahiko Sawada

On Wed, May 12, 2021 at 2:32 AM Tomas Vondra
 wrote:
>
>
>
> On 5/11/21 5:56 PM, Masahiko Sawada wrote:
> > On Tue, May 11, 2021 at 11:07 PM Tomas Vondra
> >  wrote:
> >>
> >> On 5/11/21 11:04 AM, Masahiko Sawada wrote:
> >>> On Tue, May 11, 2021 at 4:37 PM Michael Paquier  
> >>> wrote:
> 
>  On Wed, May 05, 2021 at 03:04:53PM +0200, Tomas Vondra wrote:
> > Thanks, that looks promising. I repeated the tests I did on 26/4, and 
> > the
> > results look like this:
> >
> > old (0c7d3bb99): 497ms
> > master:  621ms
> > patched: 531ms
> >
> > So yeah, that's a bit improvement - it does not remove the regression
> > entirely, but +5% is much better than +25%.
> 
>  Hmm.  Is that really something we should do after feature freeze?  A
>  25% degradation for matview refresh may be a problem for a lot of
>  users and could be an upgrade stopper.  Another thing we could do is
>  also to revert 7db0cd2 and 39b66a9 from the v14 tree, and work on a
>  proper solution for this performance problem for matviews for 15~.
> >>>
> >>> I think the approach proposed by Andres eliminates the extra vmbuffer
> >>> reads as much as possible. But even with the patch, there still is 5%
> >>> degradation (and there is no way to disable inserting frozen tuples at
> >>> matview refresh). Which could be a problem for some users. I think
> >>> it’s hard to completely eliminate the overhead so we might need to
> >>> consider another approach like having matview refresh use
> >>> heap_multi_insert() instead of heap_insert().
> >>>
> >>
> >> I think it's way too late to make such significant change (switching to
> >> heap_multi_insert) for v14 :-(
> >
> > Right.
> >
> >> Moreover, I doubt it affects just matview
> >> refresh - why wouldn't it affect other similar use cases? More likely
> >> it's just the case that was discovered.
> >
> > I've not tested yet but I guess COPY FROM …  FREEZE using heap_insert
> > would similarly be affected since it also uses heap_insert() with
> > TABLE_INSERT_FROZEN.
> >
>
> I'd say that's somewhat acceptable, as it's a trade-off between paying a
> bit of time during COPY vs. paying much more later (when freezing the
> rows eventually).
>
>  From my POV the problem here is we've not asked to freeze the rows
> (unless I'm missing something and REFRESH freezes them?), but it's still
> a bit slower. However, 5% might also be just noise due to changes in
> layout of the binary.
>
> >>
> >>> I think the changes for heap_multi_insert() are fine so we can revert
> >>> only heap_insert() part if we revert something from the v14 tree,
> >>> although we will end up not inserting frozen tuples into toast tables.
> >>>
> >>
> >> I'd be somewhat unhappy about reverting just this bit, because it'd mean
> >> that we freeze rows in the main table but not rows in the TOAST tables
> >> (that was kinda why we concluded we need the heap_insert part too).
> >>
> >> I'm still a bit puzzled where does the extra overhead (in cases when
> >> freeze is not requested) come from, TBH.
> >
> > Which cases do you mean? Doesn't matview refresh always request to
> > freeze tuples even after applying the patch proposed on this thread?
> >
>
> Oh, I didn't realize that! That'd make this much less of an issue, I'd
> say, because if we're intentionally freezing the rows it's reasonable to
> pay a bit of time (in exchange for not having to do it later). The
> original +25% was a bit too much, of course, but +5% seems reasonable.

Yes. It depends on how much the matview refresh gets slower but I
think the problem here is that users always are forced to pay the cost
for freezing tuple during refreshing the matview. There is no way to
disable it unlike FREEZE option of COPY command.

I’ve done benchmarks for matview refresh on my machine (FreeBSD 12.1,
AMD Ryzen 5 PRO 3400GE, 24GB RAM) with four codes: HEAD, HEAD +
Andres’s patch, one before 39b66a91b, and HEAD without
TABLE_INSERT_FROZEN.

The workload is to refresh the matview that simply selects 50M tuples
(about 1.7 GB). Here are the average execution times of three trials
for each code:

1) head: 42.263 sec
2) head w/ Andres’s patch: 40.194 sec
3) before 39b66a91b commit: 38.143 sec
4) head w/o freezing tuples: 32.413 sec

I also observed 5% degradation by comparing 1 and 2 but am not sure
where the overhead came from. I agree with Andres’s proposal. It’s a
straightforward approach. I think it’s a reasonable degradation
comparing to the cost of freezing tuples later. But I’m concerned a
bit that it’s reasonable that we force all users to pay the cost
during matview refresh without any choice. So we need to find the
remaining differences after applying a polished version of the patch.

FYI I’ve attached flame graphs for each evaluation. Looking at
1_head.svg, we can see CPU spent much time on visibilittmap_pin() and
it disappeared in 2_head_w_Andreas_patch.svg. There is no big
difference at a glance between

Re: Addition of authenticated ID to pg_stat_activity

2021-05-17 Thread Michael Paquier

On Mon, May 17, 2021 at 10:28:49AM +0200, Magnus Hagander wrote:
> On Mon, May 17, 2021 at 6:35 AM Michael Paquier  wrote:
>> Not sure if we would be able to agree on something here, but the
>> barrier to what a session and a connection hold is thin when it comes
>> to roles and application_name.  Thinking more about that, I would be
>> really tempted to get to do a more straight split with data that's
>> associated to a session, to a transaction and to a connection, say:
>> 1) pg_stat_session, data that may change.
>> - PID
>> - leader PID
>> - the role name
>> - role ID
>> - application_name
>> - wait_event_type
>> - wait_event
>> 2) pg_stat_connection, static data associated to a connection.
>> - PID
>> - database name
>> - database OID
>> - client_addr
>> - client_hostname
>> - client_port
>> - backend_start
>> - authn ID
>> - backend_type
>> 3) pg_stat_transaction, or pg_stat_activity, for the transactional
>> activity.
>> - PID
>> - xact_start
>> - query_start
>> - backend_xid
>> - state_change
>> - query string
>> - query ID
>> - state
> 
> This seems extremely user-unfriendly to me.
> 
> I mean. Timestamps are nso split out between different views so you
> can't track the process iwthout it. And surely wait_event info is
> *extremely* related to things like what query is running and what
> state the session is in. And putting backend_type off in a separate
> view means most queries are going to have to join that in anyway. Or
> include it in all views. And if we're forcing the majority of queries
> to join multiple views, what have we actually gained?
> 
> Based on your list above, I'd definitely want at least (1) and (2) to
> be in the same one, but they'd have to also gain at least the database
> oid/name and backend_type, and maybe also backend_start.

Okay.

> So basically, it would be moving out client_*, and authn_id.

So that would mean the addition of one new catalog view, called
pg_stat_connection, with the following fields:
- PID
- all three client_*
- authn ID
I can live with this split.  Thoughts from others?

> If we're
> doing that then as you say maybe pg_stat_connection is a good name and
> could then *also* gain the information that's currently in the ssl and
> gss views for a net simplification.

I am less enthutiastic about this addition.  SSL and GSSAPI have no
fields in common, so that would bloat the view for connection data
with mostly NULL fields most of the time.

> tld;dr; I think we have to be really careful here or the cure is going
> to be way worse than the disease.

Agreed.

>> Or I could just drop a new function that fetches the authn ID for a
>> given PID, even if this makes things potentially less consistent when
>> it comes to the lookup of PgBackendStatus, guarantee given now by
>> pg_stat_get_activity().
> 
> Well, the authnid will never change so I'm not sure the consistency
> part is a big problem? Or maybe I'm misunderstanding the risk you're
> referring to?

I just mean to keep the consistency we have now with one single call
of pg_stat_get_activity() for each catalog view, so as we still fetch
once a consistent copy of all PgBackendStatus entries in this code
path.
--
Michael


signature.asc
Description: PGP signature

Re: What is lurking in the shadows?

2021-05-17 Thread Michael Paquier

On Tue, May 18, 2021 at 12:08:57PM +1000, Peter Smith wrote:
> Yeah, I would like to work my way through all of these warnings in my
> spare time and report back to this thread (after 1-2 months?) with a
> detailed analysis.

The next commit fest is at the beginning of July, so there are a
couple of weeks of margin here.

> After that it should become much clearer what / if any action should
> be taken next.

If you can dive into the details, that would be nice!  My take would
be to look first at the local-local conflicts and rename all the
variables that conflict so as any backpatch done in the areas changed
cause a compilation failure.  Some of the global-local conflicts are
simple enough to solve, these could go second.  Each of them requires
a case-by-case lookup, of course.
--
Michael


signature.asc
Description: PGP signature

Re: What is lurking in the shadows?

2021-05-17 Thread David Rowley

On Tue, 18 May 2021 at 14:09, Peter Smith  wrote:
> Yeah, I would like to work my way through all of these warnings in my
> spare time and report back to this thread (after 1-2 months?) with a
> detailed analysis.

I'd recommend for any patches that they come in bite-sized chunks. A
committer is going to have to re-review each change. For me
personally, I'll probably run for the hills if I see a patch that
renames 200 variables.

I'd think about a dozen would be good. Starting with ones that are
least likely to raise objection also seems like a good idea.  That way
you'll have an idea if you want to trouble yourself with the more
questionable ones when the less questionable ones raised too many
questions.  Like I mentioned, start with ones like i, buf, tmp, lc. If
those are accepted then move on to the more difficult ones.  Unless
you discover bugs, then there's not really any urgency to fix these.
Doing it in bite-sized chunks is less likely going to cause
frustration for you if some of the work is rejected after you've gone
to all the trouble.

Also, going by what's mentioned in [1], in particular [2], I'm not so
certain that these changes will be received well by everyone.  So I
recommend just taking it slow.

David

[1] 
https://www.postgresql.org/message-id/flat/877k1psmpf.fsf%40mailbox.samurai.com
[2] https://www.postgresql.org/message-id/22920.1069708226%40sss.pgh.pa.us

Re: Skip partition tuple routing with constant partition key

2021-05-17 Thread Michael Paquier

On Tue, May 18, 2021 at 01:27:48PM +1200, David Rowley wrote:
> A slightly different optimization that I have considered and even
> written patches before was to have ExecFindPartition() cache the last
> routed to partition and have it check if the new row can go into that
> one on the next call.  I imagined there might be a use case for
> speeding that up for RANGE partitioned tables since it seems fairly
> likely that most use cases, at least for time series ranges will
> always hit the same partition most of the time.   Since RANGE requires
> a binary search there might be some savings there.  I imagine that
> optimisation would never be useful for HASH partitioning since it
> seems most likely that we'll be routing to a different partition each
> time and wouldn't save much since routing to hash partitions are
> cheaper than other types.  LIST partitioning I'm not so sure about. It
> seems much less likely than RANGE to hit the same partition twice in a
> row.

It depends a lot on the schema used and the load pattern, but I'd like
to think that a similar argument can be made in favor of LIST
partitioning here.
--
Michael


signature.asc
Description: PGP signature

Re: Always bump PG_CONTROL_VERSION?

2021-05-17 Thread Michael Paquier

On Mon, May 17, 2021 at 03:47:01PM -0700, Andres Freund wrote:
> There's several types of tools (particularly around backup) that need to
> parse control files. Unnecessarily increasing the numbers of versions
> that need to be dealt with makes that a bit harder.

I am digressing here, sorry for that..

But it is worth noting that it is fun to debug issues where a user
changes the system locale and breaks some logic that parses the output
of pg_controldata because of the translations of the text fields.
I've really wondered over the years whether there should be more
switches to pick up only the field values, for the popular ones like
TLIs for example.  pg_config does that.
--
Michael


signature.asc
Description: PGP signature

Re: subscriptioncheck failure

2021-05-17 Thread Amit Kapila

On Mon, May 17, 2021 at 5:48 PM vignesh C  wrote:
>
> On Mon, May 17, 2021 at 10:40 AM Amit Kapila  wrote:
> >
> > One more point:
> > + $node_publisher->wait_for_catchup('tap_sub');
> > +
> > + # Ensure a transactional logical decoding message shows up on the slot
> > + $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub 
> > DISABLE");
> > +
> > + # wait for the replication connection to drop from the publisher
> > + $node_publisher->poll_query_until('postgres',
> > + "SELECT COUNT(*) FROM pg_catalog.pg_replication_slots WHERE
> > slot_name = 'tap_sub' AND active='f'", 1);
> >
> > In the above sequence, wait_for_catchup will query pg_stat_replication
> > whereas after disabling subscription we are checking
> > pg_replication_slots. I understand from your explanation why we can't
> > rely on pg_stat_replication after DISABLE but it might be better to
> > check that the slot is active before disabling it. I think currently,
> > the test assumes that, isn't it better to have an explicit check for
> > that?
>
> I felt this is not required, wait_for_catchup will poll_query_until
> the state = 'streaming', even if START_REPLICATION takes time, state
> will be in 'startup' state, this way poll_query_until will take care
> of handling this.
>

makes sense, but let's add some comments to clarify the same.

> On further analysis I found that we need to do "Alter subscription
> tap_sub ENABLE" and "ALTER subscription tap_sub DISABLE" multiple
> time, Instead we can change pg_logical_slot_peek_binary_changes to
> pg_logical_slot_get_binary_changes at appropriate steps. This way the
> common function can be removed and the enable/disable multiple times
> can be removed.
>

I think that is a valid point. This was probably kept so that we can
peek multiple times for the same message to test various things but
that can be achieved with the way you have changed the test.

One more thing, shouldn't we make auto_vacuum=off for this test by
using 'append_conf' before starting the publisher. That will avoid the
risk of empty transactions.

-- 
With Regards,
Amit Kapila.

Re: pg_get_wal_replay_pause_state() should not return 'paused' while a promotion is ongoing.

2021-05-17 Thread Fujii Masao





On 2021/05/18 9:58, Kyotaro Horiguchi wrote:

At Mon, 17 May 2021 23:29:18 +0900, Fujii Masao  
wrote in

If a promotion is triggered while recovery is paused, the paused state
ends
and promotion continues. But currently pg_get_wal_replay_pause_state()
returns 'paused' in that case. Isn't this a bug?

Attached patch fixes this issue by resetting the recovery pause state
to
'not paused' when standby promotion is triggered.

Thought?


Nice catch!

Once the state enteres "paused" state no more WAL record is expected
to be replayed until exiting the state. I'm not sure but maybe we are
also expecting that the server promotes whthout a record replayed when
triggered while pausing.


Currently a promotion causes all available WAL to be replayed before
a standby becomes a primary whether it was in paused state or not.
OTOH, something like immediate promotion (i.e., standby becomes
a primary without replaying outstanding WAL) might be useful for
some cases. I don't object to that.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

RE: Bug in query rewriter - hasModifyingCTE not getting set

2021-05-17 Thread tsunakawa.ta...@fujitsu.com

From: Tom Lane 
> In view of this, maybe the right thing is to disallow modifying CTEs
> in rule actions in the first place.  I see we already do that for
> views (i.e. ON SELECT rules), but they're not really any safer in
> other types of rules.

You meant by views something like the following, didn't you?

postgres=# create view myview as with t as (delete from b) select * from a;
ERROR:  views must not contain data-modifying statements in WITH

OTOH, the examples Greg-san showed do not contain CTE in the rule action, but 
in the query that the rule is applied to.  So, I think the solution would be 
something different.


>  Given that non-SELECT rules are an undertested
> legacy thing, I'm not that excited about moving mountains to make
> this case possible.

> That semantic issue doesn't get any less pressing just because the query
> was generated by rewrite.  So I now think that what we have to do is
> throw an error if we have a modifying CTE and sub_action is different
> from rule_action.  Not quite sure how to phrase the error though.

So, how about just throwing an error when the original query (not the rule 
action) has a data-modifying CTE?  The error message would be something like "a 
query containing a data-modifying CTE cannot be executed because there is some 
rule applicable to the relation".  This may be overkill and too many regression 
tests might fail, so we may have to add some condition to determine if we error 
out.

Or, I thought Greg-san's patch would suffice.  What problem do you see in it?

I couldn't imagine what "mountains" are.  Could you tell me what's that?


Regards
Takayuki Tsunakawa

Re: "ERROR: deadlock detected" when replicating TRUNCATE

2021-05-17 Thread Amit Kapila

On Tue, May 18, 2021 at 6:19 AM Peter Smith  wrote:
>
> On Mon, May 17, 2021 at 8:13 PM Amit Kapila  wrote:
> >
> > On Mon, May 17, 2021 at 3:05 PM Dilip Kumar  wrote:
> > >
> > > On Mon, May 17, 2021 at 12:30 PM Peter Smith  
> > > wrote:
> > >
> > > > PSA a patch adding a test for this scenario.
> > >
> > > I am not sure this test case is exactly targeting the problematic
> > > behavior because that will depends upon the order of execution of the
> > > apply workers right?
> > >
> >
> > Yeah, so we can't guarantee that this test will always reproduce the
> > problem but OTOH, I have tried two times and it reproduced both times.
> > I guess we don't have a similar test where Truncate will replicate to
> > two subscriptions, otherwise, we would have caught such a problem.
> > Having said that, I am fine with leaving this test if others feel so
> > on the grounds that it won't always lead to the problem reported.
> >
>
> If there is any concern that the problem won't always happen then I
> think we should just increase the numbers of subscriptions.
>
> Having more simultaneous subscriptions (e.g. I have tried 4). will
> make it much more likely for the test to encounter the deadlock, and
> it probably would also be quite a useful worker stress test in it's
> own right.
>

I don't think we need to go that far.

> ~~
>
> Also, should this test be in the 010_truncate,pl,
>

+1 for keeping it in 010_truncate.pl but remove the synchronous part
of it from the testcase and comments as that is not required.

-- 
With Regards,
Amit Kapila.

Re: Testing autovacuum wraparound (including failsafe)

2021-05-17 Thread Masahiko Sawada

On Sat, Apr 24, 2021 at 11:16 AM Peter Geoghegan  wrote:
>
> On Fri, Apr 23, 2021 at 5:29 PM Andres Freund  wrote:
> > On 2021-04-23 16:12:33 -0700, Peter Geoghegan wrote:
> > > The only reason that I chose 4GB for FAILSAFE_MIN_PAGES is because the
> > > related VACUUM_FSM_EVERY_PAGES constant was 8GB -- the latter limits
> > > how often we'll consider the failsafe in the single-pass/no-indexes
> > > case.
> >
> > I don't really understand why it makes sense to tie FAILSAFE_MIN_PAGES
> > and VACUUM_FSM_EVERY_PAGES together? They seem pretty independent to me?
>
> VACUUM_FSM_EVERY_PAGES controls how often VACUUM does work that
> usually takes place right after the two pass case finishes a round of
> index and heap vacuuming. This is work that we certainly don't want to
> do every time we process a single heap page in the one-pass/no-indexes
> case. Initially this just meant FSM vacuuming, but it now includes a
> failsafe check.
>
> Of course all of the precise details here are fairly arbitrary
> (including VACUUM_FSM_EVERY_PAGES, which has been around for a couple
> of releases now). The overall goal that I had in mind was to make the
> one-pass case's use of the failsafe have analogous behavior to the
> two-pass/has-indexes case -- a goal which was itself somewhat
> arbitrary.
>
> > The failsafe mode affects the table scan itself by disabling cost
> > limiting. As far as I can see the ways it triggers for the table scan (vs
> > truncation or index processing) are:
> >
> > 1) Before vacuuming starts, for heap phases and indexes, if already
> >necessary at that point
> > 2) For a table with indexes, before/after each index vacuum, if now
> >necessary
> > 3) On a table without indexes, every 8GB, iff there are dead tuples, if now 
> > necessary
> >
> > Why would we want to trigger the failsafe mode during a scan of a table
> > with dead tuples and no indexes, but not on a table without dead tuples
> > or with indexes but fewer than m_w_m dead tuples? That makes little
> > sense to me.
>
> What alternative does make sense to you?
>
> It seemed important to put the failsafe check at points where we do
> other analogous work in all cases. We made a pragmatic trade-off. In
> theory almost any scheme might not check often enough, and/or might
> check too frequently.
>
> > It seems that for the no-index case the warning message is quite off?
>
> I'll fix that up some point soon. FWIW this happened because the
> support for one-pass VACUUM was added quite late, at Robert's request.

+1 to fix this. Are you already working on fixing this? If not, I'll
post a patch.

>
> Another issue with the failsafe commit is that we haven't considered
> the autovacuum_multixact_freeze_max_age table reloption -- we only
> check the GUC. That might have accidentally been the right thing to
> do, though, since the reloption is interpreted as lower than the GUC
> in all cases anyway -- arguably the
> autovacuum_multixact_freeze_max_age GUC should be all we care about
> anyway. I will need to think about this question some more, though.

FWIW, I intentionally ignored the reloption there since they're
interpreted as lower than the GUC as you mentioned and the situation
where we need to enter the failsafe mode is not the table-specific
problem but a system-wide problem.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Race condition in recovery?

2021-05-17 Thread Dilip Kumar

On Tue, May 18, 2021 at 1:28 AM Robert Haas  wrote:
>
> Sorry, you're right. It couldn't be uninitialized, but it could be a
> fake 1-element list saying there are no ancestors rather than the real
> value. So I think the point was to avoid that.

Yeah, it will be a fake 1-element list.  But just to be clear that
1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and
nothing else, do you agree to this?  Because we initialize
recoveryTargetTLI to this value and we might change it in
readRecoveryCommandFile() but for that, we must get the history file,
so if we are talking about the case where we don't have the history
file then "recoveryTargetTLI" will still be
"ControlFile->checkPointCopy.ThisTimeLineID".

>
> I think the issue here is: If expectedTLEs was initialized before the
> history file was available, and contained a dummy 1-element list, then
> tliOfPointInHistory() is going to say that every LSN is on that
> timeline rather than any previous timeline. And if we are supposed to
> be switching timelines then that will lead to this sanity check
> failing.

You are talking about the sanity check of validating the timeline of
the checkpoint record right?  but as I mentioned earlier the only
entry in expectedTLEs will be the TLE of the checkpoint record so how
the sanity check will fail?

>
> I agree, but that's actually bad, isn't it?

Yes, it is bad.

 I mean if we want the
> sanity check to never fail we can just take it out. What we want to
> happen is that the sanity check should pass if the startup timeline if
> the TLI of the startup checkpoint is in the history of the recovery
> target timeline, but fail if it isn't. The only way to achieve that
> behavior is if expectedTLEs is initialized from the recovery target
> timeline.

Yes, I agree, with this.  So initializing expectedTLEs with the
recovery target timeline is the right fix.

Conclusion:
- I think now we agree on the point that initializing expectedTLEs
with the recovery target timeline is the right fix.
- We still have some differences of opinion about what was the
original problem in the base code which was fixed by the commit
(ee994272ca50f70b53074f0febaec97e28f83c4e).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Testing autovacuum wraparound (including failsafe)

2021-05-17 Thread Peter Geoghegan

On Mon, May 17, 2021 at 10:29 PM Masahiko Sawada  wrote:
> +1 to fix this. Are you already working on fixing this? If not, I'll
> post a patch.

I posted a patch recently (last Thursday my time). Perhaps you can review it?

-- 
Peter Geoghegan

Re: Testing autovacuum wraparound (including failsafe)

2021-05-17 Thread Masahiko Sawada

On Tue, May 18, 2021 at 2:42 PM Peter Geoghegan  wrote:
>
> On Mon, May 17, 2021 at 10:29 PM Masahiko Sawada  
> wrote:
> > +1 to fix this. Are you already working on fixing this? If not, I'll
> > post a patch.
>
> I posted a patch recently (last Thursday my time). Perhaps you can review it?

Oh, I missed that the patch includes that fix. I'll review the patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: pg_get_wal_replay_pause_state() should not return 'paused' while a promotion is ongoing.

2021-05-17 Thread Dilip Kumar

On Mon, May 17, 2021 at 7:59 PM Fujii Masao  wrote:
>
> If a promotion is triggered while recovery is paused, the paused state ends
> and promotion continues. But currently pg_get_wal_replay_pause_state()
> returns 'paused' in that case. Isn't this a bug?
>
> Attached patch fixes this issue by resetting the recovery pause state to
> 'not paused' when standby promotion is triggered.
>
> Thought?
>

I think, prior to commit 496ee647ecd2917369ffcf1eaa0b2cdca07c8730
(Prefer standby promotion over recovery pause.) this behavior was fine
because the pause was continued but after this commit now we are
giving preference to pause so this is a bug so need to be fixed.

The fix looks fine but I think along with this we should also return
immediately from the pause loop if promotion is requested.  Because if
we recheck the recovery pause then someone can pause again and we will
be in loop so better to exit as soon as promotion is requested, see
attached patch.  Should be applied along with your patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8d163f1..1d81c23 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6358,6 +6358,8 @@ RecoveryRequiresIntParameter(const char *param_name, int currValue, int minValue
 		   minValue),
  errhint("Restart the server after making the necessary configuration changes.")));
 	warned_for_promote = true;
+
+	return;
 }

 /*

1 2 >

1 - 100 of 105 matches

Mail list logo