Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Andres Freund
Hi,

On 2020-08-15 09:42:00 -0700, Andres Freund wrote:
> On 2020-08-15 11:10:51 -0400, Tom Lane wrote:
> > We have two essentially identical buildfarm failures since these patches
> > went in:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=damselfly&dt=2020-08-15%2011%3A27%3A32
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2020-08-15%2003%3A09%3A14
> >
> > They're both in the same place in the freeze-the-dead isolation test:
> 
> > TRAP: FailedAssertion("!TransactionIdPrecedes(members[i].xid, cutoff_xid)", 
> > File: "heapam.c", Line: 6051)
> > 0x9613eb  at 
> > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x52d586  at 
> > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x53bc7e  at 
> > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x6949bb  at 
> > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x694532  at 
> > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x693d1c  at 
> > /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x8324b3
> > ...
> > 2020-08-14 22:16:41.783 CDT [78410:4] LOG:  server process (PID 80395) was 
> > terminated by signal 6: Abort trap
> > 2020-08-14 22:16:41.783 CDT [78410:5] DETAIL:  Failed process was running: 
> > VACUUM FREEZE tab_freeze;
> >
> > peripatus has successes since this failure, so it's not fully reproducible
> > on that machine.  I'm suspicious of a timing problem in computing vacuum's
> > cutoff_xid.
> 
> Hm, maybe it's something around what I observed in
> https://www.postgresql.org/message-id/20200723181018.neey2jd3u7rfrfrn%40alap3.anarazel.de
> 
> I.e. that somehow we end up with hot pruning and freezing coming to a
> different determination, and trying to freeze a hot tuple.
> 
> I'll try to add a few additional asserts here, and burn some cpu tests
> trying to trigger the issue.
> 
> I gotta escape the heat in the house for a few hours though (no AC
> here), so I'll not look at the results till later this afternoon, unless
> it triggers soon.

690 successful runs later, it didn't trigger for me :(. Seems pretty
clear that there's another variable than pure chance, otherwise it seems
like that number of runs should have hit the issue, given the number of
bf hits vs bf runs.

My current plan would is to push a bit of additional instrumentation to
help narrow down the issue. We can afterwards decide what of that we'd
like to keep longer term, and what not.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Tom Lane
Andres Freund  writes:
> 690 successful runs later, it didn't trigger for me :(. Seems pretty
> clear that there's another variable than pure chance, otherwise it seems
> like that number of runs should have hit the issue, given the number of
> bf hits vs bf runs.

It seems entirely likely that there's a timing component in this, for
instance autovacuum coming along at just the right time.  It's not too
surprising that some machines would be more prone to show that than
others.  (Note peripatus is FreeBSD, which we've already learned has
significantly different kernel scheduler behavior than Linux.)

> My current plan would is to push a bit of additional instrumentation to
> help narrow down the issue.

+1

regards, tom lane




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Andres Freund
On 2020-08-16 14:30:24 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > 690 successful runs later, it didn't trigger for me :(. Seems pretty
> > clear that there's another variable than pure chance, otherwise it seems
> > like that number of runs should have hit the issue, given the number of
> > bf hits vs bf runs.
> 
> It seems entirely likely that there's a timing component in this, for
> instance autovacuum coming along at just the right time.  It's not too
> surprising that some machines would be more prone to show that than
> others.  (Note peripatus is FreeBSD, which we've already learned has
> significantly different kernel scheduler behavior than Linux.)

Yea. Interestingly there was a reproduction on linux since the initial
reports you'd dug up:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=butterflyfish&dt=2020-08-15%2019%3A54%3A53

but that's likely a virtualized environment, so I guess the host
scheduler behaviour could play a similar role.

I'll run a few iterations with rr's chaos mode too, which tries to
randomize scheduling decisions...


I noticed that it's quite hard to actually hit the hot tuple path I
mentioned earlier on my machine. Would probably be good to have a tests
hitting it more reliably. But I'm not immediately seeing how we could
force the necessarily serialization.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Tom Lane
I wrote:
> It seems entirely likely that there's a timing component in this, for
> instance autovacuum coming along at just the right time.

D'oh.  The attached seems to make it 100% reproducible.

regards, tom lane

diff --git a/src/test/isolation/specs/freeze-the-dead.spec b/src/test/isolation/specs/freeze-the-dead.spec
index 915bf15b92..4100d9fc6f 100644
--- a/src/test/isolation/specs/freeze-the-dead.spec
+++ b/src/test/isolation/specs/freeze-the-dead.spec
@@ -32,6 +32,7 @@ session "s2"
 step "s2_begin"		{ BEGIN; }
 step "s2_key_share"	{ SELECT id FROM tab_freeze WHERE id = 3 FOR KEY SHARE; }
 step "s2_commit"	{ COMMIT; }
+step "s2_wait"		{ select pg_sleep(60); }
 step "s2_vacuum"	{ VACUUM FREEZE tab_freeze; }
 
 session "s3"
@@ -49,6 +50,7 @@ permutation "s1_begin" "s2_begin" "s3_begin" # start transactions
"s1_update" "s2_key_share" "s3_key_share" # have xmax be a multi with an updater, updater being oldest xid
"s1_update" # create additional row version that has multis
"s1_commit" "s2_commit" # commit both updater and share locker
+   "s2_wait"
"s2_vacuum" # due to bug in freezing logic, we used to *not* prune updated row, and then froze it
"s1_selectone" # if hot chain is broken, the row can't be found via index scan
"s3_commit" # commit remaining open xact


Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Andres Freund
Hi,

On 2020-08-16 16:17:23 -0400, Tom Lane wrote:
> I wrote:
> > It seems entirely likely that there's a timing component in this, for
> > instance autovacuum coming along at just the right time.
> 
> D'oh.  The attached seems to make it 100% reproducible.

Great!  It interestingly didn't work as the first item on the schedule,
where I had duplicated it it to out of impatience. I guess there might
be some need of concurrent autovacuum activity or something like that.

I now luckily have a rr trace of the problem, so I hope I can narrow it
down to the original problem fairly quickly.

Thanks,

Andres




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Andres Freund
Hi,

On 2020-08-16 13:31:53 -0700, Andres Freund wrote:
> I now luckily have a rr trace of the problem, so I hope I can narrow it
> down to the original problem fairly quickly.

Gna, I think I see the problem.  In at least one place I wrongly
accessed the 'dense' array of in-progress xids using the 'pgprocno',
instead of directly using the [0...procArray->numProcs) index.

Working on a fix, together with some improved asserts.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Andres Freund
Hi,

On 2020-08-16 13:52:58 -0700, Andres Freund wrote:
> On 2020-08-16 13:31:53 -0700, Andres Freund wrote:
> > I now luckily have a rr trace of the problem, so I hope I can narrow it
> > down to the original problem fairly quickly.
> 
> Gna, I think I see the problem.  In at least one place I wrongly
> accessed the 'dense' array of in-progress xids using the 'pgprocno',
> instead of directly using the [0...procArray->numProcs) index.
> 
> Working on a fix, together with some improved asserts.

diff --git i/src/backend/storage/ipc/procarray.c 
w/src/backend/storage/ipc/procarray.c
index 8262abd42e6..96e4a878576 100644
--- i/src/backend/storage/ipc/procarray.c
+++ w/src/backend/storage/ipc/procarray.c
@@ -1663,7 +1663,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 TransactionId xmin;
 
 /* Fetch xid just once - see GetNewTransactionId */
-xid = UINT32_ACCESS_ONCE(other_xids[pgprocno]);
+xid = UINT32_ACCESS_ONCE(other_xids[index]);
 xmin = UINT32_ACCESS_ONCE(proc->xmin);
 
 /*

indeed fixes the issue based on a number of iterations of your modified
test, and fixes a clear bug.

WRT better asserts: We could make ProcArrayRemove() and InitProcGlobal()
initialize currently unused procArray->pgprocnos,
procGlobal->{xids,subxidStates,vacuumFlags} to invalid values and/or
declare them as uninitialized using the valgrind helpers.

For the first, one issue is that there's no obviously good candidate for
an uninitialized xid. We could use something like FrozenTransactionId,
which may never be in the procarray. But it's not exactly pretty.

Opinions?

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Andres Freund
Hi,

On 2020-08-16 14:11:46 -0700, Andres Freund wrote:
> On 2020-08-16 13:52:58 -0700, Andres Freund wrote:
> > On 2020-08-16 13:31:53 -0700, Andres Freund wrote:
> > Gna, I think I see the problem.  In at least one place I wrongly
> > accessed the 'dense' array of in-progress xids using the 'pgprocno',
> > instead of directly using the [0...procArray->numProcs) index.
> > 
> > Working on a fix, together with some improved asserts.
> 
> diff --git i/src/backend/storage/ipc/procarray.c 
> w/src/backend/storage/ipc/procarray.c
> index 8262abd42e6..96e4a878576 100644
> --- i/src/backend/storage/ipc/procarray.c
> +++ w/src/backend/storage/ipc/procarray.c
> @@ -1663,7 +1663,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
>  TransactionId xmin;
>  
>  /* Fetch xid just once - see GetNewTransactionId */
> -xid = UINT32_ACCESS_ONCE(other_xids[pgprocno]);
> +xid = UINT32_ACCESS_ONCE(other_xids[index]);
>  xmin = UINT32_ACCESS_ONCE(proc->xmin);
>  
>  /*
> 
> indeed fixes the issue based on a number of iterations of your modified
> test, and fixes a clear bug.

Pushed that much.


> WRT better asserts: We could make ProcArrayRemove() and InitProcGlobal()
> initialize currently unused procArray->pgprocnos,
> procGlobal->{xids,subxidStates,vacuumFlags} to invalid values and/or
> declare them as uninitialized using the valgrind helpers.
> 
> For the first, one issue is that there's no obviously good candidate for
> an uninitialized xid. We could use something like FrozenTransactionId,
> which may never be in the procarray. But it's not exactly pretty.
> 
> Opinions?

So we get some builfarm results while thinking about this.

Greetings,

Andres Freund




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Peter Geoghegan
On Sun, Aug 16, 2020 at 2:11 PM Andres Freund  wrote:
> For the first, one issue is that there's no obviously good candidate for
> an uninitialized xid. We could use something like FrozenTransactionId,
> which may never be in the procarray. But it's not exactly pretty.

Maybe it would make sense to mark the fields as inaccessible or
undefined to Valgrind. That has advantages and disadvantages that are
obvious.

If that isn't enough, it might not hurt to do this on top of whatever
becomes the primary solution. An undefined value has the advantage of
"spreading" when the value gets copied around.

-- 
Peter Geoghegan




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Tom Lane
Andres Freund  writes:
> For the first, one issue is that there's no obviously good candidate for
> an uninitialized xid. We could use something like FrozenTransactionId,
> which may never be in the procarray. But it's not exactly pretty.

Huh?  What's wrong with using InvalidTransactionId?

regards, tom lane




Re: Improving connection scalability: GetSnapshotData()

2020-08-16 Thread Andres Freund
Hi,

On 2020-08-16 17:28:46 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > For the first, one issue is that there's no obviously good candidate for
> > an uninitialized xid. We could use something like FrozenTransactionId,
> > which may never be in the procarray. But it's not exactly pretty.
> 
> Huh?  What's wrong with using InvalidTransactionId?

It's a normal value for a backend when it doesn't have an xid assigned.

Greetings,

Andres Freund




One-off failure in "cluster" test

2020-08-16 Thread Thomas Munro
Hi,

I wonder what caused this[1] one-off failure to see tuples in clustered order:

diff -U3 
/home/pgbfarm/buildroot/REL_13_STABLE/pgsql.build/src/test/regress/expected/cluster.out
/home/pgbfarm/buildroot/REL_13_STABLE/pgsql.build/src/test/regress/results/cluster.out
--- 
/home/pgbfarm/buildroot/REL_13_STABLE/pgsql.build/src/test/regress/expected/cluster.out
2020-06-11 07:58:23.738084255 +0300
+++ 
/home/pgbfarm/buildroot/REL_13_STABLE/pgsql.build/src/test/regress/results/cluster.out
2020-07-05 02:35:06.396023210 +0300
@@ -462,7 +462,8 @@
 where row(hundred, thousand, tenthous) <= row(lhundred, lthousand, ltenthous);
  hundred | lhundred | thousand | lthousand | tenthous | ltenthous
 -+--+--+---+--+---
-(0 rows)
+   0 |   99 |0 |   999 |0 |  
+(1 row)

I guess a synchronised scan could cause that, but I wouldn't expect one here.

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2020-07-04%2023:10:22




Re: One-off failure in "cluster" test

2020-08-16 Thread Tom Lane
Thomas Munro  writes:
> I wonder what caused this[1] one-off failure to see tuples in clustered order:
> ...
> I guess a synchronised scan could cause that, but I wouldn't expect one here.

Looking at its configuration, chipmunk uses

 'extra_config' => {
 ...
  'shared_buffers = 10MB',

which I think means that clstr_4 would be large enough to trigger a
syncscan.  Ordinarily that's not a problem since no other session would
be touching clstr_4 ... but I wonder whether (a) autovacuum had decided
to look at clstr_4 and (b) syncscan can trigger on vacuum-driven scans.
(a) would explain the non-reproducibility.

I kinda think that (b), if true, is a bad idea and should be suppressed.
autovacuum would typically fail to keep up with other syncscans thanks
to vacuum delay settings, so letting it participate seems unhelpful.

regards, tom lane




Re: One-off failure in "cluster" test

2020-08-16 Thread Thomas Munro
On Mon, Aug 17, 2020 at 1:20 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > I wonder what caused this[1] one-off failure to see tuples in clustered 
> > order:
> > ...
> > I guess a synchronised scan could cause that, but I wouldn't expect one 
> > here.
>
> Looking at its configuration, chipmunk uses
>
>  'extra_config' => {
>  ...
>   'shared_buffers = 10MB',
>
> which I think means that clstr_4 would be large enough to trigger a
> syncscan.  Ordinarily that's not a problem since no other session would
> be touching clstr_4 ... but I wonder whether (a) autovacuum had decided
> to look at clstr_4 and (b) syncscan can trigger on vacuum-driven scans.
> (a) would explain the non-reproducibility.
>
> I kinda think that (b), if true, is a bad idea and should be suppressed.
> autovacuum would typically fail to keep up with other syncscans thanks
> to vacuum delay settings, so letting it participate seems unhelpful.

Yeah, I wondered that as well and found my way to historical
discussions concluding that autovacuum should not participate in sync
scans.  Now I'm wondering if either table AM refactoring or parallel
vacuum refactoring might have inadvertently caused that to become a
possibility in REL_13_STABLE.




Re: Fix an old description in high-availability.sgml

2020-08-16 Thread Michael Paquier
On Fri, Aug 14, 2020 at 05:15:20PM +0900, Michael Paquier wrote:
> Good catch it is.  Your phrasing looks good to me.

Fixed as b4f1639.  Thanks.
--
Michael


signature.asc
Description: PGP signature


Re: One-off failure in "cluster" test

2020-08-16 Thread Thomas Munro
On Mon, Aug 17, 2020 at 1:27 PM Thomas Munro  wrote:
>
> On Mon, Aug 17, 2020 at 1:20 PM Tom Lane  wrote:
> > Thomas Munro  writes:
> > > I wonder what caused this[1] one-off failure to see tuples in clustered 
> > > order:
> > > ...
> > > I guess a synchronised scan could cause that, but I wouldn't expect one 
> > > here.
> >
> > Looking at its configuration, chipmunk uses
> >
> >  'extra_config' => {
> >  ...
> >   'shared_buffers = 
> > 10MB',

Ahh, I see what's happening.  You don't need a concurrent process
scanning *your* table for scan order to be nondeterministic.  The
preceding CLUSTER command can leave the start block anywhere if its
call to ss_report_location() fails to acquire SyncScanLock
conditionally.  So I think we just need to disable that for this test,
like in the attached.
From b12d7952feb6153cc760cd8c028182927e8b0198 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 17 Aug 2020 14:41:42 +1200
Subject: [PATCH] Fix rare failure in cluster test.

Don't allow synchronized scans of the table used in the "cluster"
regression test, because the conditional locking strategy used for
synchronization means that even a non-concurrent scan of a table from
the same session causes nondeterminism in subsequent scans.

Back-patch to 9.6 when the test arrived.

Discussion: https://postgr.es/m/CA%2BhUKGLTK6ZuEkpeJ05-MEmvmgZveCh%2B_w013m7%2ByKWFSmRcDA%40mail.gmail.com
---
 src/test/regress/expected/cluster.out | 3 +++
 src/test/regress/sql/cluster.sql  | 4 
 2 files changed, 7 insertions(+)

diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index bdae8fe00c..6e3191b84e 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -452,6 +452,8 @@ create table clstr_4 as select * from tenk1;
 create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
+-- make sure our test will always read blocks from the start of the table
+set synchronize_seqscans = off;
 -- Use external sort:
 set maintenance_work_mem = '1MB';
 cluster clstr_4 using cluster_sort;
@@ -464,6 +466,7 @@ where row(hundred, thousand, tenthous) <= row(lhundred, lthousand, ltenthous);
 -+--+--+---+--+---
 (0 rows)
 
+reset synchronize_seqscans;
 reset enable_indexscan;
 reset maintenance_work_mem;
 -- test CLUSTER on expression index
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index 188183647c..1bd5e933e8 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -210,6 +210,9 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 
+-- make sure our test will always read blocks from the start of the table
+set synchronize_seqscans = off;
+
 -- Use external sort:
 set maintenance_work_mem = '1MB';
 cluster clstr_4 using cluster_sort;
@@ -219,6 +222,7 @@ select * from
 tenthous, lag(tenthous) over () as ltenthous from clstr_4) ss
 where row(hundred, thousand, tenthous) <= row(lhundred, lthousand, ltenthous);
 
+reset synchronize_seqscans;
 reset enable_indexscan;
 reset maintenance_work_mem;
 
-- 
2.20.1



Re: OpenSSL 3.0.0 compatibility

2020-08-16 Thread Michael Paquier
On Fri, May 29, 2020 at 12:16:47AM +0200, Daniel Gustafsson wrote:
> SSL_CTX_load_verify_locations and X509_STORE_load_locations are deprecated and
> replaced by individual calls to load files and directories.  These are quite
> straightforward, and are implemented like how we handle the TLS protocol API.
> DH_check has been discouraged to use for quite some time, and is now marked
> deprecated without a 1:1 replacement.  The OpenSSL docs recommends using the
> EVP API, which is also done in 0001.  For now I just stuck it in with version
> gated ifdefs, something cleaner than that can clearly be done.  0001 is 
> clearly
> not proposed for review/commit yet, it's included in case someone else is
> interested as well.

Leaving the problems with pgcrypto aside for now, we have also two
parts involving DH_check() deprecation and the load-file APIs for the
backend.

From what I can see, the new APIs to load files are new as of 3.0.0,
but these are not marked as deprecated yet, so I am not sure that it
is worth having now one extra API compatibility layer just for that
now as proposed in cert_openssl.c.  Most of the the EVP counterparts,
though, are much older than 1.0.1, except for EVP_PKEY_param_check()
introduced in 1.1.1 :(

By the way, in the previous patch, EVP_PKEY_CTX_new_from_pkey() was
getting used but it is new as of 3.0.  We could just use
EVP_PKEY_CTX_new() which does the same work (see
crypto/evp/pmeth_lib.c as we have no library context of engine to pass
down), and is much older, but it does not reduce the diffs.  Then the
actual problem is EVP_PKEY_param_check(), new as of 1.1.1, which looks
to be the expected replacement for DH_check().

It seems to me that it would not be a bad thing to switch to the EVP
APIs on HEAD to be prepared for the future, but I would switch to
EVP_PKEY_CTX_new() instead of EVP_PKEY_CTX_new_from_pkey() and add a
configure check to see if EVP_PKEY_param_check() is defined or not.
OPENSSL_VERSION_NUMBER cannot be used because of LibreSSL overriding
it, and I guess that OPENSSL_VERSION_MAJOR, as used in the original
patch, would not work with LibreSSL either.

Any thoughts?
--
Michael


signature.asc
Description: PGP signature


Re: Parallel copy

2020-08-16 Thread Greg Nancarrow
Hi Vignesh,

Some further comments:

(1) v3-0002-Framework-for-leader-worker-in-parallel-copy.patch

+/*
+ * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
+ * block to process to avoid lock contention. This value should be divisible by
+ * RINGSIZE, as wrap around cases is currently not handled while selecting the
+ * WORKER_CHUNK_COUNT by the worker.
+ */
+#define WORKER_CHUNK_COUNT 50


"This value should be divisible by RINGSIZE" is not a correct
statement (since obviously 50 is not divisible by 1).
It should say something like "This value should evenly divide into
RINGSIZE", or "RINGSIZE should be a multiple of WORKER_CHUNK_COUNT".


(2) v3-0003-Allow-copy-from-command-to-process-data-from-file.patch

(i)

+   /*
+* If the data is present in current block
lineInfo. line_size
+* will be updated. If the data is spread
across the blocks either

Somehow a space has been put between "lineinfo." and "line_size".
It should be: "If the data is present in current block
lineInfo.line_size will be updated"

(ii)

>This is not possible because of pg_atomic_compare_exchange_u32, this
>will succeed only for one of the workers whose line_state is
>LINE_LEADER_POPULATED, for other workers it will fail. This is
>explained in detail above ParallelCopyLineBoundary.

Yes, but prior to that call to pg_atomic_compare_exchange_u32(),
aren't you separately reading line_state and line_state, so that
between those reads, it may have transitioned from leader to another
worker, such that the read line state ("cur_line_state", being checked
in the if block) may not actually match what is now in the line_state
and/or the read line_size ("dataSize") doesn't actually correspond to
the read line state?

(sorry, still not 100% convinced that the synchronization and checks
are safe in all cases)

(3) v3-0006-Parallel-Copy-For-Binary-Format-Files.patch

>raw_buf is not used in parallel copy, instead raw_buf will be pointing
>to shared memory data blocks. This memory was allocated as part of
>BeginCopyFrom, uptil this point we cannot be 100% sure as copy can be
>performed sequentially like in case max_worker_processes is not
>available, if it switches to sequential mode raw_buf will be used
>while performing copy operation. At this place we can safely free this
>memory that was allocated

So the following code (which checks raw_buf, which still points to
memory that has been pfreed) is still valid?

  In the SetRawBufForLoad() function, which is called by CopyReadLineText():

cur_data_blk_ptr = (cstate->raw_buf) ?
&pcshared_info->data_blocks[cur_block_pos] : NULL;

The above code looks a bit dicey to me. I stepped over that line in
the debugger when I debugged an instance of Parallel Copy, so it
definitely gets executed.
It makes me wonder what other code could possibly be checking raw_buf
and using it in some way, when in fact what it points to has been
pfreed.

Are you able to add the following line of code, or will it (somehow)
break logic that you are relying on?

pfree(cstate->raw_buf);
cstate->raw_buf = NULL;   <=== I suggest that this line is added

Regards,
Greg Nancarrow
Fujitsu Australia




Re: new heapcheck contrib module

2020-08-16 Thread Amul Sul
On Thu, Jul 30, 2020 at 11:29 PM Robert Haas  wrote:
>
> On Mon, Jul 27, 2020 at 1:02 PM Mark Dilger
>  wrote:
> > Not at all!  I appreciate all the reviews.
>
> Reviewing 0002, reading through verify_heapam.c:
>
> +typedef enum SkipPages
> +{
> + SKIP_ALL_FROZEN_PAGES,
> + SKIP_ALL_VISIBLE_PAGES,
> + SKIP_PAGES_NONE
> +} SkipPages;
>
> This looks inconsistent. Maybe just start them all with SKIP_PAGES_.
>
> + if (PG_ARGISNULL(0))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("missing required parameter for 'rel'")));
>
> This doesn't look much like other error messages in the code. Do
> something like git grep -A4 PG_ARGISNULL | grep -A3 ereport and study
> the comparables.
>
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("unrecognized parameter for 'skip': %s", skip),
> + errhint("please choose from 'all-visible', 'all-frozen', or 'none'")));
>
> Same problem. Check pg_prewarm's handling of the prewarm type, or
> EXPLAIN's handling of the FORMAT option, or similar examples. Read the
> message style guidelines concerning punctuation of hint and detail
> messages.
>
> + * Bugs in pg_upgrade are reported (see commands/vacuum.c circa line 1572)
> + * to have sometimes rendered the oldest xid value for a database invalid.
> + * It seems unwise to report rows as corrupt for failing to be newer than
> + * a value which itself may be corrupt.  We instead use the oldest xid for
> + * the entire cluster, which must be at least as old as the oldest xid for
> + * our database.
>
> This kind of reference to another comment will not age well; line
> numbers and files change a lot. But I think the right thing to do here
> is just rely on relfrozenxid and relminmxid. If the table is
> inconsistent with those, then something needs fixing. datfrozenxid and
> the cluster-wide value can look out for themselves. The corruption
> detector shouldn't be trying to work around any bugs in setting
> relfrozenxid itself; such problems are arguably precisely what we're
> here to find.
>
> +/*
> + * confess
> + *
> + *   Return a message about corruption, including information
> + *   about where in the relation the corruption was found.
> + *
> + *   The msg argument is pfree'd by this function.
> + */
> +static void
> +confess(HeapCheckContext *ctx, char *msg)
>
> Contrary to what the comments say, the function doesn't return a
> message about corruption or anything else. It returns void.
>
> I don't really like the name, either. I get that it's probably
> inspired by Perl, but I think it should be given a less-clever name
> like report_corruption() or something.
>
> + * corrupted table from using workmem worth of memory building up the
>
> This kind of thing destroys grep-ability. If you're going to refer to
> work_mem, you gotta spell it the same way we do everywhere else.
>
> + * Helper function to construct the TupleDesc needed by verify_heapam.
>
> Instead of saying it's the TupleDesc somebody needs, how about saying
> that it's the TupleDesc that we'll use to report problems that we find
> while scanning the heap, or something like that?
>
> + * Given a TransactionId, attempt to interpret it as a valid
> + * FullTransactionId, neither in the future nor overlong in
> + * the past.  Stores the inferred FullTransactionId in *fxid.
>
> It really doesn't, because there's no such thing as 'fxid' referenced
> anywhere here. You should really make the effort to proofread your
> patches before posting, and adjust comments and so on as you go.
> Otherwise reviewing takes longer, and if you keep introducing new
> stuff like this as you fix other stuff, you can fail to ever produce a
> committable patch.
>
> + * Determine whether tuples are visible for verification.  Similar to
> + *  HeapTupleSatisfiesVacuum, but with critical differences.
>
> Not accurate, because it also reports problems, which is not mentioned
> anywhere in the function header comment that purports to be a detailed
> description of what the function does.
>
> + else if (TransactionIdIsCurrentTransactionId(raw_xmin))
> + return true; /* insert or delete in progress */
> + else if (TransactionIdIsInProgress(raw_xmin))
> + return true; /* HEAPTUPLE_INSERT_IN_PROGRESS */
> + else if (!TransactionIdDidCommit(raw_xmin))
> + {
> + return false; /* HEAPTUPLE_DEAD */
> + }
>
> One of these cases is not punctuated like the others.
>
> + pstrdup("heap tuple with XMAX_IS_MULTI is neither LOCKED_ONLY nor
> has a valid xmax"));
>
> 1. I don't think that's very grammatical.
>
> 2. Why abbreviate HEAP_XMAX_IS_MULTI to XMAX_IS_MULTI and
> HEAP_XMAX_IS_LOCKED_ONLY to LOCKED_ONLY? I don't even think you should
> be referencing C constant names here at all, and if you are I don't
> think you should abbreviate, and if you do abbreviate I don't think
> you should omit different numbers of words depending on which constant
> it is.
>
> I wonder what the intended division of responsibility is here,
> exactly. It seems like you've 

Re: tab completion of IMPORT FOREIGN SCHEMA

2020-08-16 Thread Michael Paquier
On Sun, Aug 09, 2020 at 12:33:43PM -0400, Tom Lane wrote:
> I don't see how psql could obtain a "real" list of foreign schemas
> from an arbitrary FDW, even if it magically knew which server the
> user would specify later in the command.  So this behavior seems fine.
> It has some usefulness, while not completing at all would have none.

Sounds fine to me as well.  The LIMIT TO and EXCEPT clauses are
optional, so using TailMatches() looks fine.

+   else if (TailMatches("FROM", "SERVER", MatchAny, "INTO", MatchAny))
+   COMPLETE_WITH("OPTIONS")
Shouldn't you complete with "OPTIONS (" here?

It would be good to complete with "FROM SERVER" after specifying
EXCEPT or LIMIT TO, you can just use "(*)" to include the list of
tables in the list of elements checked.
--
Michael


signature.asc
Description: PGP signature


Re: Fix an old description in high-availability.sgml

2020-08-16 Thread Masahiko Sawada
On Mon, 17 Aug 2020 at 10:32, Michael Paquier  wrote:
>
> On Fri, Aug 14, 2020 at 05:15:20PM +0900, Michael Paquier wrote:
> > Good catch it is.  Your phrasing looks good to me.
>
> Fixed as b4f1639.  Thanks.

Thank you!

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-16 Thread Ashutosh Sharma
Hello Masahiko-san,

Thanks for the review. Please check the comments inline below:

On Fri, Aug 14, 2020 at 10:07 AM Masahiko Sawada
 wrote:

> Thank you for updating the patch! Here are my comments on v5 patch:
>
> --- a/contrib/Makefile
> +++ b/contrib/Makefile
> @@ -35,6 +35,7 @@ SUBDIRS = \
> pg_standby  \
> pg_stat_statements \
> pg_trgm \
> +   pg_surgery  \
> pgcrypto\
>
> I guess we use alphabetical order here. So pg_surgery should be placed
> before pg_trgm.
>

Okay, will take care of this in the next version of patch.

> ---
> +   if (heap_force_opt == HEAP_FORCE_KILL)
> +   ItemIdSetDead(itemid);
>
> I think that if the page is an all-visible page, we should clear an
> all-visible bit on the visibility map corresponding to the page and
> PD_ALL_VISIBLE on the page header. Otherwise, index only scan would
> return the wrong results.
>

I think we should let VACUUM do that. Please note that this module is
intended to be used only on a damaged relation and should only be
operated on damaged tuples of such relations. And the execution of any
of the functions provided by this module on a damaged relation must be
followed by VACUUM with DISABLE_PAGE_SKIPPING option on that relation.
This is necessary to bring back a damaged relation to the sane state
once a surgery is performed on it. I will try to add this note in the
documentation for this module.

> ---
> +   /*
> +* We do not mark the buffer dirty or do WAL logging for unmodifed
> +* pages.
> +*/
> +   if (!did_modify_page)
> +   goto skip_wal;
> +
> +   /* Mark buffer dirty before we write WAL. */
> +   MarkBufferDirty(buf);
> +
> +   /* XLOG stuff */
> +   if (RelationNeedsWAL(rel))
> +   log_newpage_buffer(buf, true);
> +
> +skip_wal:
> +   END_CRIT_SECTION();
> +
>
> s/unmodifed/unmodified/
>

okay, will fix this typo.

> Do we really need to use goto? I think we can modify it like follows:
>
> if (did_modity_page)
> {
>/* Mark buffer dirty before we write WAL. */
>MarkBufferDirty(buf);
>
>/* XLOG stuff */
>if (RelationNeedsWAL(rel))
>log_newpage_buffer(buf, true);
> }
>
> END_CRIT_SECTION();
>

No, we don't need it. We can achieve the same by checking the status
of did_modify_page flag as you suggested. I will do this change in the
next version.

> ---
> pg_force_freeze() can revival a tuple that is already deleted but not
> vacuumed yet. Therefore, the user might need to reindex indexes after
> using that function. For instance, with the following script, the last
> two queries: index scan and seq scan, will return different results.
>
> set enable_seqscan to off;
> set enable_bitmapscan to off;
> set enable_indexonlyscan to off;
> create table tbl (a int primary key);
> insert into tbl values (1);
>
> update tbl set a = a + 100 where a = 1;
>
> explain analyze select * from tbl where a < 200;
>
> -- revive deleted tuple on heap
> select heap_force_freeze('tbl', array['(0,1)'::tid]);
>
> -- index scan returns 2 tuples
> explain analyze select * from tbl where a < 200;
>
> -- seq scan returns 1 tuple
> set enable_seqscan to on;
> explain analyze select * from tbl;
>

I am not sure if this is the right use-case of pg_force_freeze
function. I think we should only be running pg_force_freeze function
on a tuple for which VACUUM reports "found xmin ABC from before
relfrozenxid PQR" sort of error otherwise it might worsen the things
instead of making it better. Now, the question is - can VACUUM report
this type of error for a deleted tuple or it would only report it for
a live tuple? AFAIU this won't be reported for the deleted tuples
because VACUUM wouldn't consider freezing a tuple that has been
deleted.

> Also, if a tuple updated and moved to another partition is revived by
> heap_force_freeze(), its ctid still has special values:
> MovedPartitionsOffsetNumber and MovedPartitionsBlockNumber. I don't
> see a problem yet caused by a visible tuple having the special ctid
> value, but it might be worth considering either to reset ctid value as
> well or to not freezing already-deleted tuple.
>

For this as well, the answer remains the same as above.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com




Re: display offset along with block number in vacuum errors

2020-08-16 Thread Masahiko Sawada
On Sat, 15 Aug 2020 at 12:19, Amit Kapila  wrote:
>
> On Fri, Aug 14, 2020 at 4:06 PM Amit Kapila  wrote:
> >
> > On Mon, Aug 10, 2020 at 10:24 AM Masahiko Sawada
> >  wrote:
> > >
> > > It's true that heap_page_is_all_visible() is called from only
> > > lazy_vacuum_page() but I'm concerned it would lead misleading since
> > > it's not actually removing tuples but just checking after vacuum. I
> > > guess that the errcontext should show what the process is actually
> > > doing and therefore help the investigation, so I thought VACUUM_HEAP
> > > might not be appropriate for this case. But I see also your point.
> > > Other vacuum error context phases match with vacuum progress
> > > information phrases. So in heap_page_is_all_visible (), I agree it's
> > > better to update the offset number and keep the phase VACUUM_HEAP
> > > rather than do nothing.
> > >
> >
> > Okay, I have changed accordingly and this means that the offset will
> > be displayed for the vacuum phase as well. Apart from this, I have
> > fixed all the comments raised by me in the attached patch. One thing
> > we need to think is do we want to set offset during heap_page_prune
> > when called from lazy_scan_heap? I think the code path for
> > heap_prune_chain is quite deep, so an error can occur in that path.
> > What do you think?
> >
>
> The reason why I have not included heap_page_prune related change in
> the patch is that I don't want to sprinkle this in every possible
> function (code path) called via vacuum especially if the probability
> of an error in that code path is low. But, I am fine if you and or
> others think that it is a good idea to update offset in
> heap_page_prune as well.

I agree to not try sprinkling it many places than necessity.

Regarding heap_page_prune(), I'm concerned a bit that
heap_page_prune() is typically the first function to check the tuple
visibility within the vacuum code. I've sometimes observed an error
with the message like "DETAIL: could not open file “pg_xact/00AB”: No
such file or directory" due to a tuple header corruption. I suspect
this message was emitted while checking tuple visibility in
heap_page_prune(). So I guess the likelihood of an error in that code
is not so low.

On the other hand, if we want to update the offset number in
heap_page_prune() we will need to expose some vacuum structs defined
as a static including LVRelStats. But I don't think it's a good idea.
The second idea I came up with is that we set another errcontext for
heap_page_prune(). Since heap_page_prune() could be called also by a
regular page scanning it would work fine for both cases, although
there will be extra overheads for both. What do you think?

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Autovacuum on partitioned table (autoanalyze)

2020-08-16 Thread yuzuko
I'm sorry for the late reply.

> This version seems to fail under Werror which is used in the Travis builds:
>
> autovacuum.c: In function ‘relation_needs_vacanalyze’:
> autovacuum.c:3117:59: error: ‘reltuples’ may be used uninitialized in this 
> function [-Werror=maybe-uninitialized]
>anlthresh = (float4) anl_base_thresh + anl_scale_factor * reltuples;
>^
> autovacuum.c:2972:9: note: ‘reltuples’ was declared here
>   float4  reltuples;  /* pg_class.reltuples */
>  ^
>

I attach the latest patch that solves the above Werror.
Could you please check it again?

-- 
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center


v9_autovacuum_on_partitioned_table.patch
Description: Binary data


proposal: enhancing plpgsql debug API - returns text value of variable content

2020-08-16 Thread Pavel Stehule
Hi

I am working on tracing support to plpgsql_check

https://github.com/okbob/plpgsql_check

I would like to print content of variables - and now, I have to go some
deeper than I would like. I need to separate between scalar, row, and
record variables. PLpgSQL has code for it - but it is private.

Now plpgsql debug API has an API for expression evaluation - and it is
working fine, but there is a need to know the necessary namespace.
Unfortunately, the plpgsql variables have not assigned any info about
related namespaces. It increases the necessary work for implementing
conditional breakpoints or just printing all variables (and maintaining a
lot of plpgsql code outside plpgsql core).

So my proposals:

1. enhancing debug api about method

char *get_cstring_valule(PLpgSQL_variable *var, bool *isnull)

2. enhancing PLpgSQL_var structure about related namespace "struct
PLpgSQL_nsitem *ns",
PLpgSQL_stmt *scope statement (statement that limits scope of variable's
visibility). For usage in debuggers, tracers can be nice to have a info
about kind of variable (function argument, local variable, automatic custom
variable (FORC), automatic internal variable (SQLERRM, FOUND, TG_OP, ...).

Comments, notes?

Regards

Pavel


proposal - reference to plpgsql_check from plpgsql doc

2020-08-16 Thread Pavel Stehule
Hi

plpgsql_check extension is almost complete now. This extension is available
on all environments and for all supported Postgres releases. It is probably
too big to be part of contrib, but I think so it can be referenced in
https://www.postgresql.org/docs/current/plpgsql-development-tips.html
chapter.

What do you think about it?

Regards

Pavel


Re: More tests with USING INDEX replident and dropped indexes

2020-08-16 Thread Michael Paquier
On Wed, Jun 03, 2020 at 12:08:56PM -0300, Euler Taveira wrote:
> Consistency is a good goal. Why don't we clear the relreplident from the
> relation while dropping the index? relation_mark_replica_identity() already
> does that but do other things too. Let's move the first code block from
> relation_mark_replica_identity to another function and call this new
> function
> while dropping the index.

I have looked at your suggestion, and finished with the attached.
There are a couple of things to be aware of:
- For DROP INDEX CONCURRENTLY, pg_class.relreplident of the parent
table is set in the last transaction doing the drop.  It would be
tempting to reset the flag in the same transaction as the one marking
the index as invalid, but there could be a point in reindexing the
invalid index whose drop has failed, and this adds more complexity to
the patch.
- relreplident is switched to REPLICA_IDENTITY_NOTHING, which is the
behavior we have now after an index is dropped, even if there is a
primary key.
- The CCI done even if ri_type is not updated in index_drop() may look
useless, but the CCI will happen in any case as switching the replica
identity of a table to NOTHING resets pg_index.indisreplident for an
index previously used. 

Thoughts?
--
Michael
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index c1581ad178..b163a82c52 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -61,6 +61,8 @@ extern void ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_
 
 extern void SetRelationHasSubclass(Oid relationId, bool relhassubclass);
 
+extern void SetRelationReplIdent(Oid relationId, char ri_type);
+
 extern ObjectAddress renameatt(RenameStmt *stmt);
 
 extern ObjectAddress RenameConstraint(RenameStmt *stmt);
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1be27eec52..d344a070ce 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2195,6 +2195,18 @@ index_drop(Oid indexId, bool concurrent, bool concurrent_lock_mode)
 	if (userIndexRelation->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
 		RelationDropStorage(userIndexRelation);
 
+	/*
+	 * If the index to be dropped is marked as indisreplident, reset
+	 * pg_class.relreplident for the parent table.
+	 */
+	if (userIndexRelation->rd_index->indisreplident)
+	{
+		SetRelationReplIdent(heapId, REPLICA_IDENTITY_NOTHING);
+
+		/* make the update visible */
+		CommandCounterIncrement();
+	}
+
 	/*
 	 * Close and flush the index's relcache entry, to ensure relcache doesn't
 	 * try to rebuild it while we're deleting catalog entries. We keep the
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index cd989c95e5..e11e8baf89 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2976,6 +2976,41 @@ SetRelationHasSubclass(Oid relationId, bool relhassubclass)
 	table_close(relationRelation, RowExclusiveLock);
 }
 
+/*
+ * SetRelationReplIdent
+ *		Set the value of the relation's relreplident field in pg_class.
+ *
+ * NOTE: caller must be holding an appropriate lock on the relation.
+ * ShareUpdateExclusiveLock is sufficient.
+ */
+void
+SetRelationReplIdent(Oid relationId, char ri_type)
+{
+	Relation	relationRelation;
+	HeapTuple	tuple;
+	Form_pg_class classtuple;
+
+	/*
+	 * Fetch a modifiable copy of the tuple, modify it, update pg_class.
+	 */
+	relationRelation = table_open(RelationRelationId, RowExclusiveLock);
+	tuple = SearchSysCacheCopy1(RELOID,
+ObjectIdGetDatum(relationId));
+	if (!HeapTupleIsValid(tuple))
+		elog(ERROR, "cache lookup failed for relation \"%s\"",
+			 get_rel_name(relationId));
+	classtuple = (Form_pg_class) GETSTRUCT(tuple);
+
+	if (classtuple->relreplident != ri_type)
+	{
+		classtuple->relreplident = ri_type;
+		CatalogTupleUpdate(relationRelation, &tuple->t_self, tuple);
+	}
+
+	heap_freetuple(tuple);
+	table_close(relationRelation, RowExclusiveLock);
+}
+
 /*
  *		renameatt_check			- basic sanity checks before attribute rename
  */
@@ -14384,31 +14419,12 @@ relation_mark_replica_identity(Relation rel, char ri_type, Oid indexOid,
 			   bool is_internal)
 {
 	Relation	pg_index;
-	Relation	pg_class;
-	HeapTuple	pg_class_tuple;
 	HeapTuple	pg_index_tuple;
-	Form_pg_class pg_class_form;
 	Form_pg_index pg_index_form;
-
 	ListCell   *index;
 
-	/*
-	 * Check whether relreplident has changed, and update it if so.
-	 */
-	pg_class = table_open(RelationRelationId, RowExclusiveLock);
-	pg_class_tuple = SearchSysCacheCopy1(RELOID,
-		 ObjectIdGetDatum(RelationGetRelid(rel)));
-	if (!HeapTupleIsValid(pg_class_tuple))
-		elog(ERROR, "cache lookup failed for relation \"%s\"",
-			 RelationGetRelationName(rel));
-	pg_class_form = (Form_pg_class) GETSTRUCT(pg_class_tuple);
-	if (pg_class_form->relreplident != ri_type)
-	{
-		pg_class_form->relreplident = ri_type;
-		CatalogTupleUpdate(pg_class, &pg_class_tuple->t_self, pg_class_tuple);
-