Re: Asynchronous Append on postgres_fdw nodes.

2021-02-14 Thread Etsuro Fujita
On Fri, Feb 12, 2021 at 5:30 PM Kyotaro Horiguchi
 wrote:
> It seems too specific to async Append so I look it as a PoC of the
> mechanism.

Are you saying that the patch only reconsiders async ForeignScans?

> It creates a hash table that keyed by connection umid to record
> planids run on the connection, triggerd by core planner via a dedicate
> API function.  It seems to me that ConnCacheEntry.state can hold that
> and the hash is not needed at all.

I think a good thing about the hash table is that it can be used by
other FDWs that support async execution in a similar way to
postgres_fdw, so they don’t need to create their own hash tables.  But
I’d like to know about the idea of using ConnCacheEntry.  Could you
elaborate a bit more about that?

> | postgresReconsiderAsyncForeignScan(ForeignScanState *node, AsyncContext 
> *acxt)
> | {
> | ...
> | /*
> |* If the connection used for the ForeignScan node is used in other 
> parts
> |* of the query plan tree except async subplans of the parent Append 
> node,
> |* disable async execution of the ForeignScan node.
> |*/
> |   if (!bms_is_subset(fsplanids, asyncplanids))
> |   return false;
>
> This would be a reasonable restriction.

Cool!

> |   /*
> |* If the subplans of the Append node are all async-capable, and use 
> the
> |* same connection, then we won't execute them asynchronously.
> |*/
> |   if (requestor->as_nasyncplans == requestor->as_nplans &&
> |   !bms_nonempty_difference(asyncplanids, fsplanids))
> |   return false;
>
> It is the correct restiction?  I understand that the currently
> intending restriction is one connection accepts at most one FDW-scan
> node. This looks somethig different...

People put multiple partitions in a remote PostgreSQL server in
sharding, so the patch allows multiple postgres_fdw ForeignScans
beneath an Append that use the same connection to be executed
asynchronously like this:

postgres=# create table t1 (a int, b int, c text);
postgres=# create table t2 (a int, b int, c text);
postgres=# create table t3 (a int, b int, c text);
postgres=# create foreign table p1 (a int, b int, c text) server
server1 options (table_name 't1');
postgres=# create foreign table p2 (a int, b int, c text) server
server2 options (table_name 't2');
postgres=# create foreign table p3 (a int, b int, c text) server
server2 options (table_name 't3');
postgres=# create table pt (a int, b int, c text) partition by range (a);
postgres=# alter table pt attach partition p1 for values from (10) to (20);
postgres=# alter table pt attach partition p2 for values from (20) to (30);
postgres=# alter table pt attach partition p3 for values from (30) to (40);
postgres=# insert into p1 select 10 + i % 10, i, to_char(i, 'FM')
from generate_series(0, 99) i;
postgres=# insert into p2 select 20 + i % 10, i, to_char(i, 'FM')
from generate_series(0, 99) i;
postgres=# insert into p3 select 30 + i % 10, i, to_char(i, 'FM')
from generate_series(0, 99) i;
postgres=# analyze pt;

postgres=# explain verbose select count(*) from pt;
QUERY PLAN
--
 Aggregate  (cost=314.25..314.26 rows=1 width=8)
   Output: count(*)
   ->  Append  (cost=100.00..313.50 rows=300 width=0)
 ->  Async Foreign Scan on public.p1 pt_1
(cost=100.00..104.00 rows=100 width=0)
   Remote SQL: SELECT NULL FROM public.t1
 ->  Async Foreign Scan on public.p2 pt_2
(cost=100.00..104.00 rows=100 width=0)
   Remote SQL: SELECT NULL FROM public.t2
 ->  Async Foreign Scan on public.p3 pt_3
(cost=100.00..104.00 rows=100 width=0)
   Remote SQL: SELECT NULL FROM public.t3
(9 rows)

For this query, p2 and p3, which use the same connection, are scanned
asynchronously!

But if all the subplans of an Append are async postgres_fdw
ForeignScans that use the same connection, they won’t be parallelized
at all, and the overhead of async execution may cause a performance
degradation.  So the patch disables async execution of them in that
case using the above code bit.

Thanks for the review!

Best regards,
Etsuro Fujita




Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)

2021-02-14 Thread Michael Paquier
On Sat, Feb 13, 2021 at 09:33:48PM -0300, Ranier Vilela wrote:
> Em sáb., 13 de fev. de 2021 às 20:32, Michael Paquier 
> escreveu:
> 
>> You are missing the point here.  What you are proposing here would not
>> be backpatched.  However, reusing the same words as upthread, this has
>> a cost in terms of *future* maintenance.  In short, any *future*
>> potential bug fix that would require to be backpatched in need of
>> using this function or touching its area would result in a conflict.
>
> Ok. +1 for back-patching.

Please take the time to read again my previous email.

And also, please take the time to actually test patches you send,
because the list of things getting broken is impressive.  At least
you make sure that the internals of cryptohash.c generate an error as
they should because of those incorrect sizes :)

git diff --check complains, in various places.

@@ -330,7 +330,8 @@ SendBackupManifest(backup_manifest_info *manifest)
 * twice.
 */
manifest->still_checksumming = false;
-   if (pg_cryptohash_final(manifest->manifest_ctx, checksumbuf) < 0)
+   if (pg_cryptohash_final(manifest->manifest_ctx, checksumbuf,
+   sizeof(checksumbuf) - 
1) < 0)
elog(ERROR, "failed to finalize checksum of backup manifest");
This breaks backup manifests, due to an incorrect calculation.

@@ -78,7 +78,8 @@ pg_md5_hash(const void *buff, size_t len, char *hexsum)
if (pg_cryptohash_init(ctx) < 0 ||
pg_cryptohash_update(ctx, buff, len) < 0 ||
-   pg_cryptohash_final(ctx, sum) < 0)
+   pg_cryptohash_final(ctx, sum, 
+   sizeof(sum) - 1) < 0)
This one breaks MD5 hashing, due to an incorrect size calculation,
again.

@@ -51,7 +51,8 @@ scram_HMAC_init(scram_HMAC_ctx *ctx, const uint8 *key, int 
keylen)
return -1;
if (pg_cryptohash_init(sha256_ctx) < 0 ||
pg_cryptohash_update(sha256_ctx, key, keylen) < 0 ||
-   pg_cryptohash_final(sha256_ctx, keybuf) < 0)
+   pg_cryptohash_final(sha256_ctx, keybuf, 
+   sizeof(keybuf) - 1) < 0)
[...]
-   if (pg_cryptohash_final(ctx->sha256ctx, h) < 0)
+   if (pg_cryptohash_final(ctx->sha256ctx, h, 
+   sizeof(h) - 1) < 0)
This breaks SCRAM authentication, for the same reason.  In three
places.

I think that in pg_checksum_final() we had better save the digest
length in "retval" before calling pg_cryptohash_final(), and use it
for the size passed down.

contrib/uuid-ossp/ fails to compile.

contrib/pgcrypto/ fix is incorrect, requiring h->result_size(h) in
three places.

I think that as a whole we should try to minimize the number of times
we use any DIGEST_LENGTH variable, relying a maximum on sizeof().
This v4 is a mixed bad of that.  Once you switch to that, there is an
interesting result with uuid-ossp, where you can notice that there is
a match between the size of dce_uuid_t and MD5_DIGEST_LENGTH.

No need to send a new patch, the attached taking care of those
issues, and it is correctly indented.  I'll just look at that again
tomorrow, it is already late here.
--
Michael
diff --git a/src/include/common/cryptohash.h b/src/include/common/cryptohash.h
index 32d7784ca5..541dc844c8 100644
--- a/src/include/common/cryptohash.h
+++ b/src/include/common/cryptohash.h
@@ -32,7 +32,7 @@ typedef struct pg_cryptohash_ctx pg_cryptohash_ctx;
 extern pg_cryptohash_ctx *pg_cryptohash_create(pg_cryptohash_type type);
 extern int	pg_cryptohash_init(pg_cryptohash_ctx *ctx);
 extern int	pg_cryptohash_update(pg_cryptohash_ctx *ctx, const uint8 *data, size_t len);
-extern int	pg_cryptohash_final(pg_cryptohash_ctx *ctx, uint8 *dest);
+extern int	pg_cryptohash_final(pg_cryptohash_ctx *ctx, uint8 *dest, size_t len);
 extern void pg_cryptohash_free(pg_cryptohash_ctx *ctx);
 
 #endif			/* PG_CRYPTOHASH_H */
diff --git a/src/backend/libpq/auth-scram.c b/src/backend/libpq/auth-scram.c
index 8d857f39df..b9b6d464a0 100644
--- a/src/backend/libpq/auth-scram.c
+++ b/src/backend/libpq/auth-scram.c
@@ -1429,7 +1429,7 @@ scram_mock_salt(const char *username)
 	if (pg_cryptohash_init(ctx) < 0 ||
 		pg_cryptohash_update(ctx, (uint8 *) username, strlen(username)) < 0 ||
 		pg_cryptohash_update(ctx, (uint8 *) mock_auth_nonce, MOCK_AUTH_NONCE_LEN) < 0 ||
-		pg_cryptohash_final(ctx, sha_digest) < 0)
+		pg_cryptohash_final(ctx, sha_digest, sizeof(sha_digest)) < 0)
 	{
 		pg_cryptohash_free(ctx);
 		return NULL;
diff --git a/src/backend/replication/backup_manifest.c b/src/backend/replication/backup_manifest.c
index 0cefd181b5..32bb0efb3d 100644
--- a/src/backend/replication/backup_manifest.c
+++ b/src/backend/replication/backup_manifest.c
@@ -330,12 +330,13 @@ SendBackupManifest(backup_manifest_info *manifest)
 	 * twice.
 	 */
 	manifest->still_checksumming = false;
-	if (pg

Re: How to get Relation tuples in C function

2021-02-14 Thread Michael Paquier
On Sun, Feb 14, 2021 at 09:29:08AM +0800, Andy Fan wrote:
> Thank you tom for the reply.  What would be the difference between the
> SPI and "write a pure SQL UDF" and call it with DirectFunctionCall1? I
> just ran into a similar situation some days before.   Currently I think
> DirectFunctionCall1 doesn't need to maintain a connection but SPI has to
> do that.

Hard to say without knowing your use case.  A PL function is more
simple to maintain than a C function, though usually less performant
from the pure point of view of its operations.  A SQL function could
finish by being inlined, allowing the planner to apply optimizations
as it would know the function body.  Going with SPI has the advantage
to have code able to work without any changes across major versions,
which is a no-brainer when it comes to long-term maintenance.
--
Michael


signature.asc
Description: PGP signature


Re: Some regular-expression performance hacking

2021-02-14 Thread Joel Jacobson
On Sat, Feb 13, 2021, at 22:11, Tom Lane wrote:
>0001-invent-rainbow-arcs-2.patch
>0002-recognize-matchall-NFAs-2.patch

I've successfully tested both patches against the 1.5M regexes-in-the-wild 
dataset.

Out of the 1489489 (pattern, text string) pairs tested,
there was only one single deviation:

This 100577 bytes big regex (pattern_id = 207811)...

\.(ac|com\.ac|edu\.ac|gov\.ac|net\.ac|mil\.ac| ... 
|wmflabs\.org|yolasite\.com|za\.net|za\.org)$

...previously raised...

error invalid regular expression: regular expression is too complex

...but now goes through:

is_match  => t
captured  => {de}
error invalid regular expression: regular expression is too complex => 

Nice. The patched regex engine is apparently capable of handling even more 
complex regexes than before.

The test that found the deviation tests each (pattern, text string) 
individually,
to catch errors. But I've also made a separate query to just test regexes
known to not cause errors, to allow testing all regexes in one big query,
which fully utilizes the CPU cores and also runs quicker.

Below is the result of the performance test query:

\timing

SELECT
  tests.is_match IS NOT DISTINCT FROM (subjects.subject ~ patterns.pattern),
  tests.captured IS NOT DISTINCT FROM regexp_match(subjects.subject, 
patterns.pattern),
  COUNT(*)
FROM tests
JOIN subjects ON subjects.subject_id = tests.subject_id
JOIN patterns ON patterns.pattern_id = subjects.pattern_id
JOIN server_versions ON server_versions.server_version_num = 
tests.server_version_num
WHERE server_versions.server_version = current_setting('server_version')
AND tests.error IS NULL
GROUP BY 1,2
ORDER BY 1,2;

-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a:

?column? | ?column? |  count
--+--+-
t| t| 1448212
(1 row)

Time: 592196.145 ms (09:52.196)

-- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches:

?column? | ?column? |  count
--+--+-
t| t| 1448212
(1 row)

Time: 461739.364 ms (07:41.739)

That's an impressive 22% speed-up!

/Joel

Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)

2021-02-14 Thread Ranier Vilela
Em dom., 14 de fev. de 2021 às 08:22, Michael Paquier 
escreveu:

> On Sat, Feb 13, 2021 at 09:33:48PM -0300, Ranier Vilela wrote:
> > Em sáb., 13 de fev. de 2021 às 20:32, Michael Paquier <
> mich...@paquier.xyz>
> > escreveu:
> >
> >> You are missing the point here.  What you are proposing here would not
> >> be backpatched.  However, reusing the same words as upthread, this has
> >> a cost in terms of *future* maintenance.  In short, any *future*
> >> potential bug fix that would require to be backpatched in need of
> >> using this function or touching its area would result in a conflict.
> >
> > Ok. +1 for back-patching.
>
> Please take the time to read again my previous email.
>
> And also, please take the time to actually test patches you send,
> because the list of things getting broken is impressive.  At least
> you make sure that the internals of cryptohash.c generate an error as
> they should because of those incorrect sizes :)
>
> git diff --check complains, in various places.
>
> @@ -330,7 +330,8 @@ SendBackupManifest(backup_manifest_info *manifest)
>  * twice.
>  */
> manifest->still_checksumming = false;
> -   if (pg_cryptohash_final(manifest->manifest_ctx, checksumbuf) < 0)
> +   if (pg_cryptohash_final(manifest->manifest_ctx, checksumbuf,
> +
>  sizeof(checksumbuf) - 1) < 0)
> elog(ERROR, "failed to finalize checksum of backup
> manifest");
> This breaks backup manifests, due to an incorrect calculation.
>
Bad habits.
sizeof - 1, I use with strings.


> I think that in pg_checksum_final() we had better save the digest
> length in "retval" before calling pg_cryptohash_final(), and use it
> for the size passed down.
>
pg_checksum_final I would like to see it like this:

  case CHECKSUM_TYPE_SHA224:
  retval = PG_SHA224_DIGEST_LENGTH;
  break;
  case CHECKSUM_TYPE_SHA256:
  retval = PG_SHA256_DIGEST_LENGTH;
  break;
  case CHECKSUM_TYPE_SHA384:
  retval = PG_SHA384_DIGEST_LENGTH;
  break;
  case CHECKSUM_TYPE_SHA512:
  retval = PG_SHA512_DIGEST_LENGTH;
  break;
  default:
 return -1;
  }

 if (pg_cryptohash_final(context->raw_context.c_sha2,
  output, retval) < 0)
 return -1;
pg_cryptohash_free(context->raw_context.c_sha2);

What do you think?

regards,
Ranier Vilela


Re: Some regular-expression performance hacking

2021-02-14 Thread Tom Lane
"Joel Jacobson"  writes:
> I've successfully tested both patches against the 1.5M regexes-in-the-wild 
> dataset.
> Out of the 1489489 (pattern, text string) pairs tested,
> there was only one single deviation:
> This 100577 bytes big regex (pattern_id = 207811)...
> ...
> ...previously raised...
> error invalid regular expression: regular expression is too complex
> ...but now goes through:

> Nice. The patched regex engine is apparently capable of handling even more 
> complex regexes than before.

Yeah.  There are various limitations that can lead to REG_ETOOBIG, but the
main ones are "too many states" and "too many arcs".  The RAINBOW change
directly reduces the number of arcs and thus makes larger regexes feasible.
I'm sure it's coincidental that the one such example you captured happens
to be fixed by this change, but hey I'll take it.

regards, tom lane




Re: Extensibility of the PostgreSQL wire protocol

2021-02-14 Thread Dave Cramer
On Thu, 11 Feb 2021 at 09:28, Robert Haas  wrote:

> On Wed, Feb 10, 2021 at 2:04 PM Tom Lane  wrote:
> > That is a spot-on definition of where I do NOT want to end up.  Hooks
> > everywhere and enormous extensions that break anytime we change anything
> > in the core.  It's not really clear that anybody is going to find that
> > more maintainable than a straight fork, except to the extent that it
> > enables the erstwhile forkers to shove some of their work onto the PG
> > community.
>
> +1.
>
> Making the lexer and parser extensible seems desirable to me. It would
> be beneficial not only for companies like EDB and Amazon that might
> want to extend the grammar in various ways, but also for extension
> authors. However, it's vastly harder than Jan's proposal to make the
> wire protocol pluggable. The wire protocol is pretty well-isolated
> from the rest of the system. As long as you can get queries out of the
> packets the client sends and package up the results to send back, it's
> all good.


I would have to disagree that the wire protocol is well-isolated. Sending
and receiving are not in a single file
The codes are not even named constants so trying to find a specific one is
difficult.

Anything that would clean this up would be a benefit


That being said, I'm not in favor of transferring maintenance work to
> the community for this set of hooks any more than I am for something
> on the parsing side. In general, I'm in favor of as much extensibility
> as we can reasonably create, but with a complicated proposal like this
> one, the community should expect to be able to get something out of
> it. And so far what I hear Jan saying is that these hooks could in
> theory be used for things other than Amazon's proprietary efforts and
> those things could in theory bring benefits to the community, but
> there are no actual plans to do anything with this that would benefit
> anyone other than Amazon. Which seems to bring us right back to
> expecting the community to maintain things for the benefit of
> third-party forks.
>

if this proposal brought us the ability stream results that would be a huge
plus!

Dave Cramer
www.postgres.rocks

>
>


Re: GlobalVisIsRemovableFullXid() vs GlobalVisCheckRemovableXid()

2021-02-14 Thread Peter Geoghegan
On Sat, Feb 6, 2021 at 7:40 PM Andres Freund  wrote:
> Looks like a mistake on my part... Probably a rename regex that somehow
> went wrong - I went back and forth on those names way too many
> times. Want me to push the fix?

Spotted another one: Shouldn't ReadNextFullTransactionId() actually be
called ReadNewFullTransactionId()? Actually, the inverse approach
looks like it produces fewer inconsistencies: you could instead rename
ReadNewTransactionId() to ReadNextTransactionId().

-- 
Peter Geoghegan




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2021-02-14 Thread Justin Pryzby
On Tue, Feb 09, 2021 at 04:35:03AM +, tsunakawa.ta...@fujitsu.com wrote:
> Rebased to HEAD with the following modifications.  It passes make check in 
> the top directory and contrib/postgres_fdw.
> That said, with the reviews from some people and good performance results, I 
> think this can be ready for committer.

This is crashing during fdw check.
http://cfbot.cputube.org/andrey-lepikhov.html

Maybe it's related to this patch:
|commit 6214e2b2280462cbc3aa1986e350e167651b3905
|Fix permission checks on constraint violation errors on partitions.
|Security: CVE-2021-3393

TRAP: FailedAssertion("n >= 0 && n < list->length", File: 
"../../src/include/nodes/pg_list.h", Line: 259, PID: 19780)

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x7fd33a557801 in __GI_abort () at abort.c:79
#2  0x55f7f53bbc88 in ExceptionalCondition 
(conditionName=conditionName@entry=0x7fd33b81bc40 "n >= 0 && n < list->length", 
errorType=errorType@entry=0x7fd33b81b698 "FailedAssertion", 
fileName=fileName@entry=0x7fd33b81be70 "../../src/include/nodes/pg_list.h", 
lineNumber=lineNumber@entry=259) at assert.c:69
#3  0x7fd33b816b54 in list_nth_cell (n=, list=) at ../../src/include/nodes/pg_list.h:259
#4  list_nth (n=, list=) at 
../../src/include/nodes/pg_list.h:281
#5  exec_rt_fetch (estate=, rti=) at 
../../src/include/executor/executor.h:558
#6  postgresBeginForeignCopy (mtstate=, resultRelInfo=) at postgres_fdw.c:2208
#7  0x55f7f5114bb4 in ExecInitRoutingInfo 
(mtstate=mtstate@entry=0x55f7f710a508, estate=estate@entry=0x55f7f71a7d50, 
proute=proute@entry=0x55f7f710a720, dispatch=dispatch@entry=0x55f7f710a778, 
partRelInfo=partRelInfo@entry=0x55f7f710eb20, partidx=partidx@entry=0) at 
execPartition.c:1004
#8  0x55f7f511618d in ExecInitPartitionInfo (partidx=0, 
rootResultRelInfo=0x55f7f710a278, dispatch=0x55f7f710a778, 
proute=0x55f7f710a720, estate=0x55f7f71a7d50, mtstate=0x55f7f710a508) at 
execPartition.c:742
#9  ExecFindPartition () at execPartition.c:400
#10 0x55f7f50a2718 in CopyFrom () at copyfrom.c:857
#11 0x55f7f50a1b06 in DoCopy () at copy.c:299

(gdb) up
#7  0x55f7f5114bb4 in ExecInitRoutingInfo 
(mtstate=mtstate@entry=0x55f7f710a508, estate=estate@entry=0x55f7f71a7d50, 
proute=proute@entry=0x55f7f710a720, dispatch=dispatch@entry=0x55f7f710a778, 
partRelInfo=partRelInfo@entry=0x55f7f710eb20, partidx=partidx@entry=0) at 
execPartition.c:1004
1004
partRelInfo->ri_FdwRoutine->BeginForeignCopy(mtstate, partRelInfo);
(gdb) p partRelInfo->ri_RangeTableIndex
$7 = 0
(gdb) p *estate->es_range_table
$9 = {type = T_List, length = 1, max_length = 5, elements = 0x55f7f717a2c0, 
initial_elements = 0x55f7f717a2c0}

-- 
Justin




Re: GlobalVisIsRemovableFullXid() vs GlobalVisCheckRemovableXid()

2021-02-14 Thread Thomas Munro
On Mon, Feb 15, 2021 at 10:02 AM Peter Geoghegan  wrote:
> On Sat, Feb 6, 2021 at 7:40 PM Andres Freund  wrote:
> > Looks like a mistake on my part... Probably a rename regex that somehow
> > went wrong - I went back and forth on those names way too many
> > times. Want me to push the fix?
>
> Spotted another one: Shouldn't ReadNextFullTransactionId() actually be
> called ReadNewFullTransactionId()? Actually, the inverse approach
> looks like it produces fewer inconsistencies: you could instead rename
> ReadNewTransactionId() to ReadNextTransactionId().

I prefer "next", because that's in the name of the variable it reads,
and the variable name seemed to me to have a more obvious meaning.
That's why I went for that name in commit 2fc7af5e966.  I do agree
that it's slightly strange that the 32 and 64 bit versions differ
here.  I'd vote for renaming the 32 bit version to match...




Re: GlobalVisIsRemovableFullXid() vs GlobalVisCheckRemovableXid()

2021-02-14 Thread Peter Geoghegan
On Sun, Feb 14, 2021 at 2:08 PM Thomas Munro  wrote:
> I prefer "next", because that's in the name of the variable it reads,
> and the variable name seemed to me to have a more obvious meaning.
> That's why I went for that name in commit 2fc7af5e966.  I do agree
> that it's slightly strange that the 32 and 64 bit versions differ
> here.  I'd vote for renaming the 32 bit version to match...

I was just going to say the same thing myself.

Please do the honors if you have time...

-- 
Peter Geoghegan




Re: WIP: WAL prefetch (another approach)

2021-02-14 Thread Tomas Vondra




On 2/13/21 10:39 PM, Stephen Frost wrote:

Greetings,

* Andres Freund (and...@anarazel.de) wrote:

On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:

Yeah, that's a good point. I think it'd make sense to keep track of recent
FPIs and skip prefetching such blocks. But how exactly should we implement
that, how many blocks do we need to track? If you get an FPI, how long
should we skip prefetching of that block?

I don't think the history needs to be very long, for two reasons. Firstly,
the usual pattern is that we have FPI + several changes for that block
shortly after it. Secondly, maintenance_io_concurrency limits this naturally
- after crossing that, redo should place the FPI into shared buffers,
allowing us to skip the prefetch.

So I think using maintenance_io_concurrency is sufficient. We might track
more buffers to allow skipping prefetches of blocks that were evicted from
shared buffers, but that seems like an overkill.

However, maintenance_io_concurrency can be quite high, so just a simple
queue is not very suitable - searching it linearly for each block would be
too expensive. But I think we can use a simple hash table, tracking
(relfilenode, block, LSN), over-sized to minimize collisions.

Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
and whenever we prefetch a block or find an FPI, we simply add the block to
the array as determined by hash(relfilenode, block)

 hashtable[hash(...)] = {relfilenode, block, LSN}

and then when deciding whether to prefetch a block, we look at that one
position. If the (relfilenode, block) match, we check the LSN and skip the
prefetch if it's sufficiently recent. Otherwise we prefetch.


I'm a bit doubtful this is really needed at this point. Yes, the
prefetching will do a buffer table lookup - but it's a lookup that
already happens today. And the patch already avoids doing a second
lookup after prefetching (by optimistically caching the last Buffer id,
and re-checking).


I agree that when a page is looked up, and found, in the buffer table
that the subsequent cacheing of the buffer id in the WAL records does a
good job of avoiding having to re-do that lookup.  However, that isn't
the case which was being discussed here or what Tomas's suggestion was
intended to address.

What I pointed out up-thread and what's being discussed here is what
happens when the WAL contains a few FPIs and a few regular WAL records
which are mixed up and not in ideal order.  When that happens, with this
patch, the FPIs will be ignored, the regular WAL records will reference
blocks which aren't found in shared buffers (yet) and then we'll both
issue pre-fetches for those and end up having spent effort doing a
buffer lookup that we'll later re-do.



The question is how common this pattern actually is - I don't know. As 
noted, the non-FPI would have to be fairly close to the FPI, i.e. within 
the wal_decode_buffer_size, to actually cause measurable harm.



To address the unnecessary syscalls we really just need to keep track of
any FPIs that we've seen between where the point where the prefetching
is happening and the point where the replay is being done- once replay
has replayed an FPI, our buffer lookup will succeed and we'll cache the
buffer that the FPI is at- in other words, only wal_decode_buffer_size
amount of WAL needs to be considered.



Yeah, that's essentially what I proposed.


We could further leverage this tracking of FPIs, to skip the prefetch
syscalls, by cacheing what later records address the blocks that have
FPIs earlier in the queue with the FPI record and then when replay hits
the FPI and loads it into shared_buffers, it could update the other WAL
records in the queue with the buffer id of the page, allowing us to very
likely avoid having to do another lookup later on.



This seems like an over-engineering, at least for v1.


I think there's potential for some significant optimization going
forward, but I think it's basically optimization over what we're doing
today. As this is already a nontrivial patch, I'd argue for doing so
separately.


This seems like a great optimization, albeit a fair bit of code, for a
relatively uncommon use-case, specifically where full page writes are
disabled or very large checkpoints.  As that's the case though, I would
think it's reasonable to ask that it go out of its way to avoid slowing
down the more common configurations, particularly since it's proposed to
have it on by default (which I agree with, provided it ends up improving
the common cases, which I think the suggestions above would certainly
make it more likely to do).



I'm OK to do some benchmarking, but it's not quite clear to me why does 
it matter if the checkpoints are smaller than shared buffers? IMO what 
matters is how "localized" the updates are, i.e. how likely it is to hit 
the same page repeatedly (in a short amount of time). Regular pgbench is 
not very suitable for that, but some non-uniform distribution should do 
the t

Re: WIP: WAL prefetch (another approach)

2021-02-14 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
> On 2/13/21 10:39 PM, Stephen Frost wrote:
> >* Andres Freund (and...@anarazel.de) wrote:
> >>On 2021-02-12 00:42:04 +0100, Tomas Vondra wrote:
> >>>Yeah, that's a good point. I think it'd make sense to keep track of recent
> >>>FPIs and skip prefetching such blocks. But how exactly should we implement
> >>>that, how many blocks do we need to track? If you get an FPI, how long
> >>>should we skip prefetching of that block?
> >>>
> >>>I don't think the history needs to be very long, for two reasons. Firstly,
> >>>the usual pattern is that we have FPI + several changes for that block
> >>>shortly after it. Secondly, maintenance_io_concurrency limits this 
> >>>naturally
> >>>- after crossing that, redo should place the FPI into shared buffers,
> >>>allowing us to skip the prefetch.
> >>>
> >>>So I think using maintenance_io_concurrency is sufficient. We might track
> >>>more buffers to allow skipping prefetches of blocks that were evicted from
> >>>shared buffers, but that seems like an overkill.
> >>>
> >>>However, maintenance_io_concurrency can be quite high, so just a simple
> >>>queue is not very suitable - searching it linearly for each block would be
> >>>too expensive. But I think we can use a simple hash table, tracking
> >>>(relfilenode, block, LSN), over-sized to minimize collisions.
> >>>
> >>>Imagine it's a simple array with (2 * maintenance_io_concurrency) elements,
> >>>and whenever we prefetch a block or find an FPI, we simply add the block to
> >>>the array as determined by hash(relfilenode, block)
> >>>
> >>> hashtable[hash(...)] = {relfilenode, block, LSN}
> >>>
> >>>and then when deciding whether to prefetch a block, we look at that one
> >>>position. If the (relfilenode, block) match, we check the LSN and skip the
> >>>prefetch if it's sufficiently recent. Otherwise we prefetch.
> >>
> >>I'm a bit doubtful this is really needed at this point. Yes, the
> >>prefetching will do a buffer table lookup - but it's a lookup that
> >>already happens today. And the patch already avoids doing a second
> >>lookup after prefetching (by optimistically caching the last Buffer id,
> >>and re-checking).
> >
> >I agree that when a page is looked up, and found, in the buffer table
> >that the subsequent cacheing of the buffer id in the WAL records does a
> >good job of avoiding having to re-do that lookup.  However, that isn't
> >the case which was being discussed here or what Tomas's suggestion was
> >intended to address.
> >
> >What I pointed out up-thread and what's being discussed here is what
> >happens when the WAL contains a few FPIs and a few regular WAL records
> >which are mixed up and not in ideal order.  When that happens, with this
> >patch, the FPIs will be ignored, the regular WAL records will reference
> >blocks which aren't found in shared buffers (yet) and then we'll both
> >issue pre-fetches for those and end up having spent effort doing a
> >buffer lookup that we'll later re-do.
> 
> The question is how common this pattern actually is - I don't know. As
> noted, the non-FPI would have to be fairly close to the FPI, i.e. within the
> wal_decode_buffer_size, to actually cause measurable harm.

Yeah, so it'll depend on how big wal_decode_buffer_size is.  Increasing
that would certainly help to show if there ends up being a degredation
with this patch due to the extra prefetching being done.

> >To address the unnecessary syscalls we really just need to keep track of
> >any FPIs that we've seen between where the point where the prefetching
> >is happening and the point where the replay is being done- once replay
> >has replayed an FPI, our buffer lookup will succeed and we'll cache the
> >buffer that the FPI is at- in other words, only wal_decode_buffer_size
> >amount of WAL needs to be considered.
> 
> Yeah, that's essentially what I proposed.

Glad I captured it correctly.

> >We could further leverage this tracking of FPIs, to skip the prefetch
> >syscalls, by cacheing what later records address the blocks that have
> >FPIs earlier in the queue with the FPI record and then when replay hits
> >the FPI and loads it into shared_buffers, it could update the other WAL
> >records in the queue with the buffer id of the page, allowing us to very
> >likely avoid having to do another lookup later on.
> 
> This seems like an over-engineering, at least for v1.

Perhaps, though it didn't seem like it'd be very hard to do with the
already proposed changes to stash the buffer id in the WAL records.

> >>I think there's potential for some significant optimization going
> >>forward, but I think it's basically optimization over what we're doing
> >>today. As this is already a nontrivial patch, I'd argue for doing so
> >>separately.
> >
> >This seems like a great optimization, albeit a fair bit of code, for a
> >relatively uncommon use-case, specifically where full page writes are
> >disabled or very large checkpoints.  As that's the cas

Re: GlobalVisIsRemovableFullXid() vs GlobalVisCheckRemovableXid()

2021-02-14 Thread Thomas Munro
On Mon, Feb 15, 2021 at 11:33 AM Peter Geoghegan  wrote:
> On Sun, Feb 14, 2021 at 2:08 PM Thomas Munro  wrote:
> > I prefer "next", because that's in the name of the variable it reads,
> > and the variable name seemed to me to have a more obvious meaning.
> > That's why I went for that name in commit 2fc7af5e966.  I do agree
> > that it's slightly strange that the 32 and 64 bit versions differ
> > here.  I'd vote for renaming the 32 bit version to match...
>
> I was just going to say the same thing myself.
>
> Please do the honors if you have time...

Done.




Re: Preventing free space from being reused

2021-02-14 Thread Noah Bergbauer
>I would suggest to take a look at the BRIN opclass multi-minmax currently
in development.

Thank you, this does look like it could help a lot with BRIN performance in
this situation!

But again, if index performance alone was the only issue, then I would
simply accept the space overhead and switch to btree. However, the disk
fragmentation issue still remains and is significant. It is also amplified
in my use case due to using ZFS, mostly for compression. But it is worth
it: I am currently observing a 13x compression ratio (when comparing disk
space reported by du and select sum(octet_length(x)), so this does not
include the false gains from compressing padding). But in general, any
variable-sized append-only workload suffers from this fragmentation
problem. It's just that with filesystem compression, there is no longer a
good reason to fill up those holes and accept the fragmentation.

To be clear, the main reason why I even brought my questions to this
mailing list is that I don't know how to (correctly) get past the check in
heap_getnext (see my first email) when implementing the workaround as a
custom table access method. A reloption could theoretically disable free
space maps entirely for some added efficiency, but I'm inclined to agree
that this is not really needed.



On Sat, Feb 13, 2021 at 1:36 PM John Naylor 
wrote:

> On Fri, Feb 12, 2021 at 6:21 PM Noah Bergbauer 
> wrote:
> >
> > A btree index on the same column is 700x the size of BRIN, or 10% of
> relation itself. It does not perform significantly better than BRIN. The
> issue here is twofold: not only does slotting these tuples into older pages
> significantly reduce the effectiveness of BRIN, it also causes
> fragmentation on disk. Ultimately, this is why CLUSTER exists. One way to
> look at this situation is that my data is inserted exactly in index order,
> but Postgres keeps un-clustering it for reasons that are valid in general
> (don't waste disk space) but don't apply at all in this case (the file
> system uses compression, no space is wasted).
> >
> > Any alternative ideas would of course be much appreciated! But at the
> moment HEAP_INSERT_SKIP_FSM seems like the most practical solution to me.
>
> I would suggest to take a look at the BRIN opclass multi-minmax currently
> in development. It's designed to address that exact situation, and more
> review would be welcome:
>
> https://commitfest.postgresql.org/32/2523/
>
> --
> John Naylor
> EDB: http://www.enterprisedb.com
>


Re: GlobalVisIsRemovableFullXid() vs GlobalVisCheckRemovableXid()

2021-02-14 Thread Peter Geoghegan
On Sun, Feb 14, 2021 at 4:21 PM Thomas Munro  wrote:
> Done.

Thanks.

-- 
Peter Geoghegan




GCC warning in back branches

2021-02-14 Thread Thomas Munro
Hi,

guc.c: In function ‘RestoreGUCState’:
guc.c:9455:4: error: ‘varsourceline’ may be used uninitialized in this
function [-Werror=maybe-uninitialized]
 9455 |set_config_sourcefile(varname, varsourcefile, varsourceline);
  |^~~~

I propose the attached.
From d0ce66b0a24c99ed94f8caac32084b44990863ce Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 15 Feb 2021 14:04:58 +1300
Subject: [PATCH 1/2] Fix compiler warning in back branches.

Back-patch a tiny bit of commit fbb2e9a0 into 9.6 and 10, to silence an
uninitialized variable warning from GCC 10.2.
---
 src/backend/utils/misc/guc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9f47bd0aea..9e26cbd18e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -9437,6 +9437,8 @@ RestoreGUCState(void *gucstate)
 		if (varsourcefile[0])
 			read_gucstate_binary(&srcptr, srcend,
  &varsourceline, sizeof(varsourceline));
+		else
+			varsourceline = 0;
 		read_gucstate_binary(&srcptr, srcend,
 			 &varsource, sizeof(varsource));
 		read_gucstate_binary(&srcptr, srcend,
-- 
2.30.0



RE: [PoC] Non-volatile WAL buffer

2021-02-14 Thread tsunakawa.ta...@fujitsu.com
From: Masahiko Sawada 
> I've done some performance benchmarks with the master and NTT v4
> patch. Let me share the results.
> 
...
> master  NTT master-unlogged
> 32  113209  67107   154298
> 64  144880  54289   178883
> 96  151405  50562   180018
> 
> "master-unlogged" is the same setup as "master" except for using
> unlogged tables (using --unlogged-tables pgbench option). The TPS
> increased by about 20% compared to "master" case (i.g., logged table
> case). The reason why I experimented unlogged table case as well is
> that we can think these results as an ideal performance if we were
> able to write WAL records in 0 sec. IOW, even if the PMEM patch would
> significantly improve WAL logging performance, I think it could not
> exceed this performance. But hope is that if we currently have a
> performance bottle-neck in WAL logging (.e.g, locking and writing
> WAL), removing or minimizing WAL logging would bring a chance to
> further improve performance by eliminating the new-coming bottle-neck.

Could you tell us the specifics of the storage for WAL, e.g., SSD/HDD, the 
interface is NVMe/SAS/SATA, read-write throughput and latency (on the product 
catalog), and the product model?

Was the WAL stored on a storage device separate from the other files?  I want 
to know if the comparison is as fair as possible.  I guess that in the NTT 
(PMEM) case, the WAL traffic is not affected by the I/Os of the other files.

What would the comparison look like between master and unlogged-master if you 
place WAL on a DAX-aware filesystem like xfs or ext4 on PMEM, which Oracle 
recommends as REDO log storage?  That is, if we place the WAL on the fastest 
storage configuration possible, what would be the difference between the logged 
and unlogged?

I'm asking these to know if we consider it worthwhile to make further efforts 
in special code for WAL on PMEM.


> Besides, I've checked the main wait events on each experiment using
> pg_wait_sampling. Here are the top 5 wait events on "master" case
> excluding wait events on the main function of auxiliary processes:
> 
>  event_type |event |  sum
> +--+---
>  Client | ClientRead   | 46902
>  LWLock | WALWrite | 33405
>  IPC| ProcArrayGroupUpdate |  8855
>  LWLock | WALInsert|  3215
>  LWLock | ProcArray|  3022
> 
> We can see the wait event on WALWrite lwlock acquisition happened many
> times and it was the primary wait event.
> 
> The result of "ntt" case is:
> 
>  event_type |event |  sum
> +--+
>  LWLock | WALInsert| 126487
>  Client | ClientRead   |  12173
>  LWLock | BufferContent|   4480
>  Lock   | transactionid|   2017
>  IPC| ProcArrayGroupUpdate |924
> 
> The wait event on WALWrite lwlock disappeared. Instead, there were
> many wait events on WALInsert lwlock. I've not investigated this
> result yet. This could be because the v4 patch acquires WALInsert lock
> more than necessary or writing WAL records to PMEM took more time than
> writing to DRAM as Tomas mentioned before.

Increasing NUM_XLOGINSERT_LOCKS might improve the result, but I don't have much 
hope because PMEM appears to have limited concurrency...


Regards
Takayuki Tsunakawa




Re: doing something about the broken dynloader.h symlink

2021-02-14 Thread Thomas Munro
On Fri, Jun 19, 2020 at 11:38 PM Julien Rouhaud  wrote:
> On Fri, Jun 19, 2020 at 12:08 PM Thomas Munro  wrote:
> > On Fri, Jun 19, 2020 at 8:02 PM Peter Eisentraut
> >  wrote:
> > > +[# Remove links created by old versions of configure, so that there
> > > +# are no broken symlinks in the tree
> > > +rm -f src/include/dynloader.h])
> >
> > +1
>
> +1

+1




Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)

2021-02-14 Thread Michael Paquier
On Sun, Feb 14, 2021 at 11:39:47AM -0300, Ranier Vilela wrote:
> What do you think?

That's not a good idea for two reasons:
1) There is CRC32 to worry about, which relies on a different logic.
2) It would become easier to miss the new option as compilation would
not warn anymore if a new checksum type is added.

I have reviewed my patch this morning, tweaked a comment, and applied
it.
--
Michael


signature.asc
Description: PGP signature


Re: shared tempfile was not removed on statement_timeout

2021-02-14 Thread Thomas Munro
On Fri, Feb 5, 2021 at 5:47 PM Thomas Munro  wrote:
> On Sun, Jan 31, 2021 at 6:07 PM Tom Lane  wrote:
> > Thomas Munro  writes:
> > > So that gives a very simple back-patchable patch.
> >
> > Hmm, so is the *rest* of that function perfectly okay with being
> > interrupted?
>
> It looks OK to me.  There aren't any CFI()s in there.

Pushed.  That closes CF #2657.




Re: GCC warning in back branches

2021-02-14 Thread Michael Paquier
On Mon, Feb 15, 2021 at 02:15:51PM +1300, Thomas Munro wrote:
> guc.c: In function ‘RestoreGUCState’:
> guc.c:9455:4: error: ‘varsourceline’ may be used uninitialized in this
> function [-Werror=maybe-uninitialized]
>  9455 |set_config_sourcefile(varname, varsourcefile, varsourceline);
>   |^~~~
> 
> I propose the attached.

We usually don't bother much about compilation warnings in stable
branches as long as they are not real bugs, and these are the oldest
stable ones.  So why here?  I would have patched the top of the
function if it were me, btw.
--
Michael


signature.asc
Description: PGP signature


Re: GCC warning in back branches

2021-02-14 Thread Tom Lane
Michael Paquier  writes:
> On Mon, Feb 15, 2021 at 02:15:51PM +1300, Thomas Munro wrote:
>> I propose the attached.

> We usually don't bother much about compilation warnings in stable
> branches as long as they are not real bugs, and these are the oldest
> stable ones.  So why here?  I would have patched the top of the
> function if it were me, btw.

If somebody were running a buildfarm member with recent gcc
and -Werror, we'd pretty much have to fix it.

I'd say the real policy is that we don't worry about
uninitialized-variable warnings from old compiler versions,
on the theory that they're probably compiler shortcomings.
But I'd be inclined to fix anything from a current gcc version.

regards, tom lane




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-02-14 Thread Justin Pryzby
On Thu, Feb 04, 2021 at 03:38:39PM +0900, Michael Paquier wrote:
> On Wed, Feb 03, 2021 at 07:54:42PM +0900, Michael Paquier wrote:
> > Not sure I like that.  Here is a proposal:
> > "it is recommended to separately use ALTER TABLE ONLY on them so as
> > any new partitions attached inherit the new tablespace value."
> 
> So, I have done more work on this stuff today, and applied that as of
> c5b2860.

> A second thing I have come back to is allow_system_table_mods for
> toast relations, and decided to just forbid TABLESPACE if attempting
> to use it directly on a system table even if allow_system_table_mods
> is true.  This was leading to inconsistent behaviors and weirdness in
> the concurrent case because all the indexes are processed in series
> after building a list.  As we want to ignore the move of toast indexes
> when moving the indexes of the parent table, this was leading to extra
> conditions that are not really worth supporting after thinking about
> it.  One other issue was the lack of consistency when using pg_global
> that was a no-op for the concurrent case but failed in the 
> non-concurrent case.  I have put in place more regression tests for
> all that.

Isn't this dead code ?

postgres=# REINDEX (CONCURRENTLY, TABLESPACE pg_global) TABLE pg_class;
ERROR:  0A000: cannot reindex system catalogs concurrently
LOCATION:  ReindexRelationConcurrently, indexcmds.c:3276

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 127ba7835d..c77a9b2563 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -3260,73 +3260,66 @@ ReindexRelationConcurrently(Oid relationOid, 
ReindexParams *params)
{
if (IsCatalogRelationOid(relationOid))
ereport(ERROR,

(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot reindex 
system catalogs concurrently")));
 
...
 
-   if (OidIsValid(params->tablespaceOid) &&
-   IsSystemRelation(heapRelation))
-   ereport(ERROR,
-   
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-errmsg("cannot move 
system relation \"%s\"",
-   
RelationGetRelationName(heapRelation;
-
@@ -3404,73 +3397,66 @@ ReindexRelationConcurrently(Oid relationOid, 
ReindexParams *params)
if (IsCatalogRelationOid(heapId))
ereport(ERROR,

(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot reindex 
system catalogs concurrently")));
... 
 
-   if (OidIsValid(params->tablespaceOid) &&
-   IsSystemRelation(heapRelation))
-   ereport(ERROR,
-   
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-errmsg("cannot move 
system relation \"%s\"",
-   
get_rel_name(relationOid;
-
>From b4347c18bc732d30295c065ef71edaac65e68fe6 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 14 Feb 2021 19:49:52 -0600
Subject: [PATCH] Dead code: REINDEX (CONCURRENTLY, TABLESPACE ..):
 c5b286047cd698021e57a527215b48865fd4ad4e

---
 src/backend/commands/indexcmds.c | 14 --
 1 file changed, 14 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 127ba7835d..c77a9b2563 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -3260,73 +3260,66 @@ ReindexRelationConcurrently(Oid relationOid, ReindexParams *params)
 			{
 /*
  * In the case of a relation, find all its indexes including
  * toast indexes.
  */
 Relation	heapRelation;
 
 /* Save the list of relation OIDs in private context */
 oldcontext = MemoryContextSwitchTo(private_context);
 
 /* Track this relation for session locks */
 heapRelationIds = lappend_oid(heapRelationIds, relationOid);
 
 MemoryContextSwitchTo(oldcontext);
 
 if (IsCatalogRelationOid(relationOid))
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 			 errmsg("cannot reindex system catalogs concurrently")));
 
 /* Open relation to get its indexes */
 if ((params->options & REINDEXOPT_MISSING_OK) != 0)
 {
 	heapRelation = try_table_open(relationOid,
   ShareUpdateExclusiveLock);
 	/* leave if relation does not exist */
 	if

Re: Use pg_pwrite() in pg_test_fsync

2021-02-14 Thread Thomas Munro
On Sun, Jan 24, 2021 at 1:50 PM Thomas Munro  wrote:
> On Sun, Jan 10, 2021 at 9:21 AM Thomas Munro  wrote:
> > I left the fsync-after-closing and non-sync'd tests using write(),
> > because they weren't using lseek().  The latter case is arguably a bit
> > odd because it's not overwriting pre-allocated blocks, unlike the
> > earlier tests.
>
> On closer inspection, the weird thing about that final test is that
> it's opening and closing the file every time.  That doesn't seem to
> make any sense.  Perhaps it's a copy and paste error from the previous
> test?  In v2 I changed it to pg_pwrite(), and moved the open and close
> calls out of the loop.

Pushed.




Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-02-14 Thread Masahiro Ikeda

On 2021-02-08 13:01, Fujii Masao wrote:

On 2021/02/05 8:45, Masahiro Ikeda wrote:

I pgindented the patches.


Thanks for updating the patches!


Thanks for checking the patches.


+   XLogWrite, which nomally called by an
+   issue_xlog_fsync, which nomally called by 
an


Typo: "nomally" should be "normally"?


Yes, I'll fix it.


+   XLogFlush request(see )
+   XLogFlush request(see ),

Isn't it better to add a space character just after "request"?


Thanks, I'll fix it.


+   INSTR_TIME_SET_CURRENT(duration);
+   INSTR_TIME_SUBTRACT(duration, start);
+   WalStats.m_wal_write_time = 
INSTR_TIME_GET_MICROSEC(duration);

If several cycles happen in the do-while loop, m_wal_write_time should 
be
updated with the sum of "duration" in those cycles instead of 
"duration"
in the last cycle? If yes, "+=" should be used instead of "=" when 
updating

m_wal_write_time?
+   INSTR_TIME_SET_CURRENT(duration);
+   INSTR_TIME_SUBTRACT(duration, start);
+   WalStats.m_wal_sync_time = 
INSTR_TIME_GET_MICROSEC(duration);

Also "=" should be "+=" in the above?


Yes, they are my mistake when changing the unit from milliseconds to 
microseconds.


Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION




Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-02-14 Thread Masahiro Ikeda

On 2021-02-08 14:26, Fujii Masao wrote:

On 2021/02/08 13:01, Fujii Masao wrote:



On 2021/02/05 8:45, Masahiro Ikeda wrote:

I pgindented the patches.


Thanks for updating the patches!

+   XLogWrite, which nomally called by an
+   issue_xlog_fsync, which nomally called by 
an


Typo: "nomally" should be "normally"?

+   XLogFlush request(see linkend="wal-configuration"/>)
+   XLogFlush request(see linkend="wal-configuration"/>),


Isn't it better to add a space character just after "request"?

+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    WalStats.m_wal_write_time = 
INSTR_TIME_GET_MICROSEC(duration);


If several cycles happen in the do-while loop, m_wal_write_time should 
be
updated with the sum of "duration" in those cycles instead of 
"duration"
in the last cycle? If yes, "+=" should be used instead of "=" when 
updating

m_wal_write_time?

+    INSTR_TIME_SET_CURRENT(duration);
+    INSTR_TIME_SUBTRACT(duration, start);
+    WalStats.m_wal_sync_time = 
INSTR_TIME_GET_MICROSEC(duration);


Also "=" should be "+=" in the above?


+   /* Send WAL statistics */
+   pgstat_send_wal();

This may cause overhead in WAL-writing by walwriter because it's called
every cycles even when walwriter needs to write more WAL next cycle
(don't need to sleep on WaitLatch)? If this is right, pgstat_send_wal()
should be called only when WaitLatch() returns with WL_TIMEOUT?


Thanks, I didn't notice that.
I'll fix it.


-   XLogFlush request(see )
+   XLogFlush request(see ),
+   or WAL data written out to disk by WAL receiver.

So regarding walreceiver, only wal_write, wal_write_time, wal_sync, and
wal_sync_time are updated even while the other values are not. Isn't 
this
confusing to users? If so, what about reporting those walreceiver stats 
in

pg_stat_wal_receiver?


OK, I'll add new infrastructure code to interect with wal receiver
and stats collector and show the stats in pg_stat_wal_receiver.


if (endofwal)
+   {
+   /* Send WAL statistics to the stats 
collector */
+   pgstat_send_wal();
break;

You added pgstat_send_wal() so that it's called in some cases where
walreceiver exits. But ISTM that there are other walreceiver-exit 
cases.

For example, in the case where SIGTERM is received. Instead,
pgstat_send_wal() should be called in WalRcvDie() for those all cases?


Thanks, I forgot the case.
I'll fix it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION




Snapshot scalability patch issue

2021-02-14 Thread Peter Geoghegan
The call to heap_page_prune() within lazy_scan_heap() passes a bool
literal ('false') as its fourth argument. But the fourth argument is
of type TransactionId, not bool. This has been the case since the
snapshot scalability work performed by commit dc7420c2c92. Surely
something is amiss here.

I also notice some inconsistencies in the heap_page_prune() prototype
names vs the corresponding definition names. Might be worth doing
something about in passing.

-- 
Peter Geoghegan




Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-02-14 Thread Bharath Rupireddy
On Sat, Feb 13, 2021 at 11:41 AM japin  wrote:
> > IIUC, with the current patch, the new ALTER SUBSCRIPTION ... ADD/DROP
> > errors out on the first publication that already exists/that doesn't
> > exist right? What if there are multiple publications given in the
> > ADD/DROP list, and few of them exist/don't exist. Isn't it good if we
> > loop over the subscription's publication list and show all the already
> > existing/not existing publications in the error message, instead of
> > just erroring out for the first existing/not existing publication?
> >
>
> Yes, you are right. Agree with you, I modified it. Please consider v5
> for further review.

Thanks for the updated patches. I have a comment about reporting the
existing/not existing publications code. How about something like the
attached delta patch on v5-0002? Sorry for attaching

I also think that we could merge 0002 into 0001 and have only 4
patches in the patch set.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v1-0001-delta-patch-for-reporting-error.patch
Description: Binary data


Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-02-14 Thread Michael Paquier
On Sun, Feb 14, 2021 at 08:10:50PM -0600, Justin Pryzby wrote:
> Isn't this dead code ?

Nope, it's not dead.  Those two code paths can be hit when attempting
a reidex with a tablespace move directly on toast tables and indexes,
see:
=# create table aa (a text);
CREATE TABLE
=# select relname from pg_class where oid > 16000;
   relname
--
 aa
 pg_toast_16385
 pg_toast_16385_index
(3 rows)
=# reindex (concurrently, tablespace pg_default) table
  pg_toast.pg_toast_16385;
ERROR:  0A000: cannot move system relation "pg_toast_16385"
LOCATION:  ReindexRelationConcurrently, indexcmds.c:3295
=# reindex (concurrently, tablespace pg_default) index
  pg_toast.pg_toast_16385_index;
ERROR:  0A000: cannot move system relation "pg_toast_16385_index"
LOCATION:  ReindexRelationConcurrently, indexcmds.c:3439

It is easy to save the relation name using \gset in a regression test,
but we had better keep a reference to the relation name in the error
message so this would not be really portable.  Using a PL function to
do that with a CATCH block would not work either as CONCURRENTLY
cannot be run in a transaction block.  This leaves 090_reindexdb.pl,
but I was not really convinced that this was worth the extra test
cycles (I am aware of the --tablespace option missing in reindexdb,
someone I know was trying to get that done for the next CF).
--
Michael


signature.asc
Description: PGP signature


Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-02-14 Thread Masahiro Ikeda

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
 wrote:


I pgindented the patches.


... XLogWrite, which is invoked during an
XLogFlush request (see ...).  This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"


Thanks, I fixed them.


wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."


I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.


"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in) This is undesirable because ..."


Thanks, I fixed it.


I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics.  This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite.  Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not.  It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL).  Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead.  In either case, this
half-and-half implementation seems undesirable.


OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)

I added the infrastructure code to communicate the WAL receiver stats 
messages between the WAL receiver and the stats collector, and

the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONFrom 4ecbc467f88a2f923b3bc2eb6fe2c4f7725c02be Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Fri, 12 Feb 2021 11:19:59 +0900
Subject: [PATCH 1/2] Add statistics related to write/sync wal records.

This patch adds following statistics to pg_stat_wal view
to track WAL I/O activity by backends and background processes
except WAL receiver.

- the total number of times writing/syncing WAL data.
- the total amount of time spent writing/syncing WAL data.

Since to track I/O timing may leads significant overhead,
GUC parameter "track_wal_io_timing" is introduced.
Only if this is on, the I/O timing is measured.

The statistics related to sync are zero when "wal_sync_method"
is "open_datasync" or "open_sync", because it doesn't call each
sync method.

(This requires a catversion bump, as well as an update to
 PGSTAT_FILE_FORMAT_ID)

Author: Masahiro Ikeda
Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David
Johnston, Fujii Masao
Discussion:
https://postgr.es/m/0509ad67b585a5b86a83d445dfa75...@oss.nttdata.co
---
 doc/src/sgml/config.sgml  | 23 +++-
 doc/src/sgml/monitoring.sgml  | 50 
 doc/src/sgml/wal.sgml | 12 +++-
 src/backend/access/transam/xlog.c | 57 +++
 src/backend/catalog/system_views.sql  |  4 ++
 src/backend/postmaster/pgstat.c   |  4 ++
 src/backend/postmaster/walwriter.c| 16 --
 src/backend/utils/adt/pgstatfuncs.c   | 24 +++-
 src/backend/utils/misc/guc.c  |  9 +++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/xlog.h |  1 +
 src/include/catalog/pg_proc.dat   |  6 +-
 src/include/pgstat.h  | 10 
 src/test/regress/expected/rules.out   |  6 +-
 14 files changed, 208 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5ef1c7ad3c..0bb03d0717 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7404,7 +7404,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
   

 Enables timing of database I/O calls.  This parameter is off by
-default, because it will repeatedly query the operating system for
+default, as it will repeatedly query the operating system for
 the current time, which may cause significant overhead on some
 platforms.  You can use the  tool to
 measure the overhead of timing on your system.
@@ -7418,6 +7418,27 @@ COPY postgres_log FROM '/full/path/to/logfil

Re: Some regular-expression performance hacking

2021-02-14 Thread Tom Lane
"Joel Jacobson"  writes:
> Below is the result of the performance test query:
> -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a:
> Time: 592196.145 ms (09:52.196)
> -- 8facf1ea00b7a0c08c755a0392212b83e04ae28a+patches:
> Time: 461739.364 ms (07:41.739)
> That's an impressive 22% speed-up!

I've been doing some more hacking over the weekend, and have a couple
of additional improvements to show.  The point of these two additional
patches is to reduce the number of "struct subre" sub-regexps that
the regex parser creates.  The subre's themselves aren't that large,
so this might seem like it would have only small benefit.  However,
each subre requires its own NFA for the portion of the RE that it
matches.  That adds space, and it also adds compilation time because
we run the "optimize()" pass separately for each such NFA.  Maybe
there'd be a way to share some of that work, but I'm not very clear
how.  In any case, not having a subre at all is clearly better where
we can manage it.

0003 is a small patch that fixes up parseqatom() so that it doesn't
emit no-op subre's for empty portions of a regexp branch that are
adjacent to a "messy" regexp atom (that is, a capture node, a
backref, or an atom with greediness different from what preceded it).

0004 is a rather larger patch whose result is to get rid of extra
subre's associated with alternation subre's.  If we have a|b|c
and any of those alternation branches are messy, we end up with

  *
 / \
a   *
   / \
  b   *
 / \
c   NULL

where each "*" is an alternation subre node, and all those "*"'s have
identical NFAs that match the whole a|b|c construct.  This means that
for an N-way alternation we're going to need something like O(N^2)
work to optimize all those NFAs.  That's embarrassing (and I think
it's my fault --- if memory serves, I put in this representation
of messy alternations years ago).

We can improve matters by having just one parent node for an
alternation:

*
 \
  a -> b -> c

That requires replacing the binary-tree structure of subre's
with a child-and-sibling arrangement, which is not terribly
difficult but accounts for most of the bulk of the patch.
(I'd wanted to do that for years, but up till now I did not
think it would have any real material benefit.)

There might be more that can be done in this line, but that's
as far as I got so far.

I did some testing on this using your dataset (thanks for
giving me a copy) and this query:

SELECT
  pattern,
  subject,
  is_match AS is_match_head,
  captured AS captured_head,
  subject ~ pattern AS is_match_patch,
  regexp_match(subject, pattern) AS captured_patch
FROM subjects
WHERE error IS NULL
AND (is_match <> (subject ~ pattern)
 OR captured IS DISTINCT FROM regexp_match(subject, pattern));

I got these runtimes (non-cassert builds):

HEAD313661.149 ms (05:13.661)
+0001   297397.293 ms (04:57.397)   5% better than HEAD
+0002   151995.803 ms (02:31.996)   51% better than HEAD
+0003   139843.934 ms (02:19.844)   55% better than HEAD
+0004   95034.611 ms (01:35.035)69% better than HEAD

Since I don't have all the tables used in your query, I can't
try to reproduce your results exactly.  I suspect the reason
I'm getting a better percentage improvement than you did is
that the joining/grouping/ordering involved in your query
creates a higher baseline query cost.

Anyway, I'm feeling pretty pleased with these results ...

regards, tom lane

diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
index 1e4f0121f3..fcf03de32d 100644
--- a/contrib/pg_trgm/trgm_regexp.c
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -282,8 +282,8 @@ typedef struct
 typedef int TrgmColor;
 
 /* We assume that colors returned by the regexp engine cannot be these: */
-#define COLOR_UNKNOWN	(-1)
-#define COLOR_BLANK		(-2)
+#define COLOR_UNKNOWN	(-3)
+#define COLOR_BLANK		(-4)
 
 typedef struct
 {
@@ -780,7 +780,8 @@ getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
 		palloc0(colorsCount * sizeof(TrgmColorInfo));
 
 	/*
-	 * Loop over colors, filling TrgmColorInfo about each.
+	 * Loop over colors, filling TrgmColorInfo about each.  Note we include
+	 * WHITE (0) even though we know it'll be reported as non-expandable.
 	 */
 	for (i = 0; i < colorsCount; i++)
 	{
@@ -1098,9 +1099,9 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
 			/* Add enter key to this state */
 			addKeyToQueue(trgmNFA, &destKey);
 		}
-		else
+		else if (arc->co >= 0)
 		{
-			/* Regular color */
+			/* Regular color (including WHITE) */
 			TrgmColorInfo *colorInfo = &trgmNFA->colorInfo[arc->co];
 
 			if (colorInfo->expandable)
@@ -1156,6 +1157,14 @@ addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
 addKeyToQueue(trgmNFA, &destKey);
 			}
 		}
+		else
+		{
+			/* RAINBOW: treat as unexpandable color */
+			destKey.prefix.colors[0] = COLOR_UNKNOWN;
+			destKey.prefix.colors[1] = C

Re: Default wal_sync_method on FreeBSD

2021-02-14 Thread Thomas Munro
On Fri, Jan 8, 2021 at 2:19 PM Thomas Munro  wrote:
> The system header change has one interesting consequence for existing
> releases of PostgreSQL, though: xlogdefs.h now sees that there is an
> O_DSYNC macro that is distinct from O_SYNC, and defaults to
> wal_sync_method=open_datasync.  That's not a great default setting,
> because it gets you O_DIRECT | O_DSYNC, which performs terribly when
> you're writing 8KB blocks on UFS's default 32KB logical block size (it
> triggers read-before-write, quite visibly destroying performance with
> eg pg_test_fsync), and for all I know, it might even not work at all
> on some other file systems.  I suspect it might come out very slightly
> ahead on a UFS filesystem created with 8KB blocks, but in any case,
> that seems like something you should have to opt in to, as you do on
> Linux.

Done.




Re: repeated decoding of prepared transactions

2021-02-14 Thread Amit Kapila
On Sat, Feb 13, 2021 at 10:23 PM Andres Freund  wrote:
>
> On 2021-02-13 17:37:29 +0100, Petr Jelinek wrote:
>
> > AFAIK this is exactly why origins are Wal logged along with
> > transaction, it allows us to guarantee never getting anything that has
> > beed durably written.
>
> I think you'd need something like origins in that case, because
> something could still go wrong before the other side has received the
> flush (network disconnect, primary crash, ...).
>

We are already using origins in apply-worker to guarantee that and
with each commit, the origin's lsn location is also WAL-logged. That
helps us to send the start location for a slot after the restart. As
far as I understand this is how it works from the apply-worker side. I
am not sure if I am missing something here?

-- 
With Regards,
Amit Kapila.




Re: 64-bit XIDs in deleted nbtree pages

2021-02-14 Thread Peter Geoghegan
On Sat, Feb 13, 2021 at 10:47 PM Peter Geoghegan  wrote:
> It will be rare. But more importantly, the fact that scenario is now
> an extreme case justifies treating it as an extreme case. We can teach
> _bt_vacuum_needs_cleanup() to recognize it as an extreme case, too. In
> particular, I think that it will now be okay to increase the threshold
> applied when considering deleted pages inside
> _bt_vacuum_needs_cleanup(). It was 2.5% of the index size in v3 of the
> patch. But in v4, which has the new recycling enhancement, I think
> that it would be sensible to make it 5%, or maybe even 10%. This
> naturally makes Masahiko's problem scenario unlikely to actually
> result in a truly wasted call to btvacuumscan().

Attached is v4, which has the "recycle pages that we ourselves deleted
during this same VACUUM operation" enhancement. It also doubles the
_bt_vacuum_needs_cleanup() threshold applied to deleted pages -- it
goes from 2.5% to 5%. The new patch is the patch series (v4-0002-*)
certainly needs more polishing. I'm posting what I have now because v3
has bitrot.

Benchmarking has shown that the enhancement in v4-0002-* can
significantly reduce the amount of index bloat in two of the
BenchmarkSQL/TPC-C indexes.

-- 
Peter Geoghegan


v4-0002-Recycle-pages-deleted-during-same-VACUUM.patch
Description: Binary data


v4-0001-Use-full-64-bit-XID-for-nbtree-page-deletion.patch
Description: Binary data


v4-0003-Add-pages_newly_deleted-to-VACUUM-VERBOSE.patch
Description: Binary data


RE: [POC] Fast COPY FROM command for the table with foreign partitions

2021-02-14 Thread tsunakawa.ta...@fujitsu.com
From: Justin Pryzby 
> This is crashing during fdw check.
> http://cfbot.cputube.org/andrey-lepikhov.html
> 
> Maybe it's related to this patch:
> |commit 6214e2b2280462cbc3aa1986e350e167651b3905
> |Fix permission checks on constraint violation errors on partitions.
> |Security: CVE-2021-3393

Thank you for your kind detailed investigation.  The rebased version is 
attached.


Regards
Takayuki Tsunakawa




v15-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch
Description:  v15-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch


Re: GCC warning in back branches

2021-02-14 Thread Thomas Munro
On Mon, Feb 15, 2021 at 2:35 PM Michael Paquier  wrote:
> ... I would have patched the top of the
> function if it were me, btw.

I just copied the way it is coded in master (due to commit fbb2e9a0
which fixed this warning in 11+).




Re: PG vs LLVM 12 on seawasp, next round

2021-02-14 Thread Thomas Munro
On Sat, Dec 12, 2020 at 8:45 AM Fabien COELHO  wrote:
> > +ERROR:  could not load library 
> > "/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/tmp_install/home/fabien/pg/build-farm-11/buildroot/HEAD/inst/lib/postgresql/llvmjit.so":
> >  libLLVMOrcJIT.so.12git: cannot open shared object file: No such file or 
> > directory

Bonjour Fabien,

Here is the creation of llvmjit.so:

g++ ... -o llvmjit.so ... -L/home/fabien/clgtk/lib ... -lLLVMOrcJIT ...

That'd be from llvm-config --ldflags or similar, from this binary:

checking for llvm-config... (cached) /home/fabien/clgtk/bin/llvm-config

So what does ls -slap /home/fabien/clgtk/lib show?  What does ldd
/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/tmp_install/home/fabien/pg/build-farm-11/buildroot/HEAD/inst/lib/postgresql/llvmjit.so
show?  What do you have in seawap's LD_LIBRARY_PATH, and is the
problem that you need to add /home/fabien/clgtk/lib to it (or is it
supposed to be making it into the rpath, or is it in your ld.so.conf)?

PS Could you try blowing away the accache directory so we can rule out
bad cached configure stuff?




Re: adding wait_start column to pg_locks

2021-02-14 Thread Fujii Masao




On 2021/02/10 10:43, Fujii Masao wrote:



On 2021/02/09 23:31, torikoshia wrote:

On 2021-02-09 22:54, Fujii Masao wrote:

On 2021/02/09 19:11, Fujii Masao wrote:



On 2021/02/09 18:13, Fujii Masao wrote:



On 2021/02/09 17:48, torikoshia wrote:

On 2021-02-05 18:49, Fujii Masao wrote:

On 2021/02/05 0:03, torikoshia wrote:

On 2021-02-03 11:23, Fujii Masao wrote:

64-bit fetches are not atomic on some platforms. So spinlock is necessary when updating 
"waitStart" without holding the partition lock? Also GetLockStatusData() needs spinlock 
when reading "waitStart"?


Also it might be worth thinking to use 64-bit atomic operations like
pg_atomic_read_u64(), for that.


Thanks for your suggestion and advice!

In the attached patch I used pg_atomic_read_u64() and pg_atomic_write_u64().

waitStart is TimestampTz i.e., int64, but it seems pg_atomic_read_xxx and 
pg_atomic_write_xxx only supports unsigned int, so I cast the type.

I may be using these functions not correctly, so if something is wrong, I would 
appreciate any comments.


About the documentation, since your suggestion seems better than v6, I used it 
as is.


Thanks for updating the patch!

+    if (pg_atomic_read_u64(&MyProc->waitStart) == 0)
+    pg_atomic_write_u64(&MyProc->waitStart,
+    pg_atomic_read_u64((pg_atomic_uint64 *) &now));

pg_atomic_read_u64() is really necessary? I think that
"pg_atomic_write_u64(&MyProc->waitStart, now)" is enough.

+    deadlockStart = get_timeout_start_time(DEADLOCK_TIMEOUT);
+    pg_atomic_write_u64(&MyProc->waitStart,
+    pg_atomic_read_u64((pg_atomic_uint64 *) &deadlockStart));

Same as above.

+    /*
+ * Record waitStart reusing the deadlock timeout timer.
+ *
+ * It would be ideal this can be synchronously done with updating
+ * lock information. Howerver, since it gives performance impacts
+ * to hold partitionLock longer time, we do it here asynchronously.
+ */

IMO it's better to comment why we reuse the deadlock timeout timer.

 proc->waitStatus = waitStatus;
+    pg_atomic_init_u64(&MyProc->waitStart, 0);

pg_atomic_write_u64() should be used instead? Because waitStart can be
accessed concurrently there.

I updated the patch and addressed the above review comments. Patch attached.
Barring any objection, I will commit this version.


Thanks for modifying the patch!
I agree with your comments.

BTW, I ran pgbench several times before and after applying
this patch.

The environment is virtual machine(CentOS 8), so this is
just for reference, but there were no significant difference
in latency or tps(both are below 1%).


Thanks for the test! I pushed the patch.


But I reverted the patch because buildfarm members rorqual and
prion don't like the patch. I'm trying to investigate the cause
of this failures.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2021-02-09%2009%3A20%3A10


-    relation | locktype |    mode
--+--+-
- test_prepared_1 | relation | RowExclusiveLock
- test_prepared_1 | relation | AccessExclusiveLock
-(2 rows)
-
+ERROR:  invalid spinlock number: 0

"rorqual" reported that the above error happened in the server built with
--disable-atomics --disable-spinlocks when reading pg_locks after
the transaction was prepared. The cause of this issue is that "waitStart"
atomic variable in the dummy proc created at the end of prepare transaction
was not initialized. I updated the patch so that pg_atomic_init_u64() is
called for the "waitStart" in the dummy proc for prepared transaction.
Patch attached. I confirmed that the patched server built with
--disable-atomics --disable-spinlocks passed all the regression tests.


Thanks for fixing the bug, I also tested v9.patch configured with
--disable-atomics --disable-spinlocks on my environment and confirmed
that all tests have passed.


Thanks for the test!

I found another bug in the patch. InitProcess() initializes "waitStart",
but previously InitAuxiliaryProcess() did not. This could cause "invalid
spinlock number" error when reading pg_locks in the standby server.
I fixed that. Attached is the updated version of the patch.


I pushed this version. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: logical replication seems broken

2021-02-14 Thread vignesh C
On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers  wrote:
>
> > On 02/13/2021 11:49 AM Amit Kapila  wrote:
> >
> > On Fri, Feb 12, 2021 at 10:00 PM  wrote:
> > >
> > > > On 02/12/2021 1:51 PM Amit Kapila  wrote:
> > > >
> > > > On Fri, Feb 12, 2021 at 6:04 PM Erik Rijkers  wrote:
> > > > >
> > > > > I am seeing errors in replication in a test program that I've been 
> > > > > running for years with very little change (since 2017, really [1]).
> > >
> > > Hi,
> > >
> > > Here is a test program.  Careful, it deletes stuff.  And it will need 
> > > some changes:
> > >
> >
> > Thanks for sharing the test. I think I have found the problem.
> > Actually, it was an existing code problem exposed by the commit
> > ce0fdbfe97. In pgoutput_begin_txn(), we were sometimes sending the
> > prepare_write ('w') message but then the actual message was not being
> > sent. This was the case when we didn't found the origin of a txn. This
> > can happen after that commit because we have now started using origins
> > for tablesync workers as well and those origins are removed once the
> > tablesync workers are finished. We might want to change the behavior
> > related to the origin messages as indicated in the comments but for
> > now, fixing the existing code.
> >
> > Can you please test if the attached fixes the problem at your end as well?
>
> > [fix_origin_message_1.patch]
>
> I compiled just now a binary from HEAD, and a binary from HEAD+patch
>
> HEAD is still broken; your patch rescues it, so yes, fixed.
>
> Maybe a test (check or check-world) should be added to run a second replica?  
> (Assuming that would have caught this bug)
>

+1 for the idea of having a test for this. I have written a test for this.
Thanks for the fix Amit, I could reproduce the issue without your fix
and verified that the issue gets fixed with the patch you shared.
Attached a patch for the same. Thoughts?

Regards,
Vignesh
From eff13970ae34bcdc8590a9602eddce5eb6200195 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Mon, 15 Feb 2021 11:41:55 +0530
Subject: [PATCH v1] Test for verifying data is replicated in cascaded setup.

Test for verifying data is replicated in cascaded setup.
---
 src/test/subscription/t/100_bugs.pl | 65 +
 1 file changed, 65 insertions(+)

diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index d1e407a..afb2d08 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -153,3 +153,68 @@ is($node_twoways->safe_psql('d2', "SELECT count(f) FROM t"),
 	$rows * 2, "2x$rows rows in t");
 is($node_twoways->safe_psql('d2', "SELECT count(f) FROM t2"),
 	$rows * 2, "2x$rows rows in t2");
+
+# Verify table data is synced with cascaded replication setup.
+my $node_pub = get_new_node('testpublisher1');
+$node_pub->init(allows_streaming => 'logical');
+$node_pub->start;
+
+my $node_pub_sub = get_new_node('testpublisher_subscriber');
+$node_pub_sub->init(allows_streaming => 'logical');
+$node_pub_sub->start;
+
+my $node_sub = get_new_node('testsubscriber1');
+$node_sub->init(allows_streaming => 'logical');
+$node_sub->start;
+
+# Create the tables in all nodes.
+$node_pub->safe_psql('postgres', "CREATE TABLE tab1 (a int)");
+$node_pub_sub->safe_psql('postgres', "CREATE TABLE tab1 (a int)");
+$node_sub->safe_psql('postgres', "CREATE TABLE tab1 (a int)");
+
+# Create a cascaded replication setup like:
+# N1 - Create publication testpub1.
+# N2 - Create publication testpub2 and also include subscriber which subscribes
+#  to testpub1.
+# N3 - Create subscription testsub2 subscribes to testpub2.
+$node_pub->safe_psql('postgres',
+	"CREATE PUBLICATION testpub1 FOR TABLE tab1");
+
+$node_pub_sub->safe_psql('postgres',
+	"CREATE PUBLICATION testpub2 FOR TABLE tab1");
+
+my $publisher1_connstr = $node_pub->connstr . ' dbname=postgres';
+my $publisher2_connstr = $node_pub_sub->connstr . ' dbname=postgres';
+
+$node_sub->safe_psql('postgres',
+	"CREATE SUBSCRIPTION testsub2 CONNECTION '$publisher2_connstr' PUBLICATION testpub2"
+);
+
+$node_pub_sub->safe_psql('postgres',
+	"CREATE SUBSCRIPTION testsub1 CONNECTION '$publisher1_connstr' PUBLICATION testpub1"
+);
+
+$node_pub->safe_psql('postgres',
+	"INSERT INTO tab1 values(generate_series(1,10))");
+
+# Verify that the data is cascaded from testpub1 to testsub1 and further from
+# testpub2 (which had testsub1) to testsub2.
+$node_pub->wait_for_catchup('testsub1');
+$node_pub_sub->wait_for_catchup('testsub2');
+
+# Drop subscriptions as we don't need them anymore
+$node_pub_sub->safe_psql('postgres', "DROP SUBSCRIPTION testsub1");
+$node_sub->safe_psql('postgres', "DROP SUBSCRIPTION testsub2");
+
+# Drop publications as we don't need them anymore
+$node_pub->safe_psql('postgres', "DROP PUBLICATION testpub1");
+$node_pub_sub->safe_psql('postgres', "DROP PUBLICATION testpub2");
+
+# Clean up the tables on both publisher and subscriber as we don't need them
+$node_pub->safe_psql('postgres', "DROP TABLE 

Re: [DOC] add missing "[ NO ]" to various "DEPENDS ON" synopses

2021-02-14 Thread Ian Lawrence Barwick
2021年2月13日(土) 11:52 Michael Paquier :

> On Fri, Feb 12, 2021 at 10:32:14AM +0900, Ian Lawrence Barwick wrote:
> > In the documentation, the "[ NO ]" option is listed in the synopsis for
> > ALTER TRIGGER and ALTER FUNCTION, but not the others.
> > Trivial patch attached.
>
> There are two flavors to cover for 6 commands per gram.y, and you are
> covering all of them.  So this looks good to me.  I'll apply and
> backpatch in a bit.  It is worth noting that tab-complete.c does a bad
> job in completing those clauses.
> --
> Michael
>


-- 
EnterpriseDB: https://www.enterprisedb.com


Re: [DOC] add missing "[ NO ]" to various "DEPENDS ON" synopses

2021-02-14 Thread Ian Lawrence Barwick
2021年2月13日(土) 11:52 Michael Paquier :

> On Fri, Feb 12, 2021 at 10:32:14AM +0900, Ian Lawrence Barwick wrote:
> > In the documentation, the "[ NO ]" option is listed in the synopsis for
> > ALTER TRIGGER and ALTER FUNCTION, but not the others.
> > Trivial patch attached.
>
> There are two flavors to cover for 6 commands per gram.y, and you are
> covering all of them.  So this looks good to me.  I'll apply and
> backpatch in a bit.


Thanks! (Apologies for the preceding blank mail).

It is worth noting that tab-complete.c does a bad
> job in completing those clauses.
>

Indeed it does. Not the most exciting of use cases, though I imagine it
might come in handy for anyone developing an extension, and the
existing implementation is inconsistent (in place for ALTER INDEX,
and partially for ALTER MATERIALIZED VIEW, but not the others).
Patch suggestion attached.

Regards

Ian Barwick


-- 
EnterpriseDB: https://www.enterprisedb.com
commit 5ffe7228c6789cbf29084daae6aa8f4638996a98
Author: Ian Barwick 
Date:   Mon Feb 15 15:27:25 2021 +0900

psql: improve [ NO ] DEPENDS tab completion support

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 1e1c315bae..f99564937d 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1614,14 +1614,24 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER AGGREGATE,FUNCTION,PROCEDURE,ROUTINE  */
 	else if (Matches("ALTER", "AGGREGATE|FUNCTION|PROCEDURE|ROUTINE", MatchAny))
 		COMPLETE_WITH("(");
-	/* ALTER AGGREGATE,FUNCTION,PROCEDURE,ROUTINE  (...) */
-	else if (Matches("ALTER", "AGGREGATE|FUNCTION|PROCEDURE|ROUTINE", MatchAny, MatchAny))
+	/* ALTER AGGREGATE  (...) */
+	else if (Matches("ALTER", "AGGREGATE", MatchAny, MatchAny))
 	{
 		if (ends_with(prev_wd, ')'))
 			COMPLETE_WITH("OWNER TO", "RENAME TO", "SET SCHEMA");
 		else
 			COMPLETE_WITH_FUNCTION_ARG(prev2_wd);
 	}
+	/* ALTER FUNCTION,PROCEDURE,ROUTINE  (...) */
+	else if (Matches("ALTER", "FUNCTION|PROCEDURE|ROUTINE", MatchAny, MatchAny))
+	{
+		if (ends_with(prev_wd, ')'))
+			COMPLETE_WITH("OWNER TO", "RENAME TO", "SET SCHEMA",
+		  "DEPENDS ON EXTENSION", "NO DEPENDS ON EXTENSION");
+		else
+			COMPLETE_WITH_FUNCTION_ARG(prev2_wd);
+	}
+
 	/* ALTER PUBLICATION  */
 	else if (Matches("ALTER", "PUBLICATION", MatchAny))
 		COMPLETE_WITH("ADD TABLE", "DROP TABLE", "OWNER TO", "RENAME TO", "SET");
@@ -1735,7 +1745,8 @@ psql_completion(const char *text, int start, int end)
 	/* ALTER INDEX  */
 	else if (Matches("ALTER", "INDEX", MatchAny))
 		COMPLETE_WITH("ALTER COLUMN", "OWNER TO", "RENAME TO", "SET",
-	  "RESET", "ATTACH PARTITION", "DEPENDS", "NO DEPENDS",
+	  "RESET", "ATTACH PARTITION",
+	  "DEPENDS ON EXTENSION", "NO DEPENDS ON EXTENSION",
 	  "ALTER COLLATION");
 	else if (Matches("ALTER", "INDEX", MatchAny, "ATTACH"))
 		COMPLETE_WITH("PARTITION");
@@ -1782,10 +1793,6 @@ psql_completion(const char *text, int start, int end)
 	  "buffering =",	/* GiST */
 	  "pages_per_range =", "autosummarize ="	/* BRIN */
 			);
-	else if (Matches("ALTER", "INDEX", MatchAny, "NO", "DEPENDS"))
-		COMPLETE_WITH("ON EXTENSION");
-	else if (Matches("ALTER", "INDEX", MatchAny, "DEPENDS"))
-		COMPLETE_WITH("ON EXTENSION");
 	/* ALTER INDEX  ALTER COLLATION */
 	else if (Matches("ALTER", "INDEX", MatchAny, "ALTER", "COLLATION"))
 	{
@@ -1919,7 +1926,7 @@ psql_completion(const char *text, int start, int end)
 
 	/* ALTER MATERIALIZED VIEW  */
 	else if (Matches("ALTER", "MATERIALIZED", "VIEW", MatchAny))
-		COMPLETE_WITH("ALTER COLUMN", "CLUSTER ON", "DEPENDS ON EXTENSION",
+		COMPLETE_WITH("ALTER COLUMN", "CLUSTER ON", "DEPENDS ON EXTENSION", "NO DEPENDS ON EXTENSION",
 	  "OWNER TO", "RENAME", "RESET (", "SET");
 	/* ALTER MATERIALIZED VIEW xxx RENAME */
 	else if (Matches("ALTER", "MATERIALIZED", "VIEW", MatchAny, "RENAME"))
@@ -1997,7 +2004,7 @@ psql_completion(const char *text, int start, int end)
 
 	/* ALTER TRIGGER  ON  */
 	else if (Matches("ALTER", "TRIGGER", MatchAny, "ON", MatchAny))
-		COMPLETE_WITH("RENAME TO");
+		COMPLETE_WITH("RENAME TO", "DEPENDS ON EXTENSION", "NO DEPENDS ON EXTENSION");
 
 	/*
 	 * If we detect ALTER TABLE , suggest sub commands


Re: a misbehavior of partition row movement (?)

2021-02-14 Thread Masahiko Sawada
On Wed, Jan 20, 2021 at 7:04 PM Amit Langote  wrote:
>
> On Wed, Jan 20, 2021 at 4:13 PM Peter Eisentraut
>  wrote:
> > On 2021-01-08 09:54, Amit Langote wrote:
> > >>> I don't quite recall if the decision to implement it like this was
> > >>> based on assuming that this is what users would like to see happen in
> > >>> this case or the perceived difficulty of implementing it the other way
> > >>> around, that is, of firing AFTER UPDATE triggers in this case.
> > >> I tried to look that up, but I couldn't find any discussion about this. 
> > >> Do you have any ideas in which thread that was handled?
> > > It was discussed here:
> > >
> > > https://www.postgresql.org/message-id/flat/CAJ3gD9do9o2ccQ7j7%2BtSgiE1REY65XRiMb%3DyJO3u3QhyP8EEPQ%40mail.gmail.com
> > >
> > > It's a huge discussion, so you'll have to ctrl+f "trigger" to spot
> > > relevant emails.  You might notice that the developers who
> > > participated in that discussion gave various opinions and what we have
> > > today got there as a result of a majority of them voting for the
> > > current approach.  Someone also said this during the discussion:
> > > "Regarding the trigger issue, I can't claim to have a terribly strong
> > > opinion on this. I think that practically anything we do here might
> > > upset somebody, but probably any halfway-reasonable thing we choose to
> > > do will be OK for most people." So what we've got is that
> > > "halfway-reasonable" thing, YMMV. :)
> >
> > Could you summarize here what you are trying to do with respect to what
> > was decided before?  I'm a bit confused, looking through the patches you
> > have posted.  The first patch you posted hard-coded FK trigger OIDs
> > specifically, other patches talk about foreign key triggers in general
> > or special case internal triggers or talk about all triggers.
>
> The original problem statement is this: the way we generally fire
> row-level triggers of a partitioned table can lead to some unexpected
> behaviors of the foreign keys pointing to that partitioned table
> during its cross-partition updates.
>
> Let's start with an example that shows how triggers are fired during a
> cross-partition update:
>
> create table p (a numeric primary key) partition by list (a);
> create table p1 partition of p for values in (1);
> create table p2 partition of p for values in (2);
> create or replace function report_trig_details() returns trigger as $$
> begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op =
> 'DELETE' then return old; end if; return new; end; $$ language
> plpgsql;
> create trigger trig1 before update on p for each row execute function
> report_trig_details();
> create trigger trig2 after update on p for each row execute function
> report_trig_details();
> create trigger trig3 before delete on p for each row execute function
> report_trig_details();
> create trigger trig4 after delete on p for each row execute function
> report_trig_details();
> create trigger trig5 before insert on p for each row execute function
> report_trig_details();
> create trigger trig6 after insert on p for each row execute function
> report_trig_details();
>
> insert into p values (1);
> update p set a = 2;
> NOTICE:  BEFORE UPDATE on p1
> NOTICE:  BEFORE DELETE on p1
> NOTICE:  BEFORE INSERT on p2
> NOTICE:  AFTER DELETE on p1
> NOTICE:  AFTER INSERT on p2
> UPDATE 1
>
> (AR update triggers are not fired.)
>
> For contrast, here is an intra-partition update:
>
> update p set a = a;
> NOTICE:  BEFORE UPDATE on p2
> NOTICE:  AFTER UPDATE on p2
> UPDATE 1
>
> Now, the trigger machinery makes no distinction between user-defined
> and internal triggers, which has implications for the foreign key
> enforcing triggers on partitions.  Consider the following example:
>
> create table q (a bigint references p);
> insert into q values (2);
> update p set a = 1;
> NOTICE:  BEFORE UPDATE on p2
> NOTICE:  BEFORE DELETE on p2
> NOTICE:  BEFORE INSERT on p1
> ERROR:  update or delete on table "p2" violates foreign key constraint
> "q_a_fkey2" on table "q"
> DETAIL:  Key (a)=(2) is still referenced from table "q".
>
> So the RI delete trigger (NOT update) on p2 prevents the DELETE that
> occurs as part of the row movement.  One might make the updates
> cascade and expect that to prevent the error:
>
> drop table q;
> create table q (a bigint references p on update cascade);
> insert into q values (2);
> update p set a = 1;
> NOTICE:  BEFORE UPDATE on p2
> NOTICE:  BEFORE DELETE on p2
> NOTICE:  BEFORE INSERT on p1
> ERROR:  update or delete on table "p2" violates foreign key constraint
> "q_a_fkey2" on table "q"
> DETAIL:  Key (a)=(2) is still referenced from table "q".
>
> No luck, because again it's the RI delete trigger on p2 that gets
> fired.  If you make deletes cascade too, an even worse thing happens:
>
> drop table q;
> create table q (a bigint references p on update cascade on delete cascade);
> insert into q values (2);
> update p set a = 1;
> NOTICE:  BEFORE UPDATE on p2
> N

Re: Fallback table AM for relkinds without storage

2021-02-14 Thread Michael Paquier
On Tue, Feb 09, 2021 at 04:27:34PM +0900, Michael Paquier wrote:
> Putting sanity checks within all the table_* functions of tableam.h
> would not be a good idea, as nothing prevents the call of what's
> stored in rel->rd_tableam.  

I have been playing with this idea, and finished with the attached,
which is not the sexiest patch around.  The table AM used as fallback
for tables without storage is called no_storage (this could be called
virtual_am?).  Reverting e786be5 or dd705a0 leads to an error coming
from no_storage instead of a crash.

One thing to note is that this simplifies a bit slot_callbacks as
views, foreign tables and partitioned tables can grab their slot type
directly from this new table AM.
--
Michael
From 0611ef60b39907a3bd405ece990d661d4d63900d Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 15 Feb 2021 16:21:01 +0900
Subject: [PATCH] Add no_storage, fallback table AM for relations without
 storage

---
 src/include/access/tableam.h  |   6 +-
 src/include/catalog/pg_am.dat |   4 +
 src/include/catalog/pg_proc.dat   |   4 +
 src/backend/access/Makefile   |   5 +-
 src/backend/access/no_storage/Makefile|  18 +
 src/backend/access/no_storage/README  |   9 +
 .../access/no_storage/no_storage_handler.c| 518 ++
 src/backend/access/table/tableam.c|  33 --
 src/backend/utils/cache/relcache.c|  37 +-
 src/test/regress/expected/create_am.out   |  11 +-
 src/test/regress/expected/psql.out|  96 ++--
 11 files changed, 639 insertions(+), 102 deletions(-)
 create mode 100644 src/backend/access/no_storage/Makefile
 create mode 100644 src/backend/access/no_storage/README
 create mode 100644 src/backend/access/no_storage/no_storage_handler.c

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 33bffb6815..999e05bb89 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -832,7 +832,11 @@ typedef struct TableAmRoutine
  * for the relation.  Works for tables, views, foreign tables and partitioned
  * tables.
  */
-extern const TupleTableSlotOps *table_slot_callbacks(Relation rel);
+static inline const TupleTableSlotOps *
+table_slot_callbacks(Relation rel)
+{
+	return rel->rd_tableam->slot_callbacks(rel);
+}
 
 /*
  * Returns slot using the callbacks returned by table_slot_callbacks(), and
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 6082f0e6a8..e0e32a681f 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -15,6 +15,10 @@
 { oid => '2', oid_symbol => 'HEAP_TABLE_AM_OID',
   descr => 'heap table access method',
   amname => 'heap', amhandler => 'heap_tableam_handler', amtype => 't' },
+{ oid => '8381', oid_symbol => 'NO_STORAGE_TABLE_AM_OID',
+  descr => 'no_storage table access method',
+  amname => 'no_storage', amhandler => 'no_storage_tableam_handler',
+  amtype => 't' },
 { oid => '403', oid_symbol => 'BTREE_AM_OID',
   descr => 'b-tree index access method',
   amname => 'btree', amhandler => 'bthandler', amtype => 'i' },
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4e0c9be58c..5a796324a5 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -905,6 +905,10 @@
   proname => 'heap_tableam_handler', provolatile => 'v',
   prorettype => 'table_am_handler', proargtypes => 'internal',
   prosrc => 'heap_tableam_handler' },
+{ oid => '8382', descr => 'no_storage access method handler',
+  proname => 'no_storage_tableam_handler', provolatile => 'v',
+  prorettype => 'table_am_handler', proargtypes => 'internal',
+  prosrc => 'no_storage_tableam_handler' },
 
 # Index access method handlers
 { oid => '330', descr => 'btree index access method handler',
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 0880e0a8bb..29d3937ada 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -8,7 +8,8 @@ subdir = src/backend/access
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS	= brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  table tablesample transam
+SUBDIRS	= brin common gin gist hash heap index nbtree no_storage \
+	rmgrdesc spgist \
+	table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/no_storage/Makefile b/src/backend/access/no_storage/Makefile
new file mode 100644
index 00..8d1227b428
--- /dev/null
+++ b/src/backend/access/no_storage/Makefile
@@ -0,0 +1,18 @@
+#-
+#
+# Makefile--
+#Makefile for access/no_storage
+#
+# IDENTIFICATION
+#src/backend/access/no_storage/Makefile
+#
+#-
+
+subdir = src/backend/access/no_storage
+top_builddir = ../../../..
+include $(top_bui

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-14 Thread Greg Nancarrow
On Sat, Feb 13, 2021 at 12:17 AM Amit Langote  wrote:
>
> On Thu, Feb 11, 2021 at 4:43 PM Greg Nancarrow  wrote:
> > On Thu, Feb 11, 2021 at 5:33 PM Greg Nancarrow  wrote:
> > > On Tue, Feb 9, 2021 at 1:04 AM Amit Langote  
> > > wrote:
> > > >
> > > > * I think that the concerns raised by Tsunakawa-san in:
> > > >
> > > > https://www.postgresql.org/message-id/TYAPR01MB2990CCB6E24B10D35D28B949FEA30%40TYAPR01MB2990.jpnprd01.prod.outlook.com
> > > >
> > > > regarding how this interacts with plancache.c deserve a look.
> > > > Specifically, a plan that uses parallel insert may fail to be
> > > > invalidated when partitions are altered directly (that is without
> > > > altering their root parent).  That would be because we are not adding
> > > > partition OIDs to PlannerGlobal.invalItems despite making a plan
> > > > that's based on checking their properties.  See this (tested with all
> > > > patches applied!):
> > > >
> > >
> > > Does any current Postgres code add partition OIDs to
> > > PlannerGlobal.invalItems for a similar reason?
>
> Currently, the planner opens partitions only for SELECT queries and
> also adds them to the query's range table.  And because they are added
> to the range table, their OIDs do get added to
> PlannerGlobal.relationOids (not invalItems, sorry!) by way of
> CompleteCachedPlan() calling extract_query_dependencies(), which looks
> at Query.rtable to decide which tables/partitions to add.
>
> > > I would have thought that, for example,  partitions with a default
> > > column expression, using a function that is changed from SAFE to
> > > UNSAFE, would suffer the same plancache issue (for current parallel
> > > SELECT functionality) as we're talking about here - but so far I
> > > haven't seen any code handling this.
>
> AFAIK, default column expressions don't affect plans for SELECT
> queries.  OTOH, consider a check constraint expression as an example.
> The planner may use one to exclude a partition from the plan with its
> constraint exclusion algorithm (separate from "partition pruning").
> If the check constraint is dropped, any cached plans that used it will
> be invalidated.
>

Sorry, I got that wrong, default column expressions are relevant to
INSERT, not SELECT.

> >
> > Actually, I tried adding the following in the loop that checks the
> > parallel-safety of each partition and it seemed to work:
> >
> > glob->relationOids =
> > lappend_oid(glob->relationOids, pdesc->oids[i]);
> >
> > Can you confirm, is that what you were referring to?
>
> Right.  I had mistakenly mentioned PlannerGlobal.invalItems, sorry.
>
> Although it gets the job done, I'm not sure if manipulating
> relationOids from max_parallel_hazard() or its subroutines is okay,
> but I will let the committer decide that.  As I mentioned above, the
> person who designed this decided for some reason that it is
> extract_query_dependencies()'s job to populate
> PlannerGlobal.relationOids/invalItems.
>

Yes, it doesn't really seem right doing it within max_parallel_hazard().
I tried doing it in extract_query_dependencies() instead - see
attached patch - and it seems to work, but I'm not sure if there might
be any unintended side-effects.

Regards,
Greg Nancarrow
Fujitsu Australia


setrefs.patch
Description: Binary data